python lxml

時間 2019-11-08
標籤 python lxml 欄目 Python 简体版
原文原文鏈接
 
        1.  
        解析html並創建dom 
       
        >>>  
        import 
        lxml.etree as etree 
       
        >>> html  
        = 
        '<html><body id="1">abc<div>123</div>def<div>456</div>ghi</body></html>' 
       
        >>> dom  
        = 
        etree.fromstring(html) 
       
        >>> etree.tostring(dom) 
       
        '<html><body id="1">abc<div>123</div>def<div>456</div>ghi</body></html>' 
       
        若是用beautifulsoup的解析器，則 
       
        >>>  
        import 
        lxml.html.soupparser as soupparser 
       
        >>> dom  
        = 
        soupparser.fromstring(html) 
       
        >>> etree.tostring(dom) 
       
        '<html><body id="1">abc<div>123</div>def<div>456</div>ghi</body></html>' 
       
        可是我強烈建議使用soupparser，由於其處理不規範的html的能力比etree強太多。 
       
        2.  
        按照Dom訪問Element 
       
        子元素長度 
       
        >>>  
        len 
        (dom) 
       
        1 
       
        訪問子元素： 
       
        >>> dom[ 
        0 
        ].tag 
       
        'body' 
       
        循環訪問： 
       
        >>>  
        for 
        child  
        in 
        dom: 
       
        ...      
        print 
        child.tag 
       
        ...  
       
        body 
       
        查看節點索引 
       
        >>>body  
        = 
        dom[ 
        0 
        ] 
       
        >>> dom.index(body) 
       
        0 
       
        字節點獲取父節點 
       
        >>> body.getparent().tag 
       
        'html' 
       
        訪問全部子節點 
       
        >>>  
        for 
        ele  
        in 
        dom. 
        iter 
        (): 
       
        ...      
        print 
        ele.tag 
       
        ...  
       
        html 
       
        body 
       
        div 
       
        div 
       
        3. 
        訪問節點屬性 
       
        >>> body.get( 
        'id' 
        ) 
       
        '1' 
       
        也能夠這樣 
       
        >>> attrs  
        = 
        body.attrib 
       
        >>> attrs.get( 
        'id' 
        ) 
       
        '1' 
       
        4. 
        訪問Element的內容 
       
        >>> body.text 
       
        'abc' 
       
        >>> body.tail 
       
        text只是從本節點開始到第一個字節點結束；tail是從最後一個字節結束到本節點未知。 
       
        訪問本節點全部文本信息 
       
        >>> body.xpath( 
        'text()' 
        ) 
       
        [ 
        'abc' 
        ,  
        'def' 
        ,  
        'ghi' 
        ] 
       
        訪問本節點和子節點全部文本信息 
       
        >>> body.xpath( 
        '//text()' 
        ) 
       
        [ 
        'abc' 
        ,  
        '123' 
        ,  
        'def' 
        ,  
        '456' 
        ,  
        'ghi' 
        ] 
       
        貌似返回本文檔中全部文字信息 
       
        body.text_content()返回本節點全部文本信息。 
       
        5.Xpath 
        的支持 
       
        全部的div元素 
       
        >>>  
        for 
        ele  
        in 
        dom.xpath( 
        '//div' 
        ): 
       
        ...      
        print 
        ele.tag 
       
        ...  
       
        div 
       
        div 
       
        id 
        = 
        「 
        1 
        」的元素 
       
        >>> dom.xpath( 
        '//*[@id="1"]' 
        )[ 
        0 
        ].tag 
       
        'body' 
       
        body下的第 
        1 
        個div 
       
        >>> dom.xpath( 
        'body/div[1]' 
        )[ 
        0 
        ].tag 
       
        'div'
相關標籤/搜索
每日一句
每一个你不满意的现在，都有一个你没有努力的曾经。