python xpath lxml調試

時間 2019-11-08

原文原文鏈接

 
   
    
      
      
        < 
        li 
        > 
       
 
        < 
        a 
        href 
        = 
        "/Н" 
        >Н</ 
        a 
        >:< 
        a 
        class 
        = 
        "det" 
        href 
        = 
        '/view/Н/ньютон' 
        >ньютон</ 
        a 
        > 
       
 
          
        【物理】牛頓(力單位) 
       
 
        < 
        div 
        class 
        = 
        "satis" 
        style 
        = 
        "display:none" 
        > 
       
 
        < 
        span 
        >您對本詞條的內容滿意嗎：</ 
        span 
        > 
       
 
        < 
        font 
        > 
       
 
        < 
        a 
        href 
        = 
        "###" 
        tip-data 
        = 
        "good" 
        updateword 
        = 
        "ньютон" 
        satis 
        = 
        "245057" 
        >滿意</ 
        a 
        > 
       
 
        < 
        a 
        href 
        = 
        "###" 
        tip-data 
        = 
        "update" 
        updateword 
        = 
        "ньютон" 
        satis 
        = 
        "2" 
        >請改進</ 
        a 
        > 
       
 
        </ 
        font 
        > 
       
 
        </ 
        div 
        > 
       
 
        </ 
        li 
        > 
       
 
    
 
   
 

遇到此段xml須要處理，查了些資料，現解決以下：
html

 
        def 
        readFile(filen,decoding):   
       
        html  
        = 
        ''   
       
        try 
        :   
       
        html  
        = 
        open 
        (filen).read().decode(decoding)   
       
        except 
        :   
       
        pass  
       
        return 
        html   
       
        def 
        extract( 
        file 
        ,decoding, xpath):   
       
        html  
        = 
        readFile( 
        file 
        , decoding)   
       
        tree  
        = 
        etree.HTML(html) 
       
        return 
        tree.xpath(xpath)

兩個函數，用於解決讀取中文網頁時出現的編碼問題。
python

 
        def 
        GetXpath1(url,xpath,saveFile): 
       
        response 
        = 
        urllib2.urlopen(url) 
       
        data 
        = 
        response.read() 
       
        f 
        = 
        file 
        ( 
        "temp.txt" 
        , 
        'w' 
        )     
       
        f.write(data) 
       
        f.close() 
       
        sections  
        = 
        extract( 
        'temp.txt' 
        ,  
        'utf-8' 
        , xpath) 
       
        print 
        len 
        (sections), 
        type 
        (sections) 
        #輸出1 <type 'list'> 
       
        print 
        sections 
        #此處爲元素[<Element a at 0x26c8948>] 
       
        print 
        sections[ 
        0 
        ].tag,sections[ 
        0 
        ].attrib,sections[ 
        0 
        ].attrib.get( 
        "href" 
        ) 
       
        #輸出a {'href': u'/view/\u041d/\u041d\u043e\u0432\u0433\u043e\u0440\u043e\u0434', 'class': 'det'} /view/Н/Новгород 
       
        print 
        type 
        (sections[ 
        0 
        ].attrib) 
        #<type 'lxml.etree._Attrib'>