python xpath lxml調試

1
2
3
4
5
6
7
8
9
10
11
< li >
< a href = "/Н" >Н</ a >:< a class = "det" href = '/view/Н/ньютон' >ньютон</ a >
  【物理】牛頓(力單位)
< div class = "satis" style = "display:none" >
< span >您對本詞條的內容滿意嗎:</ span >
< font >
< a href = "###" tip-data = "good" updateword = "ньютон" satis = "245057" >滿意</ a >
< a href = "###" tip-data = "update" updateword = "ньютон" satis = "2" >請改進</ a >
</ font >
</ div >
</ li >

遇到此段xml須要處理,查了些資料,現解決以下:
html

1
2
3
4
5
6
7
8
9
10
11
12
def readFile(filen,decoding): 
     html = '' 
     try
         html = open (filen).read().decode(decoding) 
     except
         pass 
     return html 
   
def extract( file ,decoding, xpath): 
     html = readFile( file , decoding) 
     tree = etree.HTML(html)
     return tree.xpath(xpath)

兩個函數,用於解決讀取中文網頁時出現的編碼問題。
python


1
2
3
4
5
6
7
8
9
10
11
12
def GetXpath1(url,xpath,saveFile):
     response = urllib2.urlopen(url)
     data = response.read()
     f = file ( "temp.txt" , 'w' )   
     f.write(data)
     f.close()
     sections = extract( 'temp.txt' , 'utf-8' , xpath)
     print len (sections), type (sections) #輸出1 <type 'list'>
     print sections #此處爲元素[<Element a at 0x26c8948>]
     print sections[ 0 ].tag,sections[ 0 ].attrib,sections[ 0 ].attrib.get( "href" )
     #輸出a {'href': u'/view/\u041d/\u041d\u043e\u0432\u0433\u043e\u0440\u043e\u0434', 'class': 'det'} /view/Н/Новгород
     print type (sections[ 0 ].attrib) #<type 'lxml.etree._Attrib'>

此處關鍵地方,花了些時間解決,主要是爲了提取
函數

<li><a href="/Н">Н</a>:<a class="det" href='/view/Н/ньютон'>ньютон</a>編碼

中的俄語,須要注意的是Element的屬性tag, attrib,get("")的使用url

到此基本就獲取須要東西了spa

相關文章
相關標籤/搜索