1
2
3
4
5
6
7
8
9
10
11
|
<
li
>
<
a
href
=
"/Н"
>Н</
a
>:<
a
class
=
"det"
href
=
'/view/Н/ньютон'
>ньютон</
a
>
【物理】牛頓(力單位)
<
div
class
=
"satis"
style
=
"display:none"
>
<
span
>您對本詞條的內容滿意嗎:</
span
>
<
font
>
<
a
href
=
"###"
tip-data
=
"good"
updateword
=
"ньютон"
satis
=
"245057"
>滿意</
a
>
<
a
href
=
"###"
tip-data
=
"update"
updateword
=
"ньютон"
satis
=
"2"
>請改進</
a
>
</
font
>
</
div
>
</
li
>
|
遇到此段xml須要處理,查了些資料,現解決以下:
html
1
2
3
4
5
6
7
8
9
10
11
12
|
def
readFile(filen,decoding):
html
=
''
try
:
html
=
open
(filen).read().decode(decoding)
except
:
pass
return
html
def
extract(
file
,decoding, xpath):
html
=
readFile(
file
, decoding)
tree
=
etree.HTML(html)
return
tree.xpath(xpath)
|
兩個函數,用於解決讀取中文網頁時出現的編碼問題。
python
1
2
3
4
5
6
7
8
9
10
11
12
|
def
GetXpath1(url,xpath,saveFile):
response
=
urllib2.urlopen(url)
data
=
response.read()
f
=
file
(
"temp.txt"
,
'w'
)
f.write(data)
f.close()
sections
=
extract(
'temp.txt'
,
'utf-8'
, xpath)
print
len
(sections),
type
(sections)
#輸出1 <type 'list'>
print
sections
#此處爲元素[<Element a at 0x26c8948>]
print
sections[
0
].tag,sections[
0
].attrib,sections[
0
].attrib.get(
"href"
)
#輸出a {'href': u'/view/\u041d/\u041d\u043e\u0432\u0433\u043e\u0440\u043e\u0434', 'class': 'det'} /view/Н/Новгород
print
type
(sections[
0
].attrib)
#<type 'lxml.etree._Attrib'>
|
此處關鍵地方,花了些時間解決,主要是爲了提取
函數
<li><a href="/Н">Н</a>:<a class="det" href='/view/Н/ньютон'>ньютон</a>編碼
中的俄語,須要注意的是Element的屬性tag, attrib,get("")的使用url
到此基本就獲取須要東西了spa