Python_Crawler_Foundation3_CSS_Xpath_Json_XML_RegExp

時間 2019-11-15

標籤 python crawler foundation3 foundation css xpath json xml regexp 欄目 Python 简体版

原文原文鏈接

Python Simple Crawler css

Using XML.DOM or XML.sax to parser XML files. (https://www.tutorialspoint.com/python/python_xml_processing.htm )html

CSS (Cascading Style Sheets) 是級聯樣式表，它是用於解決如何顯示HTML元素。要解決若是顯示html元素，就要解決若是對html元素定位。node

爲何要使用CSS來定義HTML元素，而不是直接用屬性設置元素。python

直接使用屬性：<P font-size=」」「color = > abcd</P>。這樣你要手動一個個去修改屬性。數據庫

而CSS能夠經過ID class或其餘方法快速定位到一大批元素。json

h1就是一個selector。color和front-size就是屬性，red和14px就是值。數組

head，p，h1都是元素。h1，h2表明段落標籤，image表明圖像標籤，等。
class是元素的屬性。
.important選擇的是全部有這個類屬性的元素，如<p class="important"> or <h1 class="important">中，P和h1都是會被選中，他們都有import這個類。
能夠結合元素選擇器來定位，如：P.important，選擇的是具備import類的P元素。

元素選擇器+類選擇器瀏覽器

ID選擇器：能夠給每一個元素一個ID。ID選擇器與class選擇器很像，但class是能夠共享，如，不一樣的標籤（head，body，p，等等）能夠有共同的class選擇器，如上，P.important和h1.important，但id是全局惟一的。app

id選擇器和class選擇器都是屬性選擇器的特殊選擇器。dom

這裏css文件中，只適用了class選擇器。

XPath

http://www.w3school.com.cn/xpath/index.asp

節點之間的關係像是文件系統中的文件路徑的方式，可看做一個目錄樹，能夠更方便的遍歷和查找節點。這是CSS作不到的，CSS只能順序，而XPath能夠順序也能夠逆序。

html的語言和xml語言長得很像，可是html語言的語法要求不是很嚴格的，對瀏覽器而言，它會盡可能解析。若是把一段html放到xml解析器裏，頗有可能會失敗。

JQuery 來解析xml。JQuery是用xpath和CSS來定位元素的。

#outcome：

字典是不保證順序的，因此當用json導數據的時候，順序是隨機的。

量小用DOM，量大就用SAX。

https://www.tutorialspoint.com/python/python_xml_processing.htm

Here is the easiest way to quickly load an XML document and to create a minidom object using the xml.dom module. The minidom object provides a simple parser method that quickly creates a DOM tree from the XML file.

The sample phrase calls the parse( file [,parser] ) function of the minidom object to parse the XML file designated by file into a DOM tree object.

books.xml

 1 <?xml version="1.0" encoding="ISO-8859-1"?>
 2 
 3 <bookstore shelf="new arrives">
 4 
 5 <book category="COOKING">
 6   <title lang="en">Everyday Italian</title>
 7   <author>Giada De Laurentiis</author>
 8   <year>2005</year>
 9   <price>30.00</price>
10 </book>
11 
12 <book category="CHILDREN">
13   <title lang="en">Harry Potter</title>
14   <author>J K. Rowling</author>
15   <year>2005</year>
16   <price>29.99</price>
17 </book>
18 
19 <book category="WEB">
20   <title lang="en">XQuery Kick Start</title>
21   <author>James McGovern</author>
22   <author>Per Bothner</author>
23   <author>Kurt Cagle</author>
24   <author>James Linn</author>
25   <author>Vaidyanathan Nagarajan</author>
26   <year>2003</year>
27   <price>49.99</price>
28 </book>
29 
30 <book category="WEB">
31   <title lang="en">Learning XML</title>
32   <author>Erik T. Ray</author>
33   <year>2003</year>
34   <price>39.95</price>
35 </book>
36 
37 </bookstore>

XML_Dom.py

 1 from xml.dom import minidom
 2 import xml.dom.minidom
 3 
 4 # Open XML document using minidom parser (parse, documentElement, xml.dom.minidom.Element)
 5 doc = minidom.parse('books.xml')
 6 root = doc.documentElement
 7 print(type(root))
 8 print(root.nodeName)
 9 
10 # getting attribute values. (hasAttribute, getAttribute)
11 if root.hasAttribute("shelf"):
12     print ("Root element : %s" % root.getAttribute("shelf"))   
13 else:
14     print('getting vlaue failure')
15 
16 # Print detail of each books (childNodes[0].data, xml.dom.minicompat.NodeList)
17 books = root.getElementsByTagName('book')
18 book_list= []
19 price_list = []
20 for book in books:
21     if book.hasAttribute("category"):
22         print("**** %s ****" %book.getAttribute("category"))
23     titles = book.getElementsByTagName('title')
24     print(type(titles))
25     book_list.append(titles[0].childNodes[0].data)
26     print("Book's Name: %s" %titles[0].childNodes[0].data)
27     prices = book.getElementsByTagName('price')
28     price_list.append(prices[0].childNodes[0].data)
29     print("prices: %s" %prices[0].childNodes[0].data)
30     print("\n")
31 #print(titles[0].childNodes[0].nodeValue,prices[0].childNodes[0].nodeValue
32 print(book_list)
33 print(price_list)

#outcome

 1 <class 'xml.dom.minidom.Element'>
 2 bookstore
 3 Root element : new arrives
 4 
 5 **** COOKING ****
 6 <class 'xml.dom.minicompat.NodeList'>
 7 Book's Name: Everyday Italian
 8 prices: 30.00
 9 
10 
11 **** CHILDREN ****
12 <class 'xml.dom.minicompat.NodeList'>
13 Book's Name: Harry Potter
14 prices: 29.99
15 
16 
17 **** WEB ****
18 <class 'xml.dom.minicompat.NodeList'>
19 Book's Name: XQuery Kick Start
20 prices: 49.99
21 
22 
23 **** WEB ****
24 <class 'xml.dom.minicompat.NodeList'>
25 Book's Name: Learning XML
26 prices: 39.95
27 
28 
29 ['Everyday Italian', 'Harry Potter', 'XQuery Kick Start', 'Learning XML']
30 ['30.00', '29.99', '49.99', '39.95']

Outcomes:

book.xml

<?xml version="1.0"?>
<bookstore>　　　　　　#<bookstore> = <root>。 bookstore至關於root是一個根節點。
	<book>
		<title>Learn Python</title>
		<price>100</price>
	</book>
	<book>
		<title>Learn XML</title>
		<price>100</price>
	</book>
</bookstore>

XML_DOM.py

 1 from xml.dom import minidom
 2 
 3 doc = minidom.parse('book.xml')　　　　# <class 'xml.dom.minidom.Document'> doc是一個文檔樹，是從根節點開始遍歷，對應的就是根節點。
 4 root = doc.documentElement　　　　　　　# <class 'xml.dom.minidom.Element'> 將doc對應的根節點傳給root，因此root也對應根節點。
 5 print(type(root))　　　　　　　　
 6 #print(dir(root))       　　　　　　　　# use 'dir' to see all methods of the function
 7 print(root.nodeName)
 8 books = root.getElementsByTagName('book')   　 #getElements將全部元素給books，books是一個數組。<class 'xml.dom.minicompat.NodeList'> 
 9 for book in books:　　　　　　　　　　　　　　　　　#books,titles,price are <class 'xml.dom.minicompat.NodeList'>
10     titles = book.getElementsByTagName('title')  　　#在當前book節點下，找到全部元素名爲title的元素。
11     prices = book.getElementsByTagName('price')
12     print(titles[0].childNodes[0].nodeValue,prices[0].childNodes[0].nodeValue)

#Outcome:

<class 'xml.dom.minidom.Element'>
bookstore
Learn Python 100
Learn XML 80

XML_SAX.py 　　SAX處理方式在XML數據庫裏處理比較多，用於處理大型的xml頁面。

 1 import string
 2 from xml.parsers.expat import ParserCreate
 3 
 4 class DefaultSaxHandler(object):
 5     def start_element(self,name,attrs):
 6         self.name = name
 7         print('element: %s, attrs: %s' %(name,str(attrs)))
 8     def end_element(self,name):
 9         print('end element: %s' %name)
10     def char_data(self,text):
11         if text.strip():
12             print("%s's text is %s" %(self.name, text))
13 
14 handler = DefaultSaxHandler()
15 parser = ParserCreate()
16 parser.StartElementHandler = handler.start_element     #<book>
17 parser.EndElementHandler =  handler.end_element       #</book>
18 parser.CharacterDataHandler = handler.char_data        #<title>character data</title>
19 
20 with open('book.xml','r') as f:
21     parser.Parse(f.read())

#Outcome：

element: book, attrs:{}
element: title, attrs:{'lang':'eng'}
title's text is Learn Python
end element: title
element: price, attrs:{}
price's text is 100
end element: price
end element: book
....

SAX Example： url = ""http://www.ip138.com/post/""

Regular Expression

http://www.javashuo.com/article/p-mdtezkro-d.html

Reg_EXP02.py

 1 import re
 2 
 3 #3位數字-3到8個數字
 4 mr = re.match(r'\d{3}-\d{3,8}','010-23232323')
 5 print(mr.string)
 6 
 7 #加（）分組
 8 mr2 = re.match(r'(\d{3})-(\d{3,8})','010-23232323')
 9 print(mr2.groups())
10 print(mr2.group(0))
11 print(mr2.group(1))
12 print(mr2.group(2))      
13 
14 t = '2:15:45'
15 tm = re.match(r'(\d{0,2}):(\d{0,2}):(\d{0,2})',t)
16 print(tm.groups())
17 print(tm.group(0))
18 
19 #加（）分組
20 tm = re.match(r'(\d{0,2}):(\d{0,2}):(\d{0,2})',t)
21 
22 #分割字符串
23 p = re.compile(r'\d+')
24 print(p.split('one1two22three333'))

Selenium

https://selenium-python.readthedocs.io/index.html

回頭看。

更多相關文章...

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。