Python Simple Crawler css
Using XML.DOM or XML.sax to parser XML files. (https://www.tutorialspoint.com/python/python_xml_processing.htm )html
CSS (Cascading Style Sheets) 是級聯樣式表,它是用於解決如何顯示HTML元素。要解決若是顯示html元素,就要解決若是對html元素定位。node
爲何要使用CSS來定義HTML元素,而不是直接用屬性設置元素。python
直接使用屬性:<P font-size=」 」 「color = > abcd</P>。 這樣你要手動一個個去修改屬性。數據庫
而CSS能夠經過ID class或其餘方法快速定位到一大批元素。json
h1就是一個selector。color和front-size就是屬性,red和14px就是值。數組
元素選擇器+類選擇器瀏覽器
ID選擇器:能夠給每一個元素一個ID。ID選擇器與class選擇器很像,但class是能夠共享,如,不一樣的標籤(head,body,p,等等)能夠有共同的class選擇器,如上,P.important和h1.important,但id是全局惟一的。app
id選擇器和class選擇器都是屬性選擇器的特殊選擇器。dom
這裏css文件中,只適用了class選擇器。
XPath
http://www.w3school.com.cn/xpath/index.asp
節點之間的關係像是文件系統中的文件路徑的方式,可看做一個目錄樹,能夠更方便的遍歷和查找節點。這是CSS作不到的,CSS只能順序,而XPath能夠順序也能夠逆序。
html的語言和xml語言長得很像,可是html語言的語法要求不是很嚴格的,對瀏覽器而言,它會盡可能解析。若是把一段html放到xml解析器裏,頗有可能會失敗。
JQuery 來解析xml。JQuery是用xpath和CSS來定位元素的。
#outcome:
字典是不保證順序的,因此當用json導數據的時候,順序是隨機的。
量小用DOM,量大就用SAX。
https://www.tutorialspoint.com/python/python_xml_processing.htm
Here is the easiest way to quickly load an XML document and to create a minidom object using the xml.dom module. The minidom object provides a simple parser method that quickly creates a DOM tree from the XML file.
The sample phrase calls the parse( file [,parser] ) function of the minidom object to parse the XML file designated by file into a DOM tree object.
books.xml
1 <?xml version="1.0" encoding="ISO-8859-1"?> 2 3 <bookstore shelf="new arrives"> 4 5 <book category="COOKING"> 6 <title lang="en">Everyday Italian</title> 7 <author>Giada De Laurentiis</author> 8 <year>2005</year> 9 <price>30.00</price> 10 </book> 11 12 <book category="CHILDREN"> 13 <title lang="en">Harry Potter</title> 14 <author>J K. Rowling</author> 15 <year>2005</year> 16 <price>29.99</price> 17 </book> 18 19 <book category="WEB"> 20 <title lang="en">XQuery Kick Start</title> 21 <author>James McGovern</author> 22 <author>Per Bothner</author> 23 <author>Kurt Cagle</author> 24 <author>James Linn</author> 25 <author>Vaidyanathan Nagarajan</author> 26 <year>2003</year> 27 <price>49.99</price> 28 </book> 29 30 <book category="WEB"> 31 <title lang="en">Learning XML</title> 32 <author>Erik T. Ray</author> 33 <year>2003</year> 34 <price>39.95</price> 35 </book> 36 37 </bookstore>
XML_Dom.py
1 from xml.dom import minidom 2 import xml.dom.minidom 3 4 # Open XML document using minidom parser (parse, documentElement, xml.dom.minidom.Element) 5 doc = minidom.parse('books.xml') 6 root = doc.documentElement 7 print(type(root)) 8 print(root.nodeName) 9 10 # getting attribute values. (hasAttribute, getAttribute) 11 if root.hasAttribute("shelf"): 12 print ("Root element : %s" % root.getAttribute("shelf")) 13 else: 14 print('getting vlaue failure') 15 16 # Print detail of each books (childNodes[0].data, xml.dom.minicompat.NodeList) 17 books = root.getElementsByTagName('book') 18 book_list= [] 19 price_list = [] 20 for book in books: 21 if book.hasAttribute("category"): 22 print("**** %s ****" %book.getAttribute("category")) 23 titles = book.getElementsByTagName('title') 24 print(type(titles)) 25 book_list.append(titles[0].childNodes[0].data) 26 print("Book's Name: %s" %titles[0].childNodes[0].data) 27 prices = book.getElementsByTagName('price') 28 price_list.append(prices[0].childNodes[0].data) 29 print("prices: %s" %prices[0].childNodes[0].data) 30 print("\n") 31 #print(titles[0].childNodes[0].nodeValue,prices[0].childNodes[0].nodeValue 32 print(book_list) 33 print(price_list)
#outcome
1 <class 'xml.dom.minidom.Element'> 2 bookstore 3 Root element : new arrives 4 5 **** COOKING **** 6 <class 'xml.dom.minicompat.NodeList'> 7 Book's Name: Everyday Italian 8 prices: 30.00 9 10 11 **** CHILDREN **** 12 <class 'xml.dom.minicompat.NodeList'> 13 Book's Name: Harry Potter 14 prices: 29.99 15 16 17 **** WEB **** 18 <class 'xml.dom.minicompat.NodeList'> 19 Book's Name: XQuery Kick Start 20 prices: 49.99 21 22 23 **** WEB **** 24 <class 'xml.dom.minicompat.NodeList'> 25 Book's Name: Learning XML 26 prices: 39.95 27 28 29 ['Everyday Italian', 'Harry Potter', 'XQuery Kick Start', 'Learning XML'] 30 ['30.00', '29.99', '49.99', '39.95']
book.xml
<?xml version="1.0"?> <bookstore> #<bookstore> = <root>。 bookstore至關於root是一個根節點。 <book> <title>Learn Python</title> <price>100</price> </book> <book> <title>Learn XML</title> <price>100</price> </book> </bookstore>
XML_DOM.py
1 from xml.dom import minidom 2 3 doc = minidom.parse('book.xml') # <class 'xml.dom.minidom.Document'> doc是一個文檔樹,是從根節點開始遍歷,對應的就是根節點。 4 root = doc.documentElement # <class 'xml.dom.minidom.Element'> 將doc對應的根節點傳給root,因此root也對應根節點。 5 print(type(root)) 6 #print(dir(root)) # use 'dir' to see all methods of the function 7 print(root.nodeName) 8 books = root.getElementsByTagName('book') #getElements將全部元素給books,books是一個數組。<class 'xml.dom.minicompat.NodeList'>
9 for book in books: #books,titles,price are <class 'xml.dom.minicompat.NodeList'> 10 titles = book.getElementsByTagName('title') #在當前book節點下,找到全部元素名爲title的元素。 11 prices = book.getElementsByTagName('price') 12 print(titles[0].childNodes[0].nodeValue,prices[0].childNodes[0].nodeValue)
#Outcome:
<class 'xml.dom.minidom.Element'> bookstore Learn Python 100 Learn XML 80
XML_SAX.py SAX處理方式在XML數據庫裏處理比較多,用於處理大型的xml頁面。
1 import string 2 from xml.parsers.expat import ParserCreate 3 4 class DefaultSaxHandler(object): 5 def start_element(self,name,attrs): 6 self.name = name 7 print('element: %s, attrs: %s' %(name,str(attrs))) 8 def end_element(self,name): 9 print('end element: %s' %name) 10 def char_data(self,text): 11 if text.strip(): 12 print("%s's text is %s" %(self.name, text)) 13 14 handler = DefaultSaxHandler() 15 parser = ParserCreate() 16 parser.StartElementHandler = handler.start_element #<book> 17 parser.EndElementHandler = handler.end_element #</book> 18 parser.CharacterDataHandler = handler.char_data #<title>character data</title> 19 20 with open('book.xml','r') as f: 21 parser.Parse(f.read())
#Outcome:
element: book, attrs:{} element: title, attrs:{'lang':'eng'} title's text is Learn Python end element: title element: price, attrs:{} price's text is 100 end element: price end element: book ....
SAX Example: url = ""http://www.ip138.com/post/""
Regular Expression
http://www.javashuo.com/article/p-mdtezkro-d.html
Reg_EXP02.py
1 import re 2 3 #3位數字-3到8個數字 4 mr = re.match(r'\d{3}-\d{3,8}','010-23232323') 5 print(mr.string) 6 7 #加()分組 8 mr2 = re.match(r'(\d{3})-(\d{3,8})','010-23232323') 9 print(mr2.groups()) 10 print(mr2.group(0)) 11 print(mr2.group(1)) 12 print(mr2.group(2)) 13 14 t = '2:15:45' 15 tm = re.match(r'(\d{0,2}):(\d{0,2}):(\d{0,2})',t) 16 print(tm.groups()) 17 print(tm.group(0)) 18 19 #加()分組 20 tm = re.match(r'(\d{0,2}):(\d{0,2}):(\d{0,2})',t) 21 22 #分割字符串 23 p = re.compile(r'\d+') 24 print(p.split('one1two22three333'))
Selenium
https://selenium-python.readthedocs.io/index.html
回頭看。