學的是python語言,因此使用就是python的庫,沒學習過python的朋友能夠本身自學一下,很簡單就的一門語言。默認使用的是python3.x。css
開始進行數據採集時,咱們會進入到網頁背後的世界,看到html,css,js,初來乍到有點嚇人,由於若是不是它們的開發者,通常是不多有人能夠徹底理解一個網頁文件顯示給咱們的代碼,它們對於人類的視覺來講是太多,太亂了。html
想要提取網頁,首先要進行網絡鏈接,python如何進行網絡鏈接的? 新建一個scrapetest.py文件。文件內容以下:python
from urllib.request import urlopen html = urlopen("http://pythonscraping.com/pages/page1.html") print (html.read())
使用python運行上面的文件,返回結果是linux
[clgo@localhost ps]$ python scrapetest.py b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'
這是咱們在代碼中寫的網頁的所有html代碼。準確的說,輸出的是http://pythonscraping.com 服務器上網絡應用根目錄下的pages文件夾下page1.html的源代碼。centos
參考:https://docs.python.org/3/library/urllib.htmlpython3.x
urlopen是用來打開並讀取一個從網絡得到的遠程對象。是一個很是通用的庫。api
centos下安裝beautifulsoup4:服務器
pip install beautifulsoup4
安裝後測試一下,導入,若是沒有出錯則安裝成功。網絡
[clgo@localhost ps]$ python Python 3.5.1 (default, May 6 2016, 21:20:38) [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> from bs4 import BeautifulSoup >>>
運行beautifulsoup: 把上面的文件修改成下面的內容運行:ide
from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://pythonscraping.com/pages/page1.html") bs0bj = BeautifulSoup(html.read()) print(bs0bj) print(bs0bj.h1)
返回的結果是:
<html> <head> <title>A Useful Page</title> </head> <body> <h1>An Interesting Title</h1> <div> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. </div> </body> </html> <h1>An Interesting Title</h1>
其中調print(bs0bj.h1)返回的是<h1>An Interesting Title</h1>。 上面就是經過beautifulsoup提取html結點信息的一個例子。
3.可靠的網絡鏈接 網絡鏈接是十分複雜的,咱們在設計爬蟲時,也要考慮異常處理。
from urllib.request import urlopen from urllib.error import HTTPError from bs4 import BeautifulSoup def getTitle(url): #異常處理 try: html = urlopen(url) except HTTPError as e: return None try: bsObj = BeautifulSoup(html.read()) title = bsObj.body.h1 except AttributeError as e: return None return title title = getTitle("http://www.pythonscraping.com/pages/page1.html") if title == None: print("Title could not be found") else: print(title)