怎麼用Python寫爬蟲抓取網頁數據

機器學習首先面臨的一個問題就是準備數據,數據的來源大概有這麼幾種:公司積累數據,購買,交換,政府機構及企業公開的數據,經過爬蟲從網上抓取。本篇介紹怎麼寫一個爬蟲從網上抓取公開的數據。php

不少語言均可以寫爬蟲,可是不一樣語言的難易程度不一樣,Python做爲一種解釋型的膠水語言,上手簡單、入門容易,標準庫齊全,還有豐富的各類開源庫,語言自己提供了不少提升開發效率的語法糖,開發效率高,總之「人生苦短,快用Python」(Life is short, you need Python!)。在Web網站開發,科學計算,數據挖掘/分析,人工智能等不少領域普遍使用。html

開發環境配置,Python3.5.2,Scrapy1.2.1,使用pip安裝scrapy,命令:pip3 install Scrapy,此命令在Mac下會自動安裝Scrapy的依賴包,安裝過程當中若是出現網絡超時,多試幾回。python

建立工程git

首先建立一個Scrapy工程,工程名爲:kiwi,命令:scrapy startproject kiwi,將建立一些文件夾和文件模板。github

定義數據結構web

settings.py是一些設置信息,items.py用來保存解析出來的數據,在此文件裏定義一些數據結構,示例代碼:正則表達式

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define here the models for your scraped items
 4 #
 5 # See documentation in:
 6 # http://doc.scrapy.org/en/latest/topics/items.html
 7 
 8 import scrapy
 9 
10 
11 class AuthorInfo(scrapy.Item):
12     authorName = scrapy.Field()  # 做者暱稱
13     authorUrl = scrapy.Field()  # 做者Url
14 
15 class ReplyItem(scrapy.Item):
16     content = scrapy.Field()  # 回覆內容
17     time = scrapy.Field()  # 發佈時間
18     author = scrapy.Field() # 回覆人(AuthorInfo)
19 
20 class TopicItem(scrapy.Item):
21     title = scrapy.Field() # 帖子標題
22     url = scrapy.Field() # 帖子頁面Url
23     content = scrapy.Field() # 帖子內容
24     time = scrapy.Field()  # 發佈時間
25     author = scrapy.Field() # 發帖人(AuthorInfo)
26     reply = scrapy.Field() # 回覆列表(ReplyItem list)
27     replyCount = scrapy.Field() # 回覆條數

上面TopicItem中嵌套了AuthorInfo和ReplyItem list,可是初始化類型必須是scrapy.Field(),注意這三個類都須要從scrapy.Item繼續。json

建立爬蟲蜘蛛瀏覽器

工程目錄spiders下的kiwi_spider.py文件是爬蟲蜘蛛代碼,爬蟲代碼寫在這個文件裏。示例以爬豆瓣羣組裏的帖子和回覆爲例。cookie

  1 # -*- coding: utf-8 -*-
  2 from scrapy.selector import Selector
  3 from scrapy.spiders import CrawlSpider, Rule
  4 from scrapy.linkextractors import LinkExtractor
  5 
  6 from kiwi.items import TopicItem, AuthorInfo, ReplyItem
  7 class KiwiSpider(CrawlSpider):
  8     name = "kiwi"
  9     allowed_domains = ["douban.com"]
 10 
 11     anchorTitleXPath = 'a/text()'
 12     anchorHrefXPath = 'a/@href'
 13 
 14     start_urls = [
 15         "https://www.douban.com/group/topic/90895393/?start=0",
 16     ]
 17     rules = (
 18         Rule(
 19             LinkExtractor(allow=(r'/group/[^/]+/discussion\?start=\d+',)),
 20                 callback='parse_topic_list',
 21                 follow=True
 22         ),
 23         Rule(
 24             LinkExtractor(allow=(r'/group/topic/\d+/$',)),  # 帖子內容頁面
 25                 callback='parse_topic_content',
 26                 follow=True
 27         ),
 28         Rule(
 29             LinkExtractor(allow=(r'/group/topic/\d+/\?start=\d+',)), # 帖子內容頁面
 30                 callback='parse_topic_content',
 31                 follow=True
 32         ),
 33     )
 34 
 35     # 帖子詳情頁面
 36     def parse_topic_content(self, response):
 37         # 標題XPath
 38         titleXPath = '//html/head/title/text()'
 39         # 帖子內容XPath
 40         contentXPath = '//div[@class="topic-content"]/p/text()'
 41         # 發帖時間XPath
 42         timeXPath = '//div[@class="topic-doc"]/h3/span[@class="color-green"]/text()'
 43         # 發帖人XPath
 44         authorXPath = '//div[@class="topic-doc"]/h3/span[@class="from"]'
 45 
 46         item = TopicItem()
 47         # 當前頁面Url
 48         item['url'] = response.url
 49         # 標題
 50         titleFragment = Selector(response).xpath(titleXPath)
 51         item['title'] = str(titleFragment.extract()[0]).strip()
 52 
 53         # 帖子內容
 54         contentFragment = Selector(response).xpath(contentXPath)
 55         strs = [line.extract().strip() for line in contentFragment]
 56         item['content'] = '\n'.join(strs)
 57         # 發帖時間
 58         timeFragment = Selector(response).xpath(timeXPath)
 59         if timeFragment:
 60             item['time'] = timeFragment[0].extract()
 61 
 62         # 發帖人信息
 63         authorInfo = AuthorInfo()
 64         authorFragment = Selector(response).xpath(authorXPath)
 65         if authorFragment:
 66             authorInfo['authorName'] = authorFragment[0].xpath(self.anchorTitleXPath).extract()[0]
 67             authorInfo['authorUrl'] = authorFragment[0].xpath(self.anchorHrefXPath).extract()[0]
 68 
 69         item['author'] = dict(authorInfo)
 70 
 71         # 回覆列表XPath
 72         replyRootXPath = r'//div[@class="reply-doc content"]'
 73         # 回覆時間XPath
 74         replyTimeXPath = r'div[@class="bg-img-green"]/h4/span[@class="pubtime"]/text()'
 75         # 回覆人XPath
 76         replyAuthorXPath = r'div[@class="bg-img-green"]/h4'
 77 
 78         replies = []
 79         itemsFragment = Selector(response).xpath(replyRootXPath)
 80         for replyItemXPath in itemsFragment:
 81             replyItem = ReplyItem()
 82             # 回覆內容
 83             contents = replyItemXPath.xpath('p/text()')
 84             strs = [line.extract().strip() for line in contents]
 85             replyItem['content'] = '\n'.join(strs)
 86             # 回覆時間
 87             timeFragment = replyItemXPath.xpath(replyTimeXPath)
 88             if timeFragment:
 89                 replyItem['time'] = timeFragment[0].extract()
 90             # 回覆人
 91             replyAuthorInfo = AuthorInfo()
 92             authorFragment = replyItemXPath.xpath(replyAuthorXPath)
 93             if authorFragment:
 94                 replyAuthorInfo['authorName'] = authorFragment[0].xpath(self.anchorTitleXPath).extract()[0]
 95                 replyAuthorInfo['authorUrl'] = authorFragment[0].xpath(self.anchorHrefXPath).extract()[0]
 96 
 97             replyItem['author'] = dict(replyAuthorInfo)
 98             # 添加進回覆列表
 99             replies.append(dict(replyItem))
100 
101         item['reply'] = replies
102         yield item
103 
104     # 帖子列表頁面
105     def parse_topic_list(self, response):
106         # 帖子列表XPath(跳過表頭行)
107         topicRootXPath = r'//table[@class="olt"]/tr[position()>1]'
108         # 單條帖子條目XPath
109         titleXPath = r'td[@class="title"]'
110         # 發帖人XPath
111         authorXPath = r'td[2]'
112         # 回覆條數XPath
113         replyCountXPath = r'td[3]/text()'
114         # 發帖時間XPath
115         timeXPath = r'td[@class="time"]/text()'
116 
117         topicsPath = Selector(response).xpath(topicRootXPath)
118         for topicItemPath in topicsPath:
119             item = TopicItem()
120             titlePath = topicItemPath.xpath(titleXPath)
121             item['title'] = titlePath.xpath(self.anchorTitleXPath).extract()[0]
122             item['url'] = titlePath.xpath(self.anchorHrefXPath).extract()[0]
123             # 發帖時間
124             timePath = topicItemPath.xpath(timeXPath)
125             if timePath:
126                 item['time'] = timePath[0].extract()
127                 # 發帖人
128                 authorPath = topicItemPath.xpath(authorXPath)
129                 authInfo = AuthorInfo()
130                 authInfo['authorName'] = authorPath[0].xpath(self.anchorTitleXPath).extract()[0]
131                 authInfo['authorUrl'] = authorPath[0].xpath(self.anchorHrefXPath).extract()[0]
132                 item['author'] = dict(authInfo)
133                 # 回覆條數
134                 replyCountPath = topicItemPath.xpath(replyCountXPath)
135                 item['replyCount'] = replyCountPath[0].extract()
136 
137             item['content'] = ''
138             yield item
139 
140     parse_start_url = parse_topic_content

 

特別注意

一、KiwiSpider須要改爲從CrawlSpider類繼承,模板生成的代碼是從Spider繼承的,那樣的話不會去爬rules裏的頁面。

二、parse_start_url = parse_topic_list 是定義入口函數,從CrawlSpider類的代碼裏能夠看到parse函數回調的是parse_start_url函數,子類能夠重寫這個函數,也能夠像上面代碼那樣給它賦值一個新函數。

三、start_urls裏是入口網址,能夠添加多個網址。

四、rules裏定義在抓取到的網頁中哪些網址須要進去爬,規則和對應的回調函數,規則用正則表達式寫。上面的示例代碼,定義了繼續抓取帖子詳情首頁及分頁。

五、注意代碼裏用dict()包裝的部分,items.py文件裏定義數據結構的時候,author屬性實際須要的是AuthorInfo類型,賦值的時候必須用dict包裝起來,item['author'] = authInfo 賦值會報Error。

六、提取內容的時候利用XPath取出須要的內容,有關XPath的資料參看:XPath教程 http://www.w3school.com.cn/xpath/。開發過程當中能夠利用瀏覽器提供的工具查看XPath,好比Firefox 瀏覽器中的FireBug、FirePath插件,對於https://www.douban.com/group/python/discussion?start=0這個頁面,XPath規則「//td[@class="title"]」能夠獲取到帖子標題列表,示例:

上圖紅框中能夠輸入XPath規則,方便測試XPath的規則是否符合要求。新版Firefox能夠安裝 Try XPath 這個插件 查看XPath,Chrome瀏覽器能夠安裝 XPath Helper 插件。

使用隨機UserAgent

爲了讓網站看來更像是正常的瀏覽器訪問,能夠寫一個Middleware提供隨機的User-Agent,在工程根目錄下添加文件useragentmiddleware.py,示例代碼:

 1 # -*-coding:utf-8-*-
 2 
 3 import random
 4 from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
 5 
 6 
 7 class RotateUserAgentMiddleware(UserAgentMiddleware):
 8     def __init__(self, user_agent=''):
 9         self.user_agent = user_agent
10 
11     def process_request(self, request, spider):
12         ua = random.choice(self.user_agent_list)
13         if ua:
14             request.headers.setdefault('User-Agent', ua)
15 
16     # for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
17     user_agent_list = [ \
18         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" \
19         "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", \
20         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", \
21         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", \
22         "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", \
23         "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", \
24         "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", \
25         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
26         "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
27         "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
28         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
29         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
30         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
31         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
32         "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
33         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", \
34         "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", \
35         "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
36     ]

 

修改settings.py,添加下面的設置,

DOWNLOADER_MIDDLEWARES = {
   'kiwi.useragentmiddleware.RotateUserAgentMiddleware': 1,
}

同時禁用cookie,COOKIES_ENABLED = False。

運行爬蟲

切換到工程根目錄,輸入命令:scrapy crawl kiwi,console窗口能夠看到打印出來的數據,或者使用命令「scrapy crawl kiwi -o result.json -t json」將結果保存到文件裏。

 

怎麼抓取用JS代碼動態輸出的網頁數據

上面的例子對由執行js代碼輸出數據的頁面不適用,好在Python的工具庫多,能夠安裝phantomjs這個工具,從官網下載解壓便可。下面以抓取 http://www.kjj.com/index_kfjj.html 這個網頁的基金淨值數據爲例,這個頁面的數據是由js代碼動態輸出的,js代碼執行以後纔會輸出基金淨值列表。fund_spider.py代碼

 1 # -*- coding: utf-8 -*-
 2 from scrapy.selector import Selector
 3 from datetime import  datetime
 4 from selenium import webdriver
 5 from fundequity import FundEquity
 6 
 7 class PageSpider(object):
 8     def __init__(self):
 9         phantomjsPath = "/Library/Frameworks/Python.framework/Versions/3.5/phantomjs/bin/phantomjs"
10         cap = webdriver.DesiredCapabilities.PHANTOMJS
11         cap["phantomjs.page.settings.resourceTimeout"] = 1000
12         cap["phantomjs.page.settings.loadImages"] = False
13         cap["phantomjs.page.settings.disk-cache"] = False
14         self.driver = webdriver.PhantomJS(executable_path=phantomjsPath, desired_capabilities=cap)
15 
16     def fetchPage(self, url):
17         self.driver.get(url)
18         html = self.driver.page_source
19         return html
20 
21     def parse(self, html):
22         fundListXPath = r'//div[@id="maininfo_all"]/table[@id="ilist"]/tbody/tr[position()>1]'
23         itemsFragment = Selector(text=html).xpath(fundListXPath)
24         for itemXPath in itemsFragment:
25             attrXPath = itemXPath.xpath(r'td[1]/text()')
26             text = attrXPath[0].extract().strip()
27             if text != "-":
28                 fe = FundEquity()
29                 fe.serial = text
30 
31                 attrXPath = itemXPath.xpath(r'td[2]/text()')
32                 text = attrXPath[0].extract().strip()
33                 fe.date = datetime.strptime(text, "%Y-%m-%d")
34 
35                 attrXPath = itemXPath.xpath(r'td[3]/text()')
36                 text = attrXPath[0].extract().strip()
37                 fe.code = text
38 
39                 attrXPath = itemXPath.xpath(r'td[4]/a/text()')
40                 text = attrXPath[0].extract().strip()
41                 fe.name = text
42 
43                 attrXPath = itemXPath.xpath(r'td[5]/text()')
44                 text = attrXPath[0].extract().strip()
45                 fe.equity = text
46 
47                 attrXPath = itemXPath.xpath(r'td[6]/text()')
48                 text = attrXPath[0].extract().strip()
49                 fe.accumulationEquity = text
50 
51                 attrXPath = itemXPath.xpath(r'td[7]/font/text()')
52                 text = attrXPath[0].extract().strip()
53                 fe.increment = text
54 
55                 attrXPath = itemXPath.xpath(r'td[8]/font/text()')
56                 text = attrXPath[0].extract().strip().strip('%')
57                 fe.growthRate = text
58 
59                 attrXPath = itemXPath.xpath(r'td[9]/a/text()')
60                 if len(attrXPath) > 0:
61                     text = attrXPath[0].extract().strip()
62                     if text == "購買":
63                         fe.canBuy = True
64                     else:
65                         fe.canBuy = False
66 
67                 attrXPath = itemXPath.xpath(r'td[10]/font/text()')
68                 if len(attrXPath) > 0:
69                     text = attrXPath[0].extract().strip()
70                     if text == "贖回":
71                         fe.canRedeem = True
72                     else:
73                         fe.canRedeem = False
74 
75                 yield fe
76 
77     def __del__(self):
78         self.driver.quit()
79 
80 def test():
81     spider = PageSpider()
82     html = spider.fetchPage("http://www.kjj.com/index_kfjj.html")
83     for item in spider.parse(html):
84         print(item)
85     del spider
86 
87 if __name__ == "__main__":
88     test()

1
# -*- coding: utf-8 -*- 2 from datetime import date 3 4 # 基金淨值信息 5 class FundEquity(object): 6 def __init__(self): 7 # 類實例即對象的屬性 8 self.__serial = 0 # 序號 9 self.__date = None # 日期 10 self.__code = "" # 基金代碼 11 self.__name = "" # 基金名稱 12 self.__equity = 0.0 # 單位淨值 13 self.__accumulationEquity = 0.0 # 累計淨值 14 self.__increment = 0.0 # 增加值 15 self.__growthRate = 0.0 # 增加率 16 self.__canBuy = False # 是否能夠購買 17 self.__canRedeem = True # 是否能贖回 18 19 @property 20 def serial(self): 21 return self.__serial 22 23 @serial.setter 24 def serial(self, value): 25 self.__serial = value 26 27 @property 28 def date(self): 29 return self.__date 30 31 @date.setter 32 def date(self, value): 33 # 數據檢查 34 if not isinstance(value, date): 35 raise ValueError('date must be date type!') 36 self.__date = value 37 38 @property 39 def code(self): 40 return self.__code 41 42 @code.setter 43 def code(self, value): 44 self.__code = value 45 46 @property 47 def name(self): 48 return self.__name 49 50 @name.setter 51 def name(self, value): 52 self.__name = value 53 54 @property 55 def equity(self): 56 return self.__equity 57 58 @equity.setter 59 def equity(self, value): 60 self.__equity = value 61 62 @property 63 def accumulationEquity(self): 64 return self.__accumulationEquity 65 66 @accumulationEquity.setter 67 def accumulationEquity(self, value): 68 self.__accumulationEquity = value 69 70 @property 71 def increment(self): 72 return self.__increment 73 74 @increment.setter 75 def increment(self, value): 76 self.__increment = value 77 78 @property 79 def growthRate(self): 80 return self.__growthRate 81 82 @growthRate.setter 83 def growthRate(self, value): 84 self.__growthRate = value 85 86 @property 87 def canBuy(self): 88 return self.__canBuy 89 90 @canBuy.setter 91 def canBuy(self, value): 92 self.__canBuy = value 93 94 @property 95 def canRedeem(self): 96 return self.__canRedeem 97 98 @canRedeem.setter 99 def canRedeem(self, value): 100 self.__canRedeem = value 101 # 相似其它語言中的toString()函數 102 def __str__(self): 103 return '[serial:%s,date:%s,code:%s,name:%s,equity:%.4f,\ 104 accumulationEquity:%.4f,increment:%.4f,growthRate:%.4f%%,canBuy:%s,canRedeem:%s]'\ 105 % (self.serial, self.date.strftime("%Y-%m-%d"), self.code, self.name, float(self.equity), \ 106 float(self.accumulationEquity), float(self.increment), \ 107 float(self.growthRate), self.canBuy, self.canRedeem)

 

上述代碼中FundEquity類的屬性值使用getter/setter函數方式定義的,這種方式能夠對值進行檢查。__str__(self)函數相似其它語言裏的toString()。

在命令行運行fund_spider.py代碼,console窗口會輸出淨值數據。

 

小結

從以上的示例代碼中可見少許代碼就能把豆瓣網上小組中的帖子和回覆數據抓取、內容解析、存儲下來,可見Python語言的簡潔、高效。

例子的代碼比較簡單,惟一比較花時間的是調 XPath規則,藉助於瀏覽器輔助插件工具能大大提升效率。

例子中沒有說起Pipeline(管道)、Middleware(中間件) 這些複雜東西。沒有考慮爬蟲請求太頻繁致使站方封禁IP(能夠經過不斷更換HTTP Proxy 方式破解),沒有考慮須要登陸才能抓取數據的狀況(代碼模擬用戶登陸破解)。

實際項目中提取內容的XPath規則、正則表達式 這類易變更的部分不該該硬編碼寫在代碼裏,網頁抓取、內容解析、解析結果的存儲等應該使用分佈式架構的方式獨立運行。總之實際生產環境中運行的爬蟲系統須要考慮的問題不少,github上也有一些開源的網絡爬蟲系統,能夠參考。

相關文章
相關標籤/搜索