機器學習首先面臨的一個問題就是準備數據,數據的來源大概有這麼幾種:公司積累數據,購買,交換,政府機構及企業公開的數據,經過爬蟲從網上抓取。本篇介紹怎麼寫一個爬蟲從網上抓取公開的數據。php
不少語言均可以寫爬蟲,可是不一樣語言的難易程度不一樣,Python做爲一種解釋型的膠水語言,上手簡單、入門容易,標準庫齊全,還有豐富的各類開源庫,語言自己提供了不少提升開發效率的語法糖,開發效率高,總之「人生苦短,快用Python」(Life is short, you need Python!)。在Web網站開發,科學計算,數據挖掘/分析,人工智能等不少領域普遍使用。html
開發環境配置,Python3.5.2,Scrapy1.2.1,使用pip安裝scrapy,命令:pip3 install Scrapy,此命令在Mac下會自動安裝Scrapy的依賴包,安裝過程當中若是出現網絡超時,多試幾回。python
建立工程git
首先建立一個Scrapy工程,工程名爲:kiwi,命令:scrapy startproject kiwi,將建立一些文件夾和文件模板。github
定義數據結構web
settings.py是一些設置信息,items.py用來保存解析出來的數據,在此文件裏定義一些數據結構,示例代碼:正則表達式
1 # -*- coding: utf-8 -*- 2 3 # Define here the models for your scraped items 4 # 5 # See documentation in: 6 # http://doc.scrapy.org/en/latest/topics/items.html 7 8 import scrapy 9 10 11 class AuthorInfo(scrapy.Item): 12 authorName = scrapy.Field() # 做者暱稱 13 authorUrl = scrapy.Field() # 做者Url 14 15 class ReplyItem(scrapy.Item): 16 content = scrapy.Field() # 回覆內容 17 time = scrapy.Field() # 發佈時間 18 author = scrapy.Field() # 回覆人(AuthorInfo) 19 20 class TopicItem(scrapy.Item): 21 title = scrapy.Field() # 帖子標題 22 url = scrapy.Field() # 帖子頁面Url 23 content = scrapy.Field() # 帖子內容 24 time = scrapy.Field() # 發佈時間 25 author = scrapy.Field() # 發帖人(AuthorInfo) 26 reply = scrapy.Field() # 回覆列表(ReplyItem list) 27 replyCount = scrapy.Field() # 回覆條數
上面TopicItem中嵌套了AuthorInfo和ReplyItem list,可是初始化類型必須是scrapy.Field(),注意這三個類都須要從scrapy.Item繼續。json
建立爬蟲蜘蛛瀏覽器
工程目錄spiders下的kiwi_spider.py文件是爬蟲蜘蛛代碼,爬蟲代碼寫在這個文件裏。示例以爬豆瓣羣組裏的帖子和回覆爲例。cookie
1 # -*- coding: utf-8 -*- 2 from scrapy.selector import Selector 3 from scrapy.spiders import CrawlSpider, Rule 4 from scrapy.linkextractors import LinkExtractor 5 6 from kiwi.items import TopicItem, AuthorInfo, ReplyItem 7 class KiwiSpider(CrawlSpider): 8 name = "kiwi" 9 allowed_domains = ["douban.com"] 10 11 anchorTitleXPath = 'a/text()' 12 anchorHrefXPath = 'a/@href' 13 14 start_urls = [ 15 "https://www.douban.com/group/topic/90895393/?start=0", 16 ] 17 rules = ( 18 Rule( 19 LinkExtractor(allow=(r'/group/[^/]+/discussion\?start=\d+',)), 20 callback='parse_topic_list', 21 follow=True 22 ), 23 Rule( 24 LinkExtractor(allow=(r'/group/topic/\d+/$',)), # 帖子內容頁面 25 callback='parse_topic_content', 26 follow=True 27 ), 28 Rule( 29 LinkExtractor(allow=(r'/group/topic/\d+/\?start=\d+',)), # 帖子內容頁面 30 callback='parse_topic_content', 31 follow=True 32 ), 33 ) 34 35 # 帖子詳情頁面 36 def parse_topic_content(self, response): 37 # 標題XPath 38 titleXPath = '//html/head/title/text()' 39 # 帖子內容XPath 40 contentXPath = '//div[@class="topic-content"]/p/text()' 41 # 發帖時間XPath 42 timeXPath = '//div[@class="topic-doc"]/h3/span[@class="color-green"]/text()' 43 # 發帖人XPath 44 authorXPath = '//div[@class="topic-doc"]/h3/span[@class="from"]' 45 46 item = TopicItem() 47 # 當前頁面Url 48 item['url'] = response.url 49 # 標題 50 titleFragment = Selector(response).xpath(titleXPath) 51 item['title'] = str(titleFragment.extract()[0]).strip() 52 53 # 帖子內容 54 contentFragment = Selector(response).xpath(contentXPath) 55 strs = [line.extract().strip() for line in contentFragment] 56 item['content'] = '\n'.join(strs) 57 # 發帖時間 58 timeFragment = Selector(response).xpath(timeXPath) 59 if timeFragment: 60 item['time'] = timeFragment[0].extract() 61 62 # 發帖人信息 63 authorInfo = AuthorInfo() 64 authorFragment = Selector(response).xpath(authorXPath) 65 if authorFragment: 66 authorInfo['authorName'] = authorFragment[0].xpath(self.anchorTitleXPath).extract()[0] 67 authorInfo['authorUrl'] = authorFragment[0].xpath(self.anchorHrefXPath).extract()[0] 68 69 item['author'] = dict(authorInfo) 70 71 # 回覆列表XPath 72 replyRootXPath = r'//div[@class="reply-doc content"]' 73 # 回覆時間XPath 74 replyTimeXPath = r'div[@class="bg-img-green"]/h4/span[@class="pubtime"]/text()' 75 # 回覆人XPath 76 replyAuthorXPath = r'div[@class="bg-img-green"]/h4' 77 78 replies = [] 79 itemsFragment = Selector(response).xpath(replyRootXPath) 80 for replyItemXPath in itemsFragment: 81 replyItem = ReplyItem() 82 # 回覆內容 83 contents = replyItemXPath.xpath('p/text()') 84 strs = [line.extract().strip() for line in contents] 85 replyItem['content'] = '\n'.join(strs) 86 # 回覆時間 87 timeFragment = replyItemXPath.xpath(replyTimeXPath) 88 if timeFragment: 89 replyItem['time'] = timeFragment[0].extract() 90 # 回覆人 91 replyAuthorInfo = AuthorInfo() 92 authorFragment = replyItemXPath.xpath(replyAuthorXPath) 93 if authorFragment: 94 replyAuthorInfo['authorName'] = authorFragment[0].xpath(self.anchorTitleXPath).extract()[0] 95 replyAuthorInfo['authorUrl'] = authorFragment[0].xpath(self.anchorHrefXPath).extract()[0] 96 97 replyItem['author'] = dict(replyAuthorInfo) 98 # 添加進回覆列表 99 replies.append(dict(replyItem)) 100 101 item['reply'] = replies 102 yield item 103 104 # 帖子列表頁面 105 def parse_topic_list(self, response): 106 # 帖子列表XPath(跳過表頭行) 107 topicRootXPath = r'//table[@class="olt"]/tr[position()>1]' 108 # 單條帖子條目XPath 109 titleXPath = r'td[@class="title"]' 110 # 發帖人XPath 111 authorXPath = r'td[2]' 112 # 回覆條數XPath 113 replyCountXPath = r'td[3]/text()' 114 # 發帖時間XPath 115 timeXPath = r'td[@class="time"]/text()' 116 117 topicsPath = Selector(response).xpath(topicRootXPath) 118 for topicItemPath in topicsPath: 119 item = TopicItem() 120 titlePath = topicItemPath.xpath(titleXPath) 121 item['title'] = titlePath.xpath(self.anchorTitleXPath).extract()[0] 122 item['url'] = titlePath.xpath(self.anchorHrefXPath).extract()[0] 123 # 發帖時間 124 timePath = topicItemPath.xpath(timeXPath) 125 if timePath: 126 item['time'] = timePath[0].extract() 127 # 發帖人 128 authorPath = topicItemPath.xpath(authorXPath) 129 authInfo = AuthorInfo() 130 authInfo['authorName'] = authorPath[0].xpath(self.anchorTitleXPath).extract()[0] 131 authInfo['authorUrl'] = authorPath[0].xpath(self.anchorHrefXPath).extract()[0] 132 item['author'] = dict(authInfo) 133 # 回覆條數 134 replyCountPath = topicItemPath.xpath(replyCountXPath) 135 item['replyCount'] = replyCountPath[0].extract() 136 137 item['content'] = '' 138 yield item 139 140 parse_start_url = parse_topic_content
特別注意
一、KiwiSpider須要改爲從CrawlSpider類繼承,模板生成的代碼是從Spider繼承的,那樣的話不會去爬rules裏的頁面。
二、parse_start_url = parse_topic_list 是定義入口函數,從CrawlSpider類的代碼裏能夠看到parse函數回調的是parse_start_url函數,子類能夠重寫這個函數,也能夠像上面代碼那樣給它賦值一個新函數。
三、start_urls裏是入口網址,能夠添加多個網址。
四、rules裏定義在抓取到的網頁中哪些網址須要進去爬,規則和對應的回調函數,規則用正則表達式寫。上面的示例代碼,定義了繼續抓取帖子詳情首頁及分頁。
五、注意代碼裏用dict()包裝的部分,items.py文件裏定義數據結構的時候,author屬性實際須要的是AuthorInfo類型,賦值的時候必須用dict包裝起來,item['author'] = authInfo 賦值會報Error。
六、提取內容的時候利用XPath取出須要的內容,有關XPath的資料參看:XPath教程 http://www.w3school.com.cn/xpath/。開發過程當中能夠利用瀏覽器提供的工具查看XPath,好比Firefox 瀏覽器中的FireBug、FirePath插件,對於https://www.douban.com/group/python/discussion?start=0這個頁面,XPath規則「//td[@class="title"]」能夠獲取到帖子標題列表,示例:
上圖紅框中能夠輸入XPath規則,方便測試XPath的規則是否符合要求。新版Firefox能夠安裝 Try XPath 這個插件 查看XPath,Chrome瀏覽器能夠安裝 XPath Helper 插件。
使用隨機UserAgent
爲了讓網站看來更像是正常的瀏覽器訪問,能夠寫一個Middleware提供隨機的User-Agent,在工程根目錄下添加文件useragentmiddleware.py,示例代碼:
1 # -*-coding:utf-8-*- 2 3 import random 4 from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware 5 6 7 class RotateUserAgentMiddleware(UserAgentMiddleware): 8 def __init__(self, user_agent=''): 9 self.user_agent = user_agent 10 11 def process_request(self, request, spider): 12 ua = random.choice(self.user_agent_list) 13 if ua: 14 request.headers.setdefault('User-Agent', ua) 15 16 # for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php 17 user_agent_list = [ \ 18 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" \ 19 "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", \ 20 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", \ 21 "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", \ 22 "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", \ 23 "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", \ 24 "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", \ 25 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \ 26 "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \ 27 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \ 28 "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \ 29 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \ 30 "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \ 31 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \ 32 "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \ 33 "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", \ 34 "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", \ 35 "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" 36 ]
修改settings.py,添加下面的設置,
DOWNLOADER_MIDDLEWARES = { 'kiwi.useragentmiddleware.RotateUserAgentMiddleware': 1, }
同時禁用cookie,COOKIES_ENABLED = False。
運行爬蟲
切換到工程根目錄,輸入命令:scrapy crawl kiwi,console窗口能夠看到打印出來的數據,或者使用命令「scrapy crawl kiwi -o result.json -t json」將結果保存到文件裏。
怎麼抓取用JS代碼動態輸出的網頁數據
上面的例子對由執行js代碼輸出數據的頁面不適用,好在Python的工具庫多,能夠安裝phantomjs這個工具,從官網下載解壓便可。下面以抓取 http://www.kjj.com/index_kfjj.html 這個網頁的基金淨值數據爲例,這個頁面的數據是由js代碼動態輸出的,js代碼執行以後纔會輸出基金淨值列表。fund_spider.py代碼
1 # -*- coding: utf-8 -*- 2 from scrapy.selector import Selector 3 from datetime import datetime 4 from selenium import webdriver 5 from fundequity import FundEquity 6 7 class PageSpider(object): 8 def __init__(self): 9 phantomjsPath = "/Library/Frameworks/Python.framework/Versions/3.5/phantomjs/bin/phantomjs" 10 cap = webdriver.DesiredCapabilities.PHANTOMJS 11 cap["phantomjs.page.settings.resourceTimeout"] = 1000 12 cap["phantomjs.page.settings.loadImages"] = False 13 cap["phantomjs.page.settings.disk-cache"] = False 14 self.driver = webdriver.PhantomJS(executable_path=phantomjsPath, desired_capabilities=cap) 15 16 def fetchPage(self, url): 17 self.driver.get(url) 18 html = self.driver.page_source 19 return html 20 21 def parse(self, html): 22 fundListXPath = r'//div[@id="maininfo_all"]/table[@id="ilist"]/tbody/tr[position()>1]' 23 itemsFragment = Selector(text=html).xpath(fundListXPath) 24 for itemXPath in itemsFragment: 25 attrXPath = itemXPath.xpath(r'td[1]/text()') 26 text = attrXPath[0].extract().strip() 27 if text != "-": 28 fe = FundEquity() 29 fe.serial = text 30 31 attrXPath = itemXPath.xpath(r'td[2]/text()') 32 text = attrXPath[0].extract().strip() 33 fe.date = datetime.strptime(text, "%Y-%m-%d") 34 35 attrXPath = itemXPath.xpath(r'td[3]/text()') 36 text = attrXPath[0].extract().strip() 37 fe.code = text 38 39 attrXPath = itemXPath.xpath(r'td[4]/a/text()') 40 text = attrXPath[0].extract().strip() 41 fe.name = text 42 43 attrXPath = itemXPath.xpath(r'td[5]/text()') 44 text = attrXPath[0].extract().strip() 45 fe.equity = text 46 47 attrXPath = itemXPath.xpath(r'td[6]/text()') 48 text = attrXPath[0].extract().strip() 49 fe.accumulationEquity = text 50 51 attrXPath = itemXPath.xpath(r'td[7]/font/text()') 52 text = attrXPath[0].extract().strip() 53 fe.increment = text 54 55 attrXPath = itemXPath.xpath(r'td[8]/font/text()') 56 text = attrXPath[0].extract().strip().strip('%') 57 fe.growthRate = text 58 59 attrXPath = itemXPath.xpath(r'td[9]/a/text()') 60 if len(attrXPath) > 0: 61 text = attrXPath[0].extract().strip() 62 if text == "購買": 63 fe.canBuy = True 64 else: 65 fe.canBuy = False 66 67 attrXPath = itemXPath.xpath(r'td[10]/font/text()') 68 if len(attrXPath) > 0: 69 text = attrXPath[0].extract().strip() 70 if text == "贖回": 71 fe.canRedeem = True 72 else: 73 fe.canRedeem = False 74 75 yield fe 76 77 def __del__(self): 78 self.driver.quit() 79 80 def test(): 81 spider = PageSpider() 82 html = spider.fetchPage("http://www.kjj.com/index_kfjj.html") 83 for item in spider.parse(html): 84 print(item) 85 del spider 86 87 if __name__ == "__main__": 88 test()
1 # -*- coding: utf-8 -*- 2 from datetime import date 3 4 # 基金淨值信息 5 class FundEquity(object): 6 def __init__(self): 7 # 類實例即對象的屬性 8 self.__serial = 0 # 序號 9 self.__date = None # 日期 10 self.__code = "" # 基金代碼 11 self.__name = "" # 基金名稱 12 self.__equity = 0.0 # 單位淨值 13 self.__accumulationEquity = 0.0 # 累計淨值 14 self.__increment = 0.0 # 增加值 15 self.__growthRate = 0.0 # 增加率 16 self.__canBuy = False # 是否能夠購買 17 self.__canRedeem = True # 是否能贖回 18 19 @property 20 def serial(self): 21 return self.__serial 22 23 @serial.setter 24 def serial(self, value): 25 self.__serial = value 26 27 @property 28 def date(self): 29 return self.__date 30 31 @date.setter 32 def date(self, value): 33 # 數據檢查 34 if not isinstance(value, date): 35 raise ValueError('date must be date type!') 36 self.__date = value 37 38 @property 39 def code(self): 40 return self.__code 41 42 @code.setter 43 def code(self, value): 44 self.__code = value 45 46 @property 47 def name(self): 48 return self.__name 49 50 @name.setter 51 def name(self, value): 52 self.__name = value 53 54 @property 55 def equity(self): 56 return self.__equity 57 58 @equity.setter 59 def equity(self, value): 60 self.__equity = value 61 62 @property 63 def accumulationEquity(self): 64 return self.__accumulationEquity 65 66 @accumulationEquity.setter 67 def accumulationEquity(self, value): 68 self.__accumulationEquity = value 69 70 @property 71 def increment(self): 72 return self.__increment 73 74 @increment.setter 75 def increment(self, value): 76 self.__increment = value 77 78 @property 79 def growthRate(self): 80 return self.__growthRate 81 82 @growthRate.setter 83 def growthRate(self, value): 84 self.__growthRate = value 85 86 @property 87 def canBuy(self): 88 return self.__canBuy 89 90 @canBuy.setter 91 def canBuy(self, value): 92 self.__canBuy = value 93 94 @property 95 def canRedeem(self): 96 return self.__canRedeem 97 98 @canRedeem.setter 99 def canRedeem(self, value): 100 self.__canRedeem = value 101 # 相似其它語言中的toString()函數 102 def __str__(self): 103 return '[serial:%s,date:%s,code:%s,name:%s,equity:%.4f,\ 104 accumulationEquity:%.4f,increment:%.4f,growthRate:%.4f%%,canBuy:%s,canRedeem:%s]'\ 105 % (self.serial, self.date.strftime("%Y-%m-%d"), self.code, self.name, float(self.equity), \ 106 float(self.accumulationEquity), float(self.increment), \ 107 float(self.growthRate), self.canBuy, self.canRedeem)
上述代碼中FundEquity類的屬性值使用getter/setter函數方式定義的,這種方式能夠對值進行檢查。__str__(self)函數相似其它語言裏的toString()。
在命令行運行fund_spider.py代碼,console窗口會輸出淨值數據。
小結
從以上的示例代碼中可見少許代碼就能把豆瓣網上小組中的帖子和回覆數據抓取、內容解析、存儲下來,可見Python語言的簡潔、高效。
例子的代碼比較簡單,惟一比較花時間的是調 XPath規則,藉助於瀏覽器輔助插件工具能大大提升效率。
例子中沒有說起Pipeline(管道)、Middleware(中間件) 這些複雜東西。沒有考慮爬蟲請求太頻繁致使站方封禁IP(能夠經過不斷更換HTTP Proxy 方式破解),沒有考慮須要登陸才能抓取數據的狀況(代碼模擬用戶登陸破解)。
實際項目中提取內容的XPath規則、正則表達式 這類易變更的部分不該該硬編碼寫在代碼裏,網頁抓取、內容解析、解析結果的存儲等應該使用分佈式架構的方式獨立運行。總之實際生產環境中運行的爬蟲系統須要考慮的問題不少,github上也有一些開源的網絡爬蟲系統,能夠參考。