之前一直使用PHP寫爬蟲,用Snoopy
配合simple_html_dom
用起來也挺好的,至少可以解決問題。html
PHP一直沒有一個好用的多線程機制,雖然可使用一些trick的手段來實現並行的效果(例如藉助apache或者nginx服務器等,或者fork一個子進程,或者直接動態生成多個PHP腳本多進程運行),可是不管從代碼結構上,仍是從使用的複雜程度上,用起來都不是那麼順手。還據說過一個pthreads的PHP的擴展,這是一個真正可以實現PHP多線程的擴展,看github上它的介紹:Absolutely, this is not a hack, we don't use forking or any other such nonsense, what you create are honest to goodness posix threads that are completely compatible with PHP and safe ... this is true multi-threading :)python
扯遠了,PHP的內容在本文中再也不贅述,既然決定嘗試一下Python的採集,同時必定要學習一下Python的多線程知識的。之前一直聽各類大牛們將Python有多麼多麼好用,不真正用一次試試,本身也無法明確Python具體的優點在哪,處理哪些問題用Python合適。nginx
廢話就說這麼多吧,進入正題git
search_config.py
github
1 #!/usr/bin/env python 2 # coding=utf-8 3 class config: 4 keyword = '青島' 5 search_type = 'shop' 6 url = 'http://s.taobao.com/search?q=' + keyword + '&commend=all&search_type='+ search_type +'&sourceId=tb.index&initiative_id=tbindexz_20131207&app=shopsearch&s='
single_scrapy.py
redis
#!/usr/bin/env python # coding=utf-8 import requests from search_config import config class Scrapy(): def __init__(self, threadname, start_num): self.threadname = threadname self.start_num = start_num print threadname + 'start.....' def run(self): url = config.url + self.start_num response = requests.get(url) print self.threadname + 'end......' def main(): for i in range(0,13,6): scrapy = Scrapy('scrapy', str(i)) scrapy.run() if __name__ == '__main__': main()
這是最簡單最常規的一種採集方式,按照順序循環進行網絡鏈接,獲取頁面信息。看截圖可知,這種方式的效率實際上是極低的,一個鏈接進行網絡I/O的時候,其餘的必須等待前面的鏈接完成才能進行鏈接,換句話說,就是前面的鏈接阻塞的後面的鏈接。apache
#!/usr/bin/env python # coding=utf-8 import requests from search_config import config import threading class Scrapy(threading.Thread): def __init__(self, threadname, start_num): threading.Thread.__init__(self, name = threadname) self.threadname = threadname self.start_num = start_num print threadname + 'start.....' #重寫Thread類的run方法 def run(self): url = config.url + self.start_num response = requests.get(url) print self.threadname + 'end......' def main(): for i in range(0,13,6): scrapy = Scrapy('scrapy', str(i)) scrapy.start() if __name__ == '__main__': main()
經過截圖能夠看到,採集一樣數量的頁面,經過開啓多線程,時間縮短了不少,可是CPU利用率高了。安全
html頁面信息拿到之後,咱們須要對其進行解析操做,從中提取出咱們所須要的信息,包含:服務器
使用BeautifulSoup這個庫,能夠直接按照class或者id等html的attr來進行提取,比直接寫正則直觀很多,難度也小了不少,固然,執行效率上,相應的也就大打折扣了。網絡
這裏使用Queue
實現一個生產者和消費者模式
Queue模塊
#!/usr/bin/env python # coding=utf-8 import requests from BeautifulSoup import BeautifulSoup from search_config import config from Queue import Queue import threading class Scrapy(threading.Thread): def __init__(self, threadname, queue, out_queue): threading.Thread.__init__(self, name = threadname) self.sharedata = queue self.out_queue= out_queue self.threadname = threadname print threadname + 'start.....' def run(self): url = config.url + self.sharedata.get() response = requests.get(url) self.out_queue.put(response) print self.threadname + 'end......' class Parse(threading.Thread): def __init__(self, threadname, queue, out_queue): threading.Thread.__init__(self, name = threadname) self.sharedata = queue self.out_queue= out_queue self.threadname = threadname print threadname + 'start.....' def run(self): response = self.sharedata.get() body = response.content.decode('gbk').encode('utf-8') soup = BeautifulSoup(body) ul_html = soup.find('ul',{'id':'list-container'}) lists = ul_html.findAll('li',{'class':'list-item'}) stores = [] for list in lists: store= {} try: infos = list.findAll('a',{'trace':'shop'}) for info in infos: attrs = dict(info.attrs) if attrs.has_key('class'): if 'rank' in attrs['class']: rank_string = attrs['class'] rank_num = rank_string[-2:] if (rank_num[0] == '-'): store['rank'] = rank_num[-1] else: store['rank'] = rank_num if attrs.has_key('title'): store['title'] = attrs['title'] store['href'] = attrs['href'] except AttributeError: pass if store: stores.append(store) for store in stores: print store['title'] + ' ' + store['rank'] print self.threadname + 'end......' def main(): queue = Queue() targets = Queue() stores = Queue() scrapy = [] for i in range(0,13,6): #queue 原始請求 #targets 等待解析的內容 #stores解析完成的內容,這裏爲了簡單直觀,直接在線程中輸出了內容,並無使用該隊列 queue.put(str(i)) scrapy = Scrapy('scrapy', queue, targets) scrapy.start() parse = Parse('parse', targets, stores) parse.start() if __name__ == '__main__': main()
看這個運行結果,能夠看到,咱們的scrapy過程很快就完成了,咱們的parse也很早就開始了,但是在運行的時候,卻卡在parse上好長時間纔出的運行結果,每個解析結果出現,都須要3~5秒的時間,雖然我用的是臺老IBM破本,但按理說使用了多線程之後不該該會這麼慢的啊。
一樣的數據,咱們再看一下單線程下,運行結果。這裏爲了方便,我在上一個multi_scrapy里加入了redis,使用redis存儲爬行下來的原始頁面,這樣在single_parse.py裏面能夠單獨使用,更方便一些。
#!/usr/bin/env python # coding=utf-8 from BeautifulSoup import BeautifulSoup import redis class Parse(): def __init__(self, threadname, content): self.threadname = threadname self.content = content print threadname + 'start.....' def run(self): response = self.content if response: body = response.decode('gbk').encode('utf-8') soup = BeautifulSoup(body) ul_html = soup.find('ul',{'id':'list-container'}) lists = ul_html.findAll('li',{'class':'list-item'}) stores = [] for list in lists: store= {} try: infos = list.findAll('a',{'trace':'shop'}) for info in infos: attrs = dict(info.attrs) if attrs.has_key('class'): if 'rank' in attrs['class']: rank_string = attrs['class'] rank_num = rank_string[-2:] if (rank_num[0] == '-'): store['rank'] = rank_num[-1] else: store['rank'] = rank_num if attrs.has_key('title'): store['title'] = attrs['title'] store['href'] = attrs['href'] except AttributeError: pass if store: stores.append(store) for store in stores: try: print store['title'] + ' ' + store['rank'] except KeyError: pass print self.threadname + 'end......' else: pass def main(): r = redis.StrictRedis(host='localhost', port=6379) while True: content = r.lpop('targets') if (content): parse = Parse('parse', content) parse.run() else: break if __name__ == '__main__': main()
結果能夠看到,單線程版本中,耗時其實和多線程是差很少的,上文中的多線程版本,雖然包含了獲取頁面的時間,可是地一個例子裏咱們已經分析了,使用多線程之後,三個頁面的抓取,徹底能夠在1s內完成的,也就是說,使用多線程進行數據解析,並無得到實質上的效率提升。
GIL既然是針對一個python解釋器進程而言的,那麼,若是解釋器能夠多進程解釋執行,那就不存在GIL的問題了。一樣,他也不會致使你多個解釋器跑在同一個核上。 因此,最好的解決方案,是多線程+多進程結合。經過多線程來跑I/O密集型程序,經過控制合適數量的進程來跑CPU密集型的操做,這樣就能夠跑慢CPU了:)