Scrapy學習筆記

1.Scrapy是什麼

Scrapy是基於twisted的爬蟲框架,用戶定製開發幾個模塊就能夠實現爬蟲git

2.Scrapy的優點

沒有Scrapy要本身手寫爬蟲的時候,咱們要用Urlib或Requests庫發送請求、封裝http頭部信息類、多線程、封裝代理類、封裝去重類、封裝數據存儲類、封裝去重類、封裝異常檢測機制github

3.Scrapy架構

Scrapy Engine:Scrapy的引擎。它負責Scheduler,Pipeline,Spiders,Downloader之間的信號、消息和通信傳遞數據庫

Scheduler:Scrapy的調度器。簡單地說是隊列,接受Scrapy Engine發送來的Request,Scheduler對它們進行排隊,當Scrapy Engine須要數據時,Scheduler將請求隊列中的數據傳送給引擎api

Downloader:Scrapy的下載器。它負責接受Scrapy Engine的Request,生成Response,並將其交還給Scrapy Engine,引擎再將Response交給Spiders瀏覽器

Spiders:Scrapy的爬蟲。它用來寫爬蟲邏輯,如編寫正則,BeautifulSoup,Xpath等;若是Response包含下一次請求,如「下一頁」,Spiders會將URL交給Scrapy Engine,再有引擎交給Scheduler進行排隊服務器

Pipeline:Scrapy的管道。封裝去重類、存儲類的地方,負責數據的後期過濾、存儲等多線程

Downloader:下載器。它負責發送請求並下載數據架構

Downloader Middlewares:下載中間件。自定義擴展組件,是咱們封裝代理、封裝HTTP頭的地方框架

Spider Middlewares:爬蟲中間件。能夠封裝從Spiders發送出去的Request和接受到的Responsedom

4.Scrapy例子

4.1 爬取豆瓣電影Top250

搭建Scapy項目的教程網上有不少,能夠自行百度

自定義代理中間件,這裏用到了本地Ip代理,大量爬蟲請求的話須要接入第三方代理工具。能夠將爬取源Ip假裝成以下代理

class specified_proxy(object): def proccess_request(self,request,spider): #隨機選取代理Ip PROXIES = ['http://183.207.95.27:80', 'http://111.6.100.99:80', 'http://122.72.99.103:80', 'http://106.46.132.2:80', 'http://112.16.4.99:81', 'http://123.58.166.113:9000', 'http://118.178.124.33:3128', 'http://116.62.11.138:3128', 'http://121.42.176.133:3128', 'http://111.13.2.131:80', 'http://111.13.7.117:80', 'http://121.248.112.20:3128', 'http://112.5.56.108:3128', 'http://42.51.26.79:3128', 'http://183.232.65.201:3128', 'http://118.190.14.150:3128', 'http://123.57.221.41:3128', 'http://183.232.65.203:3128', 'http://166.111.77.32:3128', 'http://42.202.130.246:3128', 'http://122.228.25.97:8101', 'http://61.136.163.245:3128', 'http://121.40.23.227:3128', 'http://123.96.6.216:808', 'http://59.61.72.202:8080', 'http://114.141.166.242:80', 'http://61.136.163.246:3128', 'http://60.31.239.166:3128', 'http://114.55.31.115:3128', 'http://202.85.213.220:3128'] random_proxy = random.sample(PROXIES, 1) request.meta['proxy'] = random_proxy

自定義user_agent,讓目標服務器知道咱們不是機器,而是從操做系統、瀏覽器等發出的請求

class specified_useragent(object): def proccess_request(self, request, spider): #隨機選取user_agent USER_AGENT_LIST = [ "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)", "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)", "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20", "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52", ] agent = random.choice(USER_AGENT_LIST) request.headers['USER_AGNET'] = agent

配置完自定義中間件,要在Settings.py中引用它們

#數字越小優先級越高 DOWNLOADER_MIDDLEWARES = {'ScrapyTest.middlewares.specified_proxy': 543, 'ScrapyTest.middlewares.specified_useragent': 544 }

在items.py裏定義數據

import scrapy class ScrapytestItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() #電影序號 serial_number = scrapy.Field(); #電影名稱 movie_name = scrapy .Field(); #電影介紹 introduce = scrapy.Field(); #評分 star = scrapy.Field(); #電影的評論數 evaluate = scrapy.Field(); #電影描述 describe = scrapy.Field(); pass

在管道pipelines.py中配置數據的存儲,鏈接Monodb

class ScrapytestPipeline(object): def __init__(self): host = monodb_host port = monodb_port dbname = monodb_db_name sheetname = monodb_tb_name client = pymongo.MongoClient(host=host,port=port) mydb = client[dbname] self.post = mydb[sheetname] def process_item(self, item, spider): data = dict(item) self.post.insert(data) return item

settings.py數據庫信息

monodb_host = "127.0.0.1" monodb_port = 27017 monodb_db_name = "scrapy_test" monodb_tb_name = "douban_movie"

運行main後的效果

在Mongodb數據庫中能夠看到插入進來的數據

use scrapy_test; show collections; db.douban_movie.find().pretty()

 

4.2 源碼獲取

https://github.com/cjy513203427/ScrapyTest

相關文章
相關標籤/搜索