scrapy是一個大而全的爬蟲組件html
安裝:python
- Win: 下載:http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted pip3 install wheel pip install Twisted‑18.4.0‑cp36‑cp36m‑win_amd64.whl pip3 install pywin32 pip3 install scrapy - Linux: pip3 install scrapy
使用app
# 建立project scrapy startproject xdb cd xdb # 建立爬蟲 scrapy genspider chouti chouti.com scrapy genspider cnblogs cnblogs.com # 啓動爬蟲 scrapy crawl chouti
流程dom
1. 建立project scrapy startproject 項目名稱 項目名稱 項目名稱/ - spiders # 爬蟲文件 - chouti.py - cnblgos.py .... - items.py # 持久化 - pipelines # 持久化 - middlewares.py # 中間件 - settings.py # 配置文件(爬蟲) scrapy.cfg # 配置文件(部署) 2. 建立爬蟲 cd 項目名稱 scrapy genspider chouti chouti.com scrapy genspider cnblgos cnblgos.com 3. 啓動爬蟲 scrapy crawl chouti scrapy crawl chouti --nolog
具體操做爬取choutiscrapy
# -*- coding: utf-8 -*- import scrapy from scrapy.http.response.html import HtmlResponse from scrapy.http import Request import sys,os,io sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') class ChoutiSpider(scrapy.Spider): name = 'chouti' allowed_domains = ['chouti.com'] #定向爬蟲,只爬這一個網站 start_urls = ['http://chouti.com/']#起始url def parse(self, response): #回調函數 f = open('news.log', mode='a+') item_list = response.xpath('//div[@id="content-list"]/div[@class="item"]') # print(item_list) for item in item_list: text = item.xpath('.//a/text()').extract_first() href = item.xpath('.//a/@href').extract_first() print(href.strip()) print(text.strip()) f.write(href + '\n') f.close() page_list = response.xpath('//div[@id="dig_lcpage"]//a/@href').extract() for page in page_list: from scrapy.http import Request page = "https://dig.chouti.com" + page yield Request(url=page, callback=self.parse) # https://dig.chouti.com/all/hot/recent/2