正常安裝會報錯,主要是兩個緣由css
python -m pip install -U pip
須要手動安裝 wheel、lxml、Twisted、pywin32html
pip3 install wheel pip3 install lxml pip3 install Twisted pip3 install pywin32
pip3 install scrapy
mkdir Scrapy scrapy startproject myfirstpjt cd myfirstpjt
命令分爲兩種,一種爲全局命令,一種爲項目命令python
全局命令不須要依賴Scrapy項目便可直接與性能,項目命令必須依賴項目正則表達式
在Scrapy項目所在目錄外使用scrapy -h 顯示全部全局命令shell
C:\Users\LENOVO>scrapy -h Scrapy 2.4.0 - no active project Usage: scrapy <command> [options] [args] Available commands: bench Run quick benchmark test commands fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy [ more ] More commands available when run from project directory Use "scrapy <command> -h" to see more info about a command
fetch命令主要用來顯示爬蟲爬取過程api
支持 XPath CSS 選擇器app
同時XPath選擇器還有一個.re()方法用於經過正則表達式來提取數據框架
,不一樣於使用.xpath()或者.css()方法,.re()方法返回unicode字符串的 列表,因此沒法構造嵌套式的.re()調用dom
日常寫小腳本或者項目就至關於白紙上寫做文,而框架會集成一些常常用的東西將做文題變成填空題,大大減小了工做量scrapy
scrapy startproject 項目名稱 (例如 todayMovie)
tree todayMovie
D:\pycharm\code\Scrapy>scrapy startproject todayMovie New Scrapy project 'todayMovie', using template directory 'c:\python3.7\lib\site-packages\scrapy\templates\project', created in: D:\pycharm\code\Scrapy\todayMovie You can start your first spider with: cd todayMovie scrapy genspider example example.com D:\pycharm\code\Scrapy>tree todayMovie 文件夾 PATH 列表 卷序列號爲 6858-7249 D:\PYCHARM\CODE\SCRAPY\TODAYMOVIE └─todayMovie └─spiders D:\pycharm\code\Scrapy>
genspider
參數新建基礎爬蟲新建一個名爲wuHanMovieSpider的爬蟲腳本,腳本搜索的域爲mtime.com
scrapy genspider wuHanMovieSpider mtime.com
scrapy.cfg
主要聲明默認設置文件位置爲todayMovie模塊下的settings文件(setting.py),定義項目名爲todayMovie
items.py文件的做用是定義爬蟲最終須要哪些項,
pipelines.py文件的做用是掃尾。Scrapy爬蟲爬取了網頁中的內容 後,這些內容怎麼處理就取決於pipelines.py如何設置
須要修改、填空的只有4個文件,它們 分別是items.py、settings.py、pipelines.py、wuHanMovieSpider.py。
其中 items.py決定爬取哪些項目,wuHanMovieSpider.py決定怎麼爬, settings.py決定由誰去處理爬取的內容,pipelines.py決定爬取後的內容怎樣處理
selector = response.xpath('/html/body/div[@id='homeContentRegion']//text()')[0].extract()
extract() 返回選中內容的Unicode字符串。
符號 | 用途 |
---|---|
/ | 選中文檔的根,通常爲html |
// | 選中從當前位置開始全部子孫節點 |
./ | 表示從當前節點提取,二次提取數據時會用到 |
. | 選中當前節點,相對路徑 |
.. | 選中當前節點父節點,相對路徑 |
ELEMENT | 選中子節點中全部ELEMENT元素節點 |
//ELEMENT | 選中子孫節點全部ELEMENT元素節點 |
* | 選中全部元素子節點 |
text() | 選中素有文本子節點 |
@ATTR | 選中名爲ATTR的屬性節點 |
@* | 選中全部屬性節點 |
/@ATTR | 獲取節點屬值 |
方法 | 用途 |
---|---|
contains | a[contains(@href,"test")] 查找href屬性包含test字符的a標籤 |
start-with | a[start-with(@href,"http")] 查找href屬性以http開頭的a標籤 |
舉例
response.xpath('//a/text()') #選取全部a的文本 response.xpath('//div/*/img') #選取div孫節點的全部img response.xpath('//p[contains(@class,'song')]') #選擇class屬性中含有‘song’的p元素 response.xpath('//a[contains(@data-pan,'M18_Index_review_short_movieName')]/text()') response.xpath('//div/a | //div/p') 或者,頁面中多是a多是p selector = response.xpath('//a[contains(@href,"http://movie.mtime.com")]/text()').extract()
參考文章
http://www.javashuo.com/article/p-yqgmlgqj-dy.html
http://www.javashuo.com/article/p-wmhvcwio-no.html
cd Scrapy\code scrapy startproject weather scrapy genspider beiJingSpider www.weather.com.cn/weather/101010100.shtml
class WeatherItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() cityDate = scrapy.Field() #城市及日期 week = scrapy.Field() #星期 temperature = scrapy.Field() #溫度 weather = scrapy.Field() #天氣 wind = scrapy.Field() #風力
先使用scrapy shell命令來測試獲取選擇器,主要是看一下網站有沒有反爬機制
scrapy shell https://www.tianqi.com/beijing/
好比403就是禁止讀取,而不是頁面不存在。
簡單的bypass就是添加UA和訪問頻率
先準備一堆User-Agent放到resource.py利用random每次隨機選擇其中一個便可。
setp1:將準備好的resource.py放在settings.py的同級目錄中
resource.py
#-*- coding:utf-8 -*- UserAgents = [ "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)", "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)", "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20", "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52", "Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/76.0", ]
step2:修改middlewares.py
導入random,UserAgents,UserAgentMiddleware
最下面添加一個新類,新類繼承於UserAgentMiddleware類
類中大體內容爲提供每次請求時隨機挑選的UA頭
class CustomUserAgentMiddleware(UserAgentMiddleware): def __init__(self, user_agent='Scrapy'): # ua = "Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/76.0" ua = random.choice(UserAgents) self.user_agent = ua
step3:修改settings.py
用CustomUserAgentMiddleware來替代 UserAgentMiddleware。
在settings.py中找到DOWNLOADER_MIDDLEWARES這個選項修改成以下圖所示
DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, #'weather.middlewares.WeatherDownloaderMiddleware': 543, 'weather.middlewares.CustomUserAgentMiddleware': 542, }
step4:修改請求時間間隔
Scrapy在兩次請求之間的時間設置是DOWNLOAD_DELAY,若是不考慮反爬那必然越小越好,值爲30就是每隔30s像網站請求一次網頁。
ps:通常網站添加UA頭便可bypass
要獲取的內容在class=day7這個div下,在這個div下錨點
# 注意這裏url最後要有/否則獲取不到內容 scrapy shell https://tianqi.com/beijing/ selector = response.xpath('//div[@class="day7"]') selector1 = selector.xpath('ul[@class="week"]/li')
beiJingSpider.py
import scrapy from weather.items import WeatherItem class BeijingspiderSpider(scrapy.Spider): name = 'beiJingSpider' allowed_domains = ['https://www.tianqi.com/beijing/'] start_urls = ['https://www.tianqi.com/beijing/'] def parse(self, response): items = [] city = response.xpath('//dd[@class="name"]/h2/text()').extract() Selector = response.xpath('//div[@class="day7"]') date = Selector.xpath('ul[@class="week"]/li/b/text()').extract() week = Selector.xpath('ul[@class="week"]/li/span/text()').extract() wind = Selector.xpath('ul[@class="txt"]/li/text()').extract() weather = Selector.xpath('ul[@class="txt txt2"]/li/text()').extract() temperature1 = Selector.xpath('div[@class="zxt_shuju"]/ul/li/span/text()') temperature2 = Selector.xpath('div[@class="zxt_shuju"]/ul/li/b/text()').extract() for i in range(7): item = WeatherItem() try: item['cityDate'] = city + date[i] item['week'] = week[i] item['temperature'] = temperature1[i] + ',' + temperature2[i] item['weather'] = weather[i] item['wind'] = wind[i] except IndexError as e: exit() items.append(item) return items
import time import codecs class WeatherPipeline: def process_item(self, item, spider): today = timw.strftime('%Y%m%d', time.localtime()) fileName = today + '.txt' with codecs.open(fileName, 'a', 'utf-8') as fp: fp.write("%s \t %s \t %s \t %s \r\n" %(item['cityDate'], item['week'], item['temperature'], item['weather'], item['wind'])) return item
找到ITEM_PIPELINES去掉前面的註釋
回到weather項目下執行命令
scrapy crawl beiJingSpider