環境:Windows 7 x64 Python3.7.1 pycharmhtml
1.1linux系統使用:pip install scrapypython
1.2Windows系統:linux
一、新建一個項目,選擇Python便可。我這裏建立的項目名是demo。建立好後是一個空的項目。數據庫
二、點擊pycharm下面的terminal,以下圖所示:api
在終端中輸入:scrapy startproject demo 命令,建立scrapy項目,建立成功後會出現以下目錄結構:網絡
各文件做用大體以下:dom
3.1在終端中輸入:cd demo(我這裏輸入demo是由於個人項目名是demo)scrapy
3.2在終端中輸入:scrapy genspider books books.toscrape.com (scrapy genspider 應用名稱 爬取網頁的起始url)ide
5.1分析http://books.toscrape.com/頁面。函數
由上圖咱們能夠知道全部書籍都存放在div/ol/下的li標籤中。這裏咱們只打印書名,由此咱們能夠像下面這樣寫來提取數據。
5.2books中的部分代碼以下:
def parse(self, response): ''' 數據解析,提取。 :param response: 爬取到的response對象 :return: ''' book_list = response.xpath('/html/body/div/div/div/div/section/div[2]/ol/li') for book in book_list: print(book.xpath('./article/div[1]/a/img/@alt').extract())
5.3在setting.py中配置以下:
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0' # UA頭 ROBOTSTXT_OBEY = False # 若是爲True表示準信robots協議,則大多數數據都爬不了。因此這裏設置爲Flase LOG_LEVEL = 'ERROR' # 日誌等級
5.4在終端中執行爬取命令:scrapy crawl books
# 打印內容以下 ['A Light in the Attic'] ['Tipping the Velvet'] ['Soumission'] ['Sharp Objects'] ['Sapiens: A Brief History of Humankind'] ['The Requiem Red'] ['The Dirty Little Secrets of Getting Your Dream Job'] ['The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull'] ['The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics'] ['The Black Maria'] ['Starving Hearts (Triangular Trade Trilogy, #1)'] ["Shakespeare's Sonnets"] ['Set Me Free'] ["Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)"] ['Rip it Up and Start Again'] ['Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991'] ['Olio'] ['Mesaerion: The Best Science Fiction Stories 1800-1849'] ['Libertarianism for Beginners'] ["It's Only the Himalayas"]
由此咱們能夠看出這裏只是爬取了1頁,下面來爬取全部書籍名稱。
最終books.py的內容看起來像下面這樣:
# -*- coding: utf-8 -*- import scrapy class BooksSpider(scrapy.Spider): name = 'books' # 爬蟲的惟一標識 allowed_domains = ['books.toscrape.com'] # 要爬取的起點,能夠是多個。 start_urls = ['http://books.toscrape.com/'] url = 'http://books.toscrape.com/catalogue/page-%d.html' # url模板用於拼接新的url page_num = 2 def parse(self, response): ''' 數據解析,提取。 :param response: 爬取到的response對象 :return: ''' print(f'當前頁數{self.page_num}') # 打印當前頁數的數據 book_list = response.xpath('/html/body/div/div/div/div/section/div[2]/ol/li') for book in book_list: print(book.xpath('./article/div[1]/a/img/@alt').extract()) if self.page_num < 50: # 總共50頁的內容 new_url = format(self.url % self.page_num) # 拼接處新的URL self.page_num += 1 # 頁數加1 yield scrapy.Request(url=new_url, callback=self.parse) # 手動發送請求
在終端中執行命令獲取書名:scrapy crawl books
若是一切順利你會看到打印的最終部分結果以下:
一、提取頁面中的數據(使用XPath或CSS選擇器)。
二、提取頁面中的連接,併產生對連接頁面的下載請求。