網絡爬蟲(英語:web crawler),也叫網絡蜘蛛(spider),是一種用來自動瀏覽萬維網的網絡機器人。其目的通常爲編纂網絡索引。html
在爬蟲領域,Python幾乎是霸主地位,將網絡一切數據做爲資源,經過自動化程序進行有針對性的數據採集以及處理。從事該領域應學習爬蟲策略、高性能異步IO、分佈式爬蟲等,並針對Scrapy框架源碼進行深刻剖析,從而理解其原理並實現自定義爬蟲框架。python
Scrapy 是用 Python 編寫實現的一個爲了爬取網站數據、提取結構性數據而編寫的應用框架。Scrapy 常應用在包括數據挖掘,信息處理或存儲歷史數據等一系列的程序中。mysql
Scrapy Engine(引擎): 負責Spider、ItemPipeline、Downloader、Scheduler中間的通信,信號、數據傳遞等。web
Scheduler(調度器): 它負責接受引擎發送過來的Request請求,並按照必定的方式進行整理排列,入隊,當引擎須要時,交還給引擎。sql
Downloader(下載器):負責下載Scrapy Engine(引擎)發送的全部Requests請求,並將其獲取到的Responses交還給Scrapy Engine(引擎),由引擎交給Spider來處理,數據庫
Spider(爬蟲):它負責處理全部Responses,從中分析提取數據,獲取Item字段須要的數據,並將須要跟進的URL提交給引擎,再次進入Scheduler(調度器).網絡
Item Pipeline(管道):它負責處理Spider中獲取到的Item,並進行進行後期處理(詳細分析、過濾、存儲等)的地方。app
Downloader Middlewares(下載中間件):你能夠看成是一個能夠自定義擴展下載功能的組件。框架
Spider Middlewares(Spider中間件):你能夠理解爲是一個能夠自定擴展和操做引擎和Spider中間通訊的功能組件(好比進入Spider的Responses;和從Spider出去的Requests)dom
# 建立爬蟲虛擬環境 $ conda create --name scrapy python=3.6 # 激活虛擬環境 $ activate scrapy # 安裝scrapy $ conda install scrapy # 使用 scrapy 提供的工具建立爬蟲項目 $ scrapy startproject myScrapy # 啓動爬蟲 $ scrapy crawl scrapyName
scrapy.cfg: 項目的配置文件。
mySpider/: 項目的Python模塊,將會從這裏引用代碼。
mySpider/items.py: 項目的目標文件。
mySpider/pipelines.py: 項目的管道文件。
mySpider/settings.py: 項目的設置文件。
mySpider/spiders/: 存儲爬蟲代碼目錄。
import scrapy class DoubanItem(scrapy.Item): name = scrapy.Field() director = scrapy.Field() detail = scrapy.Field() star = scrapy.Field() synopsis = scrapy.Field() comment = scrapy.Field()
# coding:utf-8 import scrapy from scrapy import Request from douban.items import DoubanItem class DoubanSpider(scrapy.Spider): name = "douban" allowed_domains = ['douban.com'] start_urls = ['https://movie.douban.com/top250'] def parse(self, response): movie_list = response.xpath("//div[@class='article']/ol/li") if movie_list and len(movie_list) > 0: for movie in movie_list: item = DoubanItem() item['name'] = movie.xpath("./div/div[2]/div[1]/a/span[1]/text()").extract()[0] item['director'] = movie.xpath("normalize-space(./div/div[2]/div[2]/p/text())").extract_first() item['detail'] = movie.xpath("normalize-space(./div/div[2]/div[2]/p[1]/text())").extract()[0] item['star'] = movie.xpath("./div/div[2]/div[2]/div/span[2]/text()").extract()[0] item['synopsis'] = movie.xpath("normalize-space(./div/div[2]/div[2]/p[2]/span/text())").extract()[0] item['comment'] = movie.xpath("./div/div[2]/div[2]/div/span[4]/text()").extract()[0] yield item next_link = response.xpath("//span[@class='next']/a/@href").extract() if next_link: yield Request("https://movie.douban.com/top250" + next_link[0], callback=self.parse, dont_filter=True)
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html from database_handler import DatabaseHandler class DoubanPipeline(object): def __init__(self): self.db = DatabaseHandler(host="xxx", username="xxx", password="xxx", database="xxx") def close_spider(self, spider): self.db.close() # 將 Item 實例保存到文件 def process_item(self, item, spider): sql = "insert into t_douban(name,director,detail,star,synopsis,comment) values('%s', '%s', '%s', '%s', '%s', '%s')" % ( item['name'], item['director'], item['detail'], item['star'], item['synopsis'], item['comment']) self.db.insert(sql) return item
# coding:utf-8 import pymysql from pymysql.err import * class DatabaseHandler(object): def __init__(self, host, username, password, database, port=3306): """初始化數據庫鏈接""" self.host = host self.username = username self.password = password self.port = port self.database = database self.db = pymysql.connect(self.host, self.username, self.password, self.database, self.port, charset='utf8') self.cursor = None def execute(self, sql): """執行SQL語句""" try: self.cursor = self.db.cursor() self.cursor.execute(sql) self.db.commit() except (MySQLError, ProgrammingError) as e: print(e) self.db.rollback() else: print("rowCount: %s rowNumber: %s" % (self.cursor.rowcount, self.cursor.rownumber)) finally: self.cursor.close() def update(self, sql): """ 更新操做""" self.execute(sql) def insert(self, sql): """插入數據""" self.execute(sql) return self.cursor.lastrowid def insert_bath(self, sql, rows): """批量插入""" try: self.cursor.executemany(sql, rows) self.db.commit() except (MySQLError, ProgrammingError) as e: print(e) self.db.rollback() else: print("rowCount: %s rowNumber: %s" % (self.cursor.rowcount, self.cursor.rownumber)) finally: self.cursor.close() def delete(self, sql): """刪除數據""" self.execute(sql) def select(self, sql): """查詢數據 返回 map 類型的數據""" self.cursor = self.db.cursor(cursor=pymysql.cursors.DictCursor) result = [] try: self.cursor.execute(sql) data = self.cursor.fetchall() for row in data: result.append(row) except MySQLError as e: print(e) else: print(f"rowCount: {self.cursor.rowcount} rowNumber: {self.cursor.rownumber}") return result finally: self.cursor.close() def call_proc(self, name): """調用存儲過程""" self.cursor.callproc(name) return self.cursor.fetchone() def close(self): """關閉鏈接""" self.db.close() if __name__ == "__main__": pass
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36' ITEM_PIPELINES = { 'mySpider.pipelines.DoubanPipeline': 300, }
scrapy crawl douban