我在上一篇筆記中已經建立了數據庫,具體查看《scrapy電影天堂實戰(一)建立數據庫》,這篇筆記建立scrapy實例,先熟悉下要用到到xpath知識html
reference: https://germey.gitbooks.io/python3webspider/content/4.1-XPath%E7%9A%84%E4%BD%BF%E7%94%A8.htmlnode
nodename 選取此節點的全部子節點 / 從當前節點選取直接子節點 // 從當前節點選取子孫節點 . 選取當前節點 .. 選取當前節點的父節點 @ 選取屬性
//title[@lang='eng'],
這就是一個 XPath 規則,它就表明選擇全部名稱爲 title,同時屬性 lang 的值爲 eng 的節點。python
from lxml import etree text = ''' <li class="li li-first"><a href="link.html">first item</a></li> ''' html = etree.HTML(text) result = html.xpath('//li[@class="li"]/a/text()') print(result)
在這裏 HTML 文本中的 li 節點的 class 屬性有兩個值 li 和 li-first,可是此時若是咱們還想用以前的屬性匹配獲取就沒法匹配了, 若是屬性有多個值就須要用 contains() 函數了mysql
result = html.xpath('//li[contains(@class, "li")]/a/text()')
from lxml import etree text = ''' <li class="li li-first" name="item"><a href="link.html">first item</a></li> ''' html = etree.HTML(text) result = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()') print(result)
在這裏 HTML 文本的 li 節點又增長了一個屬性 name,這時候咱們須要同時根據 class 和 name 屬性來選擇,就能夠 and 運算符鏈接兩個條件,兩個條件都被中括號包圍。git
result = html.xpath('//li[position()<3]/a/text()') result = html.xpath('//li[last()-2]/a/text()')
可用該dockerfile自行構建鏡像github
FROM ubuntu:latest MAINTAINER vickeywu <vickeywu557@gmail.com> RUN apt-get update RUN apt-get install -y python3.6 python3-pip python3-dev && \ ln -snf /usr/bin/python3.6 /usr/bin/python RUN apt-get clean && \ rm -rf /var/cache/apt/archives/* /var/lib/apt/lists/* /tmp/* /var/tmp/* RUN pip3 install --upgrade pip && \ ln -snf /usr/local/bin/pip3.6 /usr/bin/pip && \ pip install --upgrade scrapy && \ pip install --upgrade pymysql && \ pip install --upgrade redis && \ pip install --upgrade bitarray && \ pip install --upgrade mmh3 WORKDIR /home/scrapy_project CMD touch /var/log/scrapy.log && tail -f /var/log/scrapy.log
PAGE_ENCODING = 'utf8'
from scrapy.utils.project import get_project_settings settings = get_project_settings() PAGE_ENCODING = settings.get('PAGE_ENCODING')
sys.setdefaultencoding('utf8') body = (response.body).decode('utf8','ignore') body = str((response.body).decode('utf16','ignore')).encode('utf8')
如今正式建立scrapy實例web
root@ubuntu:/home/vickey# docker pull vickeywu/scrapy-python3 root@ubuntu:/home/vickey# mkdir scrapy_project # 建立個文件夾存放scrapy項目 root@ubuntu:/home/vickey# cd scrapy_project/ root@ubuntu:/home/vickey/scrapy_project# docker run -itd --name scrapy_movie -v /home/vickey/scrapy_project/:/home/scrapy_project/ vickeywu/scrapy-python3 # 使用已構建好的鏡像建立容器 84ae2ee9f02268c68e59cabaf3040d8a8d67c1b2d1442a66e16d4e3e4563d8b8 root@ubuntu:/home/vickey/scrapy_project# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 84ae2ee9f022 vickeywu/scrapy-python3 "scrapy shell --nolog" 3 seconds ago Up 2 seconds scrapy_movie d8afb121afc6 mysql "docker-entrypoint.s…" 4 days ago Up 3 hours 33060/tcp, 0.0.0.0:8886->3306/tcp scrapy_mysql root@ubuntu:/home/vickey/scrapy_project# docker exec -it scrapy_movie /bin/bash root@84ae2ee9f022:/home/scrapy_project# ls # 掛載的目錄暫時沒有任何東西,等下建立了項目便會將文件掛載到宿主機,方便修改 root@84ae2ee9f022:/home/scrapy_project# scrapy --help # 查看幫助命令 略 root@84ae2ee9f022:/home/scrapy_project# scrapy startproject movie_heaven_bar # 建立項目名爲movie_heaven_bar New Scrapy project 'movie_heaven_bar', using template directory '/usr/local/lib/python3.6/dist-packages/scrapy/templates/project', created in: /home/scrapy_project/movie_heaven_bar You can start your first spider with: cd movie_heaven_bar scrapy genspider example example.com root@84ae2ee9f022:/home/scrapy_project# ls movie_heaven_bar root@84ae2ee9f022:/home/scrapy_project# cd movie_heaven_bar/ # 進入項目後再建立爬蟲 root@84ae2ee9f022:/home/scrapy_project/movie_heaven_bar# ls movie_heaven_bar scrapy.cfg root@84ae2ee9f022:/home/scrapy_project/movie_heaven_bar# scrapy genspider movie_heaven_bar www.dytt8.net # 建立爬蟲名爲movie_heaven_bar失敗,不能與項目同名。。改個名 Cannot create a spider with the same name as your project root@84ae2ee9f022:/home/scrapy_project/movie_heaven_bar# scrapy genspider newest_movie www.dytt8.net # 建立爬蟲名爲newest_movie Created spider 'newest_movie' using template 'basic' in module: movie_heaven_bar.spiders.newest_movie root@84ae2ee9f022:/home/scrapy_project/movie_heaven_bar# cd movie_heaven_bar/ root@84ae2ee9f022:/home/scrapy_project/movie_heaven_bar/movie_heaven_bar# ls __init__.py __pycache__ items.py middlewares.py pipelines.py settings.py spiders root@84ae2ee9f022:/home/scrapy_project/movie_heaven_bar/movie_heaven_bar# cd spiders/ root@84ae2ee9f022:/home/scrapy_project/movie_heaven_bar/movie_heaven_bar/spiders# ls # 建立的爬蟲文件會在項目的spiders文件夾下 __init__.py __pycache__ newest_movie.py root@84ae2ee9f022:/home/scrapy_project/movie_heaven_bar/movie_heaven_bar/spiders# exit # 退出容器 exit root@ubuntu:/home/vickey/scrapy_project# ls # 退出容器後能夠看到建立的項目文件已經掛載到宿主機本地,接下來在宿主機擼代碼便可 movie_heaven_bar
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy from scrapy.item import Item, Field class MovieHeavenBarItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() #pass movie_link = Field() movie_name = Field() movie_director = Field() movie_actors = Field() movie_publish_date = Field() movie_score = Field() movie_download_link = Field()
數據庫設置、延時設置、啓用pipeline、日誌設置,暫時只用到這些redis
BOT_NAME = 'movie_heaven_bar' SPIDER_MODULES = ['movie_heaven_bar.spiders'] NEWSPIDER_MODULE = 'movie_heaven_bar.spiders' # db settings DB_SETTINGS = { 'DB_HOST': '192.168.229.128', 'DB_PORT': 8886, 'DB_DB': 'movie_heaven_bar', 'DB_USER': 'movie', 'DB_PASSWD': '123123', } # obey ROBOTS.txt set True if raise error set False ROBOTSTXT_OBEY = True # delay 3 seconds DOWNLOAD_DELAY = 3 # enable pipeline ITEM_PIPELINES = { 'movie_heaven_bar.pipelines.MovieHeavenBarPipeline': 300, } # log settings LOG_LEVEL = 'INFO' LOG_FILE = '/var/log/scrapy.log'
reference: https://docs.scrapy.org/en/latest/topics/item-pipeline.html?highlight=filter#item-pipelinesql
項目爬蟲(
scrapy genspider spidername
命令生成到爬蟲文件)抓取到數據以後將它們發送到項目管道(項目下到pipelines.py
文件裏定義到各類class
),管道經過settings.py
裏面定義的ITEM_PIPELINES
優先級順序(0~1000從小到大)來處理數據。docker
做用:1.清洗數據 2.驗證數據(檢查項目是否包含某些字段) 3.檢查重複項(並刪除它們) 4.將數據存儲到數據庫
reference: http://scrapingauthority.com/scrapy-database-pipeline/
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import pymysql from scrapy.exceptions import NotConfigured class MovieHeavenBarPipeline(object): def __init__(self, host, port, db, user, passwd): self.host = host self.port = port self.db = db self.user = user self.passwd = passwd # reference: doc.scrapy.org/en/latest/topics/item-pipeline.html#from_crawler @classmethod def from_crawler(cls, crawler): db_settings = crawler.settings.getdict('DB_SETTINGS') if not db_settings: raise NotConfigured host = db_settings['DB_HOST'] port = db_settings['DB_PORT'] db = db_settings['DB_DB'] user = db_settings['DB_USER'] passwd = db_settings['DB_PASSWD'] return cls(host, port, db, user, passwd) def open_spider(self, spider): self.conn = pymysql.connect( host=self.host, port=self.port, db=self.db, user=self.user, passwd=self.passwd, charset='utf8', use_unicode=True, ) self.cursor = self.conn.cursor() def process_item(self, item, spider): sql = 'INSERT INTO newest_movie(movie_link, movie_name, movie_director, movie_actors, movie_publish_date, movie_score, movie_download_link) VALUES (%s, %s, %s, %s, %s, %s, %s)' self.cursor.execute(sql, (item.get('movie_link'), item.get('movie_name'), item.get('movie_director'), item.get('movie_actors'), item.get('movie_publish_date'), item.get('movie_score'), item.get('movie_download_link'))) self.conn.commit() return item def close_spider(self, spider): self.conn.close()
# -*- coding: utf-8 -*- import scrapy import time import logging from scrapy.http import Request from movie_heaven_bar.items import MovieHeavenBarItem class NewestMovieSpider(scrapy.Spider): name = 'newest_movie' allowed_domains = ['www.dytt8.net'] #start_urls = ['http://www.dytt8.net/'] # 從該urls列表開始爬取 start_urls = ['http://www.dytt8.net/html/gndy/dyzz/'] def parse(self, response): item = MovieHeavenBarItem() domain = "https://www.dytt8.net" urls = response.xpath('//b/a/@href').extract() # list type #print('urls', urls) for url in urls: url = domain + url yield Request(url=url, callback=self.parse_single_page, meta={'item': item}, dont_filter = False) # 爬取下一頁 last_page_num = response.xpath('//select[@name="sldd"]//option[last()]/text()').extract()[0] last_page_url = 'list_23_' + last_page_num + '.html' next_page_url = response.xpath('//div[@class="x"]//a[last() - 1]/@href').extract()[0] if next_page_url != last_page_url: url = 'https://www.dytt8.net/html/gndy/dyzz/' + next_page_url logging.log(logging.INFO, '***************** page num ***************** ') logging.log(logging.INFO, 'crawling page: ' + next_page_url) yield Request(url=url, callback=self.parse, meta={'item': item}, dont_filter = False) def parse_single_page(self, response): item = response.meta['item'] item['movie_link'] = response.url detail_row = response.xpath('//*[@id="Zoom"]//p/text()').extract() # str type list # 將網頁提取的str列表類型數據轉成一個長字符串, 以圓圈爲分隔符,精確提取各個字段具體內容 detail_list = ''.join(detail_row).split('◎') logging.log(logging.INFO, '******************log movie detail*******************') item['movie_name'] = detail_list[1][5:].replace(6*u'\u3000', u', ') logging.log(logging.INFO, 'movie_link: ' + item['movie_link']) logging.log(logging.INFO, 'movie_name: ' + item['movie_name']) # 找到包含特定字符到字段 for field in detail_list: if '主\u3000\u3000演' in field: # 將字段包含雜質去掉[5:].replace(6*u'\u3000', u', ') item['movie_actors'] = field[5:].replace(6*u'\u3000', u', ') logging.log(logging.INFO, 'movie_actors: ' + item['movie_actors']) if '導\u3000\u3000演' in field: item['movie_director'] = field[5:].replace(6*u'\u3000', u', ') logging.log(logging.INFO, 'movie_directors: ' + item['movie_director']) if '上映日期' in field: item['movie_publish_date'] = field[5:].replace(6*u'\u3000', u', ') logging.log(logging.INFO, 'movie_publish_date: ' + item['movie_publish_date']) if '豆瓣評分' in field: item['movie_score'] = field[5:].replace(6*u'\u3000', u', ') logging.log(logging.INFO, 'movie_score: ' + item['movie_score']) # 此處獲取的是迅雷磁力連接,安裝好迅雷,複製該連接到瀏覽器地址欄迅雷會自動打開下載連接,個別網頁結構不一致會獲取不到連接 try: item['movie_download_link'] = ''.join(response.xpath('//p/a/@href').extract()) logging.log(logging.INFO, 'movie_download_link: ' + item['movie_download_link']) except Exception as e: item['movie_download_link'] = response.url logging.log(logging.WARNING, e) yield item
root@ubuntu:/home/vickey/scrapy_project/movie_heaven_bar# docker exec -it scrapy_movie /bin/bash root@1040aa3b7363:/home/scrapy_project# ls movie_heaven_bar root@1040aa3b7363:/home/scrapy_project# cd movie_heaven_bar/ root@1040aa3b7363:/home/scrapy_project/movie_heaven_bar# ls movie_heaven_bar run.sh scrapy.cfg root@1040aa3b7363:/home/scrapy_project/movie_heaven_bar# sh run.sh & # 後臺運行腳本,日誌輸出能夠在/var/log/scrapy.log中看到 root@1040aa3b7363:/home/scrapy_project/movie_heaven_bar# exit exit root@ubuntu:/home/vickey/scrapy_project/movie_heaven_bar# ls movie_heaven_bar README.md run.sh scrapy.cfg root@ubuntu:/home/vickey/scrapy_project/movie_heaven_bar# docker logs -f scrapy_movie # 使用docker logs -f --tail 20 scrapy_movie也能夠看到scrapy的日誌輸出。
大功告成,如今我想看哪部電影只須要將movie_download_link
的連接複製到瀏覽器打開,便可自動打開迅雷連接下載電影了(前提是已經安裝迅雷),而後就能夠在迅雷邊下邊看了,美滋滋。
不過,若是我中途中止了爬取,又要從頭開始爬,因此就會有數據重複,很煩。下一篇筆記寫下scrapy的去重方法,這樣就不會有數據重複了,也能夠節省爬取耗時。
代碼已上傳至github: https://github.com/Vickey-Wu/movie_heaven_bar