scrapy電影天堂實戰(二)建立爬蟲項目

時間 2019-11-12

原文原文鏈接

公衆號原文

建立數據庫

我在上一篇筆記中已經建立了數據庫，具體查看《scrapy電影天堂實戰(一)建立數據庫》，這篇筆記建立scrapy實例，先熟悉下要用到到xpath知識html

用到的xpath相關知識

reference: https://germey.gitbooks.io/python3webspider/content/4.1-XPath%E7%9A%84%E4%BD%BF%E7%94%A8.htmlnode

nodename    選取此節點的全部子節點
/           從當前節點選取直接子節點
//          從當前節點選取子孫節點
.           選取當前節點
..          選取當前節點的父節點
@           選取屬性

//title[@lang='eng']，
這就是一個 XPath 規則，它就表明選擇全部名稱爲 title，同時屬性 lang 的值爲 eng 的節點。python

屬性多值匹配

from lxml import etree
text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[@class="li"]/a/text()')
print(result)

在這裏 HTML 文本中的 li 節點的 class 屬性有兩個值 li 和 li-first，可是此時若是咱們還想用以前的屬性匹配獲取就沒法匹配了, 若是屬性有多個值就須要用 contains() 函數了mysql

result = html.xpath('//li[contains(@class, "li")]/a/text()')

多屬性匹配

from lxml import etree
text = '''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()')
print(result)

在這裏 HTML 文本的 li 節點又增長了一個屬性 name，這時候咱們須要同時根據 class 和 name 屬性來選擇，就能夠 and 運算符鏈接兩個條件，兩個條件都被中括號包圍。git

按序選擇

result = html.xpath('//li[position()<3]/a/text()')
result = html.xpath('//li[last()-2]/a/text()')

scrapy-python3的dockerfile(可忽略)

可用該dockerfile自行構建鏡像github

FROM ubuntu:latest
MAINTAINER vickeywu <vickeywu557@gmail.com>

RUN apt-get update

RUN apt-get install -y python3.6 python3-pip python3-dev && \
     ln -snf /usr/bin/python3.6 /usr/bin/python

RUN apt-get clean && \
    rm -rf /var/cache/apt/archives/* /var/lib/apt/lists/* /tmp/* /var/tmp/*

RUN pip3 install --upgrade pip && \
        ln -snf /usr/local/bin/pip3.6 /usr/bin/pip && \
        pip install --upgrade scrapy && \
        pip install --upgrade pymysql && \
        pip install --upgrade redis && \
        pip install --upgrade bitarray && \
        pip install --upgrade mmh3

WORKDIR /home/scrapy_project

CMD touch /var/log/scrapy.log && tail -f /var/log/scrapy.log

python2環境設置編碼使用utf8 (使用python3環境可忽略)

set var in settings.py

PAGE_ENCODING = 'utf8'

quote in other file.py:

from scrapy.utils.project import get_project_settings
settings = get_project_settings()
PAGE_ENCODING = settings.get('PAGE_ENCODING')

set utf8 directly

sys.setdefaultencoding('utf8')
body = (response.body).decode('utf8','ignore')
body = str((response.body).decode('utf16','ignore')).encode('utf8')

建立爬蟲

如今正式建立scrapy實例web

root@ubuntu:/home/vickey# docker pull vickeywu/scrapy-python3
root@ubuntu:/home/vickey# mkdir scrapy_project      # 建立個文件夾存放scrapy項目
root@ubuntu:/home/vickey# cd scrapy_project/
root@ubuntu:/home/vickey/scrapy_project# docker run -itd --name scrapy_movie -v /home/vickey/scrapy_project/:/home/scrapy_project/ vickeywu/scrapy-python3     # 使用已構建好的鏡像建立容器
84ae2ee9f02268c68e59cabaf3040d8a8d67c1b2d1442a66e16d4e3e4563d8b8
root@ubuntu:/home/vickey/scrapy_project# docker ps
CONTAINER ID        IMAGE                     COMMAND                  CREATED             STATUS              PORTS                               NAMES
84ae2ee9f022        vickeywu/scrapy-python3   "scrapy shell --nolog"   3 seconds ago       Up 2 seconds                                            scrapy_movie
d8afb121afc6        mysql                     "docker-entrypoint.s…"   4 days ago          Up 3 hours          33060/tcp, 0.0.0.0:8886->3306/tcp   scrapy_mysql
root@ubuntu:/home/vickey/scrapy_project# docker exec -it scrapy_movie /bin/bash
root@84ae2ee9f022:/home/scrapy_project# ls      # 掛載的目錄暫時沒有任何東西，等下建立了項目便會將文件掛載到宿主機，方便修改
root@84ae2ee9f022:/home/scrapy_project# scrapy --help       #　查看幫助命令
略
root@84ae2ee9f022:/home/scrapy_project# scrapy startproject movie_heaven_bar        # 建立項目名爲movie_heaven_bar
New Scrapy project 'movie_heaven_bar', using template directory '/usr/local/lib/python3.6/dist-packages/scrapy/templates/project', created in:
    /home/scrapy_project/movie_heaven_bar

You can start your first spider with:
    cd movie_heaven_bar
    scrapy genspider example example.com
root@84ae2ee9f022:/home/scrapy_project# ls
movie_heaven_bar
root@84ae2ee9f022:/home/scrapy_project# cd movie_heaven_bar/       # 進入項目後再建立爬蟲
root@84ae2ee9f022:/home/scrapy_project/movie_heaven_bar# ls
movie_heaven_bar  scrapy.cfg
root@84ae2ee9f022:/home/scrapy_project/movie_heaven_bar# scrapy genspider movie_heaven_bar www.dytt8.net        #　建立爬蟲名爲movie_heaven_bar失敗，不能與項目同名。。改個名
Cannot create a spider with the same name as your project
root@84ae2ee9f022:/home/scrapy_project/movie_heaven_bar# scrapy genspider newest_movie www.dytt8.net     # 建立爬蟲名爲newest_movie
Created spider 'newest_movie' using template 'basic' in module:
  movie_heaven_bar.spiders.newest_movie
root@84ae2ee9f022:/home/scrapy_project/movie_heaven_bar# cd movie_heaven_bar/
root@84ae2ee9f022:/home/scrapy_project/movie_heaven_bar/movie_heaven_bar# ls
__init__.py  __pycache__  items.py  middlewares.py  pipelines.py  settings.py  spiders
root@84ae2ee9f022:/home/scrapy_project/movie_heaven_bar/movie_heaven_bar# cd spiders/
root@84ae2ee9f022:/home/scrapy_project/movie_heaven_bar/movie_heaven_bar/spiders# ls       # 建立的爬蟲文件會在項目的spiders文件夾下
__init__.py  __pycache__  newest_movie.py
root@84ae2ee9f022:/home/scrapy_project/movie_heaven_bar/movie_heaven_bar/spiders# exit     # 退出容器
exit
root@ubuntu:/home/vickey/scrapy_project# ls     # 退出容器後能夠看到建立的項目文件已經掛載到宿主機本地，接下來在宿主機擼代碼便可
movie_heaven_bar

擼代碼

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy
from scrapy.item import Item, Field


class MovieHeavenBarItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    #pass

    movie_link = Field()
    movie_name = Field()
    movie_director = Field()
    movie_actors = Field()
    movie_publish_date = Field()
    movie_score = Field()
    movie_download_link = Field()

settings.py

數據庫設置、延時設置、啓用pipeline、日誌設置，暫時只用到這些redis

BOT_NAME = 'movie_heaven_bar'

SPIDER_MODULES = ['movie_heaven_bar.spiders']
NEWSPIDER_MODULE = 'movie_heaven_bar.spiders'

# db settings
DB_SETTINGS = {
            'DB_HOST': '192.168.229.128',
            'DB_PORT': 8886,
            'DB_DB': 'movie_heaven_bar',
            'DB_USER': 'movie',
            'DB_PASSWD': '123123',
        }

# obey ROBOTS.txt set True if raise error set False
ROBOTSTXT_OBEY = True

# delay 3 seconds
DOWNLOAD_DELAY = 3

# enable pipeline
ITEM_PIPELINES = {
    'movie_heaven_bar.pipelines.MovieHeavenBarPipeline': 300,
}

# log settings
LOG_LEVEL = 'INFO'
LOG_FILE = '/var/log/scrapy.log'

pipelines.py

reference: https://docs.scrapy.org/en/latest/topics/item-pipeline.html?highlight=filter#item-pipelinesql

項目爬蟲(scrapy genspider spidername命令生成到爬蟲文件)抓取到數據以後將它們發送到項目管道(項目下到pipelines.py文件裏定義到各類class)，管道經過settings.py裏面定義的ITEM_PIPELINES優先級順序(0~1000從小到大)來處理數據。docker

做用：1.清洗數據 2.驗證數據（檢查項目是否包含某些字段） 3.檢查重複項（並刪除它們） 4.將數據存儲到數據庫

reference: http://scrapingauthority.com/scrapy-database-pipeline/

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql
from scrapy.exceptions import NotConfigured


class MovieHeavenBarPipeline(object):
    def __init__(self, host, port, db, user, passwd):
        self.host = host
        self.port = port
        self.db = db
        self.user = user
        self.passwd = passwd

    # reference: doc.scrapy.org/en/latest/topics/item-pipeline.html#from_crawler
    @classmethod
    def from_crawler(cls, crawler):
        db_settings = crawler.settings.getdict('DB_SETTINGS')
        if not db_settings:
            raise NotConfigured
        host = db_settings['DB_HOST']
        port = db_settings['DB_PORT']
        db = db_settings['DB_DB']
        user = db_settings['DB_USER']
        passwd = db_settings['DB_PASSWD']
        return cls(host, port, db, user, passwd)

    def open_spider(self, spider):
        self.conn = pymysql.connect(
                                       host=self.host,
                                       port=self.port,
                                       db=self.db,
                                       user=self.user,
                                       passwd=self.passwd,
                                       charset='utf8',
                                       use_unicode=True,
                                   )
        self.cursor = self.conn.cursor()

    def process_item(self, item, spider):
        sql = 'INSERT INTO newest_movie(movie_link, movie_name, movie_director, movie_actors, movie_publish_date, movie_score, movie_download_link) VALUES (%s, %s, %s, %s, %s, %s, %s)'
        self.cursor.execute(sql, (item.get('movie_link'), item.get('movie_name'), item.get('movie_director'), item.get('movie_actors'), item.get('movie_publish_date'), item.get('movie_score'), item.get('movie_download_link')))
        self.conn.commit()
        return item

    def close_spider(self, spider):
        self.conn.close()

spiders/newest_movie.py

# -*- coding: utf-8 -*-
import scrapy
import time
import logging
from scrapy.http import Request
from movie_heaven_bar.items import MovieHeavenBarItem


class NewestMovieSpider(scrapy.Spider):
    name = 'newest_movie'
    allowed_domains = ['www.dytt8.net']
    #start_urls = ['http://www.dytt8.net/']
    # 從該urls列表開始爬取
    start_urls = ['http://www.dytt8.net/html/gndy/dyzz/']

    def parse(self, response):
        item = MovieHeavenBarItem()
        domain = "https://www.dytt8.net"
        urls = response.xpath('//b/a/@href').extract()     # list type
        #print('urls', urls)
        for url in urls:
            url = domain + url
            yield Request(url=url, callback=self.parse_single_page, meta={'item': item}, dont_filter = False)

        # 爬取下一頁
        last_page_num = response.xpath('//select[@name="sldd"]//option[last()]/text()').extract()[0]
        last_page_url = 'list_23_' + last_page_num + '.html'
        next_page_url = response.xpath('//div[@class="x"]//a[last() - 1]/@href').extract()[0]
        if next_page_url != last_page_url:
            url = 'https://www.dytt8.net/html/gndy/dyzz/' + next_page_url
            logging.log(logging.INFO, '***************** page num ***************** ')
            logging.log(logging.INFO, 'crawling page: ' + next_page_url)
            yield Request(url=url, callback=self.parse, meta={'item': item}, dont_filter = False)

    def parse_single_page(self, response):
        item = response.meta['item']
        item['movie_link'] = response.url
        detail_row = response.xpath('//*[@id="Zoom"]//p/text()').extract()      # str type list
        # 將網頁提取的str列表類型數據轉成一個長字符串, 以圓圈爲分隔符，精確提取各個字段具體內容
        detail_list = ''.join(detail_row).split('◎')

        logging.log(logging.INFO, '******************log movie detail*******************')
        item['movie_name'] = detail_list[1][5:].replace(6*u'\u3000', u', ')
        logging.log(logging.INFO, 'movie_link: ' + item['movie_link'])
        logging.log(logging.INFO, 'movie_name: ' + item['movie_name'])
        # 找到包含特定字符到字段
        for field in detail_list:
            if '主\u3000\u3000演' in field:
                # 將字段包含雜質去掉[5:].replace(6*u'\u3000', u', ')
                item['movie_actors'] = field[5:].replace(6*u'\u3000', u', ')
                logging.log(logging.INFO, 'movie_actors: ' + item['movie_actors'])
            if '導\u3000\u3000演' in field:
                item['movie_director'] = field[5:].replace(6*u'\u3000', u', ')
                logging.log(logging.INFO, 'movie_directors: ' + item['movie_director'])
            if '上映日期' in field:
                item['movie_publish_date'] = field[5:].replace(6*u'\u3000', u', ')
                logging.log(logging.INFO, 'movie_publish_date: ' + item['movie_publish_date'])
            if '豆瓣評分' in field:
                item['movie_score'] = field[5:].replace(6*u'\u3000', u', ')
                logging.log(logging.INFO, 'movie_score: ' + item['movie_score'])

        # 此處獲取的是迅雷磁力連接，安裝好迅雷，複製該連接到瀏覽器地址欄迅雷會自動打開下載連接，個別網頁結構不一致會獲取不到連接
        try:
            item['movie_download_link'] = ''.join(response.xpath('//p/a/@href').extract())
            logging.log(logging.INFO, 'movie_download_link: ' + item['movie_download_link'])
        except Exception as e:
            item['movie_download_link'] = response.url
            logging.log(logging.WARNING, e)
        yield item

啓動爬蟲

root@ubuntu:/home/vickey/scrapy_project/movie_heaven_bar# docker exec -it scrapy_movie /bin/bash
root@1040aa3b7363:/home/scrapy_project# ls
movie_heaven_bar
root@1040aa3b7363:/home/scrapy_project# cd movie_heaven_bar/
root@1040aa3b7363:/home/scrapy_project/movie_heaven_bar# ls
movie_heaven_bar  run.sh  scrapy.cfg
root@1040aa3b7363:/home/scrapy_project/movie_heaven_bar# sh run.sh &       # 後臺運行腳本，日誌輸出能夠在/var/log/scrapy.log中看到
root@1040aa3b7363:/home/scrapy_project/movie_heaven_bar# exit
exit
root@ubuntu:/home/vickey/scrapy_project/movie_heaven_bar# ls
movie_heaven_bar  README.md  run.sh  scrapy.cfg
root@ubuntu:/home/vickey/scrapy_project/movie_heaven_bar# docker logs -f scrapy_movie        # 使用docker logs -f --tail 20 scrapy_movie也能夠看到scrapy的日誌輸出。