爬蟲基礎知識及scrapy框架使用和基本原理

時間 2020-08-26

標籤爬蟲基礎知識 scrapy 框架使用基本原理欄目網絡爬蟲简体版

原文原文鏈接

爬蟲

1、異步IO

線程：線程是計算機中工做的最小單元html

IO請求（IO密集型）時多線程更好，計算密集型進程併發最好，IO請求不涉及CPUpython

自定義線程池react

進程：進程默認有主線程，能夠有多線程共存，而且共享內部資源git

自定義進程github

協程：使用進程中一個線程去完成多個任務，微線程（僞線程）web

GIL：python特有，用於在進程中對線程枷鎖，保證同一時刻只能有一個線程被CPU調度windows

# Author：wylkjj
# Date：2020/2/24
# -*- coding:utf-8 -*-
import requests
# 建立多線程
from concurrent.futures import ThreadPoolExecutor
# 建立多進程
from concurrent.futures import ProcessPoolExecutor


def async_url(url):
    try:
        response = requests.get(url)
    except Exception as e:
        print('異常結果', response.url, response.content)
    print('獲取結果', response.url, response.content)


url_list = [
    'http://www.baidu.com',
    'http://www.chouti.com',
    'http://www.bing.com',
    'http://www.google.com',
]
# 線程池pool：建立五個線程,IO請求線程更適合
# GIL線程鎖，只針對cpu的調用權限，針對IO請求不會鎖住
pool = ThreadPoolExecutor(5)
# 進程池pools：建立五個線程,進程浪費資源
pools = ProcessPoolExecutor(5)

for url in url_list:
    print('開始請求：', url)
    pool.submit(async_url, url)

pool.shutdown(wait=True)

# 回調函數：.add_done_callback(回調的函數)

異步IO模塊：多線程

import asyncio缺點：只提供TCP，提供sleep，不提供http併發

事件循環：get_event_loop()框架

@asyncio.coroutine和yield from要同時配套使用，固定寫法

異步IO：

asynico + aiohttp：asynico + request
gevent + request：gevent + request兩個方法組合在一塊兒後出現了一個grequests
twisted
tornado：異步非阻塞IO

# Author：wylkjj
# Date：2020/2/24
# -*- coding:utf-8 -*-
# 異步IO模塊
import asyncio


@asyncio.coroutine
def func1():
    print('before...func1......')
    yield from asyncio.sleep(5)
    print('end...func1......')


tasks = [func1(), func1()]
loop = asyncio.get_event_loop()  # 事件循環
loop.run_until_complete(asyncio.gather(*tasks))  # 把任務做爲列表傳進來
loop.close()

# Author：wylkjj
# Date：2020/2/25
# -*- coding:utf-8 -*-
import asyncio


@asyncio.coroutine
def fetch_async(host, url='/'):
    print(host, url)
    reader, writer = yield from asyncio.open_connection(host, 80)

    request_header_content = """GET %s HTTP/1.0\r\nHost: %s\r\n\r\n""" % (url, host,)
    request_header_content = bytes(request_header_content, encoding='utf-8')

    writer.write(request_header_content)
    yield from writer.drain()
    text = yield from reader.read()
    print(host, url, str(text, encoding='utf-8'))
    writer.close()

tasks = [
    fetch_async('www.cnblogs.com', '/eric/'),
    fetch_async('dig.chouti.com', '/pic/show?nid=4073644713430508&lid=10273091')
]

loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

# Author：wylkjj
# Date：2020/2/25
# -*- coding:utf-8 -*-
# 使用aiohttp和asyncio實現http請求 （aiohttp親）
import aiohttp
import asyncio


@asyncio.coroutine
def fetch_async(url):
    print(url)
    response = yield from aiohttp.request('GET', url)
    # data = yield from response.read()
    # print(url, data)
    print(url, response)
    response.close()
 

# Author：wylkjj
# Date：2020/2/25
# -*- coding:utf-8 -*-
# asyncio和requests配合使用也能夠支持HTTP （requests後）
import asyncio
import requests


@asyncio.coroutine
def fetch_async(func, *args):
    print(args)
    # 事件循環
    loop = asyncio.get_event_loop()
    future = loop.run_in_executor(None, func, *args)
    response = yield from future
    print(response.url, response.content)


tasks = [
    fetch_async(requests.get, 'http://www.cnblogs.com/eric/'),
    fetch_async(requests.get, 'http://dig.chouti.com/pic/show?nid=4073644713430508&lid=10273091')
]

loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

# Author：wylkjj
# Date：2020/2/25
# -*- coding:utf-8 -*-
import gevent
from gevent import monkey
monkey.patch_all()

import requests


def fetch_async(method, url, req_kwargs):
    print(method, url, req_kwargs)
    response = requests.request(method=method, url=url, **req_kwargs)
    print(response.url, response.content)


# ##### 發送請求 #####
gevent.joinall([
    gevent.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}),
    gevent.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}),
    gevent.spawn(fetch_async, method='get', url='https://github.com/', req_kwargs={}),
])

# pip3 install twisted
# pip3 install wheel
#       b. 下載twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
#       c. 進入下載目錄，執行 pip3 install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl


from twisted.web.client import getPage
from twisted.internet import reactor

REV_COUNTER = 0
REQ_COUNTER = 0

def callback(contents):
    print(contents,)

    global REV_COUNTER
    REV_COUNTER += 1
    if REV_COUNTER == REQ_COUNTER:
        reactor.stop()


url_list = ['http://www.bing.com', 'http://www.baidu.com', ]
REQ_COUNTER = len(url_list)
for url in url_list:
    print(url)
    deferred = getPage(bytes(url, encoding='utf8'))
    deferred.addCallback(callback)
reactor.run()

import socket：它提供了標準的 BSD Sockets API，能夠訪問底層操做系統Socket接口的所有方法。

tronado框架原理

自定義異步IO：
基於socket，setblocking（False）
IO多路複用（也是同步IO）
while True:
r,w,e = select.select([ ],[ ],[ ],1)

關於IO的詳情博客：事件驅動IO模型：http://www.javashuo.com/article/p-evfexhvh-nt.html

2、scrapy框架

scrapy框架的安裝

Linux
pip3 install scrapy
Windows
1.
pip3 install wheel
安裝Twisted：版本信息知識一個格式，非正確版本
a. http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted, 下載：Twisted-19.1.0-cp37-cp37m-win_amd64.whl
b. 進入文件所在目錄
c. pip3 install Twisted-19.1.0-cp37-cp37m-win_amd64.whl

2.
pip3 install scrapy：，此版本與urllib3模塊產生衝突，若有此模塊須要先卸載此模塊
3.
windows上scrapy依賴 https://sourceforge.net/projects/pywin32/files/

項目的建立和執行

scrapy使用方法
建立新項目命令：scrapy startproject scy （在想要建立的目錄中執行此命令，scy是項目名）
建立一個爬蟲：scrapy genspider example example.com （建立爬蟲要先cd 到項目的目錄中，example是爬蟲文件名字，example.com 是所爬網頁地址）
項目的執行命令：scrapy crawl chouti (抽屜是所要執行的爬蟲文件)
過濾日誌命令：scrapy crawl chouti --nolog （過濾chouti 爬的數據日誌）
查看爬蟲模板命令：scrapy genspider --list（顯示四個模板：basic，crawl，csvfeed，xmlfeed）
防止蜘蛛（genspider ）的權限，robkts.txt屬性，在項目setting配置文件中修改ROBOTSTXT_OBEY屬性使其值爲ROBOTSTXT_OBEY=False
project_name/
- scrapy.cfg 項目的主配置文件
- project_name/
  - __init__.py
  - items.py 設置數據存儲模板，用於結構化數據，如：Django的Model
  - pipelines.py 數據處理行爲，如：通常結構化的數據持久化
  - settings.py 真正配置文件，如：遞歸的層數，併發數，延遲下載等
  - spiders/ 爬蟲目錄，如：建立文件，編寫爬蟲規則
    - __init__.py
    - 爬蟲1.py
    - 爬蟲2.py
注意：建立爬蟲仍是要在命令行建立，運行項目，運行爬蟲文件都要在命令行執行

# 部分項目代碼展現，爬取優美圖庫圖片
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from bs4 import BeautifulSoup


class UmeiSpider(scrapy.Spider):
    name = 'umei'
    allowed_domains = ['umei.cc']
    start_urls = ['https://www.umei.cc/meinvtupian/meinvxiezhen/1.htm']
    visited_set = set()

    def parse(self, response):
        self.visited_set.add(response.url)  # 已經爬取的網頁
        # 1.將當前頁全部的meizi圖片爬下來
        # 獲取a標籤而且屬性爲 class = TypeBigPics
        main_page = BeautifulSoup(response.text, "html.parser")
        item_list = main_page.find_all("a", attrs={'class': 'TypeBigPics'})
        for item in item_list:
            item = item.find_all("img",)
            print(item)

        # 2.獲取：https://www.umei.cc/meinvtupian/meinvxiezhen/（\d+）.htm
        page_list = main_page.find_all("div", attrs={'class': 'NewPages'})
        a_urls = 'https://www.umei.cc/meinvtupian/meinvxiezhen/'
        a_list = page_list[0].find_all("a")
        a_href = set()
        for a in a_list:
            a = a.get('href')
            if a:
                a_href.add(a_urls+a)
            else:
                pass
        for i in a_href:
            if i in self.visited_set:
                pass
            else:
                obj = Request(url=i, method='GET', callback=self.parse)
                yield obj
                print("obj:", obj)