爬蟲基礎知識及scrapy框架使用和基本原理

爬蟲

1、異步IO

線程:線程是計算機中工做的最小單元html

​ IO請求(IO密集型)時多線程更好,計算密集型進程併發最好,IO請求不涉及CPUpython

自定義線程池react

進程:進程默認有主線程,能夠有多線程共存,而且共享內部資源git

自定義進程github

協程:使用進程中一個線程去完成多個任務,微線程(僞線程)web

GIL:python特有,用於在進程中對線程枷鎖,保證同一時刻只能有一個線程被CPU調度windows

# Author:wylkjj
# Date:2020/2/24
# -*- coding:utf-8 -*-
import requests
# 建立多線程
from concurrent.futures import ThreadPoolExecutor
# 建立多進程
from concurrent.futures import ProcessPoolExecutor


def async_url(url):
    try:
        response = requests.get(url)
    except Exception as e:
        print('異常結果', response.url, response.content)
    print('獲取結果', response.url, response.content)


url_list = [
    'http://www.baidu.com',
    'http://www.chouti.com',
    'http://www.bing.com',
    'http://www.google.com',
]
# 線程池pool:建立五個線程,IO請求線程更適合
# GIL線程鎖,只針對cpu的調用權限,針對IO請求不會鎖住
pool = ThreadPoolExecutor(5)
# 進程池pools:建立五個線程,進程浪費資源
pools = ProcessPoolExecutor(5)

for url in url_list:
    print('開始請求:', url)
    pool.submit(async_url, url)

pool.shutdown(wait=True)

# 回調函數:.add_done_callback(回調的函數)

異步IO模塊:多線程

import asyncio缺點:只提供TCP,提供sleep,不提供http併發

​ 事件循環:get_event_loop()框架

​ @asyncio.coroutine和yield from要同時配套使用,固定寫法

​ 異步IO:

  • asynico + aiohttp:asynico + request
  • gevent + request:gevent + request兩個方法組合在一塊兒後出現了一個grequests
  • twisted
  • tornado:異步非阻塞IO
# Author:wylkjj
# Date:2020/2/24
# -*- coding:utf-8 -*-
# 異步IO模塊
import asyncio


@asyncio.coroutine
def func1():
    print('before...func1......')
    yield from asyncio.sleep(5)
    print('end...func1......')


tasks = [func1(), func1()]
loop = asyncio.get_event_loop()  # 事件循環
loop.run_until_complete(asyncio.gather(*tasks))  # 把任務做爲列表傳進來
loop.close()

# Author:wylkjj
# Date:2020/2/25
# -*- coding:utf-8 -*-
import asyncio


@asyncio.coroutine
def fetch_async(host, url='/'):
    print(host, url)
    reader, writer = yield from asyncio.open_connection(host, 80)

    request_header_content = """GET %s HTTP/1.0\r\nHost: %s\r\n\r\n""" % (url, host,)
    request_header_content = bytes(request_header_content, encoding='utf-8')

    writer.write(request_header_content)
    yield from writer.drain()
    text = yield from reader.read()
    print(host, url, str(text, encoding='utf-8'))
    writer.close()

tasks = [
    fetch_async('www.cnblogs.com', '/eric/'),
    fetch_async('dig.chouti.com', '/pic/show?nid=4073644713430508&lid=10273091')
]

loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

# Author:wylkjj
# Date:2020/2/25
# -*- coding:utf-8 -*-
# 使用aiohttp和asyncio實現http請求 (aiohttp親)
import aiohttp
import asyncio


@asyncio.coroutine
def fetch_async(url):
    print(url)
    response = yield from aiohttp.request('GET', url)
    # data = yield from response.read()
    # print(url, data)
    print(url, response)
    response.close()
 

# Author:wylkjj
# Date:2020/2/25
# -*- coding:utf-8 -*-
# asyncio和requests配合使用也能夠支持HTTP (requests後)
import asyncio
import requests


@asyncio.coroutine
def fetch_async(func, *args):
    print(args)
    # 事件循環
    loop = asyncio.get_event_loop()
    future = loop.run_in_executor(None, func, *args)
    response = yield from future
    print(response.url, response.content)


tasks = [
    fetch_async(requests.get, 'http://www.cnblogs.com/eric/'),
    fetch_async(requests.get, 'http://dig.chouti.com/pic/show?nid=4073644713430508&lid=10273091')
]

loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*tasks))
loop.close()
# Author:wylkjj
# Date:2020/2/25
# -*- coding:utf-8 -*-
import gevent
from gevent import monkey
monkey.patch_all()

import requests


def fetch_async(method, url, req_kwargs):
    print(method, url, req_kwargs)
    response = requests.request(method=method, url=url, **req_kwargs)
    print(response.url, response.content)


# ##### 發送請求 #####
gevent.joinall([
    gevent.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}),
    gevent.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}),
    gevent.spawn(fetch_async, method='get', url='https://github.com/', req_kwargs={}),
])
# pip3 install twisted
# pip3 install wheel
#       b. 下載twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
#       c. 進入下載目錄,執行 pip3 install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl


from twisted.web.client import getPage
from twisted.internet import reactor

REV_COUNTER = 0
REQ_COUNTER = 0

def callback(contents):
    print(contents,)

    global REV_COUNTER
    REV_COUNTER += 1
    if REV_COUNTER == REQ_COUNTER:
        reactor.stop()


url_list = ['http://www.bing.com', 'http://www.baidu.com', ]
REQ_COUNTER = len(url_list)
for url in url_list:
    print(url)
    deferred = getPage(bytes(url, encoding='utf8'))
    deferred.addCallback(callback)
reactor.run()

import socket:它提供了標準的 BSD Sockets API,能夠訪問底層操做系統Socket接口的所有方法。

tronado框架原理

自定義異步IO:
基於socket,setblocking(False)
IO多路複用(也是同步IO)
while True:
r,w,e = select.select([ ],[ ],[ ],1)

關於IO的詳情博客:事件驅動IO模型:http://www.javashuo.com/article/p-evfexhvh-nt.html

2、scrapy框架

scrapy框架的安裝

​ Linux
pip3 install scrapy
​ Windows
​ 1.
​ pip3 install wheel
​ 安裝Twisted:版本信息知識一個格式,非正確版本
​ a. http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted, 下載:Twisted-19.1.0-cp37-cp37m-win_amd64.whl
​ b. 進入文件所在目錄
​ c. pip3 install Twisted-19.1.0-cp37-cp37m-win_amd64.whl

​ 2.
pip3 install scrapy:,此版本與urllib3模塊產生衝突,若有此模塊須要先卸載此模塊
​ 3.
​ windows上scrapy依賴 https://sourceforge.net/projects/pywin32/files/

項目的建立和執行

  1. scrapy使用方法
  2. 建立新項目命令:scrapy startproject scy (在想要建立的目錄中執行此命令,scy是項目名)
  3. 建立一個爬蟲:scrapy genspider example example.com (建立爬蟲要先cd 到項目的目錄中,example是爬蟲文件名字,example.com 是所爬網頁地址)
  4. 項目的執行命令:scrapy crawl chouti (抽屜是所要執行的爬蟲文件)
  5. 過濾日誌命令:scrapy crawl chouti --nolog (過濾chouti 爬的數據日誌)
  6. 查看爬蟲模板命令:scrapy genspider --list(顯示四個模板:basic,crawl,csvfeed,xmlfeed)
  7. 防止蜘蛛(genspider )的權限,robkts.txt屬性,在項目setting配置文件中修改ROBOTSTXT_OBEY屬性使其值爲ROBOTSTXT_OBEY=False
  8. project_name/
    • scrapy.cfg 項目的主配置文件
    • project_name/
      • __init__.py
      • items.py 設置數據存儲模板,用於結構化數據,如:Django的Model
      • pipelines.py 數據處理行爲,如:通常結構化的數據持久化
      • settings.py 真正配置文件,如:遞歸的層數,併發數,延遲下載等
      • spiders/ 爬蟲目錄,如:建立文件,編寫爬蟲規則
        • __init__.py
        • 爬蟲1.py
        • 爬蟲2.py
  9. 注意:建立爬蟲仍是要在命令行建立,運行項目,運行爬蟲文件都要在命令行執行
# 部分項目代碼展現,爬取優美圖庫圖片
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from bs4 import BeautifulSoup


class UmeiSpider(scrapy.Spider):
    name = 'umei'
    allowed_domains = ['umei.cc']
    start_urls = ['https://www.umei.cc/meinvtupian/meinvxiezhen/1.htm']
    visited_set = set()

    def parse(self, response):
        self.visited_set.add(response.url)  # 已經爬取的網頁
        # 1.將當前頁全部的meizi圖片爬下來
        # 獲取a標籤而且屬性爲 class = TypeBigPics
        main_page = BeautifulSoup(response.text, "html.parser")
        item_list = main_page.find_all("a", attrs={'class': 'TypeBigPics'})
        for item in item_list:
            item = item.find_all("img",)
            print(item)

        # 2.獲取:https://www.umei.cc/meinvtupian/meinvxiezhen/(\d+).htm
        page_list = main_page.find_all("div", attrs={'class': 'NewPages'})
        a_urls = 'https://www.umei.cc/meinvtupian/meinvxiezhen/'
        a_list = page_list[0].find_all("a")
        a_href = set()
        for a in a_list:
            a = a.get('href')
            if a:
                a_href.add(a_urls+a)
            else:
                pass
        for i in a_href:
            if i in self.visited_set:
                pass
            else:
                obj = Request(url=i, method='GET', callback=self.parse)
                yield obj
                print("obj:", obj)
相關文章
相關標籤/搜索