線程:線程是計算機中工做的最小單元html
IO請求(IO密集型)時多線程更好,計算密集型進程併發最好,IO請求不涉及CPUpython
自定義線程池react
進程:進程默認有主線程,能夠有多線程共存,而且共享內部資源git
自定義進程github
協程:使用進程中一個線程去完成多個任務,微線程(僞線程)web
GIL:python特有,用於在進程中對線程枷鎖,保證同一時刻只能有一個線程被CPU調度windows
# Author:wylkjj # Date:2020/2/24 # -*- coding:utf-8 -*- import requests # 建立多線程 from concurrent.futures import ThreadPoolExecutor # 建立多進程 from concurrent.futures import ProcessPoolExecutor def async_url(url): try: response = requests.get(url) except Exception as e: print('異常結果', response.url, response.content) print('獲取結果', response.url, response.content) url_list = [ 'http://www.baidu.com', 'http://www.chouti.com', 'http://www.bing.com', 'http://www.google.com', ] # 線程池pool:建立五個線程,IO請求線程更適合 # GIL線程鎖,只針對cpu的調用權限,針對IO請求不會鎖住 pool = ThreadPoolExecutor(5) # 進程池pools:建立五個線程,進程浪費資源 pools = ProcessPoolExecutor(5) for url in url_list: print('開始請求:', url) pool.submit(async_url, url) pool.shutdown(wait=True) # 回調函數:.add_done_callback(回調的函數)
異步IO模塊:多線程
import asyncio缺點:只提供TCP,提供sleep,不提供http併發
事件循環:get_event_loop()框架
@asyncio.coroutine和yield from要同時配套使用,固定寫法
異步IO:
# Author:wylkjj # Date:2020/2/24 # -*- coding:utf-8 -*- # 異步IO模塊 import asyncio @asyncio.coroutine def func1(): print('before...func1......') yield from asyncio.sleep(5) print('end...func1......') tasks = [func1(), func1()] loop = asyncio.get_event_loop() # 事件循環 loop.run_until_complete(asyncio.gather(*tasks)) # 把任務做爲列表傳進來 loop.close() # Author:wylkjj # Date:2020/2/25 # -*- coding:utf-8 -*- import asyncio @asyncio.coroutine def fetch_async(host, url='/'): print(host, url) reader, writer = yield from asyncio.open_connection(host, 80) request_header_content = """GET %s HTTP/1.0\r\nHost: %s\r\n\r\n""" % (url, host,) request_header_content = bytes(request_header_content, encoding='utf-8') writer.write(request_header_content) yield from writer.drain() text = yield from reader.read() print(host, url, str(text, encoding='utf-8')) writer.close() tasks = [ fetch_async('www.cnblogs.com', '/eric/'), fetch_async('dig.chouti.com', '/pic/show?nid=4073644713430508&lid=10273091') ] loop = asyncio.get_event_loop() results = loop.run_until_complete(asyncio.gather(*tasks)) loop.close() # Author:wylkjj # Date:2020/2/25 # -*- coding:utf-8 -*- # 使用aiohttp和asyncio實現http請求 (aiohttp親) import aiohttp import asyncio @asyncio.coroutine def fetch_async(url): print(url) response = yield from aiohttp.request('GET', url) # data = yield from response.read() # print(url, data) print(url, response) response.close() # Author:wylkjj # Date:2020/2/25 # -*- coding:utf-8 -*- # asyncio和requests配合使用也能夠支持HTTP (requests後) import asyncio import requests @asyncio.coroutine def fetch_async(func, *args): print(args) # 事件循環 loop = asyncio.get_event_loop() future = loop.run_in_executor(None, func, *args) response = yield from future print(response.url, response.content) tasks = [ fetch_async(requests.get, 'http://www.cnblogs.com/eric/'), fetch_async(requests.get, 'http://dig.chouti.com/pic/show?nid=4073644713430508&lid=10273091') ] loop = asyncio.get_event_loop() results = loop.run_until_complete(asyncio.gather(*tasks)) loop.close()
# Author:wylkjj # Date:2020/2/25 # -*- coding:utf-8 -*- import gevent from gevent import monkey monkey.patch_all() import requests def fetch_async(method, url, req_kwargs): print(method, url, req_kwargs) response = requests.request(method=method, url=url, **req_kwargs) print(response.url, response.content) # ##### 發送請求 ##### gevent.joinall([ gevent.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}), gevent.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}), gevent.spawn(fetch_async, method='get', url='https://github.com/', req_kwargs={}), ])
# pip3 install twisted # pip3 install wheel # b. 下載twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted # c. 進入下載目錄,執行 pip3 install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl from twisted.web.client import getPage from twisted.internet import reactor REV_COUNTER = 0 REQ_COUNTER = 0 def callback(contents): print(contents,) global REV_COUNTER REV_COUNTER += 1 if REV_COUNTER == REQ_COUNTER: reactor.stop() url_list = ['http://www.bing.com', 'http://www.baidu.com', ] REQ_COUNTER = len(url_list) for url in url_list: print(url) deferred = getPage(bytes(url, encoding='utf8')) deferred.addCallback(callback) reactor.run()
import socket:它提供了標準的 BSD Sockets API,能夠訪問底層操做系統Socket接口的所有方法。
tronado框架原理
自定義異步IO:
基於socket,setblocking(False)
IO多路複用(也是同步IO)
while True:
r,w,e = select.select([ ],[ ],[ ],1)
關於IO的詳情博客:事件驅動IO模型:http://www.javashuo.com/article/p-evfexhvh-nt.html
Linux
pip3 install scrapy
Windows
1.
pip3 install wheel
安裝Twisted:版本信息知識一個格式,非正確版本
a. http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted, 下載:Twisted-19.1.0-cp37-cp37m-win_amd64.whl
b. 進入文件所在目錄
c. pip3 install Twisted-19.1.0-cp37-cp37m-win_amd64.whl
2.
pip3 install scrapy:,此版本與urllib3模塊產生衝突,若有此模塊須要先卸載此模塊
3.
windows上scrapy依賴 https://sourceforge.net/projects/pywin32/files/
# 部分項目代碼展現,爬取優美圖庫圖片 # -*- coding: utf-8 -*- import scrapy from scrapy.http import Request from bs4 import BeautifulSoup class UmeiSpider(scrapy.Spider): name = 'umei' allowed_domains = ['umei.cc'] start_urls = ['https://www.umei.cc/meinvtupian/meinvxiezhen/1.htm'] visited_set = set() def parse(self, response): self.visited_set.add(response.url) # 已經爬取的網頁 # 1.將當前頁全部的meizi圖片爬下來 # 獲取a標籤而且屬性爲 class = TypeBigPics main_page = BeautifulSoup(response.text, "html.parser") item_list = main_page.find_all("a", attrs={'class': 'TypeBigPics'}) for item in item_list: item = item.find_all("img",) print(item) # 2.獲取:https://www.umei.cc/meinvtupian/meinvxiezhen/(\d+).htm page_list = main_page.find_all("div", attrs={'class': 'NewPages'}) a_urls = 'https://www.umei.cc/meinvtupian/meinvxiezhen/' a_list = page_list[0].find_all("a") a_href = set() for a in a_list: a = a.get('href') if a: a_href.add(a_urls+a) else: pass for i in a_href: if i in self.visited_set: pass else: obj = Request(url=i, method='GET', callback=self.parse) yield obj print("obj:", obj)