詳細介紹 requests庫及函數中的各個參數web
詳細介紹 tornado 中的httpcilent的應用編程
基礎: python的基礎語法(2.7)cookie
Here we go!網絡
第二步:pip 安裝 requests庫
第三步:執行 python -c 'import requests; print requests.get("http://www.baidu.com").content'
python -c 'import requests; print requests.get("http://www.baidu.com").content'
可是你們知道,因爲Ptyhon的解釋器在執行的時候用了一把大鎖GIL保證解釋器(或者是python vm)在解釋執行的時候只有一個線程獲得調度,因此CPython事實上是僞多線程的,也就是本質上仍是單線程。 注: GIL存在於CPython中,Jython沒有這個限制(http://www.jython.org/jythonbook/en/1.0/Concurrency.html)
協程這個概念在不少編程語言中都已經支持了。python中經過yield關鍵字來實現協程,今天再給你們介紹一款基於協程的異步非阻塞框架 tornado. 使用它來實現網絡請求,相比於多線程的requests更高效。
面向過程的編程中,咱們會把一些代碼塊封裝成一個函數,這個函數的特色:一個入口,一個出口.當咱們調用一個函數時,會等它結束了才能繼續執行後續的代碼。 而協程在單線程的條件下,一個函數能夠屢次進入,屢次返回,咱們在調用協程函數的時候,能夠在它的中斷點暫時返回去執行其它的協程函數。(這有點像多線程,某一線程阻塞,CPU會調度其它線程)。
下面給一段代碼看一下運行效果,邏輯很簡單,咱們把show_my_sleep向IOLoop中添加了四次,每次入參不一樣。 show_my_sleep中打印信息,休眠,打印信息。根據結果,咱們能夠看到show_my_sleep函數在yield 語句進入休眠,暫時交出了運行權,等休眠結束後,從yield語句後開始繼續執行。
import randomfrom tornado.ioloop import IOLoopfrom tornado import gen@gen.coroutinedef show_my_sleep(idx): interval = random.uniform(5,20) print "[{}] is going to sleep {} seconds!".format(idx, interval) yield gen.sleep(interval) # 此處會做爲中斷點,交出代碼運行權 print "[{}] wake up!!".format(idx)def main(): io_loop = IOLoop.current() io_loop.spawn_callback(show_my_sleep, 1) # 下一次循環的時候調度這個函數 io_loop.spawn_callback(show_my_sleep, 2) io_loop.spawn_callback(show_my_sleep, 3) io_loop.spawn_callback(show_my_sleep, 4) io_loop.start()if __name__ == "__main__": main()
[1] is going to sleep 5.19272014406 seconds![2] is going to sleep 9.42334286914 seconds![3] is going to sleep 5.11032311172 seconds![4] is going to sleep 13.0816614451 seconds![3] wake up!![1] wake up!![2] wake up!![4] wake up!!
Tornado 是基於Python實現的異步網絡框架,它採用非阻塞IO,能夠支持成千上萬的併發訪問量, 因此很是適合於長輪詢和Websocket, 以及其它須要持久鏈接的應用場景。Tornado 主要包含四個部分:- web框架,包括了RequestHandler(它能夠用來建立WEB應用和各類支持的類)- 客戶端、服務端側的HTTP實現(包括HttpServer 和AsyncHttpClient)- 包含了 IOLoop和IOStream 的異步網絡庫,它們做爲HTTP組件的內置塊而且能夠用來實現其它協議。- 協程庫(tornado.gen),它使異步代碼寫起來比鏈式回調更直接。Tornado WEB框架和HTTP server 在一塊兒能夠做爲WSGI的全棧替代。 在WSGI容器裏可使用Tornado web框架,也能夠用Http server 做爲其它WSGI框架的容器,不過任意一種組合都是有缺陷的。 爲了充分發揮tornado的優點 ,你須要使用tornado 的web框架和http server.
咱們在這裏主要借用tornado的 httpclient和協程庫,來實現單線程下併發網絡請求。
Here, show you the code!
import tracebackfrom tornado.ioloop import IOLoopfrom tornado import genfrom tornado.curl_httpclient import CurlAsyncHTTPClientfrom tornado.httpclient import HTTPRequest@gen.coroutinedef fetch_url(url): """抓取url""" try: c = CurlAsyncHTTPClient() # 定義一個httpclient req = HTTPRequest(url=url) # 定義一個請求 response = yield c.fetch(req) # 發起請求 print response.body IOLoop.current().stop() # 中止ioloop線程 except: print traceback.format_exc()def main(): io_loop = IOLoop.current() io_loop.spawn_callback(fetch_url, "http://www.baidu.com") # 添加協程函數到Ioloop循環中 io_loop.start()if __name__ == "__main__": main()
def main(): io_loop = IOLoop.current() io_loop.spawn_callback(fetch_url, "http://www.baidu.com") # 下一次循環的時候調度這個函數 ''' io_loop.spawn_callback(fetch_url, url1) ... ... io_loop.spawn_callback(fetch_url, urln) ''' io_loop.start()if __name__ == "__main__": main()
咱們利用requests開發爬蟲時,主要會用到 get,post 方法,另外,爲了應對反爬蟲策略,會添加一些自定義的http頭信息,咱們從這個應用角度介紹一下requests的兩個關鍵函數get和post。
def get(url, params=None, **kwargs): """Sends a GET request. :param url: URL for the new :class:`Request` object. :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`. :param \*\*kwargs: Optional arguments that ``request`` takes. :return: :class:`Response <Response>` object :rtype: requests.Response """ kwargs.setdefault('allow_redirects', True) return request('get', url, params=params, **kwargs)
def post(url, data=None, json=None, **kwargs): """Sends a POST request. :param url: URL for the new :class:`Request` object. :param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`. :param json: (optional) json data to send in the body of the :class:`Request`. :param \*\*kwargs: Optional arguments that ``request`` takes. :return: :class:`Response <Response>` object :rtype: requests.Response """ return request('post', url, data=data, json=json, **kwargs)
咱們能夠看到,requests的get,post方法都會 調用 request函數,request函數定義以下:
def request(self, method, url, params=None, data=None, headers=None, cookies=None, files=None, auth=None, timeout=None, allow_redirects=True, proxies=None, hooks=None, stream=None, verify=None, cert=None, json=None): """Constructs a :class:`Request <Request>`, prepares it and sends it. Returns :class:`Response <Response>` object. :param method: method for the new :class:`Request` object. :param url: URL for the new :class:`Request` object. :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`. :param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`. :param json: (optional) json to send in the body of the :class:`Request`. :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`. :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`. :param files: (optional) Dictionary of ``'filename': file-like-objects`` for multipart encoding upload. :param auth: (optional) Auth tuple or callable to enable Basic/Digest/Custom HTTP Auth. :param timeout: (optional) How long to wait for the server to send data before giving up, as a float, or a :ref:`(connect timeout, read timeout) <timeouts>` tuple. :type timeout: float or tuple :param allow_redirects: (optional) Set to True by default. :type allow_redirects: bool :param proxies: (optional) Dictionary mapping protocol or protocol and hostname to the URL of the proxy. :param stream: (optional) whether to immediately download the response content. Defaults to ``False``. :param verify: (optional) whether the SSL cert will be verified. A CA_BUNDLE path can also be provided. Defaults to ``True``. :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair. :rtype: requests.Response """
