爬蟲開發python工具包介紹（1）

時間 2019-11-09

標籤爬蟲開發 python 工具包介紹欄目網絡爬蟲简体版

原文原文鏈接

本文來自網易雲社區css

做者：王濤html

本文大綱：

簡易介紹今天要講解的兩個爬蟲開發的python庫python
詳細介紹 requests庫及函數中的各個參數web
詳細介紹 tornado 中的httpcilent的應用編程
總結json

目標：瞭解python中經常使用的快速開發爬蟲的工具包。安全

基礎： python的基礎語法（2.7）cookie

Here we go!網絡

簡易爬蟲：我把一次性代碼稱爲簡易爬蟲，這些爬蟲是定製化的，不能通用。不像爬蟲框架，經過配置就能夠實現一個新的抓取需求。對於入門的盆友來說，本篇文章基本能夠知足你的需求。若是有對框架感興趣的盆友，瞭解了本文的tornado，對於你瞭解pyspider這個框架也是有好處的。（Pyspdier使用了tornado框架）數據結構

1、簡介requests與tornado

隨着大數據、人工智能、機器學習的發展，python語言的編程地位保持着持續提高。其它方面的功能暫且不表(由於我不知道)，咱們來談談python在爬蟲方面的表現。

一、requests 基礎

相信想快速上手爬蟲的人都會選擇python，而且選擇requests庫，請問獲取百度首頁源碼要幾步？
答：三步
第一步：下載和安裝python
第二步：pip 安裝 requests庫
第三步：執行 python -c 'import requests; print requests.get("http://www.baidu.com").content'

python -c 'import requests; print requests.get("http://www.baidu.com").content'<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div <div <div <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span ><input id=kw name=wd value maxlength=255 autocomplete=off autofocus></span><span ><input type=submit id=su value=百度一下 ></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews <a href=http://www.hao123.com name=tj_trhao123 <a href=http://map.baidu.com name=tj_trmap <a href=http://v.baidu.com name=tj_trvideo <a href=http://tieba.baidu.com name=tj_trtieba <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" >登陸</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon style="display: block;">更多產品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>關於百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017 Baidu <a href=http://www.baidu.com/duty/>使用百度前必讀</a>  <a href=http://jianyi.baidu.com/ <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

二、requests高效抓取

高效抓取，那咱們把串行改爲並行。一談併發，你們就想到多線程，多進程。

可是你們知道，因爲Ptyhon的解釋器在執行的時候用了一把大鎖GIL保證解釋器（或者是python vm）在解釋執行的時候只有一個線程獲得調度，因此CPython事實上是僞多線程的，也就是本質上仍是單線程。注： GIL存在於CPython中，Jython沒有這個限制（http://www.jython.org/jythonbook/en/1.0/Concurrency.html）

爲了程序簡單，就直接多線程運行了，畢竟Python自帶的大多數數據結構是線程安全的(list,dict,tuple等)，你能夠不用考慮競爭給代碼帶來的複雜性。

協程這個概念在不少編程語言中都已經支持了。python中經過yield關鍵字來實現協程，今天再給你們介紹一款基於協程的異步非阻塞框架 tornado. 使用它來實現網絡請求，相比於多線程的requests更高效。

三、tornado簡介

在介紹tornado以前，咱們簡單介紹一下協程的概念。

3.1 協程

在單線程的前提條件下：
面向過程的編程中，咱們會把一些代碼塊封裝成一個函數，這個函數的特色：一個入口，一個出口.當咱們調用一個函數時，會等它結束了才能繼續執行後續的代碼。而協程在單線程的條件下，一個函數能夠屢次進入，屢次返回，咱們在調用協程函數的時候，能夠在它的中斷點暫時返回去執行其它的協程函數。（這有點像多線程，某一線程阻塞，CPU會調度其它線程）。

下面給一段代碼看一下運行效果,邏輯很簡單，咱們把show_my_sleep向IOLoop中添加了四次，每次入參不一樣。 show_my_sleep中打印信息，休眠，打印信息。根據結果，咱們能夠看到show_my_sleep函數在yield 語句進入休眠，暫時交出了運行權，等休眠結束後，從yield語句後開始繼續執行。

import randomfrom tornado.ioloop import IOLoopfrom tornado import gen@gen.coroutinedef show_my_sleep(idx):
    interval = random.uniform(5,20)    print "[{}] is going to sleep {} seconds!".format(idx, interval)    yield gen.sleep(interval)    #　此處會做爲中斷點，交出代碼運行權
    print "[{}] wake up!!".format(idx)def main():
    io_loop = IOLoop.current()
    io_loop.spawn_callback(show_my_sleep, 1)  # 下一次循環的時候調度這個函數
    io_loop.spawn_callback(show_my_sleep, 2)
    io_loop.spawn_callback(show_my_sleep, 3)
    io_loop.spawn_callback(show_my_sleep, 4)
    io_loop.start()if __name__ == "__main__":
    main()

結果：

[1] is going to sleep 5.19272014406 seconds![2] is going to sleep 9.42334286914 seconds![3] is going to sleep 5.11032311172 seconds![4] is going to sleep 13.0816614451 seconds![3] wake up!![1] wake up!![2] wake up!![4] wake up!!

3.2 Tornado 簡介

[譯：https://www.tornadoweb.org/en/stable/guide/intro.html]

Tornado 是基於Python實現的異步網絡框架，它採用非阻塞IO，能夠支持成千上萬的併發訪問量，
因此很是適合於長輪詢和Websocket， 以及其它須要持久鏈接的應用場景。Tornado 主要包含四個部分：- web框架，包括了RequestHandler（它能夠用來建立WEB應用和各類支持的類）- 客戶端、服務端側的HTTP實現（包括HttpServer 和AsyncHttpClient)- 包含了 IOLoop和IOStream 的異步網絡庫，它們做爲HTTP組件的內置塊而且能夠用來實現其它協議。- 協程庫(tornado.gen),它使異步代碼寫起來比鏈式回調更直接。Tornado WEB框架和HTTP server 在一塊兒能夠做爲WSGI的全棧替代。
在WSGI容器裏可使用Tornado web框架，也能夠用Http server 做爲其它WSGI框架的容器，不過任意一種組合都是有缺陷的。
爲了充分發揮tornado的優點 ，你須要使用tornado 的web框架和http server.

咱們在這裏主要借用tornado的 httpclient和協程庫，來實現單線程下併發網絡請求。
Here, show you the code!

import tracebackfrom tornado.ioloop import IOLoopfrom tornado import genfrom tornado.curl_httpclient import CurlAsyncHTTPClientfrom tornado.httpclient import HTTPRequest@gen.coroutinedef fetch_url(url):
    """抓取url"""
    try:
        c = CurlAsyncHTTPClient()  # 定義一個httpclient
        req = HTTPRequest(url=url)  # 定義一個請求
        response = yield c.fetch(req)  # 發起請求
        print response.body
        IOLoop.current().stop()  # 中止ioloop線程
    except:        print traceback.format_exc()def main():
    io_loop = IOLoop.current()
    io_loop.spawn_callback(fetch_url, "http://www.baidu.com")  # 添加協程函數到Ioloop循環中
    io_loop.start()if __name__ == "__main__":
    main()

四、tornado併發

這裏簡單講，就是經過向ioloop中添加回調，來實現多個回調的並行調用。

def main():
    io_loop = IOLoop.current()
    io_loop.spawn_callback(fetch_url, "http://www.baidu.com")  # 下一次循環的時候調度這個函數
    '''
    io_loop.spawn_callback(fetch_url, url1)  
    ... ...
    io_loop.spawn_callback(fetch_url, urln) 
    '''
    io_loop.start()if __name__ == "__main__":
    main()

簡單介紹過兩個應用包後，來詳細介紹一下關鍵函數及參數。

2、requests 關鍵函數及參數

咱們利用requests開發爬蟲時，主要會用到 get,post 方法，另外，爲了應對反爬蟲策略，會添加一些自定義的http頭信息，咱們從這個應用角度介紹一下requests的兩個關鍵函數get和post。
函數定義：

def get(url, params=None, **kwargs):
    """Sends a GET request.

    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response
    """

    kwargs.setdefault('allow_redirects', True)    return request('get', url, params=params, **kwargs)

def post(url, data=None, json=None, **kwargs):
    """Sends a POST request.

    :param url: URL for the new :class:`Request` object.
    :param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`.
    :param json: (optional) json data to send in the body of the :class:`Request`.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response
    """

    return request('post', url, data=data, json=json, **kwargs)

咱們能夠看到,requests的get，post方法都會調用 request函數，request函數定義以下：

    def request(self, method, url,        params=None,
        data=None,
        headers=None,
        cookies=None,
        files=None,
        auth=None,
        timeout=None,
        allow_redirects=True,
        proxies=None,
        hooks=None,
        stream=None,
        verify=None,
        cert=None,
        json=None):
        """Constructs a :class:`Request <Request>`, prepares it and sends it.
        Returns :class:`Response <Response>` object.

        :param method: method for the new :class:`Request` object.
        :param url: URL for the new :class:`Request` object.
        :param params: (optional) Dictionary or bytes to be sent in the query          
          string for the :class:`Request`.
        :param data: (optional) Dictionary, bytes, or file-like object to send         
           in the body of the :class:`Request`.
        :param json: (optional) json to send in the body of the
            :class:`Request`.
        :param headers: (optional) Dictionary of HTTP Headers to send with the
            :class:`Request`.
        :param cookies: (optional) Dict or CookieJar object to send with the
            :class:`Request`.
        :param files: (optional) Dictionary of ``'filename': file-like-objects``         
           for multipart encoding upload.
        :param auth: (optional) Auth tuple or callable to enable
            Basic/Digest/Custom HTTP Auth.
        :param timeout: (optional) How long to wait for the server to send
            data before giving up, as a float, or a :ref:`(connect timeout,
            read timeout) <timeouts>` tuple.
        :type timeout: float or tuple
        :param allow_redirects: (optional) Set to True by default.
        :type allow_redirects: bool
        :param proxies: (optional) Dictionary mapping protocol or protocol and
            hostname to the URL of the proxy.
        :param stream: (optional) whether to immediately download the response
            content. Defaults to ``False``.
        :param verify: (optional) whether the SSL cert will be verified.
            A CA_BUNDLE path can also be provided. Defaults to ``True``.
        :param cert: (optional) if String, path to ssl client cert file (.pem).
            If Tuple, ('cert', 'key') pair.
        :rtype: requests.Response
        """

網易雲免費體驗館，0成本體驗20+款雲產品！

更多網易研發、產品、運營經驗分享請訪問網易雲社區。

相關文章：
【推薦】網易美學-系統架構系列1-分佈式與服務化
【推薦】 Hi，這有一份風控體系建設乾貨