爬蟲最新的庫requests-html庫總結

時間 2019-11-07

原文原文鏈接

requests-html是比較新的爬蟲庫,做者和requests是同一個做者css

一.安裝依賴

pip install requests-htmlhtml

咱們能夠在安裝的時候看到他安裝了lxml,reuqests,bs4......咱們經常使用的解析和爬取的庫都分裝在他裏面python

二. 發起請求

from requests_html import HTMLSession
session = HTMLSession()

#用法和requests.session實例化的對象用法如出一轍,也會自動保存返回信息
#相比reuqests,他多了對於response.html這個屬性

注意點:發默認發送的的是無頭瀏覽器,且他若是用render調用瀏覽器內核web

1.解決無頭瀏覽器(針對反爬,若是沒有作反爬無所謂)

修改源碼瀏覽器

ctrl左鍵進入HTMLSessioncookie
咱們能夠看到他是繼承BaseSessionsession

ctrl左鍵進入BaseSessionapp

原來的源碼less

class BaseSession(requests.Session):
    def __init__(self, mock_browser : bool = True, verify : bool = True,
                 browser_args : list = ['--no-sandbox']):
        super().__init__()
        if mock_browser:
        self.headers['User-Agent'] = user_agent()

        self.hooks['response'].append(self.response_hook)
        self.verify = verify

        self.__browser_args = browser_args
        self.__headless = headless

      #中間沒用的省略掉不是刪掉
    @property
    async def browser(self):
        if not hasattr(self, "_browser"):
            self._browser = await pyppeteer.launch(ignoreHTTPSErrors=not(self.verify), headless=True, args=self.__browser_args)

        return self._browser

修改後的源碼async

class BaseSession(requests.Session):
    """ A consumable session, for cookie persistence and connection pooling,
    amongst other things.
    """

    def __init__(self, mock_browser : bool = True, verify : bool = True,
                 browser_args : list = ['--no-sandbox'],headless=False):       #若是你設置成True他就是無頭,且你再運行render時候不會彈出瀏覽器
        super().__init__()

        # Mock a web browser's user agent.
        if mock_browser:
            self.headers['User-Agent'] = user_agent()

        self.hooks['response'].append(self.response_hook)
        self.verify = verify

        self.__browser_args = browser_args
        self.__headless = headless
          #中間沒用的省略掉不是刪掉
    @property
    async def browser(self):
        if not hasattr(self, "_browser"):
            self._browser = await pyppeteer.launch(ignoreHTTPSErrors=not(self.verify), headless=self.__headless, args=self.__browser_args)

        return self._browser

其實我就作了個處理方便傳一個headless進去

對於session從新設置

from requests_html import HTMLSession
session = HTMLSession(
browser_args=['--no-sand',
              '--user-agent='xxxxx'
             ]
)
#這樣你就能夠直接定義他是什麼瀏覽器發送請求啦

2.解決瀏覽器內核(針對反爬,若是沒有作反爬無所謂)

#利用模塊進行js注入
from requests_html  import HTMLSession

session  =HTMLSession(.....)
response = session.get('https://www.baidu.com')
script='''
()=>{
Object.defineProperties(navigator,{
        webdriver:{
        get: () => undefined
        }
    })}'''
print(response.html.render(script=script))

三.response.html相關屬性

這裏的response對象是

from requests_html  import HTMLSession
session  =HTMLSession()
response = session.get('https://www.baidu.com')
#爲了你們好理解就這個response

1.absolute_links

全部的路徑都會轉成絕對路徑返回

2.links

返還路徑原樣

3.base_url

.base標籤裏的路徑,若是沒有base標籤,就是當前url

4.html

返回字符串字符串內包含有標籤

5.text

返回字符串字符串內不包含有標籤爬取什麼小說新聞之類的超級好用!

6.encoding

解碼格式,注意這裏是response.html的encoding,你若是隻只設置了response.encoding對這個encoding毫無影響

7.raw_html

至關於r.content返回二進制

8.pq

返回PyQuery對象,我的不怎麼用這個庫全部不寫結論

四.response.html相關方法

下面response對象我就簡寫成 r了

1.find

用css選擇器找對象

獲取所有

語法:r.html.find('css選擇器')

返回值:[element對象1，。。。。。] 是個列表

只獲取第一個

語法`:r.html.find('css選擇器',first = True)

返回值:element對象

2.xpath

用xpath選擇器找對象

獲取所有

語法:r.html.xpath('xpath選擇器')

返回值:[Element對象1，。。。。。] 是列表

只獲取第一個

語法`:r.html.xpath('xpath選擇器',first = True)

返回值:Element對象

3.search(只獲取第一個)

相似用正則匹配,就是把正則裏面的(.*?)變成{}

語法:r.html.search(‘模板’)

模板一:('xx{}xxx{}')

獲取:獲取第一個:r.html.search(‘模板’)[0]其餘以此類推

模板二:（‘xxx{name}yyy{pwd}’）

獲取:獲取第一個:r.html.search(‘模板’)['name']其餘以此類推

4.search_all(獲取所有)

用法和search同樣

返回值: 【result對象，result對象，】

5.render(這個我後續單獨寫一個總結內容有點多)

他其實就是封裝了pyppeteer你若是不瞭解pyppeteer,那能夠想一想Selenium就是模擬瀏覽器訪問

五.Element對象方法及屬性

absolute_links:絕對url
links:相對url
text:只顯示文本
html:標籤也會顯示
attrs:屬性
find('css選擇器')
xpath('xapth路徑')
.search('模板')
.search_all('模板')

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。