python簡單post信息

時間 2019-12-14

原文原文鏈接

最近學了點關於python的網絡爬蟲的知識，簡單記錄一下，這裏主要用到了requests庫和BeautifulSoup庫html

Requests is an elegant and simple HTTP library for Python, built for human beings.python

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.chrome

以上是兩個庫的介紹，連接是文檔信息瀏覽器

一、示例頁面

這裏我利用東北大學的圖書館的登錄頁面來實現咱們的爬蟲功能（ps:沒錯，博主是東北大學的學生..因此我有帳號密碼），沒有帳號密碼也沒有關係，原理都是差很少的，之因此找這個頁面，是由於這個頁面沒有驗證碼，可以簡單一些，並且像學校的這種頁面通常比較簡單，方便操做
網絡

二、簡單分析

首先我用的帳戶和密碼登錄進了東北大學圖書館，我使用的是chrome瀏覽器，打開開發者模式，咱們來看看咱們提交了哪些信息。

登錄進去後，按下F12打開開發者模式，在Network選項卡下面，咱們找到這個文件，他的request方法是post，應該就是咱們要找的文件了，拉到最下面看到Form Data，紅色框出就是咱們登錄時提交的信息了，一共五個部分，畫紅線的地方是帳號和密碼。搞清楚了post的信息後，咱們就能夠寫代碼來自動提交信息了。session

登錄部分搞清楚了，接下就要分析要抓取的信息了，如今我要抓取個人app

外借
借閱歷史列表
預定請求

要抓取這三個數據，如上圖所示，我當前外借1本書，借閱過65本書，預定請求爲0，如今的目的是將這些數據抓取出來，咱們按下F12來查看網頁的源代碼，分析咱們應該抓取哪一部分。

如上圖所示，一步步找到了數據所在的標籤，我發現數據都在id=history這個標籤下，因此能夠先找到這個標籤，而後再找tr標籤，而後就能找到td標籤裏的數據了。ide

三、實現的功能

自動登錄
抓取頁面上的一些信息，並在控制檯輸出函數

四、代碼部分

4.一、post數據的部分
首先貼上這部分的代碼

def getHTMLText(url):
    try:
        kv = {'user-agent': 'Mozilla/5.0'}
        mydata = {'func':'login-session', 'login_source':'bor-info', 'bor_id': '***', 'bor_verification': '***','bor_library':'NEU50'}
        re = requests.post(url, data=mydata, headers=kv)
        re.raise_for_status()
        re.encoding = re.apparent_encoding
        return re.text
    except:
        print("異常")
        return""

代碼如上，咱們來分析一下post

kv是爲了模擬瀏覽器而定義的字典，由於有些網站若是識別出是爬蟲的話，會拒絕訪問，因此這裏能夠修改headers的信息來模擬瀏覽器登錄。
mydata裏面存的就是要post的信息，其中帳號和密碼我用***代替了。
requests.post()就是向指定的url 提交數據，關於requests在網上都能搜的到，就不贅述了。
re.raise_for_status()這個的含義是若是訪問失敗的話，就會丟出異常。
re.encoding = re.apparent_encoding修改編碼，保證中文能被正確的解析。
這裏採用try except的結構，爲了程序的健壯性考慮，讓程序在錯誤的時候不至於崩潰。
最後返回咱們新的頁面的text。

4.二、抓取數據部分
首先貼上代碼

def fillBookList(booklist, html):
    soup = BeautifulSoup(html,"html.parser")
    for tr in soup.find(id='history').descendants:
        if isinstance(tr, bs4.element.Tag):
            temp = tr.find_all('td')
            if len(temp)>0:
                booklist.append(temp[1].string.strip())
                booklist.append(temp[3].string.strip())
                booklist.append(temp[5].string.strip())
                break

參數分別是咱們要填充的列表和目標頁面
建立一個BeautifulSoup的對象
在整個頁面中查找id=history的標籤，而後遍歷其全部子孫標籤
在遍歷的過程當中，標籤的子標籤多是字符串類型，咱們要過濾掉這些，因此用了isinstance(tr, bs4.element.Tag)

isinstance 的用法：
語法：
isinstance(object, classinfo)
其中，object 是變量，classinfo 是類型(tuple,dict,int,float,list,bool等) 和 class類若參數 object 是 classinfo 類的實例，或者 object 是 classinfo 類的子類的一個實例，返回 True。若 object 不是一個給定類型的的對象，則返回結果老是False。若 classinfo 不是一種數據類型或者由數據類型構成的元組，將引起一個 TypeError 異常。
在標籤中尋找全部td標籤，觀察源代碼發現，第一個td標籤列表就是咱們要的，因此一旦找到咱們要的信息之後，就中止查找，並就信息存在booklist裏面

4.三、打印信息
貼上代碼

def printUnivList(booklist):
    print("{:^10}\t{:^6}\t{:^10}".format("外借","借閱歷史列表","預定請求"))
    print("{:^10}\t{:^6}\t{:^10}".format(booklist[0],booklist[1],booklist[2])

這部分很簡單就不說了

4.四、主函數

貼上代碼

def main():
    html = getHTMLText("http://202.118.8.7:8991/F/-?func=bor-info")
    booklist = []
    fillBookList(booklist, html)
    printUnivList(booklist)

五、測試

成功的在控制檯打印出了咱們要的信息！

六、完整的代碼

import requests
from bs4 import  BeautifulSoup
import  bs4
def getHTMLText(url):
    try:
        kv = {'user-agent': 'Mozilla/5.0'}
        mydata = {'func':'login-session', 'login_source':'bor-info', 'bor_id': '***', 'bor_verification': '***','bor_library':'NEU50'}
        re = requests.post(url, data=mydata, headers=kv)
        re.raise_for_status()
        re.encoding = re.apparent_encoding
        return re.text
    except:
        print("異常")
        return""
def fillBookList(booklist, html):
    soup = BeautifulSoup(html,"html.parser")
    for tr in soup.find(id='history').descendants:
        if isinstance(tr, bs4.element.Tag):
            temp = tr.find_all('td')
            if len(temp)>0:
                booklist.append(temp[1].string.strip())
                booklist.append(temp[3].string.strip())
                booklist.append(temp[5].string.strip())
                break
def printUnivList(booklist):
    print("{:^10}\t{:^6}\t{:^10}".format("外借","借閱歷史列表","預定請求"))
    print("{:^10}\t{:^6}\t{:^10}".format(booklist[0],booklist[1],booklist[2]))
def main():
    html = getHTMLText("http://202.118.8.7:8991/F/-?func=bor-info")
    booklist = []
    fillBookList(booklist, html)
    printUnivList(booklist)
main()