python爬蟲（一）

時間 2019-11-17

標籤 python 爬蟲欄目 Python 简体版

原文原文鏈接

1、HTTP協議html

1.基本概念
python

HTTP，Hypertext Transfer Transfer Peotocol，超文本傳輸協議nginx

HTTP是一個基於「請求與響應」模式的、無狀態的應用層協議git

HTTP協議採用URL做爲定位網絡資源的標識，URL格式以下：web

　　Http://host[:port][path]json

host：合法的Internet主機域名或IP地址api

port：端口號，缺省端口爲80瀏覽器

path：請求資源的路徑服務器

HTTP URL實例：網絡

　　http://www.bit.edu.cn

　　http://220.181.111.188/duty

HTTP URL的理解：

　　URL是經過HTTP協議存取資源Internet路徑，一個URL對應一個數據資源

2.HTTP協議對資源的操做

GET　　請求URL位置的資源
HEAD　請求URL位置資源的響應消息報告，即得到該資源的頭部信息
POST 請求向URL位置的資源後附加新的數據
PUT 請求向URL位置存儲一個資源，覆蓋原URL位置的資源
PATCH　請求局部更新URL位置的資源，即改變該處資源的部份內容
DELETE 請求刪除URL位置存儲的資源

其中GET、HEADE方法主要是用於獲取數據，PUT、POST、PATCH、DELETE主要用於提交數據

3.PATCH與PUT的區別

假設URL位置有一組數據UserInfo，包括UserID，UserName等20各字段

需求：用戶修改了UserName，其餘不變

採用PATCH，僅向URL提交UserName的局部跟新請求
採用PUT，必須將全部的20個字段一併提交到URl，未提交字段將被刪除

PATCH的最主要的好處：節省網絡帶寬

2、requests庫的使用

requests庫的7個主要方法：

一、GET方法

（一）requests.get(url,params=None,**kwargs)

url：擬獲取頁面的url連接

params：url中的額外參數，字典或字節流格式，可選

**kwargs：12各控制訪問的參數

（二）response對象

response對象的屬性

r.status_code　　 HTTP請求的返回狀態，200表示鏈接成功，404或其餘表示失敗
r.text　　　　　　 HTTP響應內容的字符串形式
r.encoding　　　從HTTP header中猜想的響應內容編碼方式
r.apparent_encoding 從HTTP響應內容分析出的內容編碼方式
r.content　　 HTTP響應內容的二進制形式

（三）response的編碼

r.encoding：若是header中不存在charset，則認爲編碼爲ISO--8859-1

r.text根據r.encoding顯示網頁內容

r.apparent_encoding：根據網頁內容分析出的編碼方式

　　　　　　　　　　　能夠看做是r.encoding的備選

例子：

# -*- coding:utf-8 -*-
#!/user/bin/env.python
#Author:Mr Wu

import requests

url="https://www.baidu.com"
r=requests.get(url)
print(r.encoding)   #ISO-8859-1
print(r.text[1000:2000])    #沒法正常顯示字符
'''
pan class="bg s_btn_wr"><input type=submit id=su value=ç¾åº¦ä¸ä¸ class="bg s_btn" autofocus></span>
 </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>æ°é»</a> \
 <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap 
 class=mnav>å°å¾</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>è§é¢</a> <a href=http://tieba.baidu.com
  name=tj_trtieba class=mnav>è´´å§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;
  u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç»å½</a> </noscript> 
  script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ 
  encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+
  '" name="tj_login" class="lb">ç»å½</a>');
 </script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">æ´å¤äº§å<
'''
r.encoding=r.apparent_encoding  #替換編碼
print(r.encoding)   #utf-8
print(r.text[1000:2000])    #可正常顯示
'''
n_wr"><input type=submit id=su value=百度一下 class="bg s_btn" autofocus></span> </form> </div> </div> 
<div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新聞</a> <a href=https://www.hao123.com 
name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地圖</a> 
<a href=http://v.baidu.com name=tj_trvideo class=mnav>視頻</a> <a href=http://tieba.baidu.com name=tj_trtieba 
class=mnav>貼吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;
u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登陸</a> </noscript> <script>
document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent
(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login"
 class="lb">登陸</a>');
 </script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">
 更多產品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p
'''

View Code

2.HEADE方法

r=requests.head(ur,**kwargs)

url：擬獲取頁面的url連接

**kwargs：12個可控制的訪問參數

例子：

import requests

url="http://httpbin.org/get"
r=requests.head(url)    #請求頭部信息
print(r.headers)
'''
{'Access-Control-Allow-Credentials': 'true',
'Access-Control-Allow-Origin': '*',
'Content-Encoding': 'gzip', 
'Content-Type': 'application/json',
'Date': 'Sun, 28 Jul 2019 09:01:44 GMT',
'Referrer-Policy': 'no-referrer-when-downgrade',
'Server': 'nginx',
'X-Content-Type-Options': 'nosniff',
'X-Frame-Options': 'DENY',
'X-XSS-Protection': '1; mode=block',
'Connection': 'keep-alive'}
'''

View Code

3.POST方法

r=requests.post(url,data=None,json=None,**kwargs)

url：擬更新頁面的url連接

data：字典、字節序列或文件，request的內容，向url post一個字典會自動編碼成一個form表　　　單進行提交，向url post一個字符串會自動編碼成data進行提交。　　

json：json格式的數據，request的內容

**kwargs：12個控制訪問的參數

例子：

# -*- coding:utf-8 -*-
#!/user/bin/env.python
#Author:Mr Wu

import requests

url="http://httpbin.org/post"
kv={'key1':'value1','key2':'value2'}
r=requests.post(url,data=kv)    #請求頭部信息
print(r.text)
'''
{..........
  "form": {
    "key1": "value1",          #向url post發送的一個字典自動編碼成form表單
    "key2": "value2"
  }, 
............
}
'''

View Code

# -*- coding:utf-8 -*-
#!/user/bin/env.python
#Author:Mr Wu

import requests

url="http://httpbin.org/post"

r=requests.post(url,data='ABC')    #請求頭部信息
print(r.text)
'''
{..........
  "data": "ABC", 
  "files": {},          #向url post一個字符串會自動編碼成data
  "form": {}, 
............
}
'''

View Code

4.PUT方法

r=requests.put(url,data=None,**kwargs)

url：擬更新頁面的url連接

data：字典、字節序列或文件，request的內容

**kwargs：12個控制訪問的參數

例子：

# -*- coding:utf-8 -*-
#!/user/bin/env.python
#Author:Mr Wu

import requests

url="http://httpbin.org/put"
kv={"k1":"v1","k2":"v2"}
r=requests.put(url,data=kv)
print(r.text)
'''
{..........
  "form": {
    "k1": "v1",         #向url put一個字典會自動編碼成一個form表單
    "k2": "v2"
  }, 
............
}
'''

View Code

5.PATCH方法

r=requests.put(url,data=None,**kwargs)

url：擬更新頁面的url連接

data：字典、字節序列或文件，request的內容

**kwargs：12個控制訪問的參數

6.DELETE方法

r=requests.delete(url,**kwargs)

url：擬更新頁面的url連接

**kwargs：12個控制訪問的參數

7.request方法

request方法是以上5種方法的基礎，上述6中方法都是經過調用request方法實現的

requests.request(method,url,**kwargs)

method：請求方式，對應get/put/post等7種
url：擬獲取頁面的連接
params：字典或字節序列，做爲參數增長到url中
data：字典、字節序列或文件對象，做爲request的內容
json：JSON格式的數據，做爲request的內容
headers：字典，HTTP定製頭　　(headers={'User-Agent':"Chrome/10"})
auth：元組，支持HTTP認證功能
files：字典，傳輸文件
timeout：設定超時時間，以秒爲單位
proxies：字典類型，設定代理服務器，能夠增長登錄認證　 proxies=pxs={"https":"https://user:pass@10.10.1:1234","http":"http://10.10.10.1:4321"}
allow_redirectss：True/False，默認爲True，重定向開關
steam：True/False，默認爲True，獲取內容當即下載開關
verify：True/False，默認爲True，認證SSL證書開關
cert：本地SSL證書路徑

8.requests庫的異常

ConnectionError：網絡鏈接異常，如DNS查詢失敗，拒絕鏈接
HTTPError：HTTP錯誤異常
URLRequired：URL缺失異常
TooManyRedirects：超過最大重定向次數，產生重定向異常
ConnectTimeout：鏈接遠程服務器超時異常
Timeout：請求URL超時，產生超時異常

9.response異常

response.raise_for_status()　　#若狀態碼不是200，產生異常requests.HTTPError

response.raise_for_status()在方法內部判斷response.status_code是否等於200，不須要增長額外的if語句。

10.爬蟲通用代碼框架

import requests
def getHTMLText(url):
    try:
        r=requests.get(url,timeout=30,headers={'User-Agent':'Mozilla/5.0',})    #替換headers來源標識
        r.raise_for_status()    #若status_code!=200 將產拋出HTTPError異常
        r.encoding=r.apparent_encoding  #替換編碼
        return r.text
    except Exception as e:
        print(e)

if __name__ == '__main__':
    url=input("url: ")
   text=getHTMLText(url)

View Code

3、網絡爬蟲帶來的問題

性能騷擾：web服務器默認接收人類的訪問，受限於編寫水平和目的，網絡爬蟲將會爲web服務器帶來巨大的資源開銷
法律風險：服務器上的數據有產權歸屬，網絡爬蟲獲取數據後牟利將帶來法律風險
隱私泄露：網絡爬蟲可能具有簡單訪問控制的能力，得到保護數據從而泄露我的隱私

4、網絡爬蟲的限制

來源審查：判斷User-Agent進行限制，檢查來訪HTTP協議頭的User-Agent域，只響應瀏覽器或友好爬蟲的訪問
發佈公告：Robots協議，告知全部爬蟲網站的爬去策略，要求爬蟲遵照

5、Robots協議（Robots Exclusion Stadard，網絡爬蟲派出標準）

做用：網站告知網絡爬蟲哪些頁面能夠抓取，哪些不行

形式：在網站根目錄下的robots.txt文件

例子：嗶哩嗶哩的robots.txt文件：

Robots協議的使用：

網絡爬蟲：自動或人工識別robots.txt，再進行內容爬取

約束性：Robots協議是建議但非約束性，網絡爬蟲能夠不遵照，但存在法律風險

若使用爬去數據進行商業牟利，應當自行遵照

User-agent: *
Disallow: /include/
Disallow: /mylist/
Disallow: /member/
Disallow: /images/
Disallow: /ass/
Disallow: /getapi
Disallow: /search
Disallow: /account
Disallow: /badlist.html
Disallow: /m/

View Code

6、網絡爬蟲實例

1.測試抓取一個網頁一百次所花費的時間

#url："https://www.bilibili.com/"
#爬去網頁總計花費時間：71.3538544178009s
import requests,time

def getHTML(url):
    try:
        r=requests.request('get',url)
        r.raise_for_status()    #若status_code!=200則拋出異常
        r.encoding=r.apparent_encoding      #替換編碼
        return r.text
    except Exception as e:
        print(e)

if __name__ == '__main__':
    url="https://www.bilibili.com/"
    start_time=time.time()
    for i in range(100):
        getHTML(url)
    end_time=time.time()
    total_spend_time=end_time-start_time
    print("爬去網頁總計花費時間：%ss"%(total_spend_time))

View Code

2.爬取亞馬遜某一商品的頁面信息

#爬取相關網頁的商品信息
import requests
url_dict={
    'jd':'https://item.jd.com/100004538426.html',
    'ymx':'https://www.amazon.cn/dp/B07BXKFXKH/ref=sr_1_1?brr=1&qid=1564219979&rd=1&s=digital-text&sr=1-1',
}
def getHTMLText(url):
    try:
        r=requests.get(url,timeout=30,headers={'User-Agent':'Mozilla/5.0',})    #替換headers標識
        r.raise_for_status()    #若status_code!=200 將產生異常
        r.encoding=r.apparent_encoding  #替換編碼
        print(r.request.headers)    #{'User-Agent': 'Mozilla/5.0', .......}
        return r.text[0:1000]
    except Exception as e:
        print(e)

if __name__ == '__main__':
    url=url_dict['ymx']
    print(getHTMLText(url))

View Code

3.爬取百度查詢信息

#使用get方法爬取搜索引擎的結果
import requests
url_dict={
    'bd':'https://www.baidu.com/s'
}
def getHTMLText(url):
    try:
        r=requests.get(url,timeout=30,params={'wd':'原神'})    #params添加字段
        r.raise_for_status()    #若status_code!=200 將產生異常
        r.encoding=r.apparent_encoding  #替換編碼
        print(r.request.url)    #https://www.baidu.com/s?wd=%E5%8E%9F%E7%A5%9E
        return r.text   #內容沒法解析，待定。。。。。
    except Exception as e:
        print(e)

if __name__ == '__main__':
    url=url_dict['bd']
    print(getHTMLText(url))

View Code

4.利用相關網站查詢IP信息，並爬取結果

#查詢ip歸屬地等信息
import requests

def getIpMessage(url):
    try:
        ip=input("輸入您的IP地址：")
        r=requests.get(url,params={'ip':ip},headers={'User-Agent':'Mozilla/5.0'})   
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        return r.text[7000:]
    except Exception as e:
        print(e)
if __name__ == '__main__':
    url='http://www.ip138.com/ips138.asp'
    print(getIpMessage(url))

View Code

5.爬取國家地理的圖片信息

#爬去圖片
import requests,os

url='http://img0.dili360.com/ga/M01/47/53/wKgBzFkP2yaAE3wtADLAEHwR25w840.tub.jpg@!rw9'
root='D://python學習//projects//python爬蟲//day1'
path='//'.join([root,url.split('/')[-1].split('@')[0]])
def getHTMLJpj(url):
    try:
        if not os.path.exists(root):
            print("保存目錄不存在!")
        else:
            if not os.path.exists(path):
                r=requests.get(url)
                r.raise_for_status()
                with open(path,"wb") as f:
                    f.write(r.content)
                    f.close()
                print("圖片保存完畢")
            else:
                print("文件已存在!")
    except Exception as e:
        print(e)
if __name__ == '__main__':
    getHTMLJpj(url)

View Code

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。