爬蟲基礎

時間 2019-12-08

原文原文鏈接

爬蟲基礎

什麼是爬蟲html
- 經過編寫程序模擬瀏覽器上網,而後讓其去互聯網上爬取數據的過程
爬蟲的分類web
- 通用爬蟲:爬取一整張頁面源碼數據
- 聚焦爬蟲:爬取頁面中指定的局部數據
- 增量式爬蟲:監測網站數據更新的狀況.爬取的就是網站中最新更新出來的數據
requestsjson
- 做用
  - 模擬瀏覽器發請求
- get/post參數:
  - url,
  - params/data,
  - headers
- 編碼流程:
  - 肯定要爬取的數據是不是動態加載的
  - 指定url
  - 發起請求
  - 獲取響應數據
  - 持久化存儲
- get/post返回值:響應對象response
  - text:字符串形式的響應數據
  - json():返回的是標準的json串
  - content:二進制形式的響應數據
  - encoding:響應數據的編碼
反爬機制數組
- 門戶網站經過相應的策略和技術手段，防止爬蟲程序進行網站數據的爬取
反反爬策略瀏覽器
爬蟲程序經過相應的策略和技術手段，破解門戶網站的反爬蟲手段，從而爬取到相應的數據服務器
第一個反爬機制app
- robots.txt協議
  - User-agent http的請求載體的身份標識
編碼流程:工具
- 指定url
- 發起請求
- 獲取響應數據
- 持久化存儲

jupyter快捷鍵post

啓動:jupyter notebook
插入:cell:a,b
刪除:x或者dd
dell模式切換:m y
執行cell:shift+enter
tab:代碼補全
打開幫助文檔:shift+tab

爬取搜狗主頁案例網站

#1.指定url
url = 'https://www.sogou.com'
#2.發起請求
response = requests.get(url=url)
#3.獲取響應數據
page_text = response.text
#4.持久化存儲
with open('./sogou.html','w',encoding='utf-8') as fp:
    fp.write(page_text)

動態爬取搜狗搜索主頁(簡易的網頁採集器)

#User-Agent:請求載體的身份標識
#UA檢測:門戶網站的服務器端會檢測每個請求的UA,若是檢測到請求的UA爲爬蟲程序,則請求失敗
#UA假裝:
#簡易的網頁採集器
wd = input('enter a word:')
url = 'https://www.sogou.com/web'
#將請求參數設定成動態的
param = {
    "query":wd
}
#UA假裝
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
}
response = requests.get(url=url,params=param,headers=headers)
#手動設置響應數據的編碼,處理中文亂碼
response.encoding = 'utf-8'
#text返回的是字符串形式的響應數據
page_text = response.text
filename = wd+'.html'
with open(filename,'w',encoding='utf-8') as fp:
    fp.write(page_text)
    print(filename,'下載成功')

爬取肯德基餐廳位置信息

#爬取肯德基餐廳位置信息

url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'   
city = input('enter a city name')
for pageIndex in range(1,9):
    data = {
        "cname": "",
        "pid": "",
        "keyword": city,
        "pageIndex": pageIndex,
        "pageSize": "10",
    }
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
    }
    response = requests.post(url=url,data=data,headers=headers)
    #json() 返回的事一個json對象類型
    page_text = response.json()
#     print(page_text)
    fp = open('./kfc.txt','a+',encoding='utf-8')
    kfc_dic = {}
    for dic in page_text['Table1']:
        kfc_dic[dic['storeName']+'餐廳'] = dic['addressDetail']
    fp.write(str(kfc_dic))
    fp.close()

爬取豆瓣排行榜電影的信息

import requests
url = 'https://movie.douban.com/j/chart/top_list'
s = 1
limit = 100
headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
    }
param = {
    "type": "5",
    "interval_id": "100:90",
    "action": "",
    "start": s,
    "limit": limit,
}
response = requests.get(url=url,headers=headers,params=param)
page_text = response.json()
print(page_text)

動態加載的頁面數據
- 發現首頁中的全部的企業數據都是動態加載出來的
- 經過抓包工具捕獲動態加載數據對應的數據包
- 從上一步的數據包對應的響應數據中提取到企業的相關信息(ID)
- 經過分析每一家企業詳情頁的url發現,全部的詳情頁的url的域名都是同樣的,只有id不一樣
- 能夠經過獲取每一家企業的id結合着固定的域名拼接成詳情頁的url
- 發現詳情頁中的企業信息是動態加載出來的,因此須要經過抓包工具數據包的捕獲
- 能夠根據固定的url結合着不一樣的請求參數組合成企業詳情數據對應的url
- 對url發起請求就能夠獲取企業詳情數據了

爬取藥監局的全部企業信息

import requests

url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList'

headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
    }
def func_list(f):
    for i in range(1,330):
        data = {
        "on": "true",
        "page": i,
        "pageSize": "15",
        "productName":" ",
        "conditionType": "1",
        "applyname":" ",
        "applysn":" "
    }
        try:
            page_text = requests.post(url=url, data=data, headers=headers).json()
        except Exception as e:
            print(e)
            continue
        for dic in page_text['list']:
            _id = dic['ID']
            detail_url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById'
            data = {'id': _id}
            detail_text = requests.post(url=detail_url, data=data, headers=headers).json()
            print(str(detail_text))
            f.write(str(detail_text))

with open('./particulars.txt','w',encoding='utf-8') as f:

    func_list(f)

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。