什麼是爬蟲html
爬蟲的分類web
requestsjson
反爬機制數組
反反爬策略瀏覽器
爬蟲程序經過相應的策略和技術手段,破解門戶網站的反爬蟲手段,從而爬取到相應的數據服務器
第一個反爬機制app
編碼流程:工具
jupyter快捷鍵post
啓動:jupyter notebook 插入:cell:a,b 刪除:x或者dd dell模式切換:m y 執行cell:shift+enter tab:代碼補全 打開幫助文檔:shift+tab
爬取搜狗主頁案例網站
#1.指定url url = 'https://www.sogou.com' #2.發起請求 response = requests.get(url=url) #3.獲取響應數據 page_text = response.text #4.持久化存儲 with open('./sogou.html','w',encoding='utf-8') as fp: fp.write(page_text)
動態爬取搜狗搜索主頁(簡易的網頁採集器)
#User-Agent:請求載體的身份標識 #UA檢測:門戶網站的服務器端會檢測每個請求的UA,若是檢測到請求的UA爲爬蟲程序,則請求失敗 #UA假裝: #簡易的網頁採集器 wd = input('enter a word:') url = 'https://www.sogou.com/web' #將請求參數設定成動態的 param = { "query":wd } #UA假裝 headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' } response = requests.get(url=url,params=param,headers=headers) #手動設置響應數據的編碼,處理中文亂碼 response.encoding = 'utf-8' #text返回的是字符串形式的響應數據 page_text = response.text filename = wd+'.html' with open(filename,'w',encoding='utf-8') as fp: fp.write(page_text) print(filename,'下載成功')
爬取肯德基餐廳位置信息
#爬取肯德基餐廳位置信息 url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword' city = input('enter a city name') for pageIndex in range(1,9): data = { "cname": "", "pid": "", "keyword": city, "pageIndex": pageIndex, "pageSize": "10", } headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' } response = requests.post(url=url,data=data,headers=headers) #json() 返回的事一個json對象類型 page_text = response.json() # print(page_text) fp = open('./kfc.txt','a+',encoding='utf-8') kfc_dic = {} for dic in page_text['Table1']: kfc_dic[dic['storeName']+'餐廳'] = dic['addressDetail'] fp.write(str(kfc_dic)) fp.close()
爬取豆瓣排行榜電影的信息
import requests url = 'https://movie.douban.com/j/chart/top_list' s = 1 limit = 100 headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' } param = { "type": "5", "interval_id": "100:90", "action": "", "start": s, "limit": limit, } response = requests.get(url=url,headers=headers,params=param) page_text = response.json() print(page_text)
動態加載的頁面數據
爬取藥監局的全部企業信息
import requests url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList' headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' } def func_list(f): for i in range(1,330): data = { "on": "true", "page": i, "pageSize": "15", "productName":" ", "conditionType": "1", "applyname":" ", "applysn":" " } try: page_text = requests.post(url=url, data=data, headers=headers).json() except Exception as e: print(e) continue for dic in page_text['list']: _id = dic['ID'] detail_url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById' data = {'id': _id} detail_text = requests.post(url=detail_url, data=data, headers=headers).json() print(str(detail_text)) f.write(str(detail_text)) with open('./particulars.txt','w',encoding='utf-8') as f: func_list(f)