近年來,隨着網絡應用的逐漸擴展和深刻,如何高效的獲取網上數據成爲了無數公司和我的的追求,在大數據時代,誰掌握了更多的數據,誰就能夠得到更高的利益,而網絡爬蟲是其中最爲經常使用的一種從網上爬取數據的手段。html
網絡爬蟲,即Web Spider,是一個很形象的名字。若是把互聯網比喻成一個蜘蛛網,那麼Spider就是在網上爬來爬去的蜘蛛。網絡蜘蛛是經過網頁的連接地址來尋找網頁的。從網站某一個頁面(一般是首頁)開始,讀取網頁的內容,找到在網頁中的其它連接地址,而後經過這些連接地址尋找下一個網頁,這樣一直循環下去,直到把這個網站全部的網頁都抓取完爲止。若是把整個互聯網當成一個網站,那麼網絡蜘蛛就能夠用這個原理把互聯網上全部的網頁都抓取下來。python
互聯網中最有價值的即是數據,好比天貓商城的商品信息,鏈家網的租房信息,雪球網的證券投資信息等等,這些數據都表明了各個行業的真金白銀,能夠說,誰掌握了行業內的第一手數據,誰就成了整個行業的主宰,若是把整個互聯網的數據比喻爲一座寶藏,那咱們的爬蟲課程就是來教你們如何來高效地挖掘這些寶藏,掌握了爬蟲技能, 你就成了全部互聯網信息公司幕後的老闆,換言之,它們都在免費爲你提供有價值的數據。git
http協議github
requests模塊支持的請求:json
import requests requests.get("http://httpbin.org/get") requests.post("http://httpbin.org/post") requests.put("http://httpbin.org/put") requests.delete("http://httpbin.org/delete") requests.head("http://httpbin.org/get") requests.options("http://httpbin.org/get")
1 基本請求cookie
import requests response=requests.get('https://www.jd.com/',) with open("jd.html","wb") as f: f.write(response.content)
2 含參數請求網絡
import requests response=requests.get('https://s.taobao.com/search?q=手機') response=requests.get('https://s.taobao.com/search',params={"q":"美女"})
3 含請求頭請求session
import requests response=requests.get('https://dig.chouti.com/', headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36', } )
4 含cookies請求app
import uuid import requests url = 'http://httpbin.org/cookies' cookies = dict(sbid=str(uuid.uuid4())) res = requests.get(url, cookies=cookies) print(res.json())
5 request.session()ide
import requests # res=requests.get("https://www.zhihu.com/explore") # print(res.cookies.get_dict()) session=requests.session() res1=session.get("https://www.zhihu.com/explore") print(session.cookies.get_dict()) res2=session.get("https://www.zhihu.com/question/30565354/answer/463324517",cookies={"abs":"123"}
1 data參數
requests.post()用法與requests.get()徹底一致,特殊的是requests.post()多了一個data參數,用來存放請求體數據
response=requests.post("http://httpbin.org/post",params={"a":"10"}, data={"name":"yuan"})
2 發送json數據
import requests
res1=requests.post(url='http://httpbin.org/post', data={'name':'yuan'}) #沒有指定請求頭,#默認的請求頭:application/x-www-form-urlencoed print(res1.json()) res2=requests.post(url='http://httpbin.org/post',json={'age':"22",}) #默認的請求頭:application/json) print(res2.json())
import requests respone=requests.get('https://sh.lianjia.com/ershoufang/') # respone屬性 print(respone.text) print(respone.content) print(respone.status_code) print(respone.headers) print(respone.cookies) print(respone.cookies.get_dict()) print(respone.cookies.items()) print(respone.url) print(respone.history) print(respone.encoding)
import requests response=requests.get('http://www.autohome.com/news') #response.encoding='gbk' #汽車之家網站返回的頁面內容爲gb2312編碼的,而requests的默認編碼爲ISO-8859-1,若是不設置成gbk則中文亂碼 with open("res.html","w") as f: f.write(response.text)
import requests response=requests.get('http://bangimg1.dahe.cn/forum/201612/10/200447p36yk96im76vatyk.jpg') with open("res.png","wb") as f: # f.write(response.content) # 好比下載視頻時,若是視頻100G,用response.content而後一會兒寫到文件中是不合理的 for line in response.iter_content(): f.write(line)
import requests import json response=requests.get('http://httpbin.org/get') res1=json.loads(response.text) #太麻煩 res2=response.json() #直接獲取json數據 print(res1==res2)
默認狀況下,除了 HEAD, Requests 會自動處理全部重定向。可使用響應對象的 history
方法來追蹤重定向。Response.history
是一個 Response
對象的列表,爲了完成請求而建立了這些對象。這個對象列表按照從最老到最近的請求進行排序。
>>> r = requests.get('http://github.com') >>> r.url 'https://github.com/' >>> r.status_code 200 >>> r.history [<Response [301]>]
另外,還能夠經過 allow_redirects
參數禁用重定向處理:
>>> r = requests.get('http://github.com', allow_redirects=False) >>> r.status_code 301 >>> r.history []
import requests import re #請求1: r1=requests.get('https://github.com/login') r1_cookie=r1.cookies.get_dict() #拿到初始cookie(未被受權) authenticity_token=re.findall(r'name="authenticity_token".*?value="(.*?)"',r1.text)[0] #從頁面中拿到CSRF TOKEN print("authenticity_token",authenticity_token) #第二次請求:帶着初始cookie和TOKEN發送POST請求給登陸頁面,帶上帳號密碼 data={ 'commit':'Sign in', 'utf8':'✓', 'authenticity_token':authenticity_token, 'login':'yuanchenqi0316@163.com', 'password':'yuanchenqi0316' } #請求2: r2=requests.post('https://github.com/session', data=data, cookies=r1_cookie, # allow_redirects=False ) print(r2.status_code) #200 print(r2.url) #看到的是跳轉後的頁面:https://github.com/ print(r2.history) #看到的是跳轉前的response:[<Response [302]>] print(r2.history[0].text) #看到的是跳轉前的response.text with open("result.html","wb") as f: f.write(r2.content)
import requests import re import json import time from concurrent.futures import ThreadPoolExecutor pool=ThreadPoolExecutor(50) def getPage(url): response=requests.get(url) return response.text def parsePage(res): com=re.compile('<div class="item">.*?<div class="pic">.*?<em .*?>(?P<id>\d+).*?<span class="title">(?P<title>.*?)</span>' '.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)評價</span>',re.S) iter_result=com.finditer(res) return iter_result def gen_movie_info(iter_result): for i in iter_result: yield { "id":i.group("id"), "title":i.group("title"), "rating_num":i.group("rating_num"), "comment_num":i.group("comment_num"), } def stored(gen): with open("move_info.txt","a",encoding="utf8") as f: for line in gen: data=json.dumps(line,ensure_ascii=False) f.write(data+"\n") def spider_movie_info(url): res=getPage(url) iter_result=parsePage(res) gen=gen_movie_info(iter_result) stored(gen) def main(num): url='https://movie.douban.com/top250?start=%s&filter='%num pool.submit(spider_movie_info,url) #spider_movie_info(url) if __name__ == '__main__': before=time.time() count=0 for i in range(10): main(count) count+=25 after=time.time() print("總共耗費時間:",after-before)