用戶代理池就是將不一樣的用戶代理組建成爲一個池子,隨後隨機調用。html
做用:每次訪問表明使用的瀏覽器不同python
import urllib.request import re import random uapools=[ 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; ) AppleWebKit/534.12 (KHTML, like Gecko) Maxthon/3.0 Safari/534.12', ] def ua(uapools): thisua=random.choice(uapools) print(thisua) headers=("User-Agent",thisua) opener=urllib.request.build_opener() opener.addheaders=[headers] urllib.request.install_opener(opener) for i in range(10): ua(uapools) thisurl="https://www.qiushibaike.com/text/page/"+str(i+1)+"/"; data=urllib.request.urlopen(thisurl).read().decode("utf-8","ignore") pat='<div class="content">.*?<span>(.*?)</span>.*?</div>' res=re.compile(pat,re.S).findall(data) for j in range(len(res)): print(res[j]) print('---------------------')
搜索西刺、大象代理IPjson
儘可能選國外的IP。瀏覽器
import urllib.request ip="219.131.240.35" proxy=urllib.request.ProxyHandler({"http":ip}) opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler) urllib.request.install_opener(opener) url="https://www.baidu.com/" data=urllib.request.urlopen(url).read() fp=open("ip_baidu.html","wb") fp.write(data) fp.close()
import random import urllib.request ippools=[ "163.125.70.22", "111.231.90.122", "121.69.37.6", ] def ip(ippools): thisip=random.choice(ippools) print(thisip) proxy=urllib.request.ProxyHandler({"http":thisip}) opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler) urllib.request.install_opener(opener) for i in range(5): try: ip(ippools) url="https://www.baidu.com/" data=urllib.request.urlopen(url).read().decode("utf-8","ignore") print(len(data)) fp=open("ip_res/ip_baidu_"+str(i+1)+".html","w") fp.write(data) fp.close() except Exception as err: print(err)
此方法由於經濟緣由暫時鴿着。服務器
如今的淘寶反爬蟲,下面這份代碼已經爬不了了,但能夠做爲練習。網絡
import urllib.request import re import random keyname="python" key=urllib.request.quote(keyname) #網址不能有中文,這裏處理中文 uapools=[ 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; ) AppleWebKit/534.12 (KHTML, like Gecko) Maxthon/3.0 Safari/534.12', ] def ua(uapools): thisua=random.choice(uapools) print(thisua) headers=("User-Agent",thisua) opener=urllib.request.build_opener() opener.addheaders=[headers] urllib.request.install_opener(opener) for i in range(1,11): #第1頁到第10頁 ua(uapools) url="https://s.taobao.com/search?q="+key+"&s="+str((i-1)*44) data=urllib.request.urlopen(url).read().decode("UTF-8","ignore") pat='pic_url":"//(.*?)"' imglist=re.compile(pat).findall(data) print(len(imglist)) for j in range(len(imglist)): thisimg=imglist[j] thisimgurl="https://"+thisimg localfile="淘寶圖片/"+str(i)+str(j)+".jpg" urllib.request.urlretrieve(thisimgurl,localfile)
封裝成函數:dom
import urllib.request import re import random uapools=[ 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; ) AppleWebKit/534.12 (KHTML, like Gecko) Maxthon/3.0 Safari/534.12', ] ippools=[ "163.125.70.22", "111.231.90.122", "121.69.37.6", ] def ua_ip(myurl): def ip(ippools,uapools): thisip=random.choice(ippools) print(thisip) thisua = random.choice(uapools) print(thisua) headers = ("User-Agent", thisua) proxy=urllib.request.ProxyHandler({"http":thisip}) opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler) opener.addheaders = [headers] urllib.request.install_opener(opener) for i in range(5): try: ip(ippools,uapools) url=myurl data=urllib.request.urlopen(url).read().decode("utf-8","ignore") print(len(data)) break except Exception as err: print(err) return data data=ua_ip("https://www.baidu.com/") fp=open("uaip.html","w",encoding="utf-8") fp.write(data) fp.close()
封裝成模塊:異步
把模塊拷貝到python目錄ide
使用:函數
from uaip import * data=ua_ip("https://www.baidu.com/") fp=open("baidu.html","w",encoding="utf-8") fp.write(data) fp.close()
fiddler工具:用做代理服務器,request和response都要通過fiddler
選用火狐瀏覽器,設置網絡:
設置HTTPS協議:打開fiddler的工具的選項,打上勾
而後點Actions選導入到桌面。
再回到火狐的設置
導入桌面上的證書
經常使用命令clear:清屏
如微博,拖到下面的時候數據才加載出來,不是同步出來的。再如「點擊加載更多」,都是異步,須要抓包分析。
看下面這個栗子。
在火狐瀏覽器打開騰訊視頻,好比https://v.qq.com/x/cover/j6cgzhtkuonf6te.html
點擊查看更多解讀,這時fiddler會有一個js文件:
裏面的內容就是評論。
找到一條評論轉一下碼:
在火狐裏ctrl+f看看有沒有這條評論。
copy js文件的url。
點擊查看更多評論,再觸發一個json,copy url
分析兩個url:
經過分析,咱們能夠知道j6cg……是視頻id,reqnum是每次查看的評論數量,commentid是評論id
https://video.coral.qq.com/filmreviewr/c/upcomment/【vid】?reqnum=【num】&commentid=【cid】
import urllib.request import re from uaip import * vid="j6cgzhtkuonf6te" cid="6227734628246412645" num="3" #每頁提取3個 url="https://video.coral.qq.com/filmreviewr/c/upcomment/"+vid+"?reqnum="+num+"&commentid="+cid data=ua_ip(url) titlepat='"title":"(.*?)","abstract":"' commentpat='"content":"(.*?)",' titleall=re.compile(titlepat,re.S).findall(data) commentall=re.compile(commentpat,re.S).findall(data) # print(len(commentall)) for i in range(len(titleall)): try: print("評論標題是:"+eval("u'"+titleall[i]+"'")) print("評論內容是:"+eval("u'"+commentall[i]+"'")) print('---------------') except Exception as err: print(err)
翻頁評論爬蟲 查看網頁源代碼能夠發現last:後面的內容爲下一頁的id
import urllib.request import re from uaip import * vid="j6cgzhtkuonf6te" cid="6227734628246412645" num="3" for j in range(10): #爬取1~10頁內容 print("第"+str(j+1)+"頁") url = "https://video.coral.qq.com/filmreviewr/c/upcomment/" + vid + "?reqnum=" + num + "&commentid=" + cid data = ua_ip(url) titlepat = '"title":"(.*?)","abstract":"' commentpat = '"content":"(.*?)",' titleall = re.compile(titlepat, re.S).findall(data) commentall = re.compile(commentpat, re.S).findall(data) lastpat='"last":"(.*?)"' cid=re.compile(lastpat,re.S).findall(data)[0] for i in range(len(titleall)): try: print("評論標題是:" + eval("u'" + titleall[i] + "'")) print("評論內容是:" + eval("u'" + commentall[i] + "'")) print('---------------') except Exception as err: print(err)
對於短評(普通評論)方法相似,這裏就不贅述了,看下面這個短評爬蟲代碼:
import urllib.request import re from uaip import * vid="1743283224" cid="6442954225602101929" num="5" for j in range(10): #爬取1~10頁內容 print("第"+str(j+1)+"頁") url="https://video.coral.qq.com/varticle/"+vid+"/comment/v2?orinum="+num+"&oriorder=o&pageflag=1&cursor="+cid data = ua_ip(url) commentpat = '"content":"(.*?)"' commentall = re.compile(commentpat, re.S).findall(data) lastpat='"last":"(.*?)"' cid=re.compile(lastpat,re.S).findall(data)[0] # print(len(gg)) # print(len(commentall)) for i in range(len(commentall)): try: print("評論內容是:" + eval("u'" + commentall[i] + "'")) print('---------------') except Exception as err: print(err)