爬蟲類型:通用爬蟲、聚焦爬蟲、增量式爬蟲html
在使用fiddler工具抓包時,須要注意下:由於它須要安裝證書,在項目請求HTTPS頁面是會ssl要求提供安全證書,可能會被拒絕請求安全
能夠在發送requests請求時,關閉安全認證,或者暫時關閉fiddler代理。末尾也會提到,這個坑……服務器
使用 BeautifulSoup對HTML標籤進行解析數據:dom
import requests from bs4 import BeautifulSoup url='https://www.yangguiweihuo.com/16/16089/' ua={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"} page_text=requests.get(url=url,headers=ua).text #獲取全部章節列表HTML
soup=BeautifulSoup(page_text,'lxml') a_list=soup.select('.listmain > dl > dd > a') #解析出全部章節的URL
with open("秦吏.txt",'w',encoding='utf-8')as f: for a in a_list: title=a.string #獲取a標籤中的文本 做爲章節名
detail_url='https://www.yangguiweihuo.com'+a['href'] #拼接章節詳情url
detail_page=requests.get(url=detail_url,headers=ua).text dsp=BeautifulSoup(detail_page,'lxml') #章節詳情頁面
content=dsp.find('div',id='content').text #章節內容詳情
f.write(title+'\n'+content) #數據持久化存儲
print(title+":下載完成") print('The end') f.close()
關於xpath的使用:工具
div[@class="song"] div中class爲song的標籤元素測試
div[@class="song"]/li/a/@href 取出其中的url地址網站
div[@class="song"]/li/a/text() 取出其中的文本ui
div[contains(@class,'ng')] 是指在div中查找class屬性名含有ng的標籤元素url
div[starts-with(@class,'ta')] 是指div中查找class屬性以ta開頭的標籤元素spa
xpath小案例:
import requests from lxml import etree url="https://gz.58.com/ershoufang/?PGTID=0d100000-0000-335c-5dda-1cebcdf9ae5f&ClickID=2" user_agent={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"} page_text=requests.get(url,headers=user_agent).text tree=etree.HTML(page_text) #格式化處理後的所有頁面數據
li_list=tree.xpath('//ul[@class="house-list-wrap"]/li') #記錄以列表返回
fp=open("58.scv",'w',encoding='utf-8') for li in li_list: title=li.xpath("./div[@class='list-info']/h2/a/text()")[0] price=li.xpath("./div[@class='price']//text()") sum_price=''.join(price) fp.write("home:"+title+"price:"+sum_price+'\n') fp.close() print("數據獲取完成!")
碰到網站文本亂碼問題的解決:
import requests,os from lxml import etree url='http://pic.netbian.com/4kmeinv/' user_agent={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"} page_text=requests.get(url,headers=user_agent).text tree=etree.HTML(page_text) li_list=tree.xpath('//div[@class="slist"]/ul/li') def getpic(title,photo): if not os.path.exists('./photo'): #沒有文件夾則直接建立空文件夾
os.mkdir('./photo') fp = open('photo/'+title, 'wb') fp.write(photo) fp.close() return "當前資源下載完成"
for li in li_list: title=li.xpath('./a/b/text()')[0]+".jpg" title=title.encode('iso-8859-1').decode('gbk') #亂碼 轉標準格式在解碼
print(title) p_url=li.xpath('./a/img/@src')[0] picture_url='http://pic.netbian.com'+p_url photo=requests.get(url=picture_url,headers=user_agent).content ret=getpic(title,photo) print(ret)
批量獲取簡歷模板:字符亂碼問題的處理
import requests,random,os,time from lxml import etree url='http://sc.chinaz.com/jianli/free.html' user_agent={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"} response=requests.get(url,headers=user_agent) #簡歷列表頁面
response.encoding='utf-8' #對亂碼進行處理
page_text=response.text tree=etree.HTML(page_text) div_list=tree.xpath('//div[@id="container"]/div') if not os.path.exists('./jl'): os.mkdir('./jl') for div in div_list: title=div.xpath('./a/img/@alt')[0] #簡歷名稱
link=div.xpath('./a/@href')[0] #簡歷詳情地址
fp=open('./jl/'+title+'.zip','wb') detail_page=requests.get(url=link,headers=user_agent).text #簡歷詳情頁面
dpage=etree.HTML(detail_page) down_list=dpage.xpath('//div[@class="clearfix mt20 downlist"]/ul/li/a/@href') down_url=random.choice(down_list) #隨機選擇下載地址
word=requests.get(url=down_url,headers=user_agent).content print("準備開始下載>>"+title) fp.write(word) time.sleep(1)
對代理ip進行測試:寫入數據則代理ip可用
import requests url='https://www.baidu.com/s?wd=ip' user_agent={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"} proxy={"https":'112.85.170.79:9999'} page=requests.get(url,headers=user_agent,proxies=proxy).text with open('./ip.html','w',encoding='utf-8')as f: f.write(page)
這一每天的mmp,當指定headers的User-Agent時,服務器會重定向到https的網址.所以報出SSL驗證失敗的錯誤,爲了不重定向形成認證失敗,直接關閉認證page_text=requests.get(url=url,headers=user_agent,verify=False).text