爬蟲入坑一段時間了,準備搞點事,嘿嘿html
注意:閱讀本文要有必定的python基礎,瞭解Requests和Xpath相關語法,以及正則表達式python
Requests正則表達式
Requests是用python語言基於urllib編寫的,採用的是Apache2 Licensed開源協議的HTTP庫
若是你看過文章關於urllib庫的使用,你會發現,其實urllib仍是很是不方便的,而Requests它會比urllib更加方便,能夠節約咱們大量的工做。(用了requests以後,你基本都不肯意用urllib了)一句話,requests是python實現的最簡單易用的HTTP庫,建議爬蟲使用requests庫。數據結構
Xpathide
#正則+request+xpath from lxml import etree import requests import re import warnings import time warnings.filterwarnings("ignore") headers = {"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1 Trident/5.0;"} def get_urls(URL): Html=requests.get(URL,headers=headers,verify=False) Html.encoding = 'gbk' HTML=etree.HTML(Html.text) results=HTML.xpath('//dd/a/@href') return results def get_items(result): url='https://www.biquyun.com'+str(result) html=requests.get(url,headers=headers,verify=False) html.encoding = 'gbk' pattern=re.compile('<div.*?<h1>(.*?)</h1>.*?<div.*?content">(.*?)</div>',re.S) items='\n'*2+str(re.findall(pattern,html.text)[0][0])+'\n'*2+str(re.findall(pattern,html.text)[0][1]) items=items.replace(' ','').replace('<br />','') return items def save_to_file(items): with open ("xiaoshuo1.txt",'a',encoding='utf-8') as file: file.write(items) def main(URL): results=get_urls(URL) ii=1 for result in results: items=get_items(result) save_to_file(items) print(str(ii)+' in 1028') ii=ii+1 # time.sleep(1) if __name__ == '__main__': start_1 = time.time() URL='https://www.biquyun.com/15_15566/' main(URL) print('Done!') end_1 = time.time() print('爬蟲時間1:',end_1-start_1)
運行結果:編碼
#requests+xpath from lxml import etree import requests import re import warnings import time warnings.filterwarnings("ignore")#因爲requests獲取網頁源代碼採用verify=False,須要忽略警告 headers = {"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1 Trident/5.0;"} def get_urls(URL): Html=requests.get(URL,headers=headers,verify=False) Html.encoding = 'gbk' HTML=etree.HTML(Html.text) results=HTML.xpath('//dd/a/@href') return results def get_items(result): url='https://www.biquyun.com'+str(result) html=requests.get(url,headers=headers,verify=False) html.encoding = 'gbk' html=etree.HTML(html.text) resultstitle=html.xpath('//*[@class="bookname"]/h1/text()') resultsbody=html.xpath('//*[@id="content"]/text()') items=str(resultstitle[0])+'\n'*2+str(resultsbody).replace('\', \'','').replace('\\xa0\\xa0\\xa0\\xa0','').replace('\\r\\n\\r\\n','\n\n').replace('[\'','').replace('\']','')+'\n'*2 return items def save_to_file(items): with open ("xiaoshuo2.txt",'a',encoding='utf-8') as file: file.write(items) def main(URL): results=get_urls(URL) ii=1 for result in results: items=get_items(result) save_to_file(items) print(str(ii)+' in 1028') ii=ii+1 # time.sleep(1) if __name__ == '__main__': start_2 = time.time() URL='https://www.biquyun.com/15_15566/' main(URL) print('Done!') end_2 = time.time() print('爬蟲時間2:',end_2-start_2)
運行結果:url
ps: 具體爬取速度與電腦配置和網速有關。另外,利用正則匹配時間有時候會很長,建議採用xpath。spa
編寫爬蟲的坑 :.net
1.爬取網頁中文亂碼3d
解決方案:
print(response.encoding) # requests猜想的編碼格式 print(requests.utils.get_encodings_from_content(response.text)[0])
參考連接:
http://cn.python-requests.org/zh_CN/latest/
https://www.liaoxuefeng.com/wiki/1016959663602400/1183249464292448
http://www.w3school.com.cn/xpath/index.asp
http://www.javashuo.com/article/p-vzrpzoiu-gd.html
https://blog.csdn.net/ahua_c/article/details/80942726
https://www.bilibili.com/video/av19057145
https://www.crifan.com/python_re_search_vs_re_findall/
https://www.jianshu.com/p/4c076da1b7f7
https://blog.csdn.net/u014109807/article/details/79735400
http://www.javashuo.com/article/p-gjeuvzyo-gh.html
http://www.javashuo.com/article/p-awceltwg-gm.html
http://www.javashuo.com/article/p-gdwkmjqb-ce.html
https://blog.51cto.com/13603552/2308728
https://blog.csdn.net/sinat_35360663/article/details/78455991