目標地址:http://weixin.sogou.com/weixin?html
這個地址是搜狗微信的文章搜索,能夠搜索到微信的文章,而咱們目標就是這些文章內容python
這個url通過測試,當咱們沒登錄微信只能看到10頁的內容,咱們登錄後才能夠查看100頁的內容,git
並且翻頁屢次會出現ip檢測的反爬機制,出現302從新跳轉到驗證碼輸入頁面,輸入驗證碼後才能夠繼續瀏覽網頁github
因而咱們就利用代理池來解決這個反爬。redis
首先搭建爬蟲主題框架,由於是搜索類型的url,通常經過get請求,因此咱們經過urlencode進行參數拼接,我這裏查詢的是query=python&type=2&page=1,type爲1是搜索公衆號,type爲2是搜索微信文章。若是出現鏈接錯誤ConnectionError就從新抓取,主體完成。微信
from urllib.parse import urlencode import requests base_url = 'http://weixin.sogou.com/weixin?' KEYWORD = 'python' def get_html(url): try: response = request.get(url) if response.status == 200: return response.text except ConnectionError: return get_html(url) def get_index(keyword, page): data = { 'query': keyword, 'type': 2, 'page': page } queries = urlencode(data) url = base_url + queries html = get_html(url) print(html) if '__name__' == '__main__': get_index(KEYWORD, 1)
其次,由於咱們這裏是抓取一頁的搜索內容,因此沒有出現302的狀態,接下來咱們要設置代理池,而後利用cookies抓取100頁的內容。我用的代理池是https://github.com/Python3WebSpider/ProxyPool,免費但不穩定(湊合着用把),記得下載這個代理池後安裝requirement時修改cookie
redis==2.10.6
不然會出現一些髒數據致使代理池出現問題。框架
代理池運行起來,經過http://localhost:5555/random能夠獲取代理ip,這樣就不用擔憂封ip了。dom
增長headers的cookies信息以及獲取代理。這裏User-Agent最好設置成Chrome 67版本如下,不然會一直卡在302中ide
from requests.exceptions import ConnectionError proxy = None PROXY_POOL_URL = 'http://localhost:5555/random' headers = { 'Cookie': 'SUV=00BC42EFDA11E2615BD9501783FF7490; CXID=62F139BEE160D023DCA77FFE46DF91D4; SUID=61E211DA4D238B0A5BDAB0B900055D85; ad=Yd1L5yllll2tbusclllllVeEkmUlllllT1Xywkllll9llllllZtll5@@@@@@@@@@; SNUID=A60850E83832BB84FAA2B6F438762A9E; IPLOC=CN4400; ld=Nlllllllll2tPpd8lllllVh9bTGlllllTLk@6yllll9llllljklll5@@@@@@@@@@; ABTEST=0|1552183166|v1; weixinIndexVisited=1; sct=1; ppinf=5|1552189565|1553399165|dHJ1c3Q6MToxfGNsaWVudGlkOjQ6MjAxN3x1bmlxbmFtZTo4OnRyaWFuZ2xlfGNydDoxMDoxNTUyMTg5NTY1fHJlZm5pY2s6ODp0cmlhbmdsZXx1c2VyaWQ6NDQ6bzl0Mmx1UHBWaElMOWYtYjBhNTNmWEEyY0RRWUB3ZWl4aW4uc29odS5jb218; pprdig=eKbU5eBV3EJe0dTpD9TJ9zQaC2Sq7rMxdIk7_8L7Auw0WcJRpE-AepJO7YGSnxk9K6iItnJuxRuhmAFJChGU84zYiQDMr08dIbTParlp32kHMtVFYV55MNF1rGsvFdPUP9wU-eLjl5bAr77Sahi6mDDozvBYjxOp1kfwkIVfRWA; sgid=12-39650667-AVyEiaH25LM0Xc0oS7saTeFQ; ppmdig=15522139360000003552a8b2e2dcbc238f5f9cc3bc460fd0; JSESSIONID=aaak4O9nDyOCAgPVQKZKw', 'Host': 'weixin.sogou.com', 'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.2987.133 Safari/537.36' } def get_proxy(): try: response = requests.get(PROXY_POOL_URL) if response.status_code == 200: return response.text return None except ConnectionError: return None
而後修改get_html方法,這裏allow_redirects=False是設置不容許自動跳轉,沒有的話get請求會幫你自動跳轉到輸入驗證碼的頁面。這裏區分有用proxy和沒有proxy的狀況,由於咱們一開始是經過本身的ip進行訪問,若是出現302後才經過代理進行訪問。而後增長對次數判斷,若是請求屢次的話就返回None,避免浪費過多資源(事實上好像最多出現Count=2時就能請求成功了)
MAX_COUNT = 5 def get_html(url, count=1): print('Crawling', url) print('Trying Count', count) global proxy if count >= MAX_COUNT: print('Tried Too Many Counts') return None try: if proxy: proxies = { 'http': 'http://' + proxy } response = requests.get(url, allow_redirects=False, headers=headers, proxies=proxies) else: response = requests.get(url, allow_redirects=False, headers=headers) if response.status_code == 200: return response.text if response.status_code == 302: # Need Proxy print('302') proxy = get_proxy() if proxy: print('Using Proxy', proxy) return get_html(url) else: print('Get Proxy Failed') return None except ConnectionError as e: print('Error Occurred', e.args) proxy = get_proxy() count += 1 return get_html(url, count)
好了,如今已經得到100頁的搜索內容了,也就是咱們還須要點擊文章鏈接進去而後進行文章內容爬取才行
目標是這個<a>,我利用pyquery來進行抓取,yield生成href連接,順便定義獲取文章頁面的get_detail,這裏轉到https://mp.weixin.qq.com,就不須要代理了。
from pyquery import PyQuery as pq def parse_index(html): doc = pq(html) items = doc('.news-box .news-list li .txt-box h3 a').items() for item in items: yield item.attr('href') def get_detail(url): try: response = requests.get(url) if response.status_code == 200: return response.text return None except ConnectionError: return None
再次分析文章頁面的內容,咱們想要的文章標題、做者、公衆號、內容、發佈時間。後面查看有些文章沒有做者,只有公衆號,因此把做者改成公衆號的微信號
因而乎,寫下parse_detail方法,而XMLSyntaxError是pyquery常常出現特殊字符致使匹配不成功,先把它加上。
from lxml.etree import XMLSyntaxError def parse_detail(html): try: doc = pq(html) title = doc('.rich_media_title').text() content = doc('.rich_media_content').text() date = doc('#publish_time').text() nickname = doc('#js_profile_qrcode > div > strong').text() wechat = doc('#js_profile_qrcode > div > p:nth-child(3) > span').text() return { 'title': title, 'content': content, 'date': date, 'nickname': nickname, 'wechat': wechat } except XMLSyntaxError: return None
這樣咱們就得到了關於python的微信文章標題、內容、公衆號、微信號、發佈時間。這些數據還須要進行保存,用MongoDB保存簡單粗暴
import pymongo MONGO_URI = 'localhost' MONGO_DB = 'weixin' client = pymongo.MongoClient(MONGO_URI) db = client[MONGO_DB] def save_to_mongo(data): if db['articles'].update({'title': data['title']}, {'$set': data}, True): print('Saved to Mongo', data['title']) else: print('Saved to Mongo Failed', data['title'])
從新寫一下mian方法
def main(): for page in range(1, 101): html = get_index(KEYWORD, page) if html: article_urls = parse_index(html) for article_url in article_urls: article_html = get_detail(article_url) if article_html: article_data = parse_detail(article_html) print(article_data) if article_data: save_to_mongo(article_data)
大功告成,到MongoDB中就能夠查看爬取下來的內容了
接到同窗反饋,說publish_time回來是空值,因而我查看了一下
得到的時間確實沒有東西,向下找發現是用了js傳了值進來
這樣仍是在requests返回的html中,因而利用re就能夠將它匹配出來(bs和pq都只能匹配html或者xml標籤)
這樣就能夠了,能用search就不要用match
修改咱們的parse_detail方法就能夠得到時間了
from lxml.etree import XMLSyntaxError import re def parse_detail(html): try: doc = pq(html) title = doc('.rich_media_title').text() content = doc('.rich_media_content').text() date = re.search(r'var\spublish_time\s=\s\"(.*?)\"\s\|\|', html).group(1) nickname = doc('#js_profile_qrcode > div > strong').text() wechat = doc('#js_profile_qrcode > div > p:nth-child(3) > span').text() return { 'title': title, 'content': content, 'date': date, 'nickname': nickname, 'wechat': wechat } except XMLSyntaxError: return None