學習強國網頁爬取)

需求

https://www.xuexi.cn/f997e76a890b0e5a053c57b19f468436/018d244441062d8916dd472a4c6a0a0b.html頁面中的新聞數據。html

項目分析

1 首先咱們經過請求網頁地址響應數據中查看瀏覽器頁面的數據是否存在於網頁html中.ajax

2 在網頁響應的html 文件中不存在咱們頁面數據,所以學習強國網的新聞數據都是動態加載出來的,而且經過抓包工具,發現也不是ajax請求(由於沒有捕獲ajax請求的數據包),那這裏的數據只有多是經過js生成的.json

3 經過谷歌瀏覽器自帶的抓包工具,咱們查看是哪個js請求的數據格式.打開開發者應用,刷新頁面.瀏覽器

4 查看數據響應的詳細信息ide

5 一樣能夠拿到詳情頁的url工具

6 url分析學習

詳情頁面:https://www.xuexi.cn/e5577906b82bc00b102d2c8d3b723312/e43e220633a65f9b6d8b53712cba9caa.htmlurl

詳情頁數據:https://www.xuexi.cn/e5577906b82bc00b102d2c8d3b723312/datae43e220633a65f9b6d8b53712cba9caa.jsspa

全棧爬取代碼實現3d

import requests import re import json from lxml import etree class Spider: def __init__(self, headers, url,fp=None): self.headers = headers self.url=url self.fp=fp def open_file(self): self.fp = open('學習強國2.txt', 'w', encoding='utf8') def get_data(self): return requests.get(url=self.url, headers=self.headers).text def parse_home_data(self): ex = '"static_page_url":"(.*?)"' home_data=self.get_data() return re.findall(ex,home_data) def parse_detail_data(self): detail_url=self.parse_home_data() print(detail_url) i = 0 for url in detail_url: i += 1
            '''<title>系統維護中</title> 坑人'''
            try: self.url = url.replace(r'/e', r'/datae').replace('html', 'js') detail_data = self.get_data() detail_data = detail_data.replace('globalCache = ', '')[:-1] dic_data = json.loads(detail_data)
　　　　　　　　　 #獲取字典中的第一個鍵值對的key first = list(dic_data.keys())[0] title = dic_data[first]['detail']['frst_name'] content_html = dic_data[first]['detail']['content_list'][0]['content'] tree = etree.HTML(content_html) content_list = tree.xpath('.//p/text()') except Exception as e: print(e) continue self.fp.write(f'第{i}章' + title + '\n' + ''.join(content_list) + '\n\n') def close_file(self): self.fp.close() def run(self): self.open_file() self.parse_detail_data() self.close_file() if __name__ == '__main__': headers = { 'Host': 'www.xuexi.cn', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36', } url = 'https://www.xuexi.cn/f997e76a890b0e5a053c57b19f468436/data018d244441062d8916dd472a4c6a0a0b.js' spider=Spider(url=url,headers=headers) spider.run()

效果:

　　一共64篇文章