Pythonhtml
Python:requestsgit
Python:BeautifulSoupjson
使用瀏覽器調試面板分析網頁結構以及網絡請求,容易知道,每個頭條信息結構如圖所示
瀏覽器
所以,咱們能夠經過 dd.tracking-ad > span > a
定位元素,同時,根據Network
面板的網絡請求分析,第一次加載更多數據的請求爲網絡
http://geek.csdn.net/service/news/get_news_list?from=-&size=20&type=HackCount
第二次的爲:app
http://geek.csdn.net/service/news/get_news_list?from=6:245113&size=20&type=HackCount
上述請求已精簡,刪除了原有請求的部分參數工具
也就是說,初始加載更多數據的時候,from
參數爲-
,後續的請求,from是前一次請求所返回來的值,所以,咱們能夠用Python爬取數據了url
# -*- coding: UTF-8 -*- from bs4 import BeautifulSoup import requests import time class CS: def __init__(self): # self.username = username pass def geek(self, _from=None, type='HackCount', size=20): """ url: http://geek.csdn.net/, more: http://geek.csdn.net/service/news/get_news_list?from=-&size=20&type=HackCount :param _from: 加載更多的時候的標誌 :param type: 極客頭條的類型 :param size: 每頁的數目 :return: """ start = '-' if _from: timestamp = int(time.time()) url = 'http://geek.csdn.net/service/news/get_news_list?' \ 'from=%s&size=%d&type=%s&_=%d' % (_from, size, type, timestamp) req = requests.get(url) js = req.json() start = js['from'] soup = BeautifulSoup(js['html'], 'lxml') else: url = 'http://geek.csdn.net/' req = requests.get(url) soup = BeautifulSoup(req.content, 'lxml') results = soup.select('dd.tracking-ad > span > a') items = [] for result in results: item = { 'href': result['href'], 'title': result.string } items.append(item) return { 'from': start, 'items': items } cs = CS() items = [] _from = '' i = 0 # 這裏控制獲取多少頁的內容 while i < 10: result = cs.geek(_from=_from) items.extend(result['items']) _from = result['from'] i = i + 1 print(items)
項目地址: 模擬京東登陸spa
吐槽QQ羣: 173318043.net