PYTHON 爬蟲筆記九:利用Ajax+正則表達式+BeautifulSoup爬取今日頭條街拍圖集(實戰項目二)

利用Ajax+正則表達式+BeautifulSoup爬取今日頭條街拍圖集

  • 目標站點分析

    今日頭條這類的網站製做,從數據形式,CSS樣式都是經過數據接口的樣式來決定的,因此它的抓取方法和其餘網頁的抓取方法不太同樣,對它的抓取須要抓取後臺傳來的JSON數據,html

 

  先來看一下今日頭條的源碼結構:咱們抓取文章的標題,詳情頁的圖片連接試一下:正則表達式

看到上面的源碼了吧,抓取下來沒有用,那麼我看下它的後臺數據:‘數據庫

 

 全部的數據都在後臺的JSON展現中,因此咱們須要經過接口對數據進行抓取json

提取網頁JSON數據框架

執行函數結果,若是你想大量抓取記得開啓多進程而且存入數據庫:ide

看下結果:函數

 

 總結一下:網上好多抓取今日頭條的案例都是先抓去指定主頁,獲取文章的URL再經過詳情頁,接着在詳情頁上抓取,可是如今的今日頭條的網站是這樣的,在主頁的接口數據中就帶有詳情頁的數據,經過點擊跳轉攜帶數據的方式將數據傳給詳情頁的頁面模板,這樣開發起來方便節省了很多時間而且減小代碼量網站

  •  流程框架

  • 爬蟲實戰

  1. spider詳情頁

    import json import os from hashlib import md5 from json import JSONDecodeError import pymongo import re from urllib.parse import urlencode from multiprocessing import Pool import requests from bs4 import BeautifulSoup from config import * client = pymongo.MongoClient(MONOGO_URL, connect=False) db = client[MONOGO_DB] def get_page_index(offset,keyword):     #請求首頁頁面html
        data = { 'offset': offset, 'format': 'json', 'keyword': keyword, 'autoload': 'true', 'count': 20, 'cur_tab': 3, 'from': 'search_tab' } url ='https://www.toutiao.com/search_content/?' + urlencode(data) try: response = requests.get(url) if response.status_code == 200: return response.text return None except Exception: print('請求首頁出錯') return None def parse_page_index(html):         #解析首頁得到的html
        data = json.loads(html) if data and 'data' in data.keys(): for item in data.get('data'): yield item.get('article_url') def get_page_detalil(url):          #請求詳情頁面html
        headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'} try: response = requests.get(url,headers = headers) if response.status_code == 200: return response.text return None except Exception: print('請求詳情頁出錯',url) return None def parse_page_detail(html,url):        #解析每一個詳情頁內容
        soup = BeautifulSoup(html,'lxml') title = soup.select('title')[0].get_text() image_pattern = re.compile('gallery: JSON.parse\("(.*)"\)',re.S) result = re.search(image_pattern,html) if result: try: data = json.loads(result.group(1).replace('\\','')) if data and 'sub_images' in data.keys(): sub_images = data.get("sub_images") images = [item.get('url') for item in sub_images] for image in images:download_image(image) return { 'title':title, 'url':url, 'images':images } except JSONDecodeError: pass
    
    def save_to_mongo(result):      #把信息存儲導Mongodb
        try: if db[MONOGO_TABLE].insert(result): print('存儲成功',result) return False return True except TypeError: pass
    
    def download_image(url):        #查看圖片連接是否正常獲取
        headers = { 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'} print('正在下載:',url) try: response = requests.get(url, headers=headers) if response.status_code == 200: save_image(response.content) return None except ConnectionError: print('請求圖片出錯', url) return None def save_image(content):        #下載圖片到指定位置
    
        #file_path = '{}/{}.{}'.format(os.getcwd(),md5(content).hexdigest(),'jpg')
        file_path = '{}/{}.{}'.format('/Users/darwin/Desktop/aaa',md5(content).hexdigest(),'jpg') if not os.path.exists(file_path): with open(file_path,'wb') as f: f.write(content) f.close() def main(offset): html = get_page_index(offset,KEYWORD) for url in parse_page_index(html): html = get_page_detalil(url) if html: result = parse_page_detail(html,url) save_to_mongo(result) if __name__ == '__main__': groups = [x*20 for x in range(GROUP_START,GROUP_END+1)] pool = Pool() pool.map(main,groups)
  2. config配置頁

    MONOGO_URL='localhost' MONOGO_DB = 'toutiao' MONOGO_TABLE = 'toutiao' GROUP_START=1 GROUP_END=2 KEYWORD='街拍'
相關文章
相關標籤/搜索