爬蟲中最常常遇到的問題就是 咱們須要對不一樣的網站上的內容進行抓取,可是抓取到的內容結構都是同樣的,後續處理也是同樣的,只是不一樣網站上展現咱們要的內容的方式是不同的,須要咱們對各個網站逐一分析,構建須要的適配器去處理python
爲了更通用地取匹配多個須要下載的網站,咱們一般須要使用適配器,主程序中調用相同的方法,卻能夠對不一樣的網站進行下載。git
實現適配器的方法有多種,這裏由於沒有用class,就簡單的使用 importlib 中的 import_module 方法,也能夠實現咱們須要的功能github
最簡單實現:json
先創建一個文件夾 extractors,裏面創建三個文件: __init__.py, baisi.py, qiubai.py數據結構
__init__.py 不用寫代碼app
baisi.py:ide
def TransferPageToData_baisi(url): print('baisi', url) transfer = TransferPageToData_baisi
qiubai.py函數
def TransferPageToData_qiubai(url): print('qiubai', url) transfer = TransferPageToData_qiubai
而後將咱們的main.py 修改爲以下:網站
# coding: utf-8 from pyquery import PyQuery as pq from utils.common import GetMedia, GetPage from importlib import import_module SITES = { 'qiubai': 'http://www.qiushibaike.com/', 'baisi': 'http://www.budejie.com/' } def AnyTransfer(key, sites=SITES): m = import_module('.'.join(['extractors', key])) url = sites[key] m.transfer(url) def main(): for (k, v) in SITES.items(): AnyTransfer(k) if __name__ == '__main__': main()
而後咱們運行url
$ python3 main.py
獲得結果
其實咱們這一種實現方式的關鍵就是 :
extractors裏面的適配器都返回一樣的數據結構(上面咱們只是print, 待會咱們返回都是json 的list)
利用transfer = XXXX 最後返回統一的函數名
在主函數中 利用 m = import_module('.'.join(['extractors', key])) 直接得到模塊,而後調用模塊的transfer函數
這裏咱們簡單的直接用 dict的key直接去配對咱們的module了,若是更好一些能夠對url正則去判斷適配哪一個適配器,這裏不作擴展,感興趣的朋友能夠去看看you-get的實現方式
最後咱們加入對各網站的解析,並返回一個list
baisi.py
from pyquery import PyQuery as pq from utils.common import (GetMedia, GetPage) def TransferPageToData_baisi(url): page = GetPage(url) results = [] d = pq(page) contents = d("div .g-mn .j-r-c .j-r-list ul li .j-r-list-c") for item in contents: i = pq(item) content = i("div .j-r-list-c-desc").text().strip() video_id = i("div .j-video-c").attr('data-id') pic_id = i("div .j-r-list-c-img img").attr('data-original') if video_id: print('video: ' + video_id) video_des = i("div .j-video").attr('data-mp4') video_path = GetMedia(video_des, media_type='video') dct = { "content": content, "id": video_id, "type": "video", "mediapath": video_path } results.append(dct) elif pic_id: print('pic: ' + pic_id) pic_path = GetMedia(pic_id) dct = { "content": content, "id": pic_id, "type": "pic", "mediapath": pic_path } results.append(dct) return results transfer = TransferPageToData_baisi
qiubai.py
from pyquery import PyQuery as pq from utils.common import (GetMedia, GetPage) def TransferPageToData_qiubai(url): page = GetPage(url) results = [] d = pq(page) contents = d("div .article") for item in contents: i = pq(item) pic_url = i("div .thumb img").attr.src content = i("div .content").text() qiubai_id = i.attr.id print("qiubai:", qiubai_id) if pic_url: pic_path = GetMedia(pic_url) dct = { 'id': qiubai_id, 'type': 'pic', 'mediapath': pic_path, 'content': content } else: dct = { 'id': qiubai_id, 'type': 'text', # 'mediapath': '', 'content': content } results.append(dct) return results transfer = TransferPageToData_qiubai
main.py
# coding: utf-8 from importlib import import_module __author__ = 'BONFY CHEN <foreverbonfy@163.com>' SITES = { 'qiubai': 'http://www.qiushibaike.com/', 'baisi': 'http://www.budejie.com/' } def AnyTransfer(key, sites=SITES): m = import_module('.'.join(['extractors', key])) url = sites[key] results = m.transfer(url) return results def PrepareData(): results = [] for (k, v) in SITES.items(): results = results + AnyTransfer(k) return results def main(): print(PrepareData()) if __name__ == '__main__': main()
運行獲得結果
詳情見 https://github.com/bonfy/xiaolinBot
敬請期待下一講 數據存儲Mongodb
若是你們以爲有用,也請不要吝嗇Star 和 Like哦~