urllib re 內置的庫html
requestspython
seleniummysql
驅動瀏覽器,獲取js渲染的結果,沒法用requests獲取jquery
須要安裝chromedriver到環境變量所支持的目錄web
缺點:須要打開瀏覽器redis
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://www.baidu.com")
driver.page_source # 輸出源代碼
複製代碼
phantomjssql
無節界面瀏覽器,直接後臺處理數據chrome
from selenuim import webdriver
driver = webdriver.PhantomJS()
driver.get("http://www.baidu.com")
driver.page_source
複製代碼
lxml數據庫
beautifulsoup4django
依賴lxml庫
from bs4 import BeautifulSoup
soup = BeautifulSoup('<html></html', 'lxml')
# 參數1 html code
# 參數2 解析方式
複製代碼
pyquery
語法和jquery一致,方便
pip install pyquery
from pyquery import PyQuery as pq
doc = pq("<html>Hello</html>")
r = doc('html').text()
複製代碼
pymysql
pymongo
nosql key-value存儲
import pymongo
client = pymongo.MongoClient("localhost")
db = client["testdb"]
db['collection'].insert({"name": "jack"})
複製代碼
redis
分佈式爬蟲,維護爬取隊列
高效
import redis
r = redis.Redis('localhost', 6379)
r.set('name', 'Bob')
r.get('name')
複製代碼
flask
django
bottle
jupyter