python爬蟲(一)

時間 2019-11-19

標籤 python 爬蟲欄目 Python 简体版

原文原文鏈接

使用環境window10+python3.4php

先安裝requestshtml

python3 -m pip install requests

1. 先使用青年文摘網站看看效果python

import requests
html = requests.get("http://www.qnwz.cn/index.html")
print(html.content)

若是咱們想把當前文本保存在一個文件裏，能夠這樣操做web

with open(filename, 'wb') as fd:
    for c in html.iter_content():
        fd.write(c)

2. 有的時候網頁就不能夠直接爬取了，這時候可能要提交表單瀏覽器

import requests
params = {'xxx': 'xxx', 'xxx': 'xxx'}
r = requests.post("http://who_am_i.com/form.php", data=params) #注意這裏使用post方法來提交表單
print(r.text)

可能還會讓你提交文件或者圖像cookie

import requests
files = {'f': open('1.png', 'rb')}
r = requests.post("http://who_am_i.com/xixi.php",files=files)
print(r.text)  #看起來也不會太複雜

也許還須要你處理登錄和cookiessession

import requests
session = requests.Session()
params = {'username': 'username', 'password': 'password'}
s = session.post("http://who_am_i.com/haha.php", params)print(s.cookies.get_dict())
s = session.get("http://who_am_i.com/a.php")
print(s.text)

也許有時候會彈出一個登錄窗口，這時候requests仍是可以優雅的處理工具

import requests
from requests.auth import AuthBase
from requests.auth import HTTPBasicAuth
auth = HTTPBasicAuth('username', 'password')
r = requests.post(url="http://who_am_i.com//login.php", auth=auth)
print(r.text)

同時還有登錄須要驗證碼問題，這個就不太好處理了，通常思路是編寫代碼獲取驗證碼的圖片，手動輸入，或者經過工具對驗證碼進行識別，自動輸入，好比python的pytesseract就有識別驗證碼的功能，不妨一試。post

3. 有些時候網頁使用JavaScript渲染的，這時候經過requests直接獲取的頁面並不能像瀏覽器所看到的那樣，這時候不妨下載一個PhantomJS程序來渲染js，要想在python中使用，須要安裝selenium ，這個包具備模擬瀏覽器的功能網站

from selenium import webdriver
import time
driver = webdriver.PhantomJS(executable_path='填入PhantomJS程序安裝路徑，如D:\p\bin\PhantomJs')
driver.get("http://weixin.sogou.com/weixin?type=1&query=dp")
time.sleep(3)
print(driver.find_element_by_id("content").text)
driver.close()

基本步驟就這樣，深刻了解就查看selenium文檔

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。