python爬蟲用到的一些東西

時間 2019-11-18

原文原文鏈接

原裝requests

>>> import requests
>>> response = requests.get('http://www.baidu.com')
>>> response.text 打印源代碼
>>> response.headers
{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'Keep-Alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Mon, 26 Nov 2018 00:21:32 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:28:36 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}
>>> response.status_code
200html

>>> headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
>>> response = requests.get('http://www.baidu.com',headers=headers) 添加了header頭部web

二進制文件的打印，圖片文件處理

>>> response = requests.get('https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1543204467171&di=19de509bd81641d74f3ac61472898d8e&imgtype=0&src=http%3A%2F%2Fimage.biaobaiju.com%2Fuploads%2F20180803%2F20%2F1533299921-zRLwijpYoE.jpg')
>>> response.content 輸出二進制文件
>>> with open('./1.jpg','wb') as f:
... f.write(response.content)瀏覽器

使用selenium模擬瀏覽器的操做

>>> from selenium import webdriver
>>> driver.get('http://m.weibo.cn') # 打開微博
>>> driver.get('http://www.zhihu.com') # 打開知乎
>>> driver.get('http://www.taobao.com') #打開淘寶
>>> driver.page_source #獲取網頁源代碼dom

相關標籤/搜索