京東商品:https://item.jd.com/100005603...
先試試下面這個代碼:javascript
import requests url = 'https://item.jd.com/100005603522.html' try: r = requests.get(url) r.raise_for_status() r.encoding = r.apparent_encoding print(r.text[:1000]) except: print('爬取失敗')
亞馬遜商品:
先按京東的這個方法來試一下html
>>> r = requests.get('https://www.amazon.cn/dp/B076SRZY65/ref=sr_1_1?__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&keywords=%E7%BA%A2%E6%A5%BC%E6%A2%A6&qid=1581427290&sr=8-1') >>> r.status_code 503
咱們看到返回的狀態碼是503,說明服務器拒絕了咱們的訪問。
咱們看看究竟是哪裏出了問題,首先改變一下返回數據的編碼。java
>>> r.encoding 'ISO-8859-1' >>> r.encoding = r.apparent_encoding >>> r.text <p class="a-last">抱歉,咱們只是想確認一下當前訪問者並不是自動程序。爲了達到最佳效果,請確保您瀏覽器上的 Cookie 已啓用。</p>
到這裏,咱們已經知道,服務器知道了咱們是用程序訪問,因此拒絕了。
咱們知道,Response對象包含了咱們發送的請求的所有信息,如今看看咱們發送的請求的頭部信息是什麼樣的。python
>>> r.request.headers {'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
咱們的程序,忠實地告訴了服務器,這個請求是由Python的requests庫進行訪問的,因此被拒絕了。下面咱們從新構造一個頭部信息,模擬成瀏覽器再訪問一次。瀏覽器
kv = {'User-Agent': 'Mozilla/5.0'} >>> r = requests.get('https://www.amazon.cn/dp/B076SRZY65/ref=sr_1_1?__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&keywords=%E7%BA%A2%E6%A5%BC%E6%A2%A6&qid=1581427290&sr=8-1',headers=kv) >>> r.status_code 200 >>> r.request.headers {'User-Agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'} >>> r.text[:1000] '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n \n\n\n\n\n\n\n\n\n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n \n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n <!doctype html><html class="a-no-js" data-19ax5a9jf="dingo">\n <head>\n<script type="text/javascript">var ue_t0=ue_t0||+new Date();</script>\n<script type="text/javascript">\nwindow.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1;\nif (window.ue_ihb === 1) {\nvar ue_hob=+new Date();\nvar ue_id=\'FXYNFB591SCNSB56G63B\',\nue_csm = window,\nue_err_chan = \'jserr-rw\',\nue = {};\n(function(d){var e=d.ue=d.ue||{},f=Date.now||function(){return+new Date};e.d=function(b){return f()-(b?0:d.ue_t0)};e.stub=function(b,a){if(!b[a]){var c=[];b[a]=function(){c.push([c.slice.call(arguments),e.d(),d.ue_id])};b[a].replay=function(b){for(var a;a=c.shift();)b(a[0],a[1],a[2])};b[a].isStub=1}};e.exec=function(b,a){return function(){try{return b.apply(this,arguments)}catch(c){ueLogError(c,{attribution:a||"undefined",logLevel:"WARN"})}}}})(ue_csm);\n\nue.stub(ue,"log");ue.stub(ue,"onunload");ue.stub(ue,'
下面給出訪問亞馬遜商品信息的所有代碼:服務器
import requests url = 'https://www.amazon.cn/dp/B076SRZY65/ref=sr_1_1?__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&keywords=%E7%BA%A2%E6%A5%BC%E6%A2%A6&qid=1581427290&sr=8-1' try: kv = {'User-Agent': 'Mozilla/5.0'} r = requests.get(url,headers = kv) r.raise_for_status() r.encoding = r.apparent_encoding print(r.text[2000:3000]) except: print('爬取失敗')
360搜索關鍵詞提交:
360的關鍵詞接口:http://www.so.com/s?q=keyword
那麼只須要構造一個URL就能夠了.網絡
>>> kv = {'q':'python'} >>> r = requests.get('http://www.so.com/s',params=kv) >>> r.status_code 200 >>> len(r.text) 346592 >>> r.url 'https://www.so.com/s?q=python'
網絡圖片的爬取和存儲:
圖片地址:http://img0.dili360.com/ga/M00/48/F7/wKgBy1llvmCAAQOVADC36j6n9bw622.tub.jpg
app
>>> path = "/Users/liuneng/Pictures/abc.jpg" >>> url = "http://img0.dili360.com/ga/M00/48/F7/wKgBy1llvmCAAQOVADC36j6n9bw622.tub.jpg" >>> r = requests.get(url) >>> r.status_code 200 >>> with open(path,'wb') as f: f.write(r.content) 389618
這樣就能夠了,下面給出全代碼this
import requests import os url = "http://img0.dili360.com/ga/M00/48/F7/wKgBy1llvmCAAQOVADC36j6n9bw622.tub.jpg" root = "/Users/liuneng/Pictures/" path = root + url.split('/')[-1] #將最後一個反斜槓後的內容提取出來 try: if not os.path.exists(root): #判斷根目錄是否存在,不存在就建一個 os.mkdir(root) if not os.path.exists(path): #判斷文件名是否存在,若是不存在再開始爬取 r = requests.get(url) with open (path,'wb') as f: f.write(r.content) f.close() print('文件保存成功') else: print('文件已存在') except: print('爬取失敗')
IP地址歸屬地查詢
連接爲:http://www.ip138.com/iplookup...
要查詢IP地址已明文方式存儲在URL中。編碼