在京東頁面找到一款手機複製網址javascript
2.1) 爬取代碼css
import requests url = "https://item.jd.com/100003534811.html" r = requests.get(url) print(r.status_code) #返回值爲200,訪問正常 print(r.text[:1000])#僅打印須要內容
2.2) 返回信息html
<!DOCTYPE HTML> <html lang="zh-CN"> <head> <!-- shouji --> <meta http-equiv="Content-Type" content="text/html; charset=gbk" /> <title>【小米Redmi K20 Pro】小米 Redmi K20Pro 4800萬超廣角三攝 8GB+128GB 冰川藍 驍龍855 全網通4G 雙卡雙待 全面屏拍照智能遊戲手機【行情 報價 價格 評測】-京東</title> <meta name="keywords" content="MIRedmi K20 Pro,小米Redmi K20 Pro,小米Redmi K20 Pro報價,MIRedmi K20 Pro報價"/> <meta name="description" content="【小米Redmi K20 Pro】京東JD.COM提供小米Redmi K20 Pro正品行貨,幷包括MIRedmi K20 Pro網購指南,以及小米Redmi K20 Pro圖片、Redmi K20 Pro參數、Redmi K20 Pro評論、Redmi K20 Pro心得、Redmi K20 Pro技巧等信息,網購小米Redmi K20 Pro上京東,放心又輕鬆" /> <meta name="format-detection" content="telephone=no"> <meta http-equiv="mobile-agent" content="format=xhtml; url=//item.m.jd.com/product/100003534811.html"> <meta http-equiv="mobile-agent" content="format=html5; url=//item.m.jd.com/product/100003534811.html"> <meta http-equiv="X-UA-Compatible" content="IE=Edge"> <link rel="canonical" href="//item.jd.com/100003534811.html"/> <link
import requests url = "https://item.jd.com/100003534811.html" try: r = requests.get(url) # 返回值爲200則不會產生異常 r.raise_for_status() r.encoding = r.apparent_encoding print(r.text[:1000]) except: print("爬取失敗")
在亞馬遜頁面找到一本書複製網址html5
2.1) 爬取代碼java
import requests url = "https://www.amazon.cn/dp/B01H36S9MO/ref=sr_1_1?__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&keywords=%E7%99%BD%E8%AF%B4&qid=1565584830&s=gateway&sr=8-1" r = requests.get(url) print(r.status_code)
2.2) 狀態碼反思node
狀態碼返回值是503,不是200,說明訪問出錯python
2.3) 打印文本內容ios
<!DOCTYPE html> <!--[if lt IE 7]> <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8 a-lt-ie7"> <![endif]--> <!--[if IE 7]> <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8"> <![endif]--> <!--[if IE 8]> <html lang="zh-CN" class="a-no-js a-lt-ie9"> <![endif]--> <!--[if gt IE 8]><!--> <html class="a-no-js" lang="zh-CN"><!--<![endif]--><head> <meta http-equiv="content-type" content="text/html; charset=UTF-8"> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> <title dir="ltr">Amazon CAPTCHA</title> <meta name="viewport" content="width=device-width"> <link rel="stylesheet" href="https://images-na.ssl-images-amazon.com/images/G/01/AUIClients/AmazonUI-3c913031596ca78a3768f4e934b1cc02ce238101.secure.min._V1_.css"> <script> if (true === true) { var ue_t0 = (+ new Date()), ue_csm = window, ue = { t0: ue_t0, d: function() { return (+new Date() - ue_t0); } }, ue_furl = "fls-cn.amazon.cn", ue_mid = "AAHKV2X7AFYLW", ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1], ue_sn = "opfcaptcha.amazon.cn", ue_id = '7M7370PKHPW590MJV57S'; } </script> </head> <body> <!-- To discuss automated access to Amazon data please contact api-services-support@amazon.com. For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com.cn/index.html/ref=rm_c_sv, or our Product Advertising API at https://associates.amazon.cn/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases. --> <!-- Correios.DoNotSend --> <div class="a-container a-padding-double-large" style="min-width:350px;padding:44px 0 !important"> <div class="a-row a-spacing-double-large" style="width: 350px; margin: 0 auto"> <div class="a-row a-spacing-medium a-text-center"><i class="a-icon a-logo"></i></div> <div class="a-box a-alert a-alert-info a-spacing-base"> <div class="a-box-inner"> <i class="a-icon a-icon-alert"></i> <h4>請輸入您在下方看到的字符</h4> <p class="a-last">抱歉,咱們只是想確認一下當前訪問者並不是自動程序。爲了達到最佳效果,請確保您瀏覽器上的 Cookie 已啓用。</p> </div> </div> <div class="a-section"> <div class="a-box a-color-offset-background"> <div class="a-box-inner a-padding-extra-large"> <form method="get" action="/errors/validateCaptcha" name=""> <input type=hidden name="amzn" value="3vXJDVQq+SKJ44y9xdfMeA==" /><input type=hidden name="amzn-r" value="/dp/B01H36S9MO/ref=sr_1_1?__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&keywords=%E7%99%BD%E8%AF%B4&qid=1565584830&s=gateway&sr=8-1" /> <div class="a-row a-spacing-large"> <div class="a-box"> <div class="a-box-inner"> <h4>請輸入您在這個圖片中看到的字符:</h4> <div class="a-row a-text-center"> <img src="https://images-na.ssl-images-amazon.com/captcha/xzqdsmvh/Captcha_ngaflmibnn.jpg"> </div> <div class="a-row a-spacing-base"> <div class="a-row"> <div class="a-column a-span6"> <label for="captchacharacters">輸入字符</label> </div> <div class="a-column a-span6 a-span-last a-text-right"> <a onclick="window.location.reload()">換一張圖</a> </div> </div> <input autocomplete="off" spellcheck="false" id="captchacharacters" name="field-keywords" class="a-span12" autocapitalize="off" autocorrect="off" type="text"> </div> </div> </div> </div> <div class="a-section a-spacing-extra-large"> <div class="a-row"> <span class="a-button a-button-primary a-span12"> <span class="a-button-inner"> <button type="submit" class="a-button-text">繼續購物</button> </span> </span> </div> </div> </form> </div> </div> </div> </div> <div class="a-divider a-divider-section"><div class="a-divider-inner"></div></div> <div class="a-text-center a-spacing-small a-size-mini"> <a href="https://www.amazon.cn/gp/help/customer/display.html/ref=footer_claim?ie=UTF8&nodeId=200347160">使用條件</a> <span class="a-letter-space"></span> <span class="a-letter-space"></span> <span class="a-letter-space"></span> <span class="a-letter-space"></span> <a href="https://www.amazon.cn/gp/help/customer/display.html/ref=footer_privacy?ie=UTF8&nodeId=200347130">隱私聲明</a> </div> <div class="a-text-center a-size-mini a-color-secondary"> © 1996-2015, Amazon.com, Inc. or its affiliates <script> if (true === true) { document.write('<img src="https://fls-cn.amaz'+'on.cn/'+'1/oc-csi/1/OP/requestId=7M7370PKHPW590MJV57S&js=1" />'); }; </script> <noscript> <img src="https://fls-cn.amazon.cn/1/oc-csi/1/OP/requestId=7M7370PKHPW590MJV57S&js=0" /> </noscript> </div> </div> <script> if (true === true) { var elem = document.createElement("script"); elem.src = "https://images-cn.ssl-images-amazon.com/images/G/01/csminstrumentation/csm-captcha-instrumentation.min._V" + (+ new Date()) + "_.js"; document.getElementsByTagName('head')[0].appendChild(elem); } </script> </body></html>
根據打印文本內容中包含Marketplace APIs 判斷該次訪問出錯因爲API形成,事實上,若是咱們可以從服務器得到網頁信息,那麼這個錯誤再也不是網絡錯誤。web
2.4) 打印頭部信息chrome
# 打印發給亞馬遜網站的頭部信息 print(r.request.headers) # 頭部信息內容 {'User-Agent': 'python-requests/2.21.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
根據打印的頭部信息咱們能夠看出咱們的爬蟲忠實的告訴了服務器咱們的訪問是一個python-requests庫程序發起的,若是亞馬遜服務器啓動了來源審查,則此類訪問會產生錯誤。
2.5) 修改頭部信息
kv = {'user-agent':'Mozilla/5.0'} r = requests.get(url, headers = kv) print(r.status_code) print(r.request.headers)
打印內容
200 {'user-agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
2.6) 與京東爬取的區別
修改header字段,模擬瀏覽器向亞馬遜服務器申請訪問。
import requests url = "https://www.amazon.cn/dp/B01H36S9MO/ref=sr_1_1?__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&keywords=%E7%99%BD%E8%AF%B4&qid=1565584830&s=gateway&sr=8-1" try: kv = {'user-agent':'Mozilla/5.0'} r = requests.get(url, headers=kv) r.raise_for_status() r.encoding = r.apparent_encoding print(r.text[:1000]) except: print("爬取失敗")
百度的關鍵詞接口 http://www.baidu.com/s?wd=keyword
360的關鍵詞接口 https://www.so.com/s?q=keyword
2.1) 百度關鍵字提交
import requests url = "http://www.baidu.com/s" kv = {'wd':'Python'} r = requests.get(url, params=kv) print(r.status_code) # 打印狀態碼 print(r.request.url) # 打印訪問連接 print(r.text) # 打印文本內容 print(len(r.text)) # 打印文本長度
2.2) 百度關鍵字提交打印內容
# 狀態碼 200 # 訪問連接打印 http://www.baidu.com/s?wd=Python # 文本內容打印 <!DOCTYPE html><html><body style="display:none"><script>function ii(a,t){var r=Math.floor(Math.random()*100);t=t||"baidu";for(var i in a){if(a.hasOwnProperty(i)){if(a[i]>r){t=i;break}else{r-=a[i]}}};return t;}var D=document,N=navigator||{},U=N.userAgent,L=location||{},H=L.hash||'#',S=L.search||'?',M=0,W='',I='',E='',P='',R='',X=RegExp;!function(){function d(a,n){var e="; expires=Mon,01-Jan-1973 00:00:01 GMT",c=a.length,b=a[c-1];var f=e+"; path=/";if(n){D.cookie=n+"="+e;D.cookie=n+"="+f;for(var i=c-2;i>=0;i--){b=a[i]+"."+b;D.cookie=n+"=; domain="+b+e;D.cookie=n+"=; domain="+b+f;D.cookie=n+"=; domain=."+b+e;D.cookie=n+"=; domain=."+b+f}}}(function(){var a=D.cookie.split("; ");for(var i=0;i<a.length;i++){d(location.hostname.split("."),a[i].split("=")[0])}})()}();if(/AppleWebKit.*Mobile/i.test(U)||(/MIDP|SymbianOS|NOKIA|SAMSUNG|LG|NEC|TCL|Alcatel|BIRD|DBTEL|Dopod|PHILIPS|HAIER|LENOVO|MOT-|Nokia|SonyEricsson|SIE-|Amoi|ZTE|Android/.test(U)))M=1;if(/iPad/i.test(U))M=0;var a=S.split("?"),b=a[1].split("&"),c,i,p,q;for(i=0;i<b.length;i++){c=b[i].split("=");p=c[0];q=c[1];if(/^(w|wd|word)=/.test(b[i]))W=q;if(b[i]=='ms=1'||b[i]=='mobile_se=1')M=1;if(p=="ie")E=q;if(p=="pn")P=q;if(p=="rn")R=q;}if(/[^a-zA-Z0-9]wd=([^&]+)/.test(H))W=X.$1;var t,u="https://www.baidu.com/";if(M){u="https://m.baidu.com/";t=ii({"1015467z":100});if(W)u=u+"from="+t+"/s?word="+W;else u=u+"?from="+t;}else{u=u+'s?wd='+W+'&tn='+ii({"90757376_hao_pg":100});}if(E)u=u+'&ie='+E;if(P)u=u+'&pn='+P;if(R)u=u+'&rn='+R;j=!1;if(/firefox/i.test(U)){j=!0;D.write('<meta http-equiv="Refresh" target="_top" Content="0; Url='+u+'" >')}if(/msie 9|msie 10|rv:11/i.test(U)){j=!0;try{top.navigate(u)}catch(_){j=!1}}if(/applewebkit/i.test(U)){j=!0;var h=D.createElement('a');h.rel='noreferrer';h.href=u;h.target='_top';D.body.appendChild(h);var e=D.createEvent('MouseEvents');e.initEvent('click',true,true);h.dispatchEvent(e);}</script><iframe src='javascript:"<html><head><script>function init(){D=document;A=D.getElementById(\"aa\");u=parent.u||\"https://www.baidu.com/?tn=dsp\";A.href=u;j=parent.j||!1;if(!j)try{A.click();}catch(_){setTimeout(function(){top.location.replace(u)},100)}}</script></head><body onload=\"init()\"><a id=\"aa\" rel=\"noreferer\" target=\"_top\"></a></body></html>"'></iframe></body></html> # 文本內容長度 2279
params字段的使用方法參見網絡爬蟲_Requests庫入門中Requests庫主要方法解析
2.3) 百度關鍵字提交全代碼
import requests url = "http://www.baidu.com/s" keyword = "Python" try: kv = {'wd':keyword} r = requests.get(url, params=kv) print(r.request.url) r.raise_for_status() print(len(r.text)) except: print("爬取失敗")
3.1) 360關鍵字提交
import requests url = "http://www.so.com/s" kv = {'q':'Python'} r = requests.get(url, params=kv) print(r.status_code) # 打印狀態碼 print(r.request.url) # 打印訪問連接 print(r.text) # 打印文本內容 print(len(r.text)) # 打印文本長度
3.2) 360關鍵字提交全代碼
import requests url = "http://www.so.com/s" keyword = "Python" try: kv = {'q':keyword} r = requests.get(url, params=kv) print(r.request.url) r.raise_for_status() print(len(r.text)) except: print("爬取失敗")
網絡圖片連接的格式 http://www.example.com/picture.jpg
圖片地址:
http://image.ngchina.com.cn/userpic/103315/2019/08121402011033159713.jpeg
import requests path = "D://abc.jpg" # 圖片存儲地址 url = "http://image.ngchina.com.cn/userpic/103315/2019/08121402011033159713.jpeg" r = requests.get(url) print(r.status_code) # 打印狀態碼 with open(path, 'wb') as f: f.write(r.content) # content 字段包含二進制信息 f.close()
import requests import os url = "http://image.ngchina.com.cn/userpic/103315/2019/08121402011033159713.jpeg" root = "F://picture//" path = root + url.split('/')[-1] # 表明最後一個/以後的內容 try: if not os.path.exists(root): os.mkdir(root) if not os.path.exists(path): r = requests.get(url) with open(path, 'wb') as f: f.write(r.content) f.close() print("文件保存成功") else: print("文件已存在") except: print("爬取失敗")
利用ip138網站的查詢功能
import requests url = "http://m.ip138.com/ip.asp?ip=" ipadress = '202.204.80.112' try: r = requests.get(url+ipadress) r.raise_for_status() r.encoding = r.apparent_encoding print(r.text[-500:]) except: print("爬取失敗")
爬取內容打印
value="查詢" class="form-btn" /> </form> </div> <div class="query-hd">ip138.com IP查詢(搜索IP地址的地理位置)</div> <h1 class="query">您查詢的IP:202.204.80.112</h1><p class="result">本站主數據:北京市海淀區 北京理工大學 教育網</p><p class="result">參考數據一:北京市 北京理工大學</p> </div> </div> <div class="footer"> <a href="http://www.miitbeian.gov.cn/" rel="nofollow" target="_blank">滬ICP備10013467號-1</a> </div> </div> <script type="text/javascript" src="/script/common.js"></script></body> </html>
以爬蟲視角看待網絡內容。(網絡內容爲url網絡鏈接所指內容)
1 下面哪一個不是Python Requests庫提供的方法?
.head()
.get()
.post()
.push()
Requests庫中,下面哪一個是檢查Response對象返回是否成功的狀態屬性?
.status
.raise_for_status
.headers
.status_code
Requests庫中,下面哪一個屬性表明了從服務器返回HTTP協議頭所推薦的編碼方式?
.apparent_encoding
.encoding
.headers
.text
Requests庫中,下面哪一個屬性表明了從服務器返回HTTP協議內容部分猜想的編碼方式?
.text
.headers
requests.URLRequired
requests.Timeout
如下哪一個是不合法的HTTP URL?
https://210.14.148.99/
https://dwz.cn/hMvN8
在Requests庫的get()方法中,可以定製向服務器提交HTTP請求頭的參數是什麼?
ookies
json
在Requests庫的get()方法中,timeout參數用來約定請求的超時時間,請問該參數的單位是什麼?
微秒
分鐘
下面哪一個不是網絡爬蟲帶來的負面問題?
商業利益
法律風險
性能騷擾
下面哪一個說法是不正確的?
Robots協議是互聯網上的國際準則,必須嚴格遵照。
Robots協議是一種約定。
若是一個網站的根目錄下沒有robots.txt文件,下面哪一個說法是不正確的?
網絡爬蟲能夠肆意爬取該網站內容。
網絡爬蟲應該以不對服務器形成性能騷擾的方式爬取內容。
12
百度的關鍵詞查詢提交接口以下,其中,keyword表明查詢關鍵詞:
https://www.baidu.com/s?wd=keyword
請問,提交查詢關鍵詞該使用Requests庫的哪一個方法?
.post()
.patch()
獲取網絡上某個URL對應的圖片或視頻等二進制資源,應該採用Response類的哪一個屬性?
.content
.status_code
Requests庫中的get()方法最經常使用,下面哪一個說法正確?
HTTP協議中GET方法應用最普遍,因此,get()方法最經常使用。
get()方法是其它方法的基礎,因此最經常使用。
下面哪些功能網絡爬蟲作不到?
爬取某我的電腦中的數據和文件。
爬取網絡公開的用戶信息,並彙總出售。
請在上述網絡爬蟲通用代碼框架中,填寫空格處的方法名稱。
在HTTP協議中,可以對URL進行局部更新的方法是什麼?
上述代碼的輸出結果是什麼?
某一個網絡爬蟲叫NoSpider,編寫一個Robots協議文本,限制該爬蟲爬取根目錄下全部.html類型文件,但不限制其它文件。請填寫robots.txt中空格內容:
20
請填寫下面語句的空格部分,使得該語句可以輸出向服務器提交的url連接。
>>>print(r.____________)