Python實戰:如何隱藏本身的爬蟲身份

<div class="htmledit_views">html

<p>使用爬蟲訪問網站,須要儘量的隱藏本身的身份,以防被服務器屏蔽,在工做工程中,咱們有2種方式來實現這一目的,分別是延時訪問和動態代理,接下來咱們會對這兩種方式進行講解</p> <p><span style="font-size:14px;"><strong>一、延時訪問</strong></span></p> <p>見名之意,延時訪問就是在訪問網站時設置一個訪問週期,每隔幾秒鐘訪問一次,這樣的方式更像是人爲訪問網站</p> <p></p><pre onclick="hljs.copyCode(event)"><code class="language-python hljs"><ol class="hljs-ln" style="width:982px"><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="1"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"><span class="hljs-keyword">import</span> time</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="2"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"><span class="hljs-keyword">import</span> urllib.request</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="3"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> </div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="4"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line">cnt = <span class="hljs-number">0</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="5"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"><span class="hljs-comment">#隱藏本身爬蟲的身份的第一種策略是設置訪問週期,使得程序更像是人爲訪問的</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="6"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"><span class="hljs-keyword">while</span> <span class="hljs-keyword">True</span>: <span class="hljs-comment">#每隔5秒鐘訪問一次百度網</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="7"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> url = <span class="hljs-string">"https://www.baidu.com"</span> <span class="hljs-comment">#設置url地址</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="8"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> param = {} <span class="hljs-comment">#設置參數,參數是字典</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="9"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> param = urllib.parse.urlencode(param).encode(<span class="hljs-string">'utf_8'</span>) <span class="hljs-comment">#將參數以utf-8編碼方式來編碼</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="10"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> </div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="11"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> req = urllib.request.Request(url, param)</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="12"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> <span class="hljs-comment">#設置header的User-Agent屬性,模擬該請求是由狐火瀏覽器發送的,也就是說欺騙服務器是人爲發送的並未程序發送的</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="13"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> req.add_header(<span class="hljs-string">"User-Agent"</span>, <span class="hljs-string">"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:53.0) Gecko/20100101 Firefox/53.0"</span>)</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="14"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> response = urllib.request.urlopen(req) <span class="hljs-comment">#訪問網絡</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="15"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> </div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="16"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> html = response.read() <span class="hljs-comment">#讀取響應的結果</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="17"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> result = html.decode(<span class="hljs-string">"utf-8"</span>) <span class="hljs-comment">#按照utf-8編碼來進行解碼</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="18"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> <span class="hljs-keyword">if</span> result != <span class="hljs-string">""</span>:</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="19"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> cnt += <span class="hljs-number">1</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="20"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> print(<span class="hljs-string">"第%s次攻擊百度網"</span> %cnt)</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="21"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> time.sleep(<span class="hljs-number">5</span>) <span class="hljs-comment">#程序睡眠5秒鐘</span></div></div></li></ol></code><div class="hljs-button" data-title="複製"></div></pre>運行結果: <p>每隔5秒鐘訪問一次百度網</p> <p><img src="https://img-blog.csdn.net/20170615225313927?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvcXpjNzA5MTk3MDA=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center" alt=""><br><br></p> <p><strong><span style="font-size:14px;">二、動態代理</span></strong></p> <p>使用代理服務器來訪問網站,這種方法很是霸道,能夠模擬出不一樣的服務器訪問網站,也是最爲推薦的一種方式,咱們能夠在百度網上查找免費的代理服務器IP</p> <p></p><pre onclick="hljs.copyCode(event)"><code class="language-python hljs"><ol class="hljs-ln" style="width:1059px"><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="1"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"><span class="hljs-keyword">import</span> urllib.request</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="2"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"><span class="hljs-keyword">import</span> random</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="3"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> </div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="4"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line">ipList = [<span class="hljs-string">'119.6.144.73:81'</span>, <span class="hljs-string">'183.203.208.166:8118'</span>, <span class="hljs-string">'111.1.32.28:81'</span>] <span class="hljs-comment">#定義多個代理IP,代理IP能夠在網上搜免費的</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="5"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line">cnt = <span class="hljs-number">0</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="6"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"><span class="hljs-comment">#隱藏本身爬蟲的身份的第二種策略是使用代理,意思是模擬多個服務器訪問</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="7"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"><span class="hljs-keyword">while</span> <span class="hljs-keyword">True</span>: <span class="hljs-comment">#使用代理服務器不停的訪問百度網</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="8"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> proxy_support = urllib.request.ProxyHandler({<span class="hljs-string">'http'</span>:random.choice(ipList)}) <span class="hljs-comment">#定義一個代理對象,使用隨機的ip</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="9"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> </div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="10"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> opener = urllib.request.build_opener(proxy_support)</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="11"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> opener.add_handlers = [(<span class="hljs-string">"User-Agent"</span>, <span class="hljs-string">"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:53.0) Gecko/20100101 Firefox/53.0"</span>)]</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="12"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> urllib.request.install_opener(opener)</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="13"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> </div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="14"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> response = urllib.request.urlopen(<span class="hljs-string">"https://www.baidu.com"</span>) <span class="hljs-comment">#訪問網絡</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="15"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> </div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="16"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> html = response.read() <span class="hljs-comment">#讀取響應的結果</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="17"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> result = html.decode(<span class="hljs-string">"utf-8"</span>) <span class="hljs-comment">#按照utf-8編碼來進行解碼</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="18"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> <span class="hljs-keyword">if</span> result != <span class="hljs-string">""</span>:</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="19"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> cnt += <span class="hljs-number">1</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="20"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> print(<span class="hljs-string">"第%s次攻擊百度網"</span> %cnt)</div></div></li></ol></code><div class="hljs-button" data-title="複製"></div></pre>運行結果: <p>不停的攻擊百度網</p> <p><img src="https://img-blog.csdn.net/20170615225529086?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvcXpjNzA5MTk3MDA=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center" alt=""><br><br></p> <p><br></p> <p><br></p> <p><br></p> </div>python

相關文章
相關標籤/搜索