python--爬蟲入門(七)urllib庫初體驗以及中文編碼問題的探討

python系列均基於python3.4環境javascript

---------@_@? --------------------------------------------------------------------css

  • 提出問題:如何簡單抓取一個網頁的源碼
  • 解決方法:利用urllib庫,抓取一個網頁的源代碼

------------------------------------------------------------------------------------html

  • 代碼示例
#python3.4
import urllib.request

response = urllib.request.urlopen("http://zzk.cnblogs.com/b")
print(response.read())
  • 運行結果
b'\n<!DOCTYPE html>\n<html>\n<head>\n    <meta charset="utf-8"/>\n    <title>\xe6\x89\xbe\xe6\x89\xbe\xe7\x9c\x8b - \xe5\x8d\x9a\xe5\xae\xa2\xe5\x9b\xad</title>    \n    <link rel="shortcut icon" href="/Content/Images/favicon.ico" type="image/x-icon"/>\n    <meta content="\xe6\x8a\x80\xe6\x9c\xaf\xe6\x90\x9c\xe7\xb4\xa2,IT\xe6\x90\x9c\xe7\xb4\xa2,\xe7\xa8\x8b\xe5\xba\x8f\xe6\x90\x9c\xe7\xb4\xa2,\xe4\xbb\xa3\xe7\xa0\x81\xe6\x90\x9c\xe7\xb4\xa2,\xe7\xa8\x8b\xe5\xba\x8f\xe5\x91\x98\xe6\x90\x9c\xe7\xb4\xa2\xe5\xbc\x95\xe6\x93\x8e" name="keywords" />\n    <meta content="\xe9\x9d\xa2\xe5\x90\x91\xe7\xa8\x8b\xe5\xba\x8f\xe5\x91\x98\xe7\x9a\x84\xe4\xb8\x93\xe4\xb8\x9a\xe6\x90\x9c\xe7\xb4\xa2\xe5\xbc\x95\xe6\x93\x8e\xe3\x80\x82\xe9\x81\x87\xe5\x88\xb0\xe6\x8a\x80\xe6\x9c\xaf\xe9\x97\xae\xe9\xa2\x98\xe6\x80\x8e\xe4\xb9\x88\xe5\x8a\x9e\xef\xbc\x8c\xe5\x88\xb0\xe5\x8d\x9a\xe5\xae\xa2\xe5\x9b\xad\xe6\x89\xbe\xe6\x89\xbe\xe7\x9c\x8b..." name="description" />\n    <link type="text/css" href="/Content/Style.css" rel="stylesheet" />\n    <script src="http://common.cnblogs.com/script/jquery.js" type="text/javascript"></script>\n    <script src="/Scripts/Common.js" type="text/javascript"></script>\n    <script src="/Scripts/Home.js" type="text/javascript"></script>\n</head>\n<body>\n    <div class="top">\n        \n        <div class="top_tabs">\n            <a href="http://www.cnblogs.com">\xc2\xab \xe5\x8d\x9a\xe5\xae\xa2\xe5\x9b\xad\xe9\xa6\x96\xe9\xa1\xb5 </a>\n        </div>\n        <div id="span_userinfo" class="top_links">\n        </div>\n    </div>\n    <div style="clear: both">\n    </div>\n    <center>\n        <div id="main">\n            <div class="logo_index">\n                <a href="http://zzk.cnblogs.com">\n                    <img alt="\xe6\x89\xbe\xe6\x89\xbe\xe7\x9c\x8blogo" src="/images/logo.gif" /></a>\n            </div>\n            <div class="index_sozone">\n                <div class="index_tab">\n                    <a href="/n" onclick="return  channelSwitch(&#39;n&#39;);">\xe6\x96\xb0\xe9\x97\xbb</a>\n<a class="tab_selected" href="/b" onclick="return  channelSwitch(&#39;b&#39;);">\xe5\x8d\x9a\xe5\xae\xa2</a>                    <a href="/k" onclick="return  channelSwitch(&#39;k&#39;);">\xe7\x9f\xa5\xe8\xaf\x86\xe5\xba\x93</a>\n                    <a href="/q" onclick="return  channelSwitch(&#39;q&#39;);">\xe5\x8d\x9a\xe9\x97\xae</a>\n                </div>\n                <div class="search_block">\n                    <div class="index_btn">\n                        <input type="button" class="btn_so_index" onclick="Search();" value="&nbsp;\xe6\x89\xbe\xe4\xb8\x80\xe4\xb8\x8b&nbsp;" />\n                        <span class="help_link"><a target="_blank" href="/help">\xe5\xb8\xae\xe5\x8a\xa9</a></span>\n                    </div>\n                    <input type="text" onkeydown="searchEnter(event);" class="input_index" name="w" id="w" />\n                </div>\n            </div>\n        </div>\n        <div class="footer">\n            &copy;2004-2016 <a href="http://www.cnblogs.com">\xe5\x8d\x9a\xe5\xae\xa2\xe5\x9b\xad</a>\n        </div>\n    </center>\n</body>\n</html>\n'
  • 附上python2.7的實現代碼:
#python2.7
import urllib2
 
response = urllib2.urlopen("http://zzk.cnblogs.com/b")
print response.read()
  • 可見,python3.4和python2.7的代碼存在差別性。

 

----------@_@? 問題出現!----------------------------------------------------------------------java

  • 發現問題:查看上面的運行結果,會發現中文並無正常顯示。
  • 解決問題:處理中文編碼問題

--------------------------------------------------------------------------------------------------node

 

  • 處理源碼中的中文問題!!!
  • 修改代碼,以下:
#python3.4
import urllib.request

response = urllib.request.urlopen("http://zzk.cnblogs.com/b")
print(response.read().decode('UTF-8'))
  • 運行,結果顯示:
C:\Python34\python.exe E:/pythone_workspace/mydemo/spider/demo.py

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8"/>
    <title>找找看 - 博客園</title>    
    <link rel="shortcut icon" href="/Content/Images/favicon.ico" type="image/x-icon"/>
    <meta content="技術搜索,IT搜索,程序搜索,代碼搜索,程序員搜索引擎" name="keywords" />
    <meta content="面向程序員的專業搜索引擎。遇到技術問題怎麼辦,到博客園找找看..." name="description" />
    <link type="text/css" href="/Content/Style.css" rel="stylesheet" />
    <script src="http://common.cnblogs.com/script/jquery.js" type="text/javascript"></script>
    <script src="/Scripts/Common.js" type="text/javascript"></script>
    <script src="/Scripts/Home.js" type="text/javascript"></script>
</head>
<body>
    <div class="top">
        
        <div class="top_tabs">
            <a href="http://www.cnblogs.com">« 博客園首頁 </a>
        </div>
        <div id="span_userinfo" class="top_links">
        </div>
    </div>
    <div style="clear: both">
    </div>
    <center>
        <div id="main">
            <div class="logo_index">
                <a href="http://zzk.cnblogs.com">
                    <img alt="找找看logo" src="/images/logo.gif" /></a>
            </div>
            <div class="index_sozone">
                <div class="index_tab">
                    <a href="/n" onclick="return  channelSwitch(&#39;n&#39;);">新聞</a>
<a class="tab_selected" href="/b" onclick="return  channelSwitch(&#39;b&#39;);">博客</a>                    <a href="/k" onclick="return  channelSwitch(&#39;k&#39;);">知識庫</a>
                    <a href="/q" onclick="return  channelSwitch(&#39;q&#39;);">博問</a>
                </div>
                <div class="search_block">
                    <div class="index_btn">
                        <input type="button" class="btn_so_index" onclick="Search();" value="&nbsp;找一下&nbsp;" />
                        <span class="help_link"><a target="_blank" href="/help">幫助</a></span>
                    </div>
                    <input type="text" onkeydown="searchEnter(event);" class="input_index" name="w" id="w" />
                </div>
            </div>
        </div>
        <div class="footer">
            &copy;2004-2016 <a href="http://www.cnblogs.com">博客園</a>
        </div>
    </center>
</body>
</html>


Process finished with exit code 0
  • 結果顯示:處理完編碼後,網頁源碼中中文能夠正常顯示了

 

 

-----------@_@! 探討一個新的中文編碼問題 ----------------------------------------------------------python

   問題:「若是url中出現中文,那麼應該若是解決呢?」jquery

   例如:url = "http://zzk.cnblogs.com/s?w=python爬蟲&t=b"程序員

  

-----------------------------------------------------------------------------------------------------ajax

 

  • 接下來,咱們來解決url中出現中文的問題!!!

(1)測試1:保留原來的格式,直接訪問,不作任何處理正則表達式

  • 代碼示例:
#python3.4
import urllib.request

url="http://zzk.cnblogs.com/s?w=python爬蟲&t=b"
resp = urllib.request.urlopen(url)
print(resp.read().decode('UTF-8'))
  • 運行結果:
C:\Python34\python.exe E:/pythone_workspace/mydemo/spider/demo.py
Traceback (most recent call last):
  File "E:/pythone_workspace/mydemo/spider/demo.py", line 9, in <module>
    response = urllib.request.urlopen(url)
  File "C:\Python34\lib\urllib\request.py", line 161, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python34\lib\urllib\request.py", line 463, in open
    response = self._open(req, data)
  File "C:\Python34\lib\urllib\request.py", line 481, in _open
    '_open', req)
  File "C:\Python34\lib\urllib\request.py", line 441, in _call_chain
    result = func(*args)
  File "C:\Python34\lib\urllib\request.py", line 1210, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "C:\Python34\lib\urllib\request.py", line 1182, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "C:\Python34\lib\http\client.py", line 1088, in request
    self._send_request(method, url, body, headers)
  File "C:\Python34\lib\http\client.py", line 1116, in _send_request
    self.putrequest(method, url, **skips)
  File "C:\Python34\lib\http\client.py", line 973, in putrequest
    self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-16: ordinal not in range(128)

Process finished with exit code 1

  果真不行!!!

 

(2)測試2:中文單獨處理

  • 代碼示例:
import urllib.request
import urllib.parse

url = "http://zzk.cnblogs.com/s?w=python"+ urllib.parse.quote("爬蟲")+"&t=b"
resp = urllib.request.urlopen(url)
print(resp.read().decode('utf-8'))
  • 運行結果:
C:\Python34\python.exe E:/pythone_workspace/mydemo/spider/demo.py
<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8" />
    <title>python爬蟲-博客園找找看</title>
    <link rel="shortcut icon" href="/Content/Images/favicon.ico" type="image/x-icon"/>
    <link href="/Content/so.css?id=20140908" rel="stylesheet" type="text/css" />
    <link href="/Content/jquery-ui-1.8.21.custom.css" rel="stylesheet" type="text/css" />
    <script src="http://common.cnblogs.com/script/jquery.js" type="text/javascript"></script>
    <script src="/Scripts/jquery-ui-1.8.11.min.js" type="text/javascript"></script>
    <script src="/Scripts/Common.js" type="text/javascript"></script>
    <script src="/Scripts/Search.js" type="text/javascript"></script>
    <script src="/Scripts/jquery.ui.datepicker-zh-CN.js" type="text/javascript"></script>
</head>
<body>
    <div class="top_bar">
        <div class="top_tabs">
            <a href="http://www.cnblogs.com">« 博客園首頁 </a>
        </div>
        <div id="span_userinfo">
        </div>
    </div>
    <div id="header">
        
<div id="headerMain">
    <a id="logo" href="/"></a>
    <div id="searchBox">
        <div id="searchRangeList">
            <ul>
                <li><a href="/s?t=n" onclick="return  channelSwitch(&#39;n&#39;);">新聞</a></li>
                    <li><a class="tab_selected" href="/s?t=b" onclick="return  channelSwitch(&#39;b&#39;);">博客</a></li>
                
                <li><a href="/s?t=k" onclick="return  channelSwitch(&#39;k&#39;);">知識庫</a></li>
                <li><a href="/s?t=q" onclick="return  channelSwitch(&#39;q&#39;);">博問</a></li>
            </ul>
        </div>
        <!--end: searchRangeList -->
        <div class="seachInput">
            <input type="text" onchange="ShowtFilter(this, false);" onkeypress="return searchEnter(event);"
                   value="python爬蟲" name="w" id="w" maxlength="2048" title="博客園 找找看" class="txtSeach" />
            <input type="button" value="找一下" class="btnSearch" onclick="Search();" />&nbsp;&nbsp;&nbsp;
            <span class="help_link"><a target="_blank" href="/help">幫助</a></span>
            <br />
        </div>
        <!--end: seachInput -->
    </div>
    <!--end: searchBox -->
    
</div>

        
        <div style="clear: both">
        </div>
        <!--end: headerMain -->
        <div id="searchInfo">
            <span style="float: left; margin-left: 15px;"></span>博客園找找看,找到相關內容<b id="CountOfResults">1491</b>篇,用時132毫秒
        </div>
        <!--end: searchInfo -->
    </div>
    <!--end: header -->
    <div id="main">
        <div id="searchResult">
            <div style="clear: both">
            </div>
            <div class="forflow">
                
<div class="searchItem">
  <h3 class="searchItemTitle">
    <a target="_blank" href="http://www.cnblogs.com/hearzeus/p/5238867.html"><strong>Python 爬蟲</strong>入門——小項目實戰(自動私信博客園某篇博客下的評論人,隨機發送一條笑話,完整代碼在博文最後)</a>
  </h3>
  <!--end: searchItemTitle -->
  <span class="searchCon">
    <strong>python, 爬蟲</strong>,  以前寫的都是針對<strong>爬蟲</strong>過程當中遇到問題...55561   <strong>python</strong>代碼以下: def getCo...經過關鍵特徵告訴<strong>爬蟲</strong>,已經遍歷結束了。我用的特徵代碼以下: ...定時器     <strong>python</strong>定時器,代碼示例: impor
  </span>
  <!--end: searchCon -->
  <div class="searchItemInfo">
    <span class="searchItemInfo-userName">
      <a href="http://www.cnblogs.com/hearzeus/" target="_blank">不剃頭的一休哥</a>
    </span><span class="searchItemInfo-publishDate">2016-03-03</span>
        <span class="searchItemInfo-good">推薦(12)</span>
            <span class="searchItemInfo-comments">評論(55)</span>
            <span class="searchItemInfo-views">瀏覽(1582)</span>
  </div>
  <div class="searchItemInfo">
    <span class="searchURL">www.cnblogs.com/hearzeus/p/5238867.html</span>
  </div>
  <!--end: searchURL -->
</div>
<div class="searchItem">
  <h3 class="searchItemTitle">
    <a target="_blank" href="http://www.cnblogs.com/hearzeus/p/5151449.html"><strong>Python 爬蟲</strong>入門(一)</a>
  </h3>
  <!--end: searchItemTitle -->
  <span class="searchCon">
    <strong>python, 爬蟲</strong>,  畢設是作<strong>爬蟲</strong>相關的,原本想的是用j...太滿意。以前據說<strong>Python</strong>這方面比較強,就想用<strong>Python</strong>...至此,一個簡單的<strong>爬蟲</strong>就完成了。以後是針對反<strong>爬蟲</strong>的一些策略,比...a寫,也寫了幾個<strong>爬蟲</strong>,其中一個是爬網易雲音樂的用戶信息,爬了
  </span>
  <!--end: searchCon -->
  <div class="searchItemInfo">
    <span class="searchItemInfo-userName">
      <a href="http://www.cnblogs.com/hearzeus/" target="_blank">不剃頭的一休哥</a>
    </span><span class="searchItemInfo-publishDate">2016-01-22</span>
        <span class="searchItemInfo-good">推薦(1)</span>
            <span class="searchItemInfo-comments">評論(13)</span>
            <span class="searchItemInfo-views">瀏覽(1493)</span>
  </div>
  <div class="searchItemInfo">
    <span class="searchURL">www.cnblogs.com/hearzeus/p/5151449.html</span>
  </div>
  <!--end: searchURL -->
</div>
<div class="searchItem">
  <h3 class="searchItemTitle">
    <a target="_blank" href="http://www.cnblogs.com/xueweihan/p/4592212.html">[<strong>Python</strong>]新手寫<strong>爬蟲</strong>全過程(已完成)</a>
  </h3>
  <!--end: searchItemTitle -->
  <span class="searchCon">
    hool.cc/<strong>python</strong>/<strong>python</strong>-files-io...<strong>python, 爬蟲</strong>,今天早上起來,第一件事情就是理一理今天...任務,寫一個只用<strong>python</strong>字符串內建函數的<strong>爬蟲</strong>,定義爲v1...實主要的不是學習<strong>爬蟲</strong>,而是依照這個需求鍛鍊下本身的編程能力,
  </span>
  <!--end: searchCon -->
  <div class="searchItemInfo">
    <span class="searchItemInfo-userName">
      <a href="http://www.cnblogs.com/xueweihan/" target="_blank">削微寒</a>
    </span><span class="searchItemInfo-publishDate">2015-06-21</span>
        <span class="searchItemInfo-good">推薦(13)</span>
            <span class="searchItemInfo-comments">評論(11)</span>
            <span class="searchItemInfo-views">瀏覽(2405)</span>
  </div>
  <div class="searchItemInfo">
    <span class="searchURL">www.cnblogs.com/xueweihan/p/4592212.html</span>
  </div>
  <!--end: searchURL -->
</div>
<div class="searchItem">
  <h3 class="searchItemTitle">
    <a target="_blank" href="http://www.cnblogs.com/hearzeus/p/5157016.html"><strong>Python 爬蟲</strong>入門(二)—— IP代理使用</a>
  </h3>
  <!--end: searchItemTitle -->
  <span class="searchCon">
    的代理。   在<strong>爬蟲</strong>中,有些網站可能爲了防止<strong>爬蟲</strong>或者DDOS...<strong>python, 爬蟲</strong>,  上一節,大概講述了Python 爬...因此,咱們能夠用<strong>爬蟲</strong>爬那麼IP。用上一節的代碼,徹底能夠作到...(;;)這樣的。<strong>python</strong>中的for循環,in 表示X的取
  </span>
  <!--end: searchCon -->
  <div class="searchItemInfo">
    <span class="searchItemInfo-userName">
      <a href="http://www.cnblogs.com/hearzeus/" target="_blank">不剃頭的一休哥</a>
    </span><span class="searchItemInfo-publishDate">2016-01-25</span>
        <span class="searchItemInfo-good">推薦(3)</span>
            <span class="searchItemInfo-comments">評論(21)</span>
            <span class="searchItemInfo-views">瀏覽(1893)</span>
  </div>
  <div class="searchItemInfo">
    <span class="searchURL">www.cnblogs.com/hearzeus/p/5157016.html</span>
  </div>
  <!--end: searchURL -->
</div>
<div class="searchItem">
  <h3 class="searchItemTitle">
    <a target="_blank" href="http://www.cnblogs.com/ruthon/p/4638262.html">《零基礎寫<strong>Python爬蟲</strong>》系列技術文章整理收藏</a>
  </h3>
  <!--end: searchItemTitle -->
  <span class="searchCon">
    <strong>Python</strong>,《零基礎寫<strong>Python爬蟲</strong>》系列技術文章整理收... 1零基礎寫<strong>python爬蟲</strong>之<strong>爬蟲</strong>的定義及URL構成ht...ml 8零基礎寫<strong>python爬蟲</strong>之<strong>爬蟲</strong>編寫全記錄http:/...ml 9零基礎寫<strong>python爬蟲</strong>之<strong>爬蟲</strong>框架Scrapy安裝配
  </span>
  <!--end: searchCon -->
  <div class="searchItemInfo">
    <span class="searchItemInfo-userName">
      <a href="http://www.cnblogs.com/ruthon/" target="_blank">豆芽ruthon</a>
    </span><span class="searchItemInfo-publishDate">2015-07-11</span>
          </div>
  <div class="searchItemInfo">
    <span class="searchURL">www.cnblogs.com/ruthon/p/4638262.html</span>
  </div>
  <!--end: searchURL -->
</div>
<div class="searchItem">
  <h3 class="searchItemTitle">
    <a target="_blank" href="http://www.cnblogs.com/wenjianmuran/p/5049966.html"><strong>Python爬蟲</strong>入門案例:獲取百詞斬已學單詞列表</a>
  </h3>
  <!--end: searchItemTitle -->
  <span class="searchCon">
    記不住。咱們來用<strong>Python</strong>來爬取這些信息,同時學習<strong>Python爬蟲</strong>基礎。 首先...<strong>Python</strong>, 案例, 百詞斬是一款很不錯的單詞記憶APP,在學習過程當中,它會記錄你所學的每...n) 若是要在<strong>Python</strong>中解析json,咱們須要json庫。咱們打印下前兩頁
  </span>
  <!--end: searchCon -->
  <div class="searchItemInfo">
    <span class="searchItemInfo-userName">
      <a href="http://www.cnblogs.com/wenjianmuran/" target="_blank">文劍木然</a>
    </span><span class="searchItemInfo-publishDate">2015-12-16</span>
        <span class="searchItemInfo-good">推薦(12)</span>
            <span class="searchItemInfo-comments">評論(4)</span>
            <span class="searchItemInfo-views">瀏覽(1235)</span>
  </div>
  <div class="searchItemInfo">
    <span class="searchURL">www.cnblogs.com/wenjianmuran/p/5049966.html</span>
  </div>
  <!--end: searchURL -->
</div>
<div class="searchItem">
  <h3 class="searchItemTitle">
    <a target="_blank" href="http://www.cnblogs.com/cs-player1/p/5169307.html"><strong>python爬蟲</strong>之初體驗</a>
  </h3>
  <!--end: searchItemTitle -->
  <span class="searchCon">
    <strong>python, 爬蟲</strong>,上網簡單看了幾篇博客本身試了試簡單的<strong>爬蟲</strong>哎呦喂頗有感受蠻好玩的 以前寫博客 有點感受是在寫教程啊什麼的寫的很彆扭 各類複製粘貼寫得很不舒服 之後仍是怎麼舒服怎麼寫把天天的練習所得寫上來就行了原本就是個菜鳥不斷學習 不斷debug就好 直接
  </span>
  <!--end: searchCon -->
  <div class="searchItemInfo">
    <span class="searchItemInfo-userName">
      <a href="http://www.cnblogs.com/cs-player1/" target="_blank">cs-player1</a>
    </span><span class="searchItemInfo-publishDate">2016-01-29</span>
        <span class="searchItemInfo-good">推薦(1)</span>
            <span class="searchItemInfo-comments">評論(14)</span>
            <span class="searchItemInfo-views">瀏覽(798)</span>
  </div>
  <div class="searchItemInfo">
    <span class="searchURL">www.cnblogs.com/cs-player1/p/5169307.html</span>
  </div>
  <!--end: searchURL -->
</div>
<div class="searchItem">
  <h3 class="searchItemTitle">
    <a target="_blank" href="http://www.cnblogs.com/hearzeus/p/5226546.html"><strong>Python 爬蟲</strong>入門(四)—— 驗證碼下篇(破解簡單的驗證碼)</a>
  </h3>
  <!--end: searchItemTitle -->
  <span class="searchCon">
    <strong>python, 爬蟲</strong>,  年前寫了驗證碼上篇,原本很早前就想寫下篇來着,只是過年比較忙,還有就是驗證碼破解比較繁雜,方法不一樣,正確率也會有差...碼(這裏我用的是<strong>python</strong>的"PIL"圖像處理庫)    a.)轉爲灰度圖     PIL 在這方面也提供了極完備的支
  </span>
  <!--end: searchCon -->
  <div class="searchItemInfo">
    <span class="searchItemInfo-userName">
      <a href="http://www.cnblogs.com/hearzeus/" target="_blank">不剃頭的一休哥</a>
    </span><span class="searchItemInfo-publishDate">2016-02-29</span>
        <span class="searchItemInfo-good">推薦(7)</span>
            <span class="searchItemInfo-comments">評論(17)</span>
            <span class="searchItemInfo-views">瀏覽(888)</span>
  </div>
  <div class="searchItemInfo">
    <span class="searchURL">www.cnblogs.com/hearzeus/p/5226546.html</span>
  </div>
  <!--end: searchURL -->
</div>
<div class="searchItem">
  <h3 class="searchItemTitle">
    <a target="_blank" href="http://www.cnblogs.com/xin-xin/p/4297852.html">《<strong>Python爬蟲</strong>學習系列教程》學習筆記</a>
  </h3>
  <!--end: searchItemTitle -->
  <span class="searchCon">
    家的交流。 1、<strong>Python</strong>入門 1. <strong>Python爬蟲</strong>入門...一之綜述 2. <strong>Python爬蟲</strong>入門二之<strong>爬蟲</strong>基礎瞭解 3. ... <strong>Python爬蟲</strong>入門七之正則表達式 2、<strong>Python</strong>實戰 ...on進階 1. <strong>Python爬蟲</strong>進階一之<strong>爬蟲</strong>框架Scrapy
  </span>
  <!--end: searchCon -->
  <div class="searchItemInfo">
    <span class="searchItemInfo-userName">
      <a href="http://www.cnblogs.com/xin-xin/" target="_blank">心_心</a>
    </span><span class="searchItemInfo-publishDate">2015-02-23</span>
        <span class="searchItemInfo-good">推薦(3)</span>
            <span class="searchItemInfo-comments">評論(2)</span>
            <span class="searchItemInfo-views">瀏覽(34430)</span>
  </div>
  <div class="searchItemInfo">
    <span class="searchURL">www.cnblogs.com/xin-xin/p/4297852.html</span>
  </div>
  <!--end: searchURL -->
</div>
<div class="searchItem">
  <h3 class="searchItemTitle">
    <a target="_blank" href="http://www.cnblogs.com/nishuihan/p/4754622.html">PHP, <strong>Python</strong>, Node.js 哪一個比較適合寫<strong>爬蟲</strong>?</a>
  </h3>
  <!--end: searchItemTitle -->
  <span class="searchCon">
    子,作一個簡單的<strong>爬蟲</strong>容易,但要作一個完備的<strong>爬蟲</strong>挺難的。像我搭...path的類庫/<strong>爬蟲</strong>庫後,就會發現此種方式雖然入門門檻低,但...薦採用一些現成的<strong>爬蟲</strong>庫,諸如xpath、多線程支持仍是必須考...以考慮。3、若是<strong>爬蟲</strong>是涉及大規模網站爬取,效率、擴展性、可維
  </span>
  <!--end: searchCon -->
  <div class="searchItemInfo">
    <span class="searchItemInfo-userName">
      <a href="http://www.cnblogs.com/nishuihan/" target="_blank">技術宅小牛牛</a>
    </span><span class="searchItemInfo-publishDate">2015-08-24</span>
          </div>
  <div class="searchItemInfo">
    <span class="searchURL">www.cnblogs.com/nishuihan/p/4754622.html</span>
  </div>
  <!--end: searchURL -->
</div>
<div class="searchItem">
  <h3 class="searchItemTitle">
    <a target="_blank" href="http://www.cnblogs.com/nishuihan/p/4815930.html">PHP, <strong>Python</strong>, Node.js 哪一個比較適合寫<strong>爬蟲</strong>?</a>
  </h3>
  <!--end: searchItemTitle -->
  <span class="searchCon">
    子,作一個簡單的<strong>爬蟲</strong>容易,但要作一個完備的<strong>爬蟲</strong>挺難的。像我搭...主要看你定義的「<strong>爬蟲</strong>」幹什麼用。1、若是是定向爬取幾個頁面,...path的類庫/<strong>爬蟲</strong>庫後,就會發現此種方式雖然入門門檻低,但...薦採用一些現成的<strong>爬蟲</strong>庫,諸如xpath、多線程支持仍是必須考
  </span>
  <!--end: searchCon -->
  <div class="searchItemInfo">
    <span class="searchItemInfo-userName">
      <a href="http://www.cnblogs.com/nishuihan/" target="_blank">技術宅小牛牛</a>
    </span><span class="searchItemInfo-publishDate">2015-09-17</span>
          </div>
  <div class="searchItemInfo">
    <span class="searchURL">www.cnblogs.com/nishuihan/p/4815930.html</span>
  </div>
  <!--end: searchURL -->
</div>
<div class="searchItem">
  <h3 class="searchItemTitle">
    <a target="_blank" href="http://www.cnblogs.com/rwxwsblog/p/4557123.html">安裝<strong>python爬蟲</strong>scrapy踩過的那些坑和編程外的思考</a>
  </h3>
  <!--end: searchItemTitle -->
  <span class="searchCon">
    了一下開源的<strong>爬蟲</strong>資料,看了許多對於開源<strong>爬蟲</strong>的比較發現開源<strong>爬蟲</strong>...沒辦法,只能升級<strong>python</strong>的版本了。 1、升級<strong>python</strong>...s://www.<strong>python</strong>.org/ftp/<strong>python</strong>/...n 檢查<strong>python</strong>版本 <strong>python</strong> --ve
  </span>
  <!--end: searchCon -->
  <div class="searchItemInfo">
    <span class="searchItemInfo-userName">
      <a href="http://www.cnblogs.com/rwxwsblog/" target="_blank">秋楓</a>
    </span><span class="searchItemInfo-publishDate">2015-06-06</span>
        <span class="searchItemInfo-good">推薦(2)</span>
            <span class="searchItemInfo-comments">評論(1)</span>
            <span class="searchItemInfo-views">瀏覽(4607)</span>
  </div>
  <div class="searchItemInfo">
    <span class="searchURL">www.cnblogs.com/rwxwsblog/p/4557123.html</span>
  </div>
  <!--end: searchURL -->
</div>
<div class="searchItem">
  <h3 class="searchItemTitle">
    <a target="_blank" href="http://www.cnblogs.com/maybe2030/p/4555382.html">[<strong>Python</strong>] 網絡<strong>爬蟲</strong>和正則表達式學習總結</a>
  </h3>
  <!--end: searchItemTitle -->
  <span class="searchCon">
    有的網站爲了防止<strong>爬蟲</strong>,可能會拒絕<strong>爬蟲</strong>的請求,這就須要咱們來修...,正則表達式不是<strong>Python</strong>的語法,並不屬於<strong>Python</strong>,其...\d" 2.2 <strong>Python</strong>的re模塊   <strong>Python</strong>經過... 實例描述 <strong>python</strong> 匹配 "<strong>python</strong>". 
  </span>
  <!--end: searchCon -->
  <div class="searchItemInfo">
    <span class="searchItemInfo-userName">
      <a href="http://www.cnblogs.com/maybe2030/" target="_blank">poll的筆記</a>
    </span><span class="searchItemInfo-publishDate">2015-06-05</span>
        <span class="searchItemInfo-good">推薦(2)</span>
            <span class="searchItemInfo-comments">評論(5)</span>
            <span class="searchItemInfo-views">瀏覽(1089)</span>
  </div>
  <div class="searchItemInfo">
    <span class="searchURL">www.cnblogs.com/maybe2030/p/4555382.html</span>
  </div>
  <!--end: searchURL -->
</div>
<div class="searchItem">
  <h3 class="searchItemTitle">
    <a target="_blank" href="http://www.cnblogs.com/mr-zys/p/5059451.html">一個簡單的多線程<strong>Python爬蟲</strong>(一)</a>
  </h3>
  <!--end: searchItemTitle -->
  <span class="searchCon">
    一個簡單的多線程<strong>Python爬蟲</strong> 最近想要抓取[拉勾網](h...本身寫一個簡單的<strong>Python爬蟲</strong>的想法。 本文中的部分連接...0525185/<strong>python</strong>-threading-how-d...0525185/<strong>python</strong>-threading-how-do-i-lock-a-thread) ## 一個<strong>爬蟲</strong>
  </span>
  <!--end: searchCon -->
  <div class="searchItemInfo">
    <span class="searchItemInfo-userName">
      <a href="http://www.cnblogs.com/mr-zys/" target="_blank">mr_zys</a>
    </span><span class="searchItemInfo-publishDate">2015-12-19</span>
        <span class="searchItemInfo-good">推薦(3)</span>
            <span class="searchItemInfo-comments">評論(4)</span>
            <span class="searchItemInfo-views">瀏覽(696)</span>
  </div>
  <div class="searchItemInfo">
    <span class="searchURL">www.cnblogs.com/mr-zys/p/5059451.html</span>
  </div>
  <!--end: searchURL -->
</div>
<div class="searchItem">
  <h3 class="searchItemTitle">
    <a target="_blank" href="http://www.cnblogs.com/jixin/p/5145813.html">自學<strong>Python</strong>十一 <strong>Python爬蟲</strong>總結</a>
  </h3>
  <!--end: searchItemTitle -->
  <span class="searchCon">
    Demo   <strong>爬蟲</strong>就靠一段落吧,更深刻的<strong>爬蟲</strong>框架以及htm...學習與嘗試逐漸對<strong>python爬蟲</strong>有了一些小小的心得,咱們漸漸...嘗試着去總結一下<strong>爬蟲</strong>的共性,試着去寫個helper類以免重...。   參考:用<strong>python爬蟲</strong>抓站的一些技巧總結 zz  
  </span>
  <!--end: searchCon -->
  <div class="searchItemInfo">
    <span class="searchItemInfo-userName">
      <a href="http://www.cnblogs.com/jixin/" target="_blank">個人代碼會飛</a>
    </span><span class="searchItemInfo-publishDate">2016-01-20</span>
        <span class="searchItemInfo-good">推薦(3)</span>
            <span class="searchItemInfo-comments">評論(1)</span>
            <span class="searchItemInfo-views">瀏覽(696)</span>
  </div>
  <div class="searchItemInfo">
    <span class="searchURL">www.cnblogs.com/jixin/p/5145813.html</span>
  </div>
  <!--end: searchURL -->
</div>
<div class="searchItem">
  <h3 class="searchItemTitle">
    <a target="_blank" href="http://www.cnblogs.com/hearzeus/p/5162691.html"><strong>Python 爬蟲</strong>入門(三)—— 尋找合適的爬取策略</a>
  </h3>
  <!--end: searchItemTitle -->
  <span class="searchCon">
    <strong>python, 爬蟲</strong>,  寫<strong>爬蟲</strong>以前,首先要明確爬取的數據。...怎麼尋找一個好的<strong>爬蟲</strong>策略。(代碼僅供學習交流,切勿用做商業或...(這個也是咱們用<strong>爬蟲</strong>發請求的結果),如圖所示      很慶...).順便說一句,<strong>python</strong>有json解析模塊,能夠用。   下面附上蟬遊記的<strong>爬蟲</strong>
  </span>
  <!--end: searchCon -->
  <div class="searchItemInfo">
    <span class="searchItemInfo-userName">
      <a href="http://www.cnblogs.com/hearzeus/" target="_blank">不剃頭的一休哥</a>
    </span><span class="searchItemInfo-publishDate">2016-01-27</span>
        <span class="searchItemInfo-good">推薦(5)</span>
            <span class="searchItemInfo-comments">評論(3)</span>
            <span class="searchItemInfo-views">瀏覽(799)</span>
  </div>
  <div class="searchItemInfo">
    <span class="searchURL">www.cnblogs.com/hearzeus/p/5162691.html</span>
  </div>
  <!--end: searchURL -->
</div>
<div class="searchItem">
  <h3 class="searchItemTitle">
    <a target="_blank" href="http://www.cnblogs.com/ybjourney/p/5304501.html"><strong>python</strong>簡單<strong>爬蟲</strong></a>
  </h3>
  <!--end: searchItemTitle -->
  <span class="searchCon">
      <strong>爬蟲</strong>真是一件有意思的事兒啊,以前寫過<strong>爬蟲</strong>,用的是urll...Soup實現簡單<strong>爬蟲</strong>,scrapy也有實現過。最近想更好的學...習<strong>爬蟲</strong>,那麼就儘量的作記錄吧。這篇博客就我今天的一個學習過...的語法規則,我在<strong>爬蟲</strong>中經常使用的有: . 匹配任意字符(換
  </span>
  <!--end: searchCon -->
  <div class="searchItemInfo">
    <span class="searchItemInfo-userName">
      <a href="http://www.cnblogs.com/ybjourney/" target="_blank">oyabea</a>
    </span><span class="searchItemInfo-publishDate">2016-03-22</span>
        <span class="searchItemInfo-good">推薦(4)</span>
            <span class="searchItemInfo-comments">評論(1)</span>
            <span class="searchItemInfo-views">瀏覽(477)</span>
  </div>
  <div class="searchItemInfo">
    <span class="searchURL">www.cnblogs.com/ybjourney/p/5304501.html</span>
  </div>
  <!--end: searchURL -->
</div>
<div class="searchItem">
  <h3 class="searchItemTitle">
    <a target="_blank" href="http://www.cnblogs.com/hippieZhou/p/4967075.html"><strong>Python</strong>帶你輕鬆進行網頁<strong>爬蟲</strong></a>
  </h3>
  <!--end: searchItemTitle -->
  <span class="searchCon">
    ,因此就打算自學<strong>Python</strong>。在尚未學它的時候就據說用它來進行網頁<strong>爬蟲</strong>...3.0此次的網絡<strong>爬蟲</strong>需求背景我打算延續DotNet開源大本營...例。2.實戰網頁<strong>爬蟲</strong>:2.1.獲取城市列表:首先,咱們須要獲...行速度,那麼可能<strong>Python</strong>仍是挺適合的,畢竟能夠經過它寫更
  </span>
  <!--end: searchCon -->
  <div class="searchItemInfo">
    <span class="searchItemInfo-userName">
      <a href="http://www.cnblogs.com/hippiezhou/" target="_blank">hippiezhou</a>
    </span><span class="searchItemInfo-publishDate">2015-11-22</span>
        <span class="searchItemInfo-good">推薦(2)</span>
            <span class="searchItemInfo-comments">評論(2)</span>
            <span class="searchItemInfo-views">瀏覽(1563)</span>
  </div>
  <div class="searchItemInfo">
    <span class="searchURL">www.cnblogs.com/hippieZhou/p/4967075.html</span>
  </div>
  <!--end: searchURL -->
</div>
<div class="searchItem">
  <h3 class="searchItemTitle">
    <a target="_blank" href="http://www.cnblogs.com/mfryf/p/3695844.html">開發記錄_自學<strong>Python</strong>寫<strong>爬蟲</strong>程序爬取csdn我的博客信息</a>
  </h3>
  <!--end: searchItemTitle -->
  <span class="searchCon">
    .3_開工 聽說<strong>Python</strong>並不難,看過了<strong>python</strong>的代碼...lecd這 個半<strong>爬蟲</strong>半網站的項目, 累積很多<strong>爬蟲</strong>抓站的經驗,... 某些網站反感<strong>爬蟲</strong>的到訪,因而對<strong>爬蟲</strong>一概拒絕請求 ...模仿了一個本身的<strong>Python爬蟲</strong>。 [<strong>python</strong>]
  </span>
  <!--end: searchCon -->
  <div class="searchItemInfo">
    <span class="searchItemInfo-userName">
      <a href="http://www.cnblogs.com/mfryf/" target="_blank">知識天地</a>
    </span><span class="searchItemInfo-publishDate">2014-04-28</span>
        <span class="searchItemInfo-good">推薦(1)</span>
            <span class="searchItemInfo-comments">評論(1)</span>
            <span class="searchItemInfo-views">瀏覽(4481)</span>
  </div>
  <div class="searchItemInfo">
    <span class="searchURL">www.cnblogs.com/mfryf/p/3695844.html</span>
  </div>
  <!--end: searchURL -->
</div>
<div class="searchItem">
  <h3 class="searchItemTitle">
    <a target="_blank" href="http://www.cnblogs.com/coltfoal/archive/2012/10/06/2713348.html"><strong>Python</strong>天氣預報採集器(網頁<strong>爬蟲</strong>)</a>
  </h3>
  <!--end: searchItemTitle -->
  <span class="searchCon">
    的。   補充上<strong>爬蟲</strong>結果的截圖:      <strong>python</strong>的使...編程, <strong>Python</strong>,  python是一門很強大的語言,在...以就算了。   <strong>爬蟲</strong>簡單說來包括兩個步驟:得到網頁文本、過濾...ml文本。   <strong>python</strong>在獲取html方面十分方便,寥寥
  </span>
  <!--end: searchCon -->
  <div class="searchItemInfo">
    <span class="searchItemInfo-userName">
      <a href="http://www.cnblogs.com/coltfoal/" target="_blank">coltfoal</a>
    </span><span class="searchItemInfo-publishDate">2012-10-06</span>
        <span class="searchItemInfo-good">推薦(5)</span>
            <span class="searchItemInfo-comments">評論(16)</span>
            <span class="searchItemInfo-views">瀏覽(5412)</span>
  </div>
  <div class="searchItemInfo">
    <span class="searchURL">www.cnblogs.com/coltfoal/archive/2012/10/06/2713348.html</span>
  </div>
  <!--end: searchURL -->
</div>
<div id="paging_block"><div class="pager"><a href="/s?w=python%e7%88%ac%e8%99%ab&t=b&p=1" class="p_1 current" onclick="Return true;;buildPaging(1);return false;">1</a><a href="/s?w=python%e7%88%ac%e8%99%ab&t=b&p=2" class="p_2" onclick="Return true;;buildPaging(2);return false;">2</a><a href="/s?w=python%e7%88%ac%e8%99%ab&t=b&p=3" class="p_3" onclick="Return true;;buildPaging(3);return false;">3</a><a href="/s?w=python%e7%88%ac%e8%99%ab&t=b&p=4" class="p_4" onclick="Return true;;buildPaging(4);return false;">4</a><a href="/s?w=python%e7%88%ac%e8%99%ab&t=b&p=5" class="p_5" onclick="Return true;;buildPaging(5);return false;">5</a><a href="/s?w=python%e7%88%ac%e8%99%ab&t=b&p=6" class="p_6" onclick="Return true;;buildPaging(6);return false;">6</a><a href="/s?w=python%e7%88%ac%e8%99%ab&t=b&p=7" class="p_7" onclick="Return true;;buildPaging(7);return false;">7</a><a href="/s?w=python%e7%88%ac%e8%99%ab&t=b&p=8" class="p_8" onclick="Return true;;buildPaging(8);return false;">8</a><a href="/s?w=python%e7%88%ac%e8%99%ab&t=b&p=9" class="p_9" onclick="Return true;;buildPaging(9);return false;">9</a><a href="/s?w=python%e7%88%ac%e8%99%ab&t=b&p=10" class="p_10" onclick="Return true;;buildPaging(10);return false;">10</a><a href="/s?w=python%e7%88%ac%e8%99%ab&t=b&p=11" class="p_11" onclick="Return true;;buildPaging(11);return false;">11</a><a href="/s?w=python%e7%88%ac%e8%99%ab&t=b&p=12" class="p_12" onclick="Return true;;buildPaging(12);return false;">12</a><a href="/s?w=python%e7%88%ac%e8%99%ab&t=b&p=13" class="p_13" onclick="Return true;;buildPaging(13);return false;">13</a><a href="/s?w=python%e7%88%ac%e8%99%ab&t=b&p=14" class="p_14" onclick="Return true;;buildPaging(14);return false;">14</a><a href="/s?w=python%e7%88%ac%e8%99%ab&t=b&p=15" class="p_15" onclick="Return true;;buildPaging(15);return false;">15</a><a href="/s?w=python%e7%88%ac%e8%99%ab&t=b&p=16" class="p_16" onclick="Return true;;buildPaging(16);return false;">16</a><a href="/s?w=python%e7%88%ac%e8%99%ab&t=b&p=17" class="p_17" onclick="Return true;;buildPaging(17);return false;">17</a><a href="/s?w=python%e7%88%ac%e8%99%ab&t=b&p=18" class="p_18" onclick="Return true;;buildPaging(18);return false;">18</a><a href="/s?w=python%e7%88%ac%e8%99%ab&t=b&p=19" class="p_19" onclick="Return true;;buildPaging(19);return false;">19</a><a href="/s?w=python%e7%88%ac%e8%99%ab&t=b&p=20" class="p_20" onclick="Return true;;buildPaging(20);return false;">20</a><a href="/s?w=python%e7%88%ac%e8%99%ab&t=b&p=21" class="p_21" onclick="Return true;;buildPaging(21);return false;">21</a><a href="/s?w=python%e7%88%ac%e8%99%ab&t=b&p=22" class="p_22" onclick="Return true;;buildPaging(22);return false;">22</a><a href="/s?w=python%e7%88%ac%e8%99%ab&t=b&p=23" class="p_23" onclick="Return true;;buildPaging(23);return false;">23</a>···<a href="/s?w=python%e7%88%ac%e8%99%ab&t=b&p=75" class="p_75" onclick="Return true;;buildPaging(75);return false;">75</a><a href="/s?w=python%e7%88%ac%e8%99%ab&t=b&p=2" onclick="Return true;;buildPaging(2);return false;">Next &gt;</a></div></div><script type="text/javascript">var pagingBuider={"OnlyLinkText":false,"TotalCount":1491,"PageIndex":1,"PageSize":20,"ShowPageCount":11,"SkipCount":0,"UrlFormat":"/s?w=python%e7%88%ac%e8%99%ab&t=b&p={0}","OnlickJsFunc":"Return true;","FirstPageLink":"/s?w=python%e7%88%ac%e8%99%ab&t=b&p=1","AjaxUrl":"/","AjaxCallbak":null,"TopPagerId":"pager_top","IsRenderScript":true};function buildPaging(pageIndex){pagingBuider.PageIndex=pageIndex;$.ajax({url:pagingBuider.AjaxUrl,data:JSON.stringify(pagingBuider),type:'post',dataType:'text',contentType:'application/json; charset=utf-8',success:function (data) { $('#paging_block').html(data); var pagerTop=$('#pager_top');if(pageIndex>1){$(pagerTop).html(data).show();}else{$(pagerTop).hide();}}});}</script>


            </div>
        </div>
        <div class="forflow" id="sidebar">
            <div class="s_google"><a href="javascript:void(0);" title="Google站內搜索" onclick="return google_search()">Google</a> 找一下<br/>
            </div>
            
            <div style="clear: both;">
            </div>
            
            <div style="clear: both;">
            </div>
            <div class="sideRightWidget">
    <b>按瀏覽數篩選</b><br />

    <ol id="viewsRange">
        <li                                 class="ui-selected"
><a href="javascript:void(0);" onclick="Views(0);redirect();">所有</a></li>
        <li ><a href="javascript:void(0);" onclick="Views(200);redirect();">200以上</a></li>
        <li ><a href="javascript:void(0);" onclick="Views(500);redirect();">500以上</a></li>
        <li ><a href="javascript:void(0);" onclick="Views(1000);redirect();">1000以上</a></li>
    </ol>
</div>

            <div style="clear: both;">
            </div>
            
            <div class="sideRightWidget">
    <b>按時間篩選</b><br />
    <ol id="dateRange">
        <li                 class="ui-selected"
><a href="javascript:void(0);" onclick="clearDate();dateRange(null);redirect();">所有</a></li>
        <li ><a href="javascript:void(0);" onclick="dateRange('One-Week');redirect();">
                  一週內</a></li>
        <li ><a href="javascript:void(0);" onclick="dateRange('One-Month');redirect();">
                  一月內</a></li>
        <li ><a href="javascript:void(0);" onclick="dateRange('Three-Month');redirect();">
                  三月內</a></li>
        <li ><a href="javascript:void(0);" onclick="dateRange('One-Year');redirect();">
                  一年內</a></li>
    </ol>
    <p id="datepicker">
        自定義:  <input type="text" id="dateMin" 
        class="datepicker"/>-<input type="text" id="dateMax" class="datepicker"
        />
    </p>
</div>

            <div style="clear: both;">
            </div>
            <div class="sideRightWidget">
                » 去「<a title="博問是博客園提供的問答系統" href="http://q.cnblogs.com/">博問</a>」問一下?
                    <br />
                » 搜索「<a href="http://job.cnblogs.com/search/">招聘職位</a><br />
                » 我有<a href="http://space.cnblogs.com/forum/public">反饋或建議</a>
            </div>
            <div id="siderigt_ad">
                <script type='text/javascript'>
                var googletag = googletag || {};
                googletag.cmd = googletag.cmd || [];
                (function () {
                    var gads = document.createElement('script');
                    gads.async = true;
                    gads.type = 'text/javascript';
                    var useSSL = 'https:' == document.location.protocol;
                    gads.src = (useSSL ? 'https:' : 'http:') +
                    '//www.googletagservices.com/tag/js/gpt.js';
                    var node = document.getElementsByTagName('script')[0];
                    node.parentNode.insertBefore(gads, node);
                })();
                </script>
                <script type='text/javascript'>
                    googletag.cmd.push(function () {
                        googletag.defineSlot('/1090369/cnblogs_zzk_Z1', [300, 250], 'div-gpt-ad-1410172170550-0').addService(googletag.pubads());
                        googletag.pubads().enableSingleRequest();
                        googletag.enableServices();
                    });
                </script>
                <!-- cnblogs_zzk_Z1 -->
                <div id='div-gpt-ad-1410172170550-0' style='width:300px; height:250px;'>
                    <script type='text/javascript'>
                        googletag.cmd.push(function () { googletag.display('div-gpt-ad-1410172170550-0'); });
                    </script>
                </div>
            </div>
        </div>
    </div>
    <div style="clear: both;">
    </div>
    
<div id="footer">
    &copy; 2004-2016 <a title="開發者的網上家園" href="http://www.cnblogs.com">博客園</a>
</div>
<script type="text/javascript">

    var _gaq = _gaq || [];
    _gaq.push(['_setAccount', 'UA-476124-10']);
    _gaq.push(['_trackPageview']);

    (function () {
        var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
        ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
        var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
    })();

</script>

    <!--end: footer -->
</body>
</html>


Process finished with exit code 0
運行結果
  • 結果顯示:對url中的中文進行單獨處理,url對應內容能夠正常抓取了

 

------@_@! 又有一個新的問題-----------------------------------------------------------

  • 問題:若是把url的中英文一塊兒進行處理呢?還能成功抓取嗎?

----------------------------------------------------------------------------------------

(3)因而,測試3出現了!測試3:url中,中英文一塊兒進行處理

  • 代碼示例:
#python3.4
import urllib.request
import urllib.parse

url = urllib.parse.quote("http://zzk.cnblogs.com/s?w=python爬蟲&t=b")
resp = urllib.request.urlopen(url)
print(resp.read().decode('utf-8'))
  • 運行結果:
C:\Python34\python.exe E:/pythone_workspace/mydemo/spider/demo.py
Traceback (most recent call last):
  File "E:/pythone_workspace/mydemo/spider/demo.py", line 21, in <module>
    resp = urllib.request.urlopen(url)
  File "C:\Python34\lib\urllib\request.py", line 161, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python34\lib\urllib\request.py", line 448, in open
    req = Request(fullurl, data)
  File "C:\Python34\lib\urllib\request.py", line 266, in __init__
    self.full_url = url
  File "C:\Python34\lib\urllib\request.py", line 292, in full_url
    self._parse()
  File "C:\Python34\lib\urllib\request.py", line 321, in _parse
    raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: 'http%3A//zzk.cnblogs.com/s%3Fw%3Dpython%E7%88%AC%E8%99%AB%26t%3Db'

Process finished with exit code 1
  • 結果顯示:ValueError!沒法成功抓取網頁!

 

  • 結合測試一、二、3,可獲得下面結果:

(1)在python3.4中,若是url中包含中文,能夠用 urllib.parse.quote("爬蟲") 進行處理。

(2)url中的中文須要單獨處理,不能中英文一塊兒處理。

 

  • Tips:若是想了解一個函數的參數傳值
#python3.4
import urllib.request
help(urllib.request.urlopen)
  • 運行上面代碼,控制檯輸出
C:\Python34\python.exe E:/pythone_workspace/mydemo/spider/demo.py
Help on function urlopen in module urllib.request:

urlopen(url, data=None, timeout=<object object at 0x00A50490>, *, cafile=None, capath=None, cadefault=False, context=None)

Process finished with exit code 0

 

  @_@)Y,這篇的分享就到此結束~待續~

相關文章
相關標籤/搜索