python用字符串操做20行代碼簡單爬蟲入門+案例（爬取一章《三體》小說）

時間 2019-12-14

標籤 python 用字符串代碼簡單爬蟲入門案例一章三體欄目 Python 简体版

原文原文鏈接

所須要的簡單的方法

一、#導入專用包

import urllib.request

二、try...except..

try:
   語句1....
except Exception as e:
    語句2...
嘗試執行語句1，執行不成功就執行語句2

三、urlopen獲取內容

response =urllib.request.urlopen(webList)
#獲取webList頁面的內容

四、read()讀取

response.read()
#讀取獲取的內容

五、decode解碼

decode('UTF-8')
#用utf-8的方式解碼

六、替換方法

html = html.expandtabs()
#html內容替換全部的製表符爲空

html =html.replace(' ','')
#替換掉全部的空格

七、獲取長度

lenth = len(html)
#獲取文檔的長度

八、find()查找方法

lenth = len(html)
#獲取文檔的長度

九、字符串的截取

html[0:index2]
#對整篇字符串進行截取

十、寫入 open..write

writeFile =open('三體.txt','w')
writeFile.write(htm)
#寫入文件

案例爬取一頁《三體》小說。

#導入專用包
import urllib.request
#須要鏈接的頁面
webList ='http://www.51shucheng.net/kehuan/santi/santi1/174.html'
#用try嘗試去鏈接
try:
    response =urllib.request.urlopen(webList)
    #若是能成功鏈接，並獲取內容，response就是咱們所獲取的那個頁面
except Exception as e:
    print('獲取失敗')
    #不然就打印出‘獲取失敗’
html = str(response.read().decode('UTF-8'))
# 把獲取的內容讀取出來，而且用UTF-8解碼
html = html.expandtabs()
#替換掉全部的TAB符號
html =html.replace(' ','')
#替換掉全部的空格
print(html)
#能夠打印出來預覽一下，方便進行定位
lenth = len(html)
#獲取文檔的長度
html = html[html.find('neirong">',0,lenth)+9:]
index =html.find('跟鞋。</p>',0)+3
index2 = html.find('眷戀着天空。</p>')
index3 =html.find('<p>「紅色聯合」的戰士們歡呼起來')
#找到一些關鍵位置，獲取索引，方便下邊進行定位
htm =str(html[0:index2]+html[index3:index])
#對整篇字符串進行截取
htm = htm.replace('<p>','    ')
htm = htm.replace('</p>','\n')
#把文中的<p></p>替換掉
writeFile =open('三體.txt','w')
writeFile.write(htm)
#寫入文件
print('寫入完成')