[Python爬蟲筆記][隨意找個博客入門(一)]

時間 2019-12-11

原文原文鏈接

[Python爬蟲筆記][隨意找個博客入門(一)]

標籤（空格分隔）： Python 爬蟲 2016年暑假html

來源博客:掙脫不足與矇昧

1.簡單的爬取特定url的html代碼

import urllib.request
url = "http://120.27.101.158/"
response = urllib.request.urlopen(url)
html = response.read()
html = html.decode('utf-8');
print  (html)

urllib.request.urlopen()
- 有點相似於文件操做裏的open，返回的response對象也相似與文件對象。
- 等價於
```
req = urllib.request.Request("http://placekitten.com/500/600")
response = urllib.request.urlopen(req)python
```
response.read()
- response對象的讀操做,相似的文件對象的讀操做.
- 該對象還有如下經常使用方法
```
response.geturl() ##訪問的具體地址。
response.info() ##遠程的服務器的信息
response.getcode() ##http的狀態web
```
html.decode()
- decode() 方法以encoding指定的編碼格式解碼字符串。

2.簡單的翻譯程序(爬取有道詞典)

在咱們註冊信息的時候，填寫資料的時候，都涉及到表單（form）的應用。是一個POST請求發送到服務器端的過程。 HTML中的表單時有特定格式的，舉個例子，咱們打開有道在線翻譯，調出調試平臺，輸入翻譯內容「Hello,Python」點擊自動翻譯。

在調試平臺中的network中咱們能夠看到一些常見的信息
如訪問的具體的url地址，http的狀態（200)
在參數欄(FireFox)能夠看見提交的表單信息(json格式)
在響應欄，能夠知道返回的表單信息也是json格式

用字典傳入一個json並提交表單，並解析返回來html裏的json，代碼以下。正則表達式

import urllib.request
'''urllib中的parse用來對url解析'''
import urllib.parse 
import json

url = 'http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=null/'
content = input("你想翻譯什麼呀？")

data = {}
data['type']='AUTO'
data['i'] = content
data['doctype'] = 'json'
data['xmlVersion'] = '1.8'
data['keyfrom'] = 'fanyi.web'
data['ue'] = 'UTF-8'
data['typoResult'] = 'true'
data = urllib.parse.urlencode(data).encode('utf-8')

response = urllib.request.urlopen(url, data)
html=response.read().decode('utf-8')

target =json.loads(html)

print ("翻譯結果是：%s" %(target['translateResult'][0][0]['tgt']))

結果

> print (target)

{'translateResult': [[{'src': '測試程序', 'tgt': 'The test program'}]], 'elapsedTime': 0, 'errorCode': 0, 'smartResult': {'entries': ['', '[計] test program'], 'type': 1}, 'type': 'ZH_CN2EN'}

咱們看到翻譯的內容在translateResult[0][0][‘tgt’]中json

data = urllib.parse.urlencode(data).encode('utf-8')

將字典轉換爲可以post,get進行的字符串,對於中文編碼爲默認格式的字符串。

encode將該字符串轉換爲一個字節序列。(從下面程序能夠看出其實這個utf-8沒什麼卵用，換成gbk還會是同樣的結果)瀏覽器

>data
>{'type': 'AUTO', 'ue': 'UTF-8', 'typoResult': 'true', 'i': '程序測試', 'xmlVersion': '1.8', 'keyfrom': 'fanyi.web', 'doctype': 'json'}
>data = urllib.parse.urlencode(data); #dict轉換爲str
>'type=AUTO&ue=UTF-8&typoResult=true&i=%E7%A8%8B%E5%BA%8F%E6%B5%8B%E8%AF%95&xmlVersion=1.8&keyfrom=fanyi.web&doctype=json'
>data = data.encode('utf-8'); #str轉換爲byte序列
>b'type=AUTO&ue=UTF-8&typoResult=true&i=%E7%A8%8B%E5%BA%8F%E6%B5%8B%E8%AF%95&xmlVersion=1.8&keyfrom=fanyi.web&doctype=json'

response = urllib.request.urlopen(url, data)
- 傳入的data必須爲byte型字符串
html=response.read().decode('utf-8')
- 將接收來的utf-8頁面解碼爲unicode
target =json.loads(html)
- 這個頁面應該是一個json,將其轉換爲字典

3.小模仿，爬谷歌翻譯

import re
import urllib.parse
import urllib.request

#----------模擬瀏覽器的行爲，向谷歌翻譯發送數據，而後抓取翻譯結果，這就是大概的思路-------
def Gtranslate(text):
    Gtext=text     #text 輸入要翻譯的英文句子
    #hl:瀏覽器、操做系統語言，默認是zh-CN
    #ie:默認是UTF-8
    #text：就是要翻譯的字符串
    #langpair:語言對，即'en'|'zh-CN'表示從英語到簡體中文
    values={'hl':'zh-CN','ie':'UTF-8','text':Gtext,'langpair':"auto"}
    url='http://translate.google.cn/'     #URL用來存儲谷歌翻譯的網址
    data = urllib.parse.urlencode(values).encode("utf-8")    #將values中的數據經過urllib.urlencode轉義爲URL專用的格式而後賦給data存儲
    req = urllib.request.Request(url,data)     #而後用URL和data生成一個request
    browser='Mozilla/4.0 (Windows; U;MSIE 6.0; Windows NT 6.1; SV1; .NET CLR 2.0.50727)'     #假裝一個IE6.0瀏覽器訪問，若是不假裝，谷歌將返回一個403錯誤
    req.add_header('User-Agent',browser)
    response = urllib.request.urlopen(req)     #向谷歌翻譯發送請求
    html=response.read()     #讀取返回頁面，而後咱們就從這個HTML頁面中截取翻譯過來的字符串便可
    html=html.decode('utf-8')
    #使用正則表達式匹配<=TRANSLATED_TEXT=)。而翻譯後的文本是'TRANSLATED_TEXT='等號後面的內容
    p=re.compile(r"(?<=TRANSLATED_TEXT=).*(?=';INPUT_TOOL_PATH='//www.google.com')")
    m=p.search(html)
    chineseText=m.group(0).strip(';')
    return chineseText

if __name__ == "__main__":
    #Gtext爲待翻譯的字符串
    Gtext='我是上帝'
    print('The input text: %s' % Gtext)
    chineseText=Gtranslate(Gtext).strip("'")
    print('Translated End,The output text: %s' % chineseText)