python爬蟲入門(1)-urllib模塊

做用:用於讀取來自網上(服務器上)的數據
 
基本方法:urllib.request.urlopen(url,data=None,[]timeout]*,cafile=None,cadefault=False,context=None)
  • url:須要打開的網址
  • data:Post提交的數據
  • timeout:設置網站的訪問超時時間
 
示例1:獲取頁面
    import urllib.request
    response = urllib.request.urlopen("http://www.fishc.com")#是一個HTTP響應類型
    html =response.read()#讀取響應內容,爲bytes類型
    # print(type(html),html) #輸出的爲一串<class 'bytes'>
    html = html.decode('utf-8')#bytes類型解碼爲str類型
    print(html)

 

示例2:抓一隻貓
import urllib.request
response = urllib.request.urlopen("http://placekitten.com/g/400/400")
cat_img = response.read()
with open('cat_400_400.jpg','wb')as f:
  f.write(cat_img)

 

 
示例3:翻譯器
右擊瀏覽器,選擇檢查或審查元素,再點擊網絡,找到post的 Name,複製RequestURL
在headers中找到Form Data,複製表單中內容

 
import urllib.request
import urllib.parse
import json
import time
while True:
    content = input("請輸入須要翻譯的內容《輸入q!退出程序》:")
    if content == 'q!':
        break
    url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=http://www.youdao.com/"    #即RequestURL中的連接
    data = {}
    #Form Data中的內容,適當刪除無用信息
    data['i'] = content
    data['smartresult'] = 'dict'
    data['client'] = 'fanyideskweb'
    data['doctype'] = 'json'
    data['version'] = '2.1'
    data['keyfrom'] = 'fanyi.web'
    data['action'] = 'FY_BY_CLICKBUTTON'
    data['typoResult'] = 'true'
    data = urllib.parse.urlencode(data).encode('utf-8')
    #打開網址並提交表單
    response = urllib.request.urlopen(url, data)
    html = response.read().decode('utf-8')
    target = json.loads(html)
    print("翻譯結果:%s" % (target['translateResult'][0][0]['tgt']))
    time.sleep(2)

 

 

 
 
隱藏和代理
隱藏:1.經過request的headers參數修改
           2.經過Request.add_header()方法修改
代理:1.proxy_support = urllib.request.ProxyHandler({}) #參數是一個字典{'類型':'代理IP:端口號'}
           2.opener = urllib.request.build_opener(proxy_support) #定製、建立一個opener
           3.urllib.request.install_opener(opener) #安裝opener
              opener.open(url) #調用opener
代理
示例5:代理
    import urllib.request
    import random
    url ='http://www.whatismyip.com.tw/'
    iplist =['61.191.41.130:80','115.46.97.122:8123',]
     
    #參數是一個字典{'類型':'代理IP:端口號'}
    proxy_support = urllib.request.ProxyHandler({'http':random.choice(iplist)})
    #定製、建立一個opener
    opener = urllib.request.build_opener(proxy_support)
    #經過addheaders修改User-Agent
    opener.addheaders =[('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36')]
    #安裝opener
    urllib.request.install_opener(opener)
    response = urllib.request.urlopen(url)
    html = response.read().decode('utf-8')
    print(html)

 

 
 
示例6:簡單爬取貼吧圖片
    import urllib.request
    import re
    def open_url(url):
        #打開URL並修改header,將URL內容讀取
        req = urllib.request.Request(url)
        #經過add_header修改User-Agent
        req.add_header('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36')
        page = urllib.request.urlopen(req)
        html = page.read().decode('utf-8')
        return html

    def get_img(html):
        p = r'<img class="BDE_Image" src="([^"]+\.jpg)'
        imglist = re.findall(p,html)#尋找到圖片的連接
        for each in imglist:
          filename = each.split("/")[-1]
          urllib.request.urlretrieve(each,filename,None)#保存圖片
    
  if __name__ =='__main__': url ="https://tieba.baidu.com/p/5090206152" get_img(open_url(url))

 

 
 



相關文章
相關標籤/搜索