2017.07.23 Python網絡爬蟲之爬蟲經常使用模塊

時間 2019-12-17

標籤 2017.07.23 python 網絡爬蟲經常使用模塊欄目 Python 简体版

原文原文鏈接

1.涉及網絡這塊，必不可少的模塊就是urllib2了。顧名思義這個模塊主要負責打開URL和HTTP協議之類的，還有一個模塊叫urllib，但它們不是升級版的關係html

2.urllib2請求返回網頁前端

（1）urllib2最賤的應用就是urllib2.urlopen函數了：python

urllib2.urlopen(url[,data[,timeout[,cafile[,capath[,cadefault[,context]]]]]])web

按照官方的文檔，urllib2.urlopen能夠打開HTTP,HTTPS,FTP協議的URL，主要應用於HTTP正則表達式

（2）參數：後端

它的參數中以ca開頭的都是跟身份驗證相關，不太經常使用服務器

data參數是以post方式提交URL時使用的cookie

最經常使用的就只有URL和timeout參數：url參數是提交的網絡地址（地址全稱，前端需協議名，後端需端口，好比:http://192.168.1.1:80），timeout是超時時間設置網絡

（3）函數返回對象：函數

****geturl()函數返回response的url信息，經常使用於url重定向的狀況；

****info()函數返回response的基本信息

****getcode()函數返回response的狀態代碼，最多見的代碼是200服務器成功返回頁面，404請求的網頁不存在，503服務器暫時不可用

（4）編寫testUrllib2.py測試：

# !/usr/bin/env python
# -*- coding:utf-8 -*-

import urllib2
import time
import platform
import os
import gzip
import StringIO

def clear():
    """該函數用於清屏"""
    print(u"內容較多，顯示3秒後翻頁")
    time.sleep(3)
    OS=platform.system()
    if (OS==u'Windows'):
        os.system('cls')
    else:
        os.system('clear')

def linkBaidu():
    url='http://www.baidu.com'

    try:
        res=urllib2.urlopen(url,timeout=3)
        response=urllib2.urlopen(url,timeout=3).read()
        response=StringIO.StringIO(response)
        gzipper = gzip.GzipFile(fileobj=response)
        html = gzipper.read()
    except urllib2.URLError:
        print(u"網絡地址錯誤")
        exit()

    with open('./baidu.txt','w+') as fp:

        fp.write(html)

        print(u"獲取url信息，response.geturl()\n: %s" %res.geturl())
        print(u"獲取返回碼，response.getcode()\n: %s" %res.getcode())
        print(u"獲取返回信息，response.info()\n: %s" %res.info())
        print(u"獲取的網頁內容已保存噹噹前目錄的baidu.txt中，請自行查看")

if __name__ == '__main__':
    linkBaidu()

運行過程當中遇到寫入baidu.txt文件亂碼問題
解決方法：gzip解壓讀寫（百度確實是gzip編碼），由於http請求中，若是在request header包含」Accept-Encoding」:」gzip, deflate」, 
而且web服務器端支持，返回的數據是通過壓縮的，這個好處是減小了網絡流量，由客戶端根據header，在客戶端層解壓，再解碼。
urllib2 module，獲取的http response數據是原始數據，沒有通過解壓，因此這是亂碼的根本緣由。

返回結果：

建立的baidu.txt文件：

噹噹前目錄下已經建立過baidu.txt文件時，程序運行會出錯，須要刪除該文件再運行

爬取通常網站能夠用如下代碼：

# !/usr/bin/env python
# -*- coding:utf-8 -*-

import urllib2
import time
import platform
import os

def clear():
    """該函數用於清屏"""
    print(u"內容較多，顯示3秒後翻頁")
    time.sleep(3)
    OS=platform.system()
    if (OS==u'Windows'):
        os.system('cls')
    else:
        os.system('clear')

def linkBaidu():
    url='http://www.hqu.edu.cn'

    try:

        response=urllib2.urlopen(url,timeout=3)


    except urllib2.URLError:
        print(u"網絡地址錯誤")
        exit()

    with open('./baidu.txt','w+') as fp:

        fp.write(response.read())

        print(u"獲取url信息，response.geturl()\n: %s" %response.geturl())
        print(u"獲取返回碼，response.getcode()\n: %s" %response.getcode())
        print(u"獲取返回信息，response.info()\n: %s" %response.info())
        print(u"獲取的網頁內容已保存噹噹前目錄的baidu.txt中，請自行查看")

if __name__ == '__main__':
    linkBaidu()

3.urllib2使用代理訪問網頁：在使用網絡爬蟲時，有的網絡拒絕了一些IP的直接訪問，這時就不得不利用代理了，至於免費的代理網絡上不少，選擇一個肯定可用的proxy

編寫testUrllib2WithProxy.py測試代理訪問網頁：

#!/usr/bin/env python
#-*- coding: utf-8 -*-
__author__ = 'hstking hstking@hotmail.com'

import urllib2
import sys
import re

def testArgument():
   '''測試輸入參數，只須要一個參數 '''
   if len(sys.argv) != 2:
      print(u'只須要一個參數就夠了')
      tipUse()
      exit()
   else:
      TP = TestProxy(sys.argv[1])

def tipUse():
   '''顯示提示信息 '''
   print(u'該程序只能輸入一個參數，這個參數必須是一個可用的proxy')
   print(u'usage: python testUrllib2WithProxy.py http://1.2.3.4:5')
   print(u'usage: python testUrllib2WithProxy.py https://1.2.3.4:5')
   

class TestProxy(object):
   '''這個類的做用是測試proxy是否有效 '''
   def __init__(self,proxy):
      self.proxy = proxy
      self.checkProxyFormat(self.proxy)
      self.url = 'http://www.baidu.com'
      self.timeout = 5 
      self.flagWord = '百度' #在網頁返回的數據中查找這個關鍵詞
      self.useProxy(self.proxy)

   def checkProxyFormat(self,proxy):
      try:
         proxyMatch = re.compile('http[s]?://[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}:[\d]{1,5}$')
         re.search(proxyMatch,proxy).group()
      except AttributeError:
         tipUse()
         exit()
      flag = 1
      proxy = proxy.replace('//','')
      try:
         protocol = proxy.split(':')[0]
         ip = proxy.split(':')[1]
         port = proxy.split(':')[2]
      except IndexError:
         print(u'下標出界')
         tipUse()
         exit()
      flag = flag and len(proxy.split(':')) == 3 and len(ip.split('.')) == 4
      flag = ip.split('.')[0] in map(str,xrange(1,256)) and flag
      flag = ip.split('.')[1] in map(str,xrange(256)) and flag
      flag = ip.split('.')[2] in map(str,xrange(256)) and flag
      flag = ip.split('.')[3] in map(str,xrange(1,255)) and flag
      flag = protocol in [u'http',u'https'] and flag
      flag = port in map(str,range(1,65535)) and flag
      '''這裏是在檢查proxy的格式 '''
      if flag:
         print(u'輸入的http代理服務器符合標準')
      else:
         tipUse()
         exit()    

   def useProxy(self,proxy):
      '''利用代理訪問百度，並查找關鍵詞 '''
      protocol = proxy.split('//')[0].replace(':','')
      ip = proxy.split('//')[1]
      opener = urllib2.build_opener(urllib2.ProxyHandler({protocol:ip}))
      urllib2.install_opener(opener)
      try:
         response = urllib2.urlopen(self.url,timeout = self.timeout)
      except:
         print(u'鏈接錯誤，退出程序')
         exit()
      str = response.read()
      if re.search(self.flagWord,str):
         print(u'已取得特徵詞，該代理可用')
      else:
         print(u'該代理不可用')

      
if __name__ == '__main__':
   testArgument() 

執行結果以下：

代碼詳細解釋：

（1）sys.argv：

sys.argv[]說白了就是一個從程序外部獲取參數的橋樑，這個「外部」很關鍵，因此那些試圖從代碼來講明它做用的解釋一直沒看明白。由於咱們從外部取得的參數能夠是多個，因此得到的是一個列表（list)，也就是說sys.argv其實能夠看做是一個列表，因此才能用[]提取其中的元素。其第一個元素是程序自己，隨後才依次是外部給予的參數。

（2）re.compile：compile(pattern[,flags] ) 根據包含正則表達式的字符串建立模式對象。

經過help能夠看到compile方法的介紹，返回一個pattern對象，可是卻沒有對第二個參數flags進行介紹。第二個參數flags是匹配模式，能夠使用按位或’|’表示同時生效，也能夠在正則表達式字符串中指定。Pattern對象是不能直接實例化的，只能經過compile方法獲得。

（3）re.search(proxyMatch,proxy).group()：正則表達式中，group（）用來提出分組截獲的字符串，（）用來分組

import re
a = "123abc456"
print re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(0)   #123abc456,返回總體
print re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(1)   #123
print re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(2)   #abc
print re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(3)   #456

1. 正則表達式中的三組括號把匹配結果分紅三組