爬蟲-Python爬蟲經常使用庫

1、經常使用庫php

一、requests 作請求的時候用到。html

requests.get("url")python

二、selenium 自動化會用到。mysql

三、lxmljquery

四、beautifulsoupredis

五、pyquery 網頁解析庫 說是比beautiful 好用,語法和jquery很是像。sql

六、pymysql 存儲庫。操做mysql數據的。數據庫

七、pymongo 操做MongoDB 數據庫。瀏覽器

八、redis 非關係型數據庫。服務器

九、jupyter 在線記事本。

2、什麼是Urllib

Python內置的Http請求庫

urllib.request 請求模塊    模擬瀏覽器

urllib.error 異常處理模塊

urllib.parse url解析模塊    工具模塊,如:拆分、合併

urllib.robotparser robots.txt    解析模塊  

 

2和3的區別

Python2

import urllib2

response = urllib2.urlopen('http://www.baidu.com');

 

Python3

import urllib.request

response =urllib.request.urlopen('http://www.baidu.com');

用法:

urlOpen 發送請求給服務器。

urllib.request.urlopen(url,data=None[參數],[timeout,]*,cafile=None,capath=None,cadefault=false,context=None)

 例子:

例子1:

import urllib.requests

response=urllib.reqeust.urlopen('http://www.baidu.com')

print(response.read().decode('utf-8'))

 

  例子2:

  import urllib.request

  import urllib.parse

  data=bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8')

  response=urllib.reqeust.urlopen('http://httpbin.org/post',data=data)

  print(response.read())

  注:加data就是post發送,不加就是以get發送。

 

  例子3:

  超時測試

  import urllib.request

  response =urllib.request.urlopen('http://httpbin.org/get',timeout=1)

  print(response.read())

  -----正常

  import socket

  import urllib.reqeust

  import urllib.error

  try:

    response=urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)

  except urllib.error.URLError as e:

    if isinstance(e.reason,socket.timeout):

      print('TIME OUT')

  這是就是輸出 TIME OUT

 

 響應

 響應類型

import urllib.request

response=urllib.request.urlopen('https://www.python.org')

print(type(response))

 輸出:print(type(response))

 

     

   狀態碼、響應頭

   import urllib.request

   response = urllib.request.urlopen('http://www.python.org')

   print(response.status)  // 正確返回200 

   print(response.getheaders())    //返回請求頭

     print(response.getheader('Server'))  

 

3、Request     能夠添加headers

   import urllib.request

  request=urllib.request.Request('https://python.org')

  response=urllib.request.urlopen(request)

  print(response.read().decode('utf-8'))

 

 

  例子:

   from urllib import request,parse

  url='http://httpbin.org/post'

  headers={

    User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36
    Host:httpbin.org

  }

  dict={

    'name':'Germey'

  }

 

  data=bytes(parse.urlencode(dict),encoding='utf8')

  req= request.Request(url=url,data=data,headers=headers,method='POST')

  response = request.urlopen(req)

  print(response.read().decode('utf-8'))

 

 

4、代理

   import urllib.request

  proxy_handler =urllib.request.ProxyHandler({

    'http':'http://127.0.0.1:9743',

    'https':'http://127.0.0.1:9743',

  })

  opener =urllib.request.build_opener(proxy_handler)

   response= opener.open('http://httpbin.org/get')

  print(response.read())

 

 

5、Cookie

   import http.cookiejar,urllib.request

  cookie = http.cookiejar.Cookiejar()

  handler=urllib.request.HTTPCookieProcessor(cookie)

  opener = urllib.request.build_opener(handler)

  response = opener.open('http://www.baidu.com')

  for item in cookie:

    print(item.name+"="+item.value)

 

  第一種保存cookie方式

  import http.cookiejar,urllib.request

  filename = 'cookie.txt'  

  cookie =http.cookiejar.MozillaCookieJar(filename)

  handler= urllib.request.HTTPCookieProcessor(cookie)

  opener=urllib.request.build_opener(handler)

  response= opener.open('http://www.baidu.com')

  cookie.save(ignore_discard=True,ignore_expires=True)

 

  第二種保存cookie方式

  import http.cookiejar,urllib.request

  filename = 'cookie.txt'

  cookie =http.cookiejar.LWPCookieJar(filename)

  handler=urllib.request.HTTPCookieProcessor(cookie)

  opener=urllib.request.build_opener(handler)

  response=opener.open('http://www.baidu.com')

  cookie.save(ignore_discard=True,ignore_expires=True)

  讀取cookie

  import http.cookiejar,urllib.request

  cookie=http.cookiejar.LWPCookieJar()

  cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)

  handler=urllib.request.HTTPCookieProcessor(cookie)

  opener=urllib.request.build_opener(handler)

  response=opener.open('http://www.baidu.com')

  print(response.read().decode('utf-8'))

 

 

 6、異常處理

  例子1:

   from urllib import reqeust,error

   try:

    response =request.urlopen('http://cuiqingcai.com/index.htm') 

  except error.URLError as e:

    print(e.reason)  //url異常捕獲

 

  例子2:

  from urllib import reqeust,error

   try:

    response =request.urlopen('http://cuiqingcai.com/index.htm') 

  except error.HTTPError as e:

    print(e.reason,e.code,e.headers,sep='\n')  //url異常捕獲

  except error.URLError as e:

    print(e.reason)  

  else:

    print('Request Successfully')

 

 

七、URL解析

   urlparse   //url 拆分

  urllib.parse.urlparse(urlstring,scheme='',allow_fragments=True)

  

  例子:

  from urllib.parse import urlparse    //url 拆分

  result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')

  print(type(result),result)

   結果:

  

 

   例子2:

  from urllib.parse import urlparse   //沒有http

  result = urlparse('www.baidu.com/index.html;user?id=5#comment',scheme='https')

     print(result)

  

 

   例子3:

  from urllib.parse import urlparse

  result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',scheme='https')

   print(result)

   

 

   例子4:

  from urllib.parse import urlparse

  result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',allow_fragments=False)

   print(result)

   

 

   例子5:

  from urllib.parse import urlparse

  result = urlparse('http://www.baidu.com/index.html#comment',allow_fragments=False)

   print(result)

   

 

 

 7、拼接  

  urlunparse

   例子:

  from urllib.parse import urlunparse

  data=['http','www.baidu.com','index.html','user','a=6','comment']

  print(urlunparse(data))

   

 

   urljoin

   from urllib.parse import urljoin

  print(urljoin('http://www.baidu.com','FAQ.html'))

  

  後面覆蓋前面的

 

  urlencode

  from urllib.parse import urlencode

  params={

    'name':'gemey',

    'age':22

  }

  base_url='http//www.baidu.com?'

  url = base_url+urlencode(params)

  print(url)

  http://www.baidu.com?name=gemey&age=22

 

 

example:

urllib是Python自帶的標準庫,無需安裝,直接能夠用。
提供了以下功能:

  • 網頁請求
  • 響應獲取
  • 代理和cookie設置
  • 異常處理
  • URL解析

爬蟲所須要的功能,基本上在urllib中都能找到,學習這個標準庫,能夠更加深刻的理解後面更加便利的requests庫。

urllib庫

urlopen 語法

urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None) #url:訪問的網址 #data:額外的數據,如header,form data 

用法

# request:GET import urllib.request response = urllib.request.urlopen('http://www.baidu.com') print(response.read().decode('utf-8')) # request: POST # http測試:http://httpbin.org/ import urllib.parse import urllib.request data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8') response = urllib.request.urlopen('http://httpbin.org/post',data=data) print(response.read()) # 超時設置 import urllib.request response = urllib.request.urlopen('http://httpbin.org/get',timeout=1) print(response.read()) import socket import urllib.request import urllib.error try: response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1) except urllib.error.URLError as e: if isinstance(e.reason,socket.timeout): print('TIME OUT') 

響應

# 響應類型 import urllib.open response = urllib.request.urlopen('https:///www.python.org') print(type(response)) # 狀態碼, 響應頭 import urllib.request response = urllib.request.urlopen('https://www.python.org') print(response.status) print(response.getheaders()) print(response.getheader('Server')) 

Request

聲明一個request對象,該對象能夠包括header等信息,而後用urlopen打開。

# 簡單例子 import urllib.request request = urllib.request.Requests('https://python.org') response = urllib.request.urlopen(request) print(response.read().decode('utf-8')) # 增長header from urllib import request, parse url = 'http://httpbin.org/post' headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36' 'Host':'httpbin.org' } # 構造POST表格 dict = { 'name':'Germey' } data = bytes(parse.urlencode(dict),encoding='utf8') req = request.Request(url=url,data=data,headers=headers,method='POST') response = request.urlopen(req) print(response.read()).decode('utf-8') # 或者隨後增長header from urllib import request, parse url = 'http://httpbin.org/post' dict = { 'name':'Germey' } req = request.Request(url=url,data=data,method='POST') req.add_hader('User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36') response = request.urlopen(req) print(response.read().decode('utf-8')) 

Handler:處理更加複雜的頁面

官方說明
代理

import urllib.request proxy_handler = urllib.request.ProxyHandler({ 'http':'http://127.0.0.1:9743' 'https':'https://127.0.0.1.9743' }) opener = urllib.request.build_openner(proxy_handler) response = opener.open('http://www.baidu.com') print(response.read()) 

Cookie:客戶端用於記錄用戶身份,維持登陸信息

import http.cookiejar, urllib.request cookie = http.cookiejar.CookieJar() handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open("http://www.baidu.com") for item in cookie: print(item.name+"="+item.value) # 保存cooki爲文本 import http.cookiejar, urllib.request filename = "cookie.txt" # 保存類型有不少種 ## 類型1 cookie = http.cookiejar.MozillaCookieJar(filename) ## 類型2 cookie = http.cookiejar.LWPCookieJar(filename) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open("http://www.baidu.com") # 使用相應的方法讀取 import http.cookiejar, urllib.request cookie = http.cookiejar.LWPCookieJar() cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open("http://www.baidu.com") 

異常處理

捕獲異常,保證程序穩定運行

# 訪問不存在的頁面 from urllib import request, error try: response = request.urlopen('http://cuiqingcai.com/index.htm') except error.URLError as e: print(e.reason) # 先捕獲子類錯誤 from urllib imort request, error try: response = request.urlopen('http://cuiqingcai.com/index.htm') except error.HTTPError as e: print(e.reason, e.code, e.headers, sep='\n') except error.URLError as e: print(e.reason) else: print("Request Successfully') # 判斷緣由 import socket import urllib.request import urllib.error try: response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1) except urllib.error.URLError as e: if isinstance(e.reason,socket.timeout): print('TIME OUT') 

URL解析

主要是一個工具模塊,可用於爲爬蟲提供URL。

urlparse:拆分URL

urlib.parse.urlparse(urlstring,scheme='', allow_fragments=True) # scheme: 協議類型 # 是否忽略’#‘部分 

舉個例子

from urllib import urlparse result = urlparse("https://edu.hellobi.com/course/157/play/lesson/2580") result ##ParseResult(scheme='https', netloc='edu.hellobi.com', path='/course/157/play/lesson/2580', params='', query='', fragment='') 

urlunparse:拼接URL,爲urlparse的反向操做

from urllib.parse import urlunparse data = ['http','www.baidu.com','index.html','user','a=6','comment'] print(urlunparse(data)) 

urljoin:拼接兩個URL

 
urljoin

urlencode:字典對象轉換成GET請求對象

from urllib.parse import urlencode params = { 'name':'germey', 'age': 22 } base_url = 'http://www.baidu.com?' url = base_url + urlencode(params) print(url) 

最後還有一個robotparse,解析網站容許爬取的部分。

做者:hoptop 連接:https://www.jianshu.com/p/cfbdacbeac6e 來源:簡書 著做權歸做者全部。商業轉載請聯繫做者得到受權,非商業轉載請註明出處。
相關文章
相關標籤/搜索