python_爬蟲

時間 2019-11-10

標籤 python 爬蟲欄目 Python 简体版

原文原文鏈接

一、網絡爬蟲
   一、定義：網絡蜘蛛，網絡機器人，抓取網絡數據的程序
   二、總結：用Python程序去模仿人去訪問網站，模仿的越逼真越好
   三、目的：經過有效的大量的數據分析市場走勢，公司的決策
二、企業獲取數據的方式
   一、公司自有
   二、第三方數據平臺購買
       一、數據堂、貴陽大數據交易所
   三、爬蟲程序爬取數據
       市場上沒有或者價格過高，利用爬蟲程序去爬取
三、Python作爬蟲的優點
   一、Python：請求模塊、解析模塊豐富成熟
   二、PHP：多線程，異步支持不夠好
   三、JAVA：代碼笨重，代碼量大
   四、C/C++：雖然效率高，但代碼成型太慢
四、爬蟲的分類
   一、通用的網絡爬蟲（搜索引擎引用，須要遵照robots協議）

       一、搜索引擎如何獲取一個新網站的URL
           一、網站主動向搜索引擎提供（百度站長平臺）
           二、和DNS服務商（萬網），快速收錄新網站
   二、聚焦網絡爬蟲（須要什麼爬取什麼）
       本身寫的爬蟲程序：面向主題爬蟲，面向需求爬蟲
五、爬取數據步驟
   一、肯定須要爬取的URL地址
   二、經過HTTP/HTTPS協議來獲取響應的HTML頁面
   三、提取HTML頁面裏有用的數據
       一、所需數據，保存
       二、頁面中其餘的URL，繼續重複第2步
六、Chrome瀏覽器插件
   一、插件安裝步驟
       一、右上角->更多工具->擴展程序php

       二、點開開發者模式
       三、把插件拖拽到瀏覽器界面
   二、插件介紹
       一、Proxy SwitchyOmega：代理切換插件
       二、XPath Helper：網頁數據解析插件
       三、JSON View：查看json格式的數據（好看）
七、Fiddler抓包工具
   一、抓包設置
       一、設置Fiddler抓包工具html

二、設置瀏覽器代理
Proxy SwitchyOmega ->選項->新建情景模式->HTTP 127.0.0.1 8888python

   二、Fiddler經常使用菜單
       一、Inspector：查看抓到數據包的詳細內容
       二、經常使用選項
           一、Headers：客戶端發送到服務器的header，包含web客戶端信息 cookie傳輸狀態
           二、WebForms：顯示請求的POST的數據
           三、Raw：將整個請求顯示爲純文本mysql

八、Anaconda 和 spyder
   一、Anaconda：開源的python發行版本
   二、Spyder：集成的開發工具
       spyder經常使用快捷鍵
           一、註釋/取消註釋：ctrl+1
           二、保存：ctrl+s
           三、運行程序：F5
九、WEB
   一、HTTP 和 HTTPS
       一、HTTP：80
       二、HTTPS：443，HTTP的升級版
   二、GET 和 POST
       一、GET：查詢參數會在URL上顯示出來
       二、POST：查詢參數和提交的數據在form表單裏，不會在URL地址上顯示
       三、URL
           http:// item.jd.com     :80   /2660656.html #detail
           協議    域名/IP地址默認端口資源路徑   錨點（可選）
       四、User-Agent
           記錄用戶瀏覽器、操做系統等，爲了讓用戶獲取更好的HTML頁面效果
           Mozilla：Fireox（Gecko內核）
           IE：Trident（本身內核）
           Linux：KHIML（like Gecko）
           Apple：Webkit（like KHTML）
           google：Chrome（like webkit）
十、爬蟲請求模塊
   一、urllib.request
       一、版本
           一、Python2中：urllib 和 urllib2
           二、Python3中：把二者合併，urllib.request
       二、經常使用方法
           一、urllib.request.urlopen('URL')
               做用：向網站發起請求並獲取響應
               urlopen(),獲得的響應對象response：bytesgit

import urllib.request
url = 'http://www.baidu.com/'
#發起請求並獲取響應對象
response = urllib.request.urlopen(url)
#響應對象的read()方法獲取響應內容
#read()方法獲得的是bytes類型
#read() bytes -->string
html = response.read().decode('utf-8')
print(html)

             二、urllib.request.Request(url,headers={})
               一、重構User-Agent，爬蟲和反爬蟲鬥爭第一步
               二、使用步驟
                   一、構建請求對象request：Request（）
                   二、獲取響應對象response：urlopen（request）
                   三、利用響應對象response.read().decode('utf-8')web

# -*- coding: utf-8 -*-
import urllib.request
url = 'http://www.baidu.com/'
headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}
#一、構建請求對象
request = urllib.request.Request(url,headers=headers)
#二、獲得響應對象
response = urllib.request.urlopen(request)
#三、獲取響應對象的內容
html = response.read().decode('utf-8')
print(html)

        三、請求對象request方法
           一、add_header()
               做用：添加或修改headers(User-Agent)
           二、get_header(‘User-agent’)，只有U是大寫
               做用：獲取已有的HTTP報頭的值正則表達式

import urllib.request

url = 'http://www.baidu.com/'
headers = 'User-AgentUser-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'
request = urllib.request.Request(url)
#請求對象方法add_header()
request.add_header("User-Agent",headers)
#獲取響應對象
response = urllib.request.urlopen(request)
#get_header()方法獲取User-agent,
#注意User-agent的寫法，只有U是大寫的
print(request.get_header('User-agent'))
#獲取響應碼
print(response.getcode())
#獲取響應報頭信息，返回結果是一個字典
print(response.info())
html = response.read().decode('utf-8')
print(html)

        四、響應對象response方法
           一、read()；讀取服務器響應的內容
           二、getcode()：
               做用：返回HTTP的響應狀態碼
                   200：成功
                   4XX：服務器頁面出錯（鏈接到了服務器）
                   5XX：服務器出錯（沒有鏈接到服務器）
           三、info()：
               做用：返回服務器的響應報頭信息
   二、urllib.parse
       一、quote('中文字符串')
       二、urlencode(字典)
       三、unquote("編碼以後的字符串")，解碼sql

import urllib.request
import urllib.parse

url = 'http://www.baidu.com/s?wd='
key = input('請輸入要搜索的內容')
#編碼，拼接URL
key = urllib.parse.quote(key)
fullurl = url+key
print(fullurl)#http://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3
headers = {'User-Agent':"User-AgentUser-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"}
request = urllib.request.Request(fullurl,headers = headers)
resp = urllib.request.urlopen(request)
html = resp.read().decode('utf-8')
print(html)

import urllib.request
import urllib.parse

baseurl = "http://www.baidu.com/s?"
headers = {'User-Agent':"User-AgentUser-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"}
key = input("請輸入要搜索的內容")
#urlencode編碼，參數必定是字典
d = {"wd":key}
d = urllib.parse.urlencode(d)
url = baseurl + d
resq = urllib.request.Request(url,headers = headers)
resp = urllib.request.urlopen(resq)
html = resp.read().decode('utf-8')
print(html)

練習：爬取百度貼吧shell

一、簡單版數據庫

# -*- coding: utf-8 -*-
"""
百度貼吧數據抓取
要求：
    一、輸入貼吧的名稱
    二、輸入抓取的起始頁和終止頁
    三、把每一頁的內容保存到本地：第一頁.html 第二頁.html
    http://tieba.baidu.com/f?kw=%E6%B2%B3%E5%8D%97%E5%A4%A7%E5%AD%A6&ie=utf-8&pn=0
"""
import urllib.request
import urllib.parse

baseurl = "http://tieba.baidu.com/f?"
headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}
title = input("請輸入要查找的貼吧")
begin_page = int(input("請輸入起始頁"))
end_page = int(input("請輸入起始頁"))
#RUL進行編碼
kw = {"kw":title}
kw = urllib.parse.urlencode(kw)
#寫循環拼接URL，發請求獲取響應，寫入本地文件
for page in range(begin_page,end_page+1):
    pn = (page-1)*50
    #拼接URL
    url = baseurl + kw + "&pa=" + str(pn) 
    #發請求，獲取響應
    req = urllib.request.Request(url,headers=headers)
    res = urllib.request.urlopen(req)
    html = res.read().decode("utf-8")
    #寫文件保存在本地
    filename = "第" + str(page) +"頁.html"
    with open(filename,'w',encoding='utf-8') as f:
        print("正在下載第%d頁"%page)
        f.write(html)
        print("第%d頁下載成功"%page)

二、函數版

import urllib.request
import urllib.parse

#發請求，獲取響應，獲得html
def getPage(url):
    headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}
    req = urllib.request.Request(url,headers=headers)
    res = urllib.request.urlopen(req)
    html = res.read().decode("utf-8")
    return html

#保存html文件到本地
def writePage(filename,html):
    with open(filename,'w',encoding="utf-8") as f:
        f.write(html)
    
#主函數
def workOn():
    name = input("請輸入貼吧名")
    begin = int(input("請輸入起始頁"))
    end = int(input("請輸入終止頁"))
    baseurl = "http://tieba.baidu.com/f?"
    kw = {"kw":name}
    kw = urllib.parse.urlencode(kw)
    for page in range(begin,end+1):
        pn = (page-1) *50
        url = baseurl + kw + "&pn=" + str(pn)
        html = getPage(url)
        filename = "第"+ str(page) + "頁.html"
        writePage(filename,html)
if __name__ == "__main__":
    workOn()

三、封裝爲類

import urllib.request
import urllib.parse

class BaiduSpider:
    def __init__(self):
        self.baseurl = "http://tieba.baidu.com/f?"
        self.headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}
        
    def getPage(self,url):
        '''發請求，獲取響應，獲得html'''
        req = urllib.request.Request(url,headers = self.headers)
        res = urllib.request.urlopen(req)
        html = res.read().decode("utf-8")
        return html
    
    def writePage(self,filename,html):
        '''保存html文件到本地'''
        with open(filename,'w',encoding="utf-8") as f:
            f.write(html)
    
    def workOn(self):
        '''主函數'''
        name = input("請輸入貼吧名")
        begin = int(input("請輸入起始頁"))
        end = int(input("請輸入終止頁"))
        kw = {"kw":name}
        kw = urllib.parse.urlencode(kw)
        for page in range(begin,end+1):
            pn = (page-1) *50
            url = self.baseurl + kw + "&pn=" + str(pn)
            html = self.getPage(url)
            filename = "第"+ str(page) + "頁.html"
            writePage(filename,html)
        
if __name__ == "__main__":
    #建立對象
    daiduSpider = BaiduSpider()
    #調用類內的方法
    daiduSpider.workOn()

一、解析
   一、數據分類
       一、結構化數據
           特色：有固定的格式：HTML、XML、JSON等
       二、非結構化數據
           示例：圖片、音頻、視頻，這類數據通常存儲爲二進制
   二、正則表達式（re模塊）
       一、使用流程
           一、建立編譯對象：p = re.compile(r"\d")
           二、對字符串匹配：result = p.match('123ABC')
           三、獲取匹配結果：print(result.group())
       二、經常使用方法
           一、match(s)：只匹配字符串開頭，返回一個對象
           二、search(s)：從開始日後去匹配第一個，返回一個對象
           三、group()：從match和search返回的對象中取值
           四、findall(s)：所有匹配，返回一個列表
       三、表達式
           .:任意字符(不能匹配\n)
           [...]:包含[]中的一個內容
           \d:數字
           \w:字母、數字、下劃線
           \s：空白字符
           \S：非空字符

           *:前一個字符出現0次或屢次
           ？：0次或1次
           +：1次或屢次
           {m}：前一個字符出現m次

           貪婪匹配：在整個表達式匹配成功前提下，儘量多的去匹配
           非貪婪匹配：整個表達式匹配成功前提下，儘量少的去匹配

　　四、示例：

import re
s = """<div><p>仰天大笑出門去，我輩豈是篷篙人</p></div>
        <div><p>天生我材必有用，千金散盡還復來</p></div>
     """
#建立編譯對象，貪婪匹配
p =re.compile("<div>.*</div>",re.S)
result = p.findall(s)
print(result) 
#['<div><p>仰天大笑出門去，我輩豈是篷篙人</p></div>\n\t    <div><p>天生我材必有用，千金散盡還復來</p></div>']
#非貪婪匹配
p1 = re.compile("<div>.*?</div>",re.S)
result1 = p1.findall(s)
print(result1)
#['<div><p>仰天大笑出門去，我輩豈是篷篙人</p></div>', '<div><p>天生我材必有用，千金散盡還復來</p></div>']

五、findall()的分組
解釋：先按總體匹配出來，而後在匹配()中內容，若是有2個或多個()，則以元組方式顯示

import re
s = 'A B C D'
p1 = re.compile("\w+\s+\w+")
print(p1.findall(s))#['A B','C D']

#一、先按照總體去匹配['A B','C D']
#二、顯示括號裏面的人內容,['A','C']
p2 = re.compile("(\w+)\s+\w+")
print(p2.findall(s))#['A','C']
#一、先按照總體匹配['A B','C D']
#二、有兩個以上分組須要將匹配分組的內容寫在小括號裏面
#,顯示括號內容：[('A','B'),('C','D')]
p3 = re.compile("(\w+)\s+(\w+)")
print(p3.findall(s))
#[('A','B'),('C','D')]

六、練習,貓眼電影榜單top100

# -*- coding: utf-8 -*-
"""
一、爬取貓眼電影top100榜單
    一、程序運行，直接爬取第一頁
    二、是否繼續爬取（y/n）
        y:爬取第2頁
        n：爬取結束，謝謝使用
    三、把每一頁的內容保存到本地，第一頁.html
        第一頁：http://maoyan.com/board/4?offset=0
        第二頁：http://maoyan.com/board/4?offset=10
    四、解析：電影名，主演，上映時間
"""
import urllib.request
import re
class MaoyanSpider:
    '''爬取貓眼電影top100榜單'''
    def __init__(self):
        self.baseurl = "http://maoyan.com/board/4?offset="
        self.headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}
    
    def getPage(self,url):
        '''獲取html頁面'''
        #建立請求對象
        res = urllib.request.Request(url,headers= self.headers)
        #發送請求
        rep = urllib.request.urlopen(res)
        #獲得響應結果
        html = rep.read().decode("utf=8")
        return html
    
    def wirtePage(self,filename,html):
        '''保存至本地文件'''
#        with open(filename,'w',encoding="utf-8") as f:
#            f.write(html)
            
        content_list = self.match_contents(html)
        for content_tuple in content_list:
            movie_title = content_tuple[0].strip()
            movie_actors = content_tuple[1].strip()[3:]
            releasetime = content_tuple[2].strip()[5:15]
            with open(filename,'a',encoding='utf-8') as f:
                f.write(movie_title+"|" + movie_actors+"|" + releasetime+'\n')
            
     
    def match_contents(self,html):
        '''匹配電影名，主演，和上映時間'''
        #正則表達式
#       '''
#        <div class="movie-item-info">
#        <p class="name"><a href="/films/1203" title="霸王別姬" data-act="boarditem-click" data-val="{movieId:1203}">霸王別姬</a></p>
#        <p class="star">
#                主演：張國榮,張豐毅,鞏俐
#        </p>
#        <p class="releasetime">上映時間：1993-01-01(中國香港)</p>    </div>
#        '''
        regex = r'<div class="movie-item-info">.*?<a.*? title="(.*?)".*?<p class="star">(.*?)</p>.*?<p class="releasetime">(.*?)</p>.*?</div>'
        p = re.compile(regex,re.S)
        content_list = p.findall(html)
        return content_list
    def workOn(self):
         '''主函數'''
         for page in range(0,10):
             #拼接URL
             url = self.baseurl + str(page*10)
             #filename = '貓眼/第' + str(page+1) + "頁.html"
             filename = '貓眼/第' + str(page+1) + "頁.txt"
             print("正在爬取%s頁"%(page+1))
             html = self.getPage(url)
             self.wirtePage(filename,html)
             #用於記錄輸入的命令
             flag = False
             while True:
                 msg = input("是否繼續爬取（y/n）")
                 if msg == "y":
                     flag = True
                 elif msg == "n":
                     print("爬取結束，謝謝使用")
                     flag = False
                 else:
                     print("您輸入的命令無效")
                     continue
                 if flag :
                     break
                 else:
                     return None
         print("全部內容爬取完成")
            
                 
                
if __name__ == "__main__":
    spider = MaoyanSpider()
    spider.workOn()

貓眼電影top100爬取

    三、Xpath
   四、BeautifulSoup
二、請求方式及方案
   一、GET（查詢參數都在URL地址中顯示）
   二、POST
       一、特色：查詢參數在Form表單裏保存
       二、使用：
           urllib.request.urlopen(url,data = data ,headers = headers)
           data:表單數據data必須以bytes類型提交，不能是字典
       三、案例：有道翻譯
           一、利用Fiddler抓包工具抓取WebForms裏表單數據
           二、對POST數據進行處理bytes數據類型
           三、發送請求獲取響應

from urllib import request,parse
import json
#一、處理表單數據
#Form表單的數據放到字典中，而後在進行編碼轉換
word = input('請輸入要翻譯的內容：')
data = {"i":word,
        "from":"AUTO",
        "to":"AUTO",
        "smartresult":"dict",
        "client":"fanyideskweb",
        "salt":"1536648367302",
        "sign":"f7f6b53876957660bf69994389fd0014",
        "doctype":"json",
        "version":"2.1",
        "keyfrom":"fanyi.web",
        "action":"FY_BY_REALTIME",
        "typoResult":"false"}
#二、把data轉換爲bytes類型
data = parse.urlencode(data).encode('utf-8')
#三、發請求獲取響應
#此處德 URL爲抓包工具抓到的POST的URL
url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule"
headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}
req = request.Request(url,data=data,headers=headers)
res = request.urlopen(req)
result = res.read().decode('utf-8')
print(type(result))#<class 'str'>
print(result)#result爲json格式的字符串
'''{"type":"ZH_CN2EN",
    "errorCode":0,
    "elapsedTime":1,
    "translateResult":[
                        [{"src":"你好",
                        "tgt":"hello"
                        }]
                        ]
}'''
#把json格式的字符串轉換爲Python字典
#
dic = json.loads(result)
print(dic["translateResult"][0][0]["tgt"])

         四、json模塊
           json.loads('json格式的字符串')
               做用：把json格式的字符串轉換爲Python字典
   三、Cookie模擬登錄
       一、Cookie 和 Session
           cookie：經過在客戶端記錄的信息肯定用戶身份
           session：經過在服務器端記錄的信息肯定用戶身份
       二、案例：使用cookie模擬登錄人人網
           一、獲取到登陸信息的cookie（登陸一次抓包）
           二、發送請求獲得響應

from urllib import request
url = "http://www.renren.com/967982493/profile"
headers = {
        'Host': 'www.renren.com',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
        #Accept-Encoding: gzip, deflate
        'Referer': 'http://www.renren.com/SysHome.do',
        'Cookie': 'anonymid=jlxfkyrx-jh2vcz; depovince=SC; _r01_=1; jebe_key=6aac48eb-05fb-4569-8b0d-5d71a4a7a3e4%7C911ac4448a97a17c4d3447cbdae800e4%7C1536714317279%7C1%7C1536714319337; jebecookies=a70e405c-c17a-4877-8164-00823b5e092c|||||; JSESSIONID=abcq8TskVWDMEgvjGslxw; ick_login=d1b4c959-7554-421e-8a7f-b97edd577b3a; ick=c6c7cac9-d9ac-49e5-9e74-9ac481136db1; XNESSESSIONID=e94666d4bdb8; wp_fold=0; BAIDU_SSP_lcr=https://www.baidu.com/link?url=n0NWyopmrKuQ6xUulfbYUud3nr02sIODSKI8sfzvS2G&wd=&eqid=e7cd8eed0003aeaa000000055b9864da; _de=5EE7F4A4EC35EE3510B8477EDD9F1F27; p=dc67b283c53b57a3c9f20e04cb9ca2d43; first_login_flag=1; ln_uact=13333759329; ln_hurl=http://head.xiaonei.com/photos/0/0/men_main.gif; t=cb96dfe9e344a2d817027a2c8f7f0c4c3; societyguester=cb96dfe9e344a2d817027a2c8f7f0c4c3; id=967982493; xnsid=34a50049; loginfrom=syshome',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
}
req = request.Request(url,headers = headers)
res = request.urlopen(req)
html = res.read().decode('utf-8')
print(html)

三、requests模塊
   一、安裝（Conda prompt終端）
       一、(base) ->conda install requests
   二、經常使用方法
       一、get()：向網站發送請求，並獲取響應對象
           一、用法：resopnse = requests.get(url,headers = headers)
           二、response的屬性
               一、response.text:獲取響應內容(字符串)
                   說明：通常返回字符編碼爲ISO-8859-1，能夠經過手動指定：response.encoding='utf-8'
               二、response.content：獲取響應內容(bytes)
                   一、應用場景：爬取圖片，音頻等非結構化數據
                   二、示例：爬取圖片
               三、response.status_code：返回服務器的響應碼

import requests

url = "http://www.baidu.com/"
headers = {"User-Agent":"Mozilla5.0/"}
#發送請求獲取響應對象
response = requests.get(url,headers)
#改變編碼方式
response.encoding = 'utf-8'
#獲取響應內容,text返回字符串
print(response.text)
#content返回bytes
print(response.content)
print(response.status_code)#200

            三、get():查詢參數 params(字典格式)
               一、沒有查詢參數
                   res = requests.get(url,headers=headers)
               二、有查詢參數
                   params= {"wd":"python"}
                   res = requuests.get(url,params=params,headers=headers)
       二、post()：參數名data
           一、data={} #data參數爲字典，不用轉爲bytes數據類型
           二、示例：

import requests
import json
#一、處理表單數據
word = input('請輸入要翻譯的內容：')
data = {"i":word,
        "from":"AUTO",
        "to":"AUTO",
        "smartresult":"dict",
        "client":"fanyideskweb",
        "salt":"1536648367302",
        "sign":"f7f6b53876957660bf69994389fd0014",
        "doctype":"json",
        "version":"2.1",
        "keyfrom":"fanyi.web",
        "action":"FY_BY_REALTIME",
        "typoResult":"false"}

#此處德 URL爲抓包工具抓到的POST的URL
url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule"
headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}
response = requests.post(url,data=data,headers=headers)
response.encoding = 'utf-8'
result = response.text
print(type(result))#<class 'str'>
print(result)#result爲json格式的字符串
'''{"type":"ZH_CN2EN",
    "errorCode":0,
    "elapsedTime":1,
    "translateResult":[
                        [{"src":"你好",
                        "tgt":"hello"
                        }]
                        ]
}'''
#把json格式的字符串轉換爲Python字典
dic = json.loads(result)
print(dic["translateResult"][0][0]["tgt"])

        三、代理：proxies
           一、爬蟲和反爬蟲鬥爭的第二步
               獲取代理IP的網站
                   一、西刺代理
                   二、快代理
                   三、全國代理

二、普通代理：proxies={"協議":"IP地址：端口號"}
proxies = {'HTTP':"123.161.237.114:45327"}

import requests

url = "http://www.taobao.com"
proxies = {"HTTP":"123.161.237.114:45327"}
headers = {"User-Agent":"Mozilla5.0/"}
response = requests.get(url,proxies=proxies,headers=headers)
response.encoding = 'utf-8'
print(response.text)

三、私密代理：proxies={"協議":"http://用戶名:密碼@IP地址:端口號"}
proxies={'HTTP':'http://309435365:szayclhp@114.67.228.126:16819'}

import requests

url = "http://www.taobao.com/"
proxies = {'HTTP':'http://309435365:szayclhp@114.67.228.126:16819'}
headers = {"User-Agent":"Mozilla5.0/"}
response = requests.get(url,proxies=proxies,headers=headers)
response.encoding = 'utf-8'
print(response.text)

四、案例：爬取鏈家地產二手房信息

　　　　　　一、存入mysql數據庫

import pymysql
db = pymysql.connect("localhost","root","123456",charset='utf8')
cusor = db.cursor()
cursor.execute("create database if not exists testspider;")
cursor.execute("use testspider;")
cursor.execute("create table if not exists t1(id int);")
cursor.execute("insert into t1 values(100);")
db.commit()
cursor.close()
db.close()

　　　　　　二、存入MongoDB數據庫

import pymongo
#連接mongoDB數據庫
conn = pymongo.MongoClient('localhost',27017)
#建立數據庫並獲得數據庫對象
db = conn.testpymongo
#建立集合並獲得集合對象
myset = db.t1
#向集合中插入一個數據
myset.insert({"name":"Tom"})

"""
爬取鏈家地產二手房信息（用私密代理實現）
目標：爬取小區名稱，總價
步驟：
    一、獲取url
        https://cd.lianjia.com/ershoufang/pg1/
        https://cd.lianjia.com/ershoufang/pg2/
    二、正則匹配
    三、寫入到本地文件
"""
import requests
import re
import multiprocessing as mp
BASE_URL = "https://cd.lianjia.com/ershoufang/pg"
proxies = {'HTTP':'http://309435365:szayclhp@114.67.228.126:16819'}
headers = {"User-Agent":"Mozilla5.0/"}
regex = '<div class="houseInfo">.*?<a.*?data-el="region">(.*?)</a>.*?<div class="totalPrice">.*?<span>(.*?)</span>'                                      

def getText(BASE_URL,proxies,headers,page):
    url = BASE_URL+str(page)
    res = requests.get(url,proxies=proxies,headers=headers)
    res.encoding = 'utf-8'
    html = res.text
    return html

def saveFile(page,regex=regex):
    html = getText(BASE_URL,proxies,headers,page)
    p = re.compile(regex,re.S)
    content_list = p.findall(html)
    for content_tuple in content_list:
        cell = content_tuple[0].strip()
        price = content_tuple[1].strip()
        with open('鏈家.txt','a') as f:
            f.write(cell+"  "+price+"\n")
                
if __name__ == "__main__":
    pool = mp.Pool(processes = 10)
    pool.map(saveFile,[page for page in range(1,101)])

鏈家二手房產

import requests
import re
import multiprocessing as mp
import pymysql
import warnings
BASE_URL = "https://cd.lianjia.com/ershoufang/pg"
proxies = {'HTTP':'http://309435365:szayclhp@114.67.228.126:16819'}
headers = {"User-Agent":"Mozilla5.0/"}
regex = '<div class="houseInfo">.*?<a.*?data-el="region">(.*?)</a>.*?<div class="totalPrice">.*?<span>(.*?)</span>'                                      
c_db = "create database if not exists spider;"
u_db = "use spider;"
c_tab = "create table if not exists lianjia(id int primary key auto_increment,\
        name varchar(30),\
        price decimal(20,2))charset=utf8;"
db = pymysql.connect("localhost","root",'123456',charset="utf8")
cursor = db.cursor()
warnings.filterwarnings("error")
try:
    cursor.execute(c_db)
except Warning:
    pass
cursor.execute(u_db)
try:
   cursor.execute(c_tab)
except Warning:
    pass
def getText(BASE_URL,proxies,headers,page):
    url = BASE_URL+str(page)
    res = requests.get(url,proxies=proxies,headers=headers)
    res.encoding = 'utf-8'
    html = res.text
    return html
def writeToMySQL(page,regex=regex):
    html = getText(BASE_URL,proxies,headers,page)
    p = re.compile(regex,re.S)
    content_list = p.findall(html)
    for content_tuple in content_list:
        cell = content_tuple[0].strip()
        price = float(content_tuple[1].strip())*10000
        s_insert = "insert into lianjia(name,price) values('%s','%s');"%(cell,price)
        cursor.execute(s_insert)
        db.commit()              
if __name__ == "__main__":
    pool = mp.Pool(processes = 20)
    pool.map(writeToMySQL,[page for page in range(1,101)])

存入mysql數據庫

import requests
import re
import multiprocessing as mp
import pymongo
BASE_URL = "https://cd.lianjia.com/ershoufang/pg"
proxies = {'HTTP':'http://309435365:szayclhp@114.67.228.126:16819'}
headers = {"User-Agent":"Mozilla5.0/"}
regex = '<div class="houseInfo">.*?<a.*?data-el="region">(.*?)</a>.*?<div class="totalPrice">.*?<span>(.*?)</span>'                                      
#連接mongoDB數據庫
conn = pymongo.MongoClient('localhost',27017)
#建立數據庫並獲得數據庫對象
db = conn.spider;
#建立集合並獲得集合對象
myset = db.lianjia

def getText(BASE_URL,proxies,headers,page):
    url = BASE_URL+str(page)
    res = requests.get(url,proxies=proxies,headers=headers)
    res.encoding = 'utf-8'
    html = res.text
    return html
def writeToMongoDB(page,regex=regex):
    html = getText(BASE_URL,proxies,headers,page)
    p = re.compile(regex,re.S)
    content_list = p.findall(html)
    for content_tuple in content_list:
        cell = content_tuple[0].strip()
        price = float(content_tuple[1].strip())*10000
        d = {"houseName":cell,"housePrice":price} 
        #向集合中插入一個數據
        myset.insert(d)
if __name__ == "__main__":
    pool = mp.Pool(processes = 20)
    pool.map(writeToMongoDB,[page for page in range(1,101)])

存入MongoDB

四、WEB客戶端驗證(有些網站須要先登陸才能夠訪問)：auth
一、auth = ("用戶名","密碼")，是一個元組

import requests
import re
regex = r'<a.*?>(.*?)</a>'
class NoteSpider:
    def __init__(self):
        self.headers = {"User-Agent":"Mozilla5.0/"}
        #auth參數爲元組
        self.auth = ("tarenacode","code_2013")
        self.url = "http://code.tarena.com.cn/"
    
    def getParsePage(self):
        res = requests.get(self.url,auth=self.auth, headers=self.headers)
        res.encoding = "utf-8"
        html = res.text
        p = re.compile(regex,re.S)
        r_list = p.findall(html)
        #調用writePage()方法
        self.writePage(r_list)
    def writePage(self,r_list):
        print("開始寫入")
        for r_str in r_list: 
            with open('筆記.txt','a') as f:
                f.write(r_str + "\n")
        print("寫入完成")
            
if __name__=="__main__":
    obj = NoteSpider()
    obj.getParsePage()

        五、SSL證書認證：verify
           一、verify=True：默認，作SSL證書認證
           二、verify=False: 忽略證書認證

import requests
url = "http://www.12306.cn/mormhweb/"
headers = {"User-Agent":"Mozilla5.0/"}
res = requests.get(url,verify=False,headers=headers)
res.encoding = "utf-8"
print(res.text)

四、Handler處理器（urllib.request,瞭解）
   一、定義
       自定義的urlopen()方法，urlopen方法是一個特殊的opener
   二、經常使用方法
       一、build_opener(Handler處理器對象)
       二、opener.open(url),至關於執行了urlopen
   三、使用流程
       一、建立相關Handler處理器對象
           http_handler = urllib.request.HTTPHandler()
       二、建立自定義opener對象
           opener = urllib.request.build_opener(http_handler)
       三、利用opener對象的open方法發送請求
   四、Handler處理器分類
       一、HTTPHandler()

import urllib.request
url = "http://www.baidu.com/"
#一、建立HTTPHandler處理器對象
http_handler = urllib.request.HTTPHandler()
#二、建立自定義的opener對象
opener = urllib.request.build_opener(http_handler)
#三、利用opener對象的open方法發送請求
req = urllib.request.Request(url)
res = opener.open(req)
print(res.read().decode("utf-8"))

　　二、ProxyHandler(代理IP)：普通代理

import urllib.request

url = "http://www.baidu.com"

#一、建立handler
proxy_handler = urllib.request.ProxyHandler({"HTTP":"123.161.237.114:45327"})
#二、建立自定義opener
opener = urllib.request.build_opener(proxy_handler)
#三、利用opener的open方法發送請求
req = urllib.request.Request(url)
res = opener.open(req)
print(res.read().decode("utf-8"))

三、ProxyBasicAuthHandler(密碼管理器對象)：私密代理

            一、密碼管理器使用流程
               一、建立密碼管理器對象
                   pwd = urllib.request.HTTPPasswordMgrWithDefaultRealm()
               二、添加私密代理用戶名，密碼，IP地址，端口號
                   pwd.add_password(None,"IP:端口"，"用戶名","密碼")
           二、urllib.request.ProxyBasicAuthHandler(密碼管理器對象)

一、CSV模塊使用流程
   一、Python語句打開CSV文件：
       with open('test.csv','a',newline='',encoding='utf-8') as f:
           pass
   二、初始化寫入對象使用writer(方法：
       writer = csv.writer(f)
   三、寫入數據使用writerow()方法
       writer.writerow(["霸王別姬",1993])
   四、示例：

import csv
#打開csv文件,若是不寫newline=‘’,則每一條數據中間會出現一條空行
with open("test.csv",'a',newline='') as f:
    #初始化寫入對象
    writer = csv.writer(f)
    #寫入數據
    writer.writerow(['id','name','age'])
    writer.writerow([1,'Lucy',20])
    writer.writerow([2,'Tom',25])

import csv
with open("貓眼/第一頁.csv",'w',newline="") as f:
    writer = csv.writer(f)
    writer.writerow(['電影名','主演','上映時間'])
    '''
    若是使用utf-8會出現['\ufeff霸王別姬', '張國榮,張豐毅,鞏俐', '1993-01-01']
    使用utf-8-sig['霸王別姬', '張國榮,張豐毅,鞏俐', '1993-01-01']
    二者的區別：
    UTF-8以字節爲編碼單元，它的字節順序在全部系統中都是一様的，沒有字節序的問題，
    也所以它實際上並不須要BOM(「ByteOrder Mark」)。
    可是UTF-8 with BOM即utf-8-sig須要提供BOM。
    '''
    with open("貓眼/第1頁.txt",'r',encoding="utf-8-sig") as file:
        while True:
            data_list = file.readline().strip().split("|")
            print(data_list)
            writer.writerow(data_list)
            if data_list[0]=='':
                break

二、Xpath工具(解析HTML)
   一、Xpath
       在XML文檔中查找信息的語言，一樣適用於HTML文檔的檢索
   二、Xpath輔助工具
       一、Chrome插件：Xpath Helper
           打開/關閉：Ctrl + Shift + 大寫X
       二、FireFox插件：XPath checker
       三、Xpath表達式編輯工具：XML Quire
   三、Xpath匹配規則

<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore>
　　<book>
　　　　<title lang="en">Harry Potter</title>
　　　　<author>J K. Rowling</author> 
　　　　<year>2005</year>
　　　　<price>29.99</price>
　　</book>
　　<book>
　　　　<title lang="chs">Python</title>
　　　　<author>Joe</author> 
　　　　<year>2018</year>
　　　　<price>49.99</price>
　　</book>
</bookstore>

        一、匹配演示
           一、查找bookstore下面的全部節點：/bookstore
           二、查找全部的book節點：//book
           三、查找全部book節點下title節點中，lang屬性爲‘en’的節點：//book/title[@lang='en']
       二、選取節點
           /:從根節點開始選取 /bookstore，表示「/‘前面的節點的子節點
           //:從整個文檔中查找某個節點 //price，表示「//」前面節點的全部後代節點
           @:選取某個節點的屬性 //title[@lang="en"]
       三、@使用
           一、選取1個節點：//title[@lang='en']
           二、選取N個節點：//title[@lang]
           三、選取節點屬性值：//title/@lang
       四、匹配多路徑
           一、符號： |
           二、示例：
               獲取全部book節點下的title節點和price節點
               //book/title|//book/price
       五、函數
           contains():匹配一個屬性值中包含某些字符串的節點
               //title[contains(@lang,'e')]

　　六、能夠經過解析出來的標籤對象繼續調用xpath函數往下尋找標籤

　　　　語法：獲取的標籤對象.xpath（「./div/span」）

"""
糗事百科https://www.qiushibaike.com/8hr/page/1/
匹配內容
    一、用戶暱稱,div/div/a/h2.text
    二、內容,div/a/div/span.text
    三、點贊數,div/div/span/i.text
    四、評論數,div/div/span/a/i.text
"""
import requests
from lxml import etree
url = "https://www.qiushibaike.com/8hr/page/1/"
headers = {'User-Agent':"Mozilla5.0/"}
res = requests.get(url,headers=headers)
res.encoding = "utf-8"
html = res.text

#先獲取全部段子的div列表
parseHtml = etree.HTML(html)
div_list = parseHtml.xpath("//div[contains(@id,'qiushi_tag_')]")
print(len(div_list))
#遍歷列表
for div in div_list:
    #獲取用戶暱稱
    username = div.xpath('./div/a/h2')[0].text
    print(username)
    #獲取內容
    content = div.xpath('.//div[@class="content"]/span')[0].text
    print(content)
    #獲取點贊
    laughNum = div.xpath('./div/span/i')[0].text
    print(laughNum)
    #獲取評論數
    pingNum = div.xpath('./div/span/a/i')[0].text
    print(pingNum)

三、解析HTML源碼
   一、lxml庫：HTML/XML解析庫
       一、安裝
           conda install lxml
           pip install lxml
   二、使用流程
       一、利用lxml庫的etree模塊構建解析對象
       二、解析對象調用xpath工具定位節點信息
   三、使用
       一、導入模塊from lxml import etree
       二、建立解析對象：parseHtml = etree.HTML(html)
       三、調用xpath進行解析：r_list = parseHtml.xpath("//title[@lang='en']")
       說明：只要調用了xpath，則結果必定是列表

from lxml import etree
html = """<div class="wrapper">
    <i class="iconfont icon-back" id="back"></i>
    <a href="/" id="channel">新浪社會</a>
    <ul id="nav">
        <li><a href="http://domestic.firefox.sina.com/" title="國內">國內</a></li>
        <li><a href="http://world.firefox.sina.com/" title="國際">國際</a></li>
        <li><a href="http://mil.firefox.sina.com/" title="軍事">軍事</a></li>
        <li><a href="http://photo.firefox.sina.com/" title="圖片">圖片</a></li>
        <li><a href="http://society.firefox.sina.com/" title="社會">社會</a></li>
        <li><a href="http://ent.firefox.sina.com/" title="娛樂">娛樂</a></li>
        <li><a href="http://tech.firefox.sina.com/" title="科技">科技</a></li>
        <li><a href="http://sports.firefox.sina.com/" title="體育">體育</a></li>
        <li><a href="http://finance.firefox.sina.com/" title="財經">財經</a></li>
        <li><a href="http://auto.firefox.sina.com/" title="汽車">汽車</a></li>
    </ul>
    <i class="iconfont icon-liebiao" id="menu"></i>
</div>"""

#一、建立解析對象
parseHtml = etree.HTML(html)
#二、利用解析對象調用xpath工具,
#獲取a標籤中href的值
s1 = "//a/@href"
#獲取單獨的/
s2 = "//a[@id='channel']/@href"
#獲取後面的a標籤中href的值
s3 = "//li/a/@href"
s3 = "//ul[@id='nav']/li/a/@href"#更準確
#獲取全部a標籤的內容,一、首相獲取標籤對象，二、遍歷對象列表，在經過對象.text屬性獲取文本值
s4 = "//a"
#獲取新浪社會
s5 = "//a[@id='channel']"
#獲取國內，國際，.......
s6 = "//ul[@id='nav']//a"
r_list = parseHtml.xpath(s6)
print(r_list)

for i in r_list:
    print(i.text)

    四、案例：抓取百度貼吧帖子裏面的圖片
       一、目標：抓取貼吧中帖子圖片
       二、思路
           一、先獲取貼吧主頁的URL：河南大學，下一頁的URL規律
           二、獲取河南大學吧中每一個帖子的URL
           三、對每一個帖子發送請求，獲取帖子裏面全部圖片的URL
           四、對圖片URL發送請求，以wb的範式寫入本地文件

"""
步驟
    一、獲取貼吧主頁的URL
        http://tieba.baidu.com/f?kw=河南大學&pn=0
        http://tieba.baidu.com/f?kw=河南大學&pn=50
    二、獲取每一個帖子的URL,//div[@class='t_con cleafix']/div/div/div/a/@href
        https://tieba.baidu.com/p/5878699216
    三、打開每一個帖子，找到圖片的URL,//img[@class='BDE_Image']/@src
        http://imgsrc.baidu.com/forum/w%3D580/sign=da37aaca6fd9f2d3201124e799ed8a53/27985266d01609240adb3730d90735fae7cd3480.jpg
    四、保存到本地
    
"""
import requests
from lxml import etree
class TiebaPicture:
    def __init__(self):
        self.baseurl = "http://tieba.baidu.com"
        self.pageurl = "http://tieba.baidu.com/f"
        self.headers = {'User-Agent':"Mozilla5.0/"}
      
    
    def getPageUrl(self,url,params):
        '''獲取每一個帖子的URL'''
        res = requests.get(url,params=params,headers = self.headers)
        res.encoding = 'utf-8'
        html = res.text

        #從HTML頁面獲取每一個帖子的URL
        parseHtml = etree.HTML(html)
        t_list = parseHtml.xpath("//div[@class='t_con cleafix']/div/div/div/a/@href")
        print(t_list)
        for t in t_list:
            t_url = self.baseurl + t
            self.getImgUrl(t_url)
        
    
    def getImgUrl(self,t_url):
        '''獲取帖子中全部圖片的URL'''
        res = requests.get(t_url,headers=self.headers)
        res.encoding = "utf-8"
        html = res.text
        parseHtml = etree.HTML(html)
        img_url_list = parseHtml.xpath("//img[@class='BDE_Image']/@src")
        for img_url in img_url_list:
            self.writeImg(img_url)

    
    
    def writeImg(self,img_url):
        '''將圖片保存如文件'''
        res = requests.get(img_url,headers=self.headers)
        html = res.content
        #保存到本地,將圖片的URL的後10位做爲文件名
        filename = img_url[-10:]
        with open(filename,'wb') as f:
            print("%s正在下載"%filename)
            f.write(html)
            print("%s下載完成"%filename)

    def workOn(self):
        '''主函數'''
        kw = input("請輸入你要爬取的貼吧名")
        begin = int(input("請輸入起始頁"))
        end = int(input("請輸入終止頁"))
        for page in range(begin,end+1):
            pn = (page-1)*50
            #拼接某個貼吧的URl
            params = {"kw":kw,"pn":pn}
            self.getPageUrl(self.pageurl,params=params)
            
if __name__ == "__main__":
    spider = TiebaPicture()
    spider.workOn()

爬取百度貼吧圖片

一、動態網站數據抓取 - Ajax
   一、Ajax動態加載
       一、特色：動態加載(滾動鼠標滑輪時加載)
       二、抓包工具：查詢參數在WebForms -> QueryString
       二、案例：豆瓣電影top100榜單

import requests
import json
import csv
url = "https://movie.douban.com/j/chart/top_list"
headers = {'User-Agent':"Mozilla5.0/"}

params = {"type":"11",
          "interval_id":"100:90",
          "action":"",
          "start":"0",
          "limit":"100"}
res = requests.get(url,params=params,headers=headers)
res.encoding="utf-8"
#獲得json格式的數組[]
html = res.text
#把json格式的數組轉爲python的列表
ls = json.loads(html)

with open("豆瓣100.csv",'a',newline="") as f:
        writer = csv.writer(f)
        writer.writerow(["name","score"])
        for dic in ls:
            name = dic['title']
            score = dic['rating'][1]
            writer.writerow([name,score])

二、json模塊
   一、做用：json格式類型和 Python數據類型相互轉換
   二、經常使用方法
       一、json.loads():json格式 --> Python數據類型
               json      python
               對象       字典
               數組       列表
       二、json.dumps():
三、selenium + phantomjs 強大的網絡爬蟲
   一、selenium
       一、定義:WEB自動化測試工具，應用於WEB自動化測試
       二、特色：
           一、可運行在瀏覽器上，根據指令操做瀏覽器，讓瀏覽器自動加載頁面
           二、只是一個工具，不支持瀏覽器功能，只能與第三方瀏覽器結合使用
       三、安裝
           conda install selenium
           pip install selenium
   二、phantomjs
       一、Windowds
           一、定義：無界面瀏覽器（無頭瀏覽器）
           二、特色：
               一、把網站加載到內存執行頁面加載
               二、運行高效
           三、安裝
               一、把安裝包拷貝到Python安裝路徑Script...
       二、Ubuntu
           一、下載phantomjs安裝包放到一個路徑下
           二、用戶主目錄：vi .bashrc
               export PHANTOM_JS = /home/.../phantomjs-...
               export PATH=$PHANTOM_JS/bin:$PATH
           三、source .bashrc
           四、終端：phantomjs
   三、示例代碼

#導入selenium庫中的文本driver
from selenium import webdriver
#建立打開phantomjs的對象
driver = webdriver.PhantomJS()
#訪問百度
driver.get("http://www.baidu.com/")
#獲取網頁截圖
driver.save_screenshot("百度.png")

    四、經常使用方法
       一、driver.get(url)
       二、driver.page_source.find("內容")：
           做用：從html源碼中搜索字符串，搜索成功返回非-1，搜索失敗返回-1

from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get("http://www.baidu.com/")
r1 = driver.page_source.find("kw")
r2 = driver.page_source.find("aaaa")
print(r1,r2)#1053 -1

        三、driver.find_element_by_id("id值").text
       四、driver.find_element_by_name("屬性值")
       五、driver.find_element_by_class_name("屬性值")
       六、對象名.send_keys("內容")
       七、對象名.click()
       八、driver.quit()
   五、案例：登陸豆瓣網站
四、BeautifulSoup
   一、定義：HTML或XML的解析，依賴於lxml庫
   二、安裝並導入
       安裝：
           pip install beautifulsoup4
           conda install beautifulsoup4
       導入模塊：from bs4 import BeautifulSoup as bs
   三、示例
   四、BeautifulSoup支持的解析庫
       一、lxml HTML解析器， 'lxml'速度快，文檔容錯能力強
       二、Python標準庫   'html.parser'，速度通常
       三、lxml XML解析器 'xml'：速度快

from selenium import webdriver
from bs4 import BeautifulSoup as bs
import time

driver = webdriver.PhantomJS()
driver.get("https://www.douyu.com/directory/all")
while True:
    html = driver.page_source
    #建立解析對象
    soup = bs(html,'lxml')
    #直接調用方法去查找元素
    #存放全部主播的元素對象
    names = soup.find_all("span",{"class":"dy-name ellipsis fl"})
    numbers = soup.find_all("span",{"class":"dy-num fr"})
    #name ,number 都是對象，有get_text()
    for name , number in zip(names,numbers):
        print("觀衆人數：",number.get_text(),"主播",name.get_text())
    if html.find("shark-pager-disable-next") ==-1:
        driver.find_element_by_class_name("shark-pager-next").click()
        time.sleep(4)
    else:
        break

使用pytesseract識別驗證碼

　　一、安裝 sudo pip3 install pytesseract

　　二、使用步驟：

　　　　一、打開驗證碼圖片：Image.open(‘驗證碼圖片路徑’)

　　　　二、使用pytesseract模塊中的image_to_string()方法進行識別

from PIL import Image
from pytesseract import *
#一、加載圖片
image = Image.open('t1.png')
#二、識別過程
text = image_to_string(image)
print(text)

使用captcha模塊生成驗證碼

　　一、安裝 sudo pip3 install captcha

import random
from PIL import Image
import numpy as np
from captcha.image import ImageCaptcha

digit = ['0','1','2','3','4','5','6','7','8','9']
alphabet = [chr(i) for i in range(97,123)]+[chr(i) for i in range(65,91)]
char_set = digit + alphabet
#print(char_set)
def random_captcha_text(char_set=char_set,captcha_size=4):
    '''默認獲取一個隨機的含有四個元素的列表'''
    captcha_text = []
    for i in range(captcha_size):
        ele = random.choice(char_set)
        captcha_text.append(ele)
    return captcha_text
def gen_captcha_text_and_inage():
    '''默認隨機獲得一個包含四個字符的圖片驗證碼並返回字符集'''
    image = ImageCaptcha()
    captcha_text = random_captcha_text()
    #將列表轉爲字符串
    captcha_text = ''.join(captcha_text)
    captchaInfo = image.generate(captcha_text)
    #生成驗證碼圖片
    captcha_imge = Image.open(captchaInfo)
    captcha_imge = np.array(captcha_imge)
    im = Image.fromarray(captcha_imge)
    im.save('captcha.png')
    return captcha_text
if __name__ == '__main__':
    gen_captcha_text_and_inage()

去重

　　一、去重分爲兩個步驟，建立兩個隊列（列表）

　　　　一、一個隊列存放已經爬取過了url，存放以前先判斷這個url是否已經存在於已爬隊列中，經過這樣的方式去重

　　　　二、另一個隊列存放待爬取的url，若是該url不在已爬隊列中則放入到帶爬取隊列中

　　使用去重和廣度優先遍歷爬取豆瓣網

import re
from bs4 import BeautifulSoup
import basicspider
import hashlibHelper

def get_html(url):
    """
    獲取一頁的網頁源碼信息
    """
    headers = [("User-Agent","Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36")]
    html = basicspider.downloadHtml(url, headers=headers)
    return html

def get_movie_all(html):
    """
    獲取當前頁面中全部的電影的列表信息
    """
    soup = BeautifulSoup(html, "html.parser")
    movie_list = soup.find_all('div', class_='bd doulist-subject')
    #print(movie_list)
    return movie_list

def get_movie_one(movie):
    """
    獲取一部電影的精細信息，最終拼成一個大的字符串
    """
    result = ""
    soup = BeautifulSoup(str(movie),"html.parser")
    title = soup.find_all('div', class_="title")
    soup_title = BeautifulSoup(str(title[0]), "html.parser")
    for line in soup_title.stripped_strings:
        result += line
    try:
        score = soup.find_all('span', class_='rating_nums')
        score_ = BeautifulSoup(str(score[0]), "html.parser")
        for line in score_.stripped_strings:
            result += "|| 評分："
            result += line
    except:
         result += "|| 評分：5.0"    
    abstract = soup.find_all('div', class_='abstract')
    abstract_info = BeautifulSoup(str(abstract[0]), "html.parser")
    for line in abstract_info.stripped_strings:
        result += "|| "
        result += line    
    
    result += '\n'
    print(result)
    return result

def save_file(movieInfo):
    """
    寫文件的操做,這裏使用的追加的方式來寫文件
    """
    with open("doubanMovie.txt","ab") as f:
        #lock.acquire()
        f.write(movieInfo.encode("utf-8"))
        #lock.release()
    
crawl_queue = []#待爬取隊列
crawled_queue = []#已爬取隊列
    
def crawlMovieInfo(url):
    '''抓取一頁數據'''
    'https://www.douban.com/doulist/3516235/'
    global crawl_queue
    global crawled_queue
    html = get_html(url)
    regex = r'https://www\.douban\.com/doulist/3516235/\?start=\d+&amp;sort=seq&amp;playable=0&amp;sub_type='
    p = re.compile(regex,re.S)
    itemUrls = p.findall(html)
    #兩步去重過程
    for item in itemUrls:
        #將item進行hash而後判斷是否已經在已爬隊列中
        hash_irem = hashlibHelper.hashStr(item)
        if hash_irem not in crawled_queue:#已爬隊列去重
            crawl_queue.append(item)
    crawl_queue = list(set(crawl_queue))#將待爬隊列去重
    #處理當前頁面
    movie_list = get_movie_all(html)
    for movie in movie_list:
        save_file(get_movie_one(movie))
    #將url轉爲hash值並存入已爬隊列中
    hash_url = hashlibHelper.hashStr(url)
    crawled_queue.append(hash_url)
    
if __name__ == "__main__":
    #廣度優先遍歷
    seed_url = 'https://www.douban.com/doulist/3516235/?start=0&amp;sort=seq&amp;playable=0&amp;sub_type='
    crawl_queue.append(seed_url)
    while crawl_queue:
        url = crawl_queue.pop(0)
        crawlMovieInfo(url)
    print(crawled_queue)
    print(len(crawled_queue))

import hashlib 

def hashStr(strInfo):
    '''對字符串進行hash'''
    hashObj = hashlib.sha256()
    hashObj.update(strInfo.encode('utf-8'))
    return hashObj.hexdigest()

def hashFile(fileName):
    '''對文件進行hash'''
    hashObj = hashlib.md5()
    with open(fileName,'rb') as f:
        while True:
            #不要一次性所有讀取出來，若是文件太大，內存不夠
            data = f.read(2048)
            if not data:
                break
            hashObj.update(data)
    return hashObj.hexdigest()
        
    
if __name__ == "__main__":
    print(hashStr("hello"))
    print(hashFile('貓眼電影.txt'))

hashlibHelper.py

from urllib import request
from urllib import parse
from urllib import error
import random
import time

def downloadHtml(url,headers=[()],proxy={},timeout=None,decodeInfo='utf-8',num_tries=10,useProxyRatio=11):
    '''
    支持user-agent等Http，Request，Headers
    支持proxy
    超時的考慮
    編碼的問題，若是不是UTF-8編碼怎麼辦
    服務器錯誤返回5XX怎麼辦
    客戶端錯誤返回4XX怎麼辦
    考慮延時的問題
    '''
    time.sleep(random.randint(1,2))#控制訪問，不要太快
    #經過useProxyRatio設置是否使用代理
    if random.randint(1,10) >useProxyRatio:
        proxy = None 
    #建立ProxuHandler
    proxy_support = request.ProxyHandler(proxy)
    #建立opener
    opener = request.build_opener(proxy_support)
    #設置user-agent
    opener.addheaders = headers
    #安裝opener
    request.install_opener(opener)
    html = None
    try:
        #這裏可能出現不少異常
        #可能會出現編碼異常
        #可能會出現網絡下載異常：客戶端的異常404，403
        #                   服務器的異常5XX
        res = request.urlopen(url)
        html = res.read().decode(decodeInfo)
    except UnicodeDecodeError:
        print("UnicodeDecodeError")
    except error.URLError or error.HTTPError as e:
        #客戶端的異常404,403（可能被反爬了）
        if hasattr(e,'code') and 400 <= e.code < 500:
            print("Client Error"+e.code)
        elif hasattr(e,'code') and 500 <= e.code < 600:
            if num_tries > 0:
                time.sleep(random.randint(1,3))#設置等待的時間
                downloadHtml(url,headers,proxy,timeout,decodeInfo,num_tries-1)
    return html

if __name__ == "__main__":
    url = "http://maoyan.com/board/4?offset=0"
    headers = [("User-Agent","User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50")]            
    print(downloadHtml(url,headers=headers))

basicspider.py

Scrapy框架
   在終端直接輸入scrapy查看可使用的命令
      bench         Run quick benchmark test
   fetch         Fetch a URL using the Scrapy downloader
   genspider     Generate new spider using pre-defined templates
   runspider     Run a self-contained spider (without creating a project)
   settings      Get settings values
   shell         Interactive scraping console
   startproject Create new project
   version       Print Scrapy version
   view          Open URL in browser, as seen by Scrapy
   使用步驟：
       一、建立一個項目：scrapy startproject 項目名稱
           scrapy startproject tencentSpider
       二、進入到項目中，建立一個爬蟲
           cd tencentSpider
           scrapy genspider tencent hr.tencent.com #tencent表示建立爬蟲的名字，hr.tencent.com表示入口，要爬取的數據必須在這個域名之下
       三、修改程序的邏輯
           一、settings.py
               一、設置ua
               二、關閉robots協議
               三、關閉cookie
               四、打開ItemPipelines

# -*- coding: utf-8 -*-

# Scrapy settings for tencentSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'tencentSpider'

SPIDER_MODULES = ['tencentSpider.spiders']
NEWSPIDER_MODULE = 'tencentSpider.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'tencentSpider (+http://www.yourdomain.com)'

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False #是否遵循robots協議    

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'tencentSpider.middlewares.TencentspiderSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'tencentSpider.middlewares.TencentspiderDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'tencentSpider.pipelines.TencentspiderPipeline': 300,#值表示優先級    
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

settings.py

二、items.py:ORM

import scrapy

class TencentspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    #抓取招聘的職位，鏈接，崗位類型
    positionName = scrapy.Field()
    positionLink = scrapy.Field()
    positionType = scrapy.Field()

三、pipelines.py:保存數據的邏輯

import json

class TencentspiderPipeline(object):
    def process_item(self, item, spider):
        with open('tencent.json','ab') as f:
            text = json.dumps(dict(item),ensure_ascii=False)+'\n'
            f.write(text.encode('utf-8'))
        return item

四、spiders/tencent.py：主體的邏輯

import scrapy
from tencentSpider.items import TencentspiderItem

class TencentSpider(scrapy.Spider):
    name = 'tencent'
    allowed_domains = ['hr.tencent.com']
    #start_urls = ['http://hr.tencent.com/']
#    start_urls = []
#    for i in range(0,530,10):
#        url = "https://hr.tencent.com/position.php?keywords=python&start="
#        url += str(i)+"#a"
#        start_urls.append(url)
    url = "https://hr.tencent.com/position.php?keywords=python&start="
    offset = 0
    start_urls = [url + str(offset)+"#a"]

    def parse(self, response):
        for each in response.xpath('//tr[@class="even"]|//tr[@class="odd"]'):
            item = TencentspiderItem()#item是一個空字典
            item['positionName'] = each.xpath('./td[1]/a/text()').extract()[0]
            item['positionLink'] = "https://hr.tencent.com/"+each.xpath('./td[1]/a/@href').extract()[0]
            item['positionType'] = each.xpath('./td[2]/text()').extract()[0]
            yield item
            
            #提取連接
        if self.offset < 530:
            self.offset += 10
            nextPageUrl = self.url+str(self.offset)+"#a"
        else:
            return
        #對下一頁發起請求
        yield scrapy.Request(nextPageUrl,callback = self.parse)

四、運行爬蟲
scrapy crawl tencent

五、運行爬蟲並將數據保存到指定文件中

　　　　scrapy crawl tencent -o 文件名如何在scrapy框架中設置代理服務器一、能夠在middlewares.py文件中的DownloaderMiddleware類中的process_request()方法中，來完成代理服務器的設置二、而後將代理服務器的池放在setting.py文件中定義一個proxyList = [.....] 三、process_request()方法裏面經過random.choice(proxyList)隨機選一個代理服務器注意：一、這裏的代理服務器若是是私密的，有用戶名和密碼時，須要作一層簡單的加密處理Base64 二、在scrapy生成一個基礎爬蟲時使用：scrapy genspider tencent hr.tencent.com,若是要想生成一個高級的爬蟲CrawlSpider scrapy genspider -t crawl tencent2 hr.tencent.com CrawSpider這個爬蟲能夠更加靈活的提取URL等信息，須要瞭解URL，LinkExtractor Scrapy-Redis搭建分佈式爬蟲 Redis是一種內存數據庫(提供了接口將數據保存到磁盤數據庫中);

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。