Python搭建代理池爬取拉勾網招聘信息

時間 2019-11-06

標籤 python 搭建代理拉勾招聘信息欄目 Python 简体版

原文原文鏈接

先來看一張圖瞭解下爬蟲 javascript

實現功能

多線程爬取拉勾網招聘信息
維護代理 ip 池
搭建 node 服務器
Taro 使用 echarts 作數據分析

一、多線程爬取拉勾網招聘信息

Tip:涉及知識

1.Python3 基礎語法菜鳥教程
2.requests 模塊快速上手
3.Mongodb 數據庫快速安裝
4.pymongo 的使用快速上手
5.線程池 concurrent 快速上手html

首先咱們先了解下什麼是爬蟲，看下百度百科的定義前端

網絡爬蟲（又被稱爲網頁蜘蛛，網絡機器人，在 FOAF 社區中間，更常常的稱爲網頁追逐者），是一種按照必定的規則，自動地抓取萬維網信息的程序或者腳本。另一些不常使用的名字還有螞蟻、自動索引、模擬程序或者蠕蟲。java

簡單來講就是按照必定規則來抓取內容node

抓取什麼內容?

咱們的目的是抓取拉勾網的招聘信息。拉勾網武漢站 Python 招聘信息python

ok,明白了咱們要抓取的數據,下一步就是要找數據的來源了。react

咱們經過點擊下一頁觀察瀏覽器控制檯,發現每次點擊下一頁時都有一個新的請求咱們發現這個請求正是招聘數據的來源，這樣只要咱們之間請求這個接口就能夠得來數據了。因而咱們快速的寫出來下面的代碼git

import requests
# 請求參數
data = {
    'first': False,  # 這個參數固定能夠寫False
    'pn': 2,         # pn表示頁碼
    'kd': 'Python'  # kd表示搜索關鍵測
}
# 發送post請求
response = requests.post(
    'https://www.lagou.com/jobs/positionAjax.json?px=default&city=武漢&needAddtionalResult=false', data=data)
# 編碼
response.encoding = 'utf-8'
# 獲取json
res = response.json()
print(res)
複製代碼

運行後獲得如下結果github

{ "status": False, "msg": "您操做太頻繁,請稍後再訪問", "clientIp": "59.xxx.xxx.170", "state": 2408 }
複製代碼

爲何咱們請求獲得的結果和網頁中返回的結果不同呢?web

再回到控制檯看看這個請求，發現是須要攜帶 cookie 的，ok，那咱們加上 cookie。但是 cookie 是從哪裏來的，總不能寫死吧。

咱們先把瀏覽器的 cookie 清除如下(控制檯-Application-Cookies 點擊清除)，而後再刷新下頁面，發現了 cookie 的來源

ok,那咱們先獲取 cookie,再去請求接口

import requests
response = requests.get(
    'https://www.lagou.com/jobs/list_Python?px=default&city=%E6%AD%A6%E6%B1%89')
response.encoding = 'utf-8'
print(response.text)
複製代碼

運行發現返回內容中有這麼一句

<div class="tip">當前請求存在惡意行爲已被系統攔截，您的全部操做記錄將被系統記錄！</div>
複製代碼

我擦，什麼，怎麼會被攔截了~

這個時候咱們再想一想最前面的一張圖，這個網站不會就是有 User-Agent 驗證吧

無論，先加上 User-Agent 再試試

import requests
# 新增了User-Agent請求頭
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3704.400 QQBrowser/10.4.3587.400"
}
response = requests.get(
    'https://www.lagou.com/jobs/list_Python?px=default&city=%E6%AD%A6%E6%B1%89', headers=headers)
response.encoding = 'utf-8'
print(response.text)
複製代碼

驚奇的發現正常了,返回結果正常了!!!

既然正常了,那咱們就獲取 cookie 再去請求接口了

import requests
UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3704.400 QQBrowser/10.4.3587.400"

def getCookie():
    ''' @method 獲取cookie '''
    global UserAgent
    response = requests.get('https://www.lagou.com/jobs/list_Python?px=default&city=%E6%AD%A6%E6%B1%89', headers={
        "User-Agent": UserAgent
    })
    # 獲取的cookie是字典類型的
    cookies = response.cookies.get_dict()
    # 由於請求頭中cookie須要字符串,將字典轉化爲字符串類型
    COOKIE = ''
    for key, val in cookies.items():
        COOKIE += (key + '=' + val + '; ')
    return COOKIE


# 請求頭
headers = {
    "Cookie": getCookie()
}
print(headers)
# 請求數據
data = {
    'first': False,  # 這個參數固定能夠寫False
    'pn': 2,         # pn表示頁碼
    'kd': 'Python'  # kd表示搜索關鍵測
}
response = requests.post(
    'https://www.lagou.com/jobs/positionAjax.json?px=default&city=武漢&needAddtionalResult=false', data=data, headers=headers)
# 編碼
response.encoding = 'utf-8'
# 獲取json
res = response.json()
print(res)

複製代碼

這下總該成功而後數據了吧，而後就發現... 尼瑪，這麼坑，怎麼返回結果仍是您操做太頻繁,請稍後再訪問

沉住氣，再看看請求頭

把其餘請求頭全加上試試

# 把headers改爲這樣
headers = {
    "Accept": "application/json, text/javascript, */*; q=0.01",
    "Connection": "keep-alive",
    "Host": "www.lagou.com",
    "Referer": 'https://www.lagou.com/jobs/list_Python?px=default&city=%E6%AD%A6%E6%B1%89',
    "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
    "User-Agent": UserAgent,
    "Cookie": getCookie()
}
複製代碼

運行以後就成功抓取到數據了。

到這裏咱們就已經成功的抓取了一頁的數據，而後咱們就要抓取多頁啦。。。

考慮到抓取數據較多，能夠採用多線程的方式來提升效率，同時應該將數據存到數據庫去(這裏使用 Mongodb 數據庫,其餘數據庫同樣的道理)

爬蟲完成代碼

import requests
from pymongo import MongoClient
from time import sleep
# 鏈接數據庫
client = MongoClient('127.0.0.1', 27017)
db = client.db  # 鏈接mydb數據庫，沒有則自動建立
# 請求頭的cookie和UserAgent
COOKIE = ''
UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3704.400 QQBrowser/10.4.3587.400"
# 武漢
city = '%E6%AD%A6%E6%B1%89'

# 獲取cookie
def getCookie(key_world):
    global COOKIE, UserAgent, city
    data = requests.get('https://www.lagou.com/jobs/list_' + key_world + '?px=default&city=' + city, headers={
        "User-Agent": UserAgent
    })
    cookies = data.cookies.get_dict()
    COOKIE = ''
    for key, val in cookies.items():
        COOKIE += (key + '=' + val + '; ')


# 請求數據接口
def getList(page, key_world):
    global COOKIE, UserAgent
    data = {
        "first": "false",
        "pn": page + 1,
        "kd": key_world == 'web' and '前端' or key_world
    }
    headers = {
        "Accept": "application/json, text/javascript, */*; q=0.01",
        "Connection": "keep-alive",
        "Host": "www.lagou.com",
        "Referer": 'https://www.lagou.com/jobs/list_' + key_world + '?px=default&city=' + city,
        "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
        "User-Agent": UserAgent,
        "Cookie": COOKIE
    }
    response = requests.post(
        'https://www.lagou.com/jobs/positionAjax.json?px=default&city=武漢&needAddtionalResult=false', data=data, headers=headers)
    response.encoding = 'utf-8'
    res = response.json()
    return res

# 抓取數據
def getData(key_world):
    global COOKIE, UserAgent, client, db
    print('開始抓取'+key_world)

    # 前端須要轉爲web
    if key_world == '%E5%89%8D%E7%AB%AF':
        table = db.web  # 鏈接mydb數據庫，沒有則自動建立
    else:
        table = db[key_world]  # 鏈接mydb數據庫，沒有則自動建立

    # 由於請求接口須要cookie,先獲取cookie
    getCookie(key_world)

    # 抓取數據
    for page in range(1, 100):
        # 請求數據
        res = getList(page, key_world)
        # 若是請求成功存入數據庫中
        if res['msg'] == None:
            print('成功')
            # 工做崗位
            position = res['content']['positionResult']['result']
            # 記錄當前的數據
            one_data = []
            for idx, item in enumerate(position):
                one_data.append({
                    'positionName': item['positionName'],
                    'workYear': item['workYear'],
                    'salary': item['salary'],
                    'education': item['education'],
                    'companySize': item['companySize'],
                    'companyFullName': item['companyFullName'],
                    'formatCreateTime': item['formatCreateTime'],
                    'positionId': item['positionId']
                })
            # 沒有數據了
            if len(one_data) == 0:
                break
            # 存儲當前數據
            table.insert_many(one_data)
        else:
            print('失敗')
            # 寫日誌
            with open('./log.txt', 'a', -1, 'utf-8') as f:
                f.write(str(res))
                f.write('\n')
            # 從新獲取cookie
            getCookie(key_world)
            # 再爬取當頁數據
            res_once = getList(page, key_world)
            # 工做崗位
            position_once = res_once['content']['positionResult']['result']
            # 記錄當前的數據
            one_data = []
            for idx, item in enumerate(position_once):
                one_data.append({
                    'positionName': item['positionName'],
                    'workYear': item['workYear'],
                    'salary': item['salary'],
                    'education': item['education'],
                    'companySize': item['companySize'],
                    'companyFullName': item['companyFullName'],
                    'formatCreateTime': item['formatCreateTime']
                })
            # 沒有數據了
            if len(one_data) == 0:
                print(key_world + '存入成功')
                # 這裏用新cookie獲取數據仍是被限制了,獲取不到,這裏暫時先休眠60秒,等後面有代理ip池再使用代理ip來解決這個問題
                sleep(60)
                return
            # 存儲當前數據
            table.insert_many(one_data)
    print(key_world + '存入成功')
    sleep(60)

# 抓取的數據搜索關鍵詞, 前面的示例是Python,這裏抓取多個類型的
key_worlds = ['Python', 'Java', 'PHP', 'C', 'C++', 'C#']
# 開始抓取數據
for idx, key_world in enumerate(key_worlds):
    getData(key_world)

複製代碼

目前還須要解決的兩個問題,等有了代理 ip 池再解決。

1.未使用多線程
2.仍是會存在封 ip 的狀況,須要使用代理

二、維護代理 ip 池

Tip:涉及知識

1.以前的全部知識
2.xpath 解析模塊 lxml 快速上手

維護一個 ip 池大體分爲兩步

1.抓取網上免費代理存到數據庫
2.篩選出數據庫中的有效代理
複製代碼

看到這裏相信你已經知道爬蟲的運行原理。維護一個本身 ip 池其實也就是一個定時爬蟲不停的去爬取網上免費的代理

先完成第一步, 抓取網上免費代理存到數據庫

咱們這裏爬取西拉免費代理 IP

老套路,先把網頁抓下來,再提取咱們想要的內容

先抓數據

import requests
response = requests.get('http://www.xiladaili.com/gaoni/1/')
response.encoding = 'utf-8'
print(response.text)
複製代碼

運行起來就發現已經把全部的內容都抓下來了,很顯然這個網站沒有反爬蟲。

再提取數據 xPath 怎麼獲取看這裏

import requests
from lxml import etree  # xpath解析模塊
response = requests.get('http://www.xiladaili.com/gaoni/1/')
response.encoding = 'utf-8'
# print(response.text)
s = etree.HTML(response.text)
# 全部的ip
''' 第一條xpath /html/body/div[1]/div[3]/div[2]/table/tbody/tr[1]/td[1] 第二條xpath /html/body/div[1]/div[3]/div[2]/table/tbody/tr[2]/td[1] 全部的xpath就是把選擇tr的部分去掉 /html/body/div[1]/div[3]/div[2]/table/tbody/tr/td[1] '''
ips = s.xpath('/html/body/div[1]/div[3]/div[2]/table/tbody/tr/td[1]/text()')
# 全部的請求代理協議
types = s.xpath('/html/body/div[1]/div[3]/div[2]/table/tbody/tr/td[2]/text()')
print(ips)
print(types)

複製代碼

這樣咱們就提取了咱們須要的內容了,再把須要的內容存到數據庫

import requests
from lxml import etree  # xpath解析模塊
from pymongo import MongoClient

# 數據庫鏈接
client = MongoClient('127.0.0.1', 27017)
db = client.ip  # 鏈接ip數據庫，沒有則自動建立
table = db.table  # 使用table集合，沒有則自動建立

response = requests.get('http://www.xiladaili.com/gaoni/1/')
response.encoding = 'utf-8'
s = etree.HTML(response.text)
# 全部的ip
ips = s.xpath('/html/body/div[1]/div[3]/div[2]/table/tbody/tr/td[1]/text()')
# 全部的請求代理協議
types = s.xpath('/html/body/div[1]/div[3]/div[2]/table/tbody/tr/td[2]/text()')

# 存儲到數據庫
for index, ip in enumerate(ips):
    host = ip.split(':')[0]
    port = ip.split(':')[1]
    table.insert_one({"ip": host, "port": port, "type": types[index]})

複製代碼

前面咱們只爬取了一頁,最後就改用多線程來爬取多頁數據

import requests
from lxml import etree  # xpath解析模塊
from pymongo import MongoClient
from concurrent.futures import ThreadPoolExecutor  # 線程池

# 數據庫鏈接
client = MongoClient('127.0.0.1', 27017)
db = client.ip  # 鏈接mydb數據庫，沒有則自動建立
table = db.table  # 使用test_set集合，沒有則自動建立

spider_poll_max = 5  # 爬蟲線程池最大數量
spider_poll = ThreadPoolExecutor(
    max_workers=spider_poll_max)  # 爬蟲線程池 max_workers最大數量

# 爬取單頁數據
def getIp(page):
    response = requests.get('http://www.xiladaili.com/gaoni/' + str(page + 1)+'/')
    response.encoding = 'utf-8'
    s = etree.HTML(response.text)
    # 全部的ip
    ips = s.xpath('/html/body/div[1]/div[3]/div[2]/table/tbody/tr/td[1]/text()')
    # 全部的請求代理協議
    types = s.xpath('/html/body/div[1]/div[3]/div[2]/table/tbody/tr/td[2]/text()')
    print(ips)
    print(types)
    # 存儲到數據庫
    for index, ip in enumerate(ips):
        host = ip.split(':')[0]
        port = ip.split(':')[1]
        table.insert_one({"ip": host, "port": port, "type": types[index]})


# 爬取10頁
for page in range(0, 10):
    # 添加一個線程
    spider_poll.submit(getIp, (page))

複製代碼

還存在的問題

1.仍是會存在封 ip 的狀況,須要使用代理

抓取 ip 完成了,如今到了驗證 ip 的步驟了

再完成第二步,篩選出數據庫中的有效代理 咱們以前在數據庫中建立了一個叫 table 的集合(表),用來存貯全部抓取的 ip(並未有效性檢測),再這裏咱們要專門準備一個叫 ip 的集合,用來存有效 ip。

有效 ip 的檢測也分爲兩步，第一：將 ip 表中的失效代理刪除，第二：將 table 表中的有效代理存到 ip 表中。

import requests
from pymongo import MongoClient
from concurrent.futures import ThreadPoolExecutor  # 線程池

REQ_TIMEOUT = 3  # 代理的超時時間，

# 數據庫鏈接
client = MongoClient('127.0.0.1', 27017)
db = client.ip  # 鏈接mydb數據庫，沒有則自動建立
table = db.table  # 全部的ip表
ip_table = db.ip  # 有效ip表

# 線程池
spider_poll_max = 20  # 多線程的最大數量
# 建立一個線程池
proving_poll = ThreadPoolExecutor(max_workers=spider_poll_max)


def proving(ip):
    ''' @method 檢測全部ip中有效的ip '''
    global table, ip_table
    host = ip['ip']
    port = ip['port']
    _id = ip['_id']
    types = ip['type']

    proxies = {
        'http': host+':'+port,
        'https': host+':'+port
    }

    try:
        # 經過比較ip是否相同來判斷代理是否有效
        OrigionalIP = requests.get("http://icanhazip.com", timeout=REQ_TIMEOUT).content
        MaskedIP = requests.get("http://icanhazip.com",timeout=REQ_TIMEOUT, proxies=proxies).content
        # 刪除代理
        if OrigionalIP == MaskedIP:
            result = table.delete_one({"ip": host, "port": port, "type": types})
        else:
            print('新增有效代理', host+':'+port)
            # 有效代理則存到ip表中
            ip_table.insert_one({"ip": host, "port": port, "type": types})
    except:
        # 刪除代理
        result = table.delete_one({"ip": host, "port": port, "type": types})


def proving_ip(ip):
    ''' @method 檢測有效ip中無效ip '''
    global ip_table
    host = ip['ip']
    port = ip['port']
    _id = ip['_id']
    types = ip['type']

    #代理
    proxies = {
        'http': host+':'+port,
        'https': host+':'+port
    }

    #try except用於檢測超時的代理
    try:
        # 經過比較ip是否相同來判斷代理是否有效
        OrigionalIP = requests.get("http://icanhazip.com", timeout=REQ_TIMEOUT).content
        MaskedIP = requests.get("http://icanhazip.com",timeout=REQ_TIMEOUT, proxies=proxies).content
        # 刪除代理
        if OrigionalIP == MaskedIP:
            # ip相同則是無效代理
            ip_table.delete_one({"ip": host, "port": port, "type": types})
        else:
            print('有效代理', host+':'+port)

    except:
        # 刪除代理超時的代理
        ip_table.delete_one({"ip": host, "port": port, "type": types})


# 進行第一步，先檢測有效ip表中無效的ip
proving_ips = ip_table.find({})
print('開始清理無效ip...')
# 有效性驗證
for data in proving_ips:
    # 添加一個線程
    proving_poll.submit(proving_ip, (data))

# 再進行第二步，提取全部ip中的有效ip
ips = table.find({})
print('開始代理有效性驗證...')
# 有效性驗證
for data in ips:
    # 添加一個線程
    proving_poll.submit(proving, (data))

複製代碼

到這裏咱們就有了一個專門存在有效代理的數據表了(ip 表),之後直接從這裏取一個有效代理就能夠直接使用了

如今解決一下前面所遺留的問題 1.使用多線程 + 代理完成招聘數據爬取

import requests
import random
from pymongo import MongoClient
from time import sleep
from concurrent.futures import ThreadPoolExecutor  # 線程池

# 鏈接數據庫
client = MongoClient('127.0.0.1', 27017)
db = client.db  # 鏈接mydb數據庫，沒有則自動建立
ip_client = client.ip  # 鏈接mydb數據庫，沒有則自動建立
ip_table = ip_client.ip  # 有效ip表

# 線程池
spider_poll_max = 7  # 爬蟲線程池最大數量
# 爬蟲線程池 max_workers最大數量
spider_poll = ThreadPoolExecutor(max_workers=spider_poll_max)

# 請求頭的cookie和UserAgent
UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3704.400 QQBrowser/10.4.3587.400"
# 武漢
city = '%E6%AD%A6%E6%B1%89'


def getRandomIp():
    ''' @method 有效ip表中隨機取一個ip '''
    global ip_table
    # 獲取總數量
    count = ip_table.count_documents({})  # 查詢一共有多少數據
    # 隨機取一個
    index = random.randint(0, count)  # 獲取0到count的隨機整數
    print(count, index)
    data = ip_table.find().skip(index).limit(1)

    for item in data:
        print({'ip': item['ip'], 'port': item['port']})
        return {'ip': item['ip'], 'port': item['port']}


def getCookie(key_world):
    ''' @method 獲取cookie '''
    global UserAgent, city, ip_table

    # 隨機獲取一個代理,防止被封ip
    row = getRandomIp()
    print(50, row)
    try:
        proxies = {
            'http': row['ip'] + ':' + row['port'],
            'https': row['ip'] + ':' + row['port']
        }

        data = requests.get('https://www.lagou.com/jobs/list_' + key_world + '?px=default&city=' + city, timeout=10, proxies=proxies, headers={
            "User-Agent": UserAgent
        })
        cookies = data.cookies.get_dict()
        COOKIE = ''
        for key, val in cookies.items():
            COOKIE += (key + '=' + val + '; ')
        return COOKIE
    except:
        print('獲取cookie失敗,無效代理', row['ip'] + ':' + row['port'])
        # 刪除無效代理
        ip_table.delete_one({"ip": row['ip'], "port": row['port']})
        # 從新獲取cookie
        return getCookie(key_world)


def getData(obj):
    ''' @method 抓取一頁數據 '''
    global UserAgent, client, db

    key_world = obj['key_world']  # 關鍵詞
    page = obj['page']  # 分頁

    print('開始抓取')

    # 鏈接數據表,前端須要轉爲web表,
    if key_world == '%E5%89%8D%E7%AB%AF':
        table = db.web  # 鏈接mydb數據庫，沒有則自動建立
    else:
        table = db[key_world]  # 鏈接mydb數據庫，沒有則自動建立

    # 隨機獲取一個代理,防止被封ip
    row = getRandomIp()
    print(102, row)
    proxies = {
        'http': row['ip'] + ':' + row['port'],
        'https': row['ip'] + ':' + row['port']
    }

    try:
        # 由於請求接口須要cookie,先獲取cookie
        cookie = getCookie(key_world)

        # 抓取數據開始
        data = {
            "first": "false",
            "pn": page + 1,
            "kd": key_world == '%E5%89%8D%E7%AB%AF' and '前端' or key_world
        }
        headers = {
            "Accept": "application/json, text/javascript, */*; q=0.01",
            "Connection": "keep-alive",
            "Host": "www.lagou.com",
            "Referer": 'https://www.lagou.com/jobs/list_' + key_world + '?px=default&city=' + city,
            "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
            "User-Agent": UserAgent,
            "Cookie": cookie
        }
        response = requests.post('https://www.lagou.com/jobs/positionAjax.json?px=default&city=武漢&needAddtionalResult=false',
                                 data=data, timeout=10, proxies=proxies, headers=headers)
        response.encoding = 'utf-8'
        res = response.json()
        print(res)

        # 若是請求成功存入數據庫中
        if res['msg'] == None:
            # 工做崗位
            position = res['content']['positionResult']['result']
            # 記錄當前的數據
            one_data = []
            for idx, item in enumerate(position):
                one_data.append({
                    'positionName': item['positionName'],
                    'workYear': item['workYear'],
                    'salary': item['salary'],
                    'education': item['education'],
                    'companySize': item['companySize'],
                    'companyFullName': item['companyFullName'],
                    'formatCreateTime': item['formatCreateTime'],
                    'positionId': item['positionId']
                })
            # 沒有數據了
            if len(one_data) == 0:
                print(key_world + '第'+page+'頁數據爲空')
                return

            # 存儲當前數據
            table.insert_many(one_data)
            print(key_world + '第'+page+'頁存入成功')
        else:
            print(key_world + '第'+page+'頁存入失敗')
            # 寫日誌
            with open('./log.txt', 'a', -1, 'utf-8') as f:
                f.write('key_world:'+key_world+',page:'+page+'\n')
                f.write(str(res))
                f.write('\n')
            # 刪除無效代理
            ip_table.delete_one({"ip": row['ip'], "port": row['port']})

            # 從新添加到任務中
            spider_poll.submit(
                getData, ({'key_world': key_world, 'page': page}))
    except:
        print('超時代理', row['ip'] + ':' + row['port'])
        # 刪除無效代理
        ip_table.delete_one({"ip": row['ip'], "port": row['port']})
        # 從新添加到任務中
        spider_poll.submit(getData, ({'key_world': key_world, 'page': page}))


# 搜索的關鍵詞, 第一個爲前端
key_worlds = ['%E5%89%8D%E7%AB%AF', 'Python', 'Java', 'PHP', 'C', 'C++', 'C#']
# 添加任務
for idx, key_world in enumerate(key_worlds):
    # 每種搜索關鍵詞爬取100頁
    for page in range(1, 100):
        # 添加一個任務
        spider_poll.submit(getData, ({'key_world': key_world, 'page': page}))

複製代碼

2.使用代理爬取代理

import requests
import random
from pymongo import MongoClient
from lxml import etree  # xpath解析模塊
from concurrent.futures import ThreadPoolExecutor  # 線程池


# 數據庫鏈接
client = MongoClient('127.0.0.1', 27017)
db = client.ip  # 鏈接mydb數據庫，沒有則自動建立
table = db.table  # 將抓取的ip所有存到table表中
ip_table = db.ip  # 有效ip表

# 線程池
spider_poll_max = 50  # 爬蟲線程池最大數量
# 爬蟲線程池 max_workers最大數量
spider_poll = ThreadPoolExecutor(max_workers=spider_poll_max)


def getRandomIp():
    ''' @method 有效ip表中隨機取一個ip '''
    global ip_table
    # 獲取總數量
    count = ip_table.count_documents({})  # 查詢一共有多少數據
    # 隨機取一個
    index = random.randint(0, count)  # 獲取0到count的隨機整數
    # print(count, index)
    data = ip_table.find().skip(index).limit(1)

    for item in data:
        return {'ip': item['ip'], 'port': item['port']}


def getIp(page):
    ''' @method 爬取數據 '''
    # 隨機獲取一個代理,防止被封ip
    row = getRandomIp()
    proxies = {
        'http': row['ip'] + ':' + row['port'],
        'https': row['ip'] + ':' + row['port']
    }

    try:
        # 抓取代理
        response = requests.get(
            'http://www.xiladaili.com/gaoni/' + str(page + 1)+'/', timeout=10, proxies=proxies)
        # 設置編碼
        response.encoding = 'utf-8'
        # 解析
        s = etree.HTML(response.text)
        # 獲取ip和請求類型
        ips = s.xpath(
            '/html/body/div[1]/div[3]/div[2]/table/tbody/tr/td[1]/text()')
        types = s.xpath(
            '/html/body/div[1]/div[3]/div[2]/table/tbody/tr/td[2]/text()')
        if (len(ips) == 0):
            print('抓取數據爲空')
            # 寫日誌
            with open('./log.txt', 'a', -1, 'utf-8') as f:
                f.write(response.text)
                f.write('--------------------------------------------------')
                f.write('\n')
                f.write('\n')
                f.write('\n')
                f.write('\n')
            # 刪除無效代理
            ip_table.delete_one({"ip": row['ip'], "port": row['port']})
        else:
            print('抓取數據成功, 正在存入數據庫...')
            # 存儲ip
            for index, ip in enumerate(ips):
                host = ip.split(':')[0]
                port = ip.split(':')[1]
                table.insert_one(
                    {"ip": host, "port": port, "type": types[index]})
    except:
        print('超時')
        # 刪除無效代理
        ip_table.delete_one({"ip": row['ip'], "port": row['port']})


# 抓取網頁的數量
for page in range(0, 100):
    # 添加一個線程
    spider_poll.submit(getIp, (page))

複製代碼

三、搭建 node 服務器

Tip:涉及知識

1.JavaScript 基礎語法菜鳥教程
2.http 模塊快速上手
3.mongoose 模塊快速安裝

server.js

const http = require('http');
var url = require('url');
var qs = require('qs');
const { get_education } = require('./api/education.js');
const { get_workYear } = require('./api/workYear.js');
const { get_salary } = require('./api/salary.js');

//用node中的http建立服務器 並傳入兩個形參
http.createServer(function(req, res) {
    //設置請求頭 容許全部域名訪問 解決跨域
    res.setHeader('Access-Control-Allow-Origin', '*');
    res.writeHead(200, { 'Content-Type': 'application/json;charset=utf-8' }); //設置response編碼

    try {
        //獲取地址中的參數部分
        var query = url.parse(req.url).query;
        //用qs模塊的方法 把地址中的參數轉變成對象 方便獲取
        var queryObj = qs.parse(query);
        //獲取前端傳來的myUrl=後面的內容&emsp;&emsp;GET方式傳入的數據
        var type = queryObj.type;

        /* /get_education 獲取學歷分佈 /get_workYear 獲取工做經驗分佈 /get_salary 獲取薪資分佈 */
        if (req.url.indexOf('/get_education?type=') > -1) {
            get_education(type, function(err, data) {
                if (err) res.end({ errmsg: err });
                console.log('[ok] /get_education');
                res.end(JSON.stringify(data));
            });
        } else if (req.url.indexOf('/get_workYear?type=') > -1) {
            get_workYear(type, function(err, data) {
                if (err) res.end({ errmsg: err });
                console.log('[ok] /get_workYear');
                res.end(JSON.stringify(data));
            });
        } else if (req.url.indexOf('/get_salary?type=') > -1) {
            get_salary(type, function(err, data) {
                if (err) res.end({ errmsg: err });
                console.log('[ok] /get_salary');
                res.end(JSON.stringify(data));
            });
        } else {
            console.log(req.url);
            res.end('404');
        }
    } catch (err) {
        res.end(err);
    }
}).listen(8989, function(err) {
    if (!err) {
        console.log('服務器啓動成功，正在監聽8989...');
    }
});
複製代碼

education.js 文件 (其餘文件與這個相似)

const { model } = require('./db.js');

//獲取學歷
exports.get_education = function(type, callback) {
    //查詢全部的本科學歷
    model[type].find({}, { education: 1 }, function(err, res) {
        if (err) return callback(err);
        let result = [],
            type = [];
        //找出每種學歷的數量
        res.forEach(item => {
            if (type.includes(item.education)) {
                result[type.indexOf(item.education)].count++;
            } else {
                type.push(item.education);
                result.push({
                    label: item.education,
                    count: 1
                });
            }
        });
        callback(null, result);
    });
};
複製代碼

db.js

const mongoose = require('mongoose');
const DB_URL = 'mongodb://localhost:27017/db';
// 鏈接數據庫
mongoose.connect(DB_URL, { useNewUrlParser: true });

var Schema = mongoose.Schema;

//全部的表
let collections = ['web', 'Python', 'PHP', 'Java', 'C++', 'C#', 'C'];
let model = {};

//爲每張表都生成一個model用來操做表
collections.forEach(collection => {
    let UserSchema = new Schema(
        {
            positionName: { type: String }, //職位
            workYear: { type: String }, //工做年限
            salary: { type: String }, //薪水
            education: { type: String }, //學歷
            companySize: { type: String }, //規模
            companyFullName: { type: String }, //公司名
            formatCreateTime: { type: String }, //發佈時間
            positionId: { type: Number } //id
        },
        {
            collection: collection
        }
    );
    let web_model = mongoose.model(collection, UserSchema);
    model[collection] = web_model;
});

exports.model = model;
複製代碼

而後咱們用運行 server.js 文件, cmd 中輸入 node server.js

運行成功後再用瀏覽器打開 localhost:8989/get_education?type=Python就能夠看到數據了

四、Taro 使用 echarts 作數據分析

Tip:涉及知識

1.熟悉微信小程序官方文檔
2.熟悉 react 語法官方文檔
3.熟悉 Taro 的使用官方文檔
4.熟悉 echarts 的使用快速上手官網實例
5.在微信小程序中使用 echarts 快速上手

小程序源碼請移步 github

Tip.

1.第一步請先抓取代理存入的table表中
2.第二步再驗證代理確保ip表中有數據
3.最後在運行爬蟲爬取數據
4.寫個定時任務去循環前三步
複製代碼

項目源碼 github 歡迎star

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。