基於asyncio編寫一個telegram爬蟲機器人

時間 2019-12-05

標籤基於 asyncio 編寫一個 telegram 爬蟲機器人欄目網絡爬蟲简体版

原文原文鏈接

前言

aiotg 能夠經過異步調用telegram api的方式來構建bot，由於決定開發一個爬蟲功能的bot，因此網絡請求阻塞是比較嚴重的性能障礙。而asyncio的異步非阻塞特性可以完美的解決這一問題。這篇文章在記錄如何使用aiotg進行telegram開發的同時，也會說明一些aiohttp的使用方法,這裏是項目源碼。若是你以爲不錯能夠幫忙點一下starpython

https://t.me/fpicturebot 點擊連接能夠體驗一下這個bot的功能。git

若是讀者以前對telegram的bot沒有了解，能夠查看這篇寫給開發者的telegram-bots介紹文檔github

aiotg簡單教程

1.一個最簡單的bot

你能夠先學習如何新建一個機器人web

from aiotg import Bot, Chat

config = {
    "api_token": "***********",
    "proxy": "http://127.0.0.1:8118"
}

bot = Bot(**config)

@bot.command(r"/echo (.+)")
def echo(chat: Chat, match):
    return chat.reply(match.group(1))

bot.run()

上面是一個簡單的回覆機器人，當你使用指令 /echo+內容時，機器人會自動回覆給你發送的內容。這裏要注意一點，在我這裏沒有采用使用 pipenv ( pip ) 安裝aiotg的方法，由於pip平臺上安裝的是master分支的包，aiotg經過使用aiohttp來調用telegram bot api，在建立一個bot的時候沒有提供proxy選項爲aiohttp設置代理，在本地開發的時候會由於國內網絡抽搐出現網絡鏈接錯誤，因此在這裏我使用了aiotg的prxoy分支，直接從github上下載的代碼。在建立Bot對象的時候加入proxy選項就能使用本地代理來進行開發調試了。redis

後來我在aiotg telegram羣裏建議做者將proxy合併到主分支，後來做者表示他也有這樣的想法，同時他也吐槽了一下俄羅斯的網絡也有不少審查和限制，如今在aiotg裏已經沒有proxy分支了，在aiotg-0.9.9版本中提供proxy選項，因此你們能夠繼續使用pipenv下載aiotg包。json

2.aiotg異步特性

既然用到aiotg來開發就是看中了他的異步特性，下面就列出一個簡單的例子api

import aiohttp
import json
from aiotg import Bot, Chat

with open('token.conf') as f:
    token = json.loads(f.read())

bot = Bot(**token)

@bot.command("/fetch")
async def async_fecth(chat: Chat, match):
    url = "http://www.gamersky.com/ent/111/"
    async with aiohttp.ClientSession() as sesssion:
        async with sesssion.get(url) as resp:
            info = ' version: {}\n status :{}\n method: {}\n url: {}\n'.format(
                resp.version, resp.status, resp.method, resp.url)
            await chat.send_text(info)

bot.run(debug=True)

3. 自定義鍵盤

關於自定義鍵盤的內容能夠點擊連接查看官方解釋，這裏是簡單的中文描述。數組

category.json緩存

[
    {
        "name": "dynamic",
        "title": "動態圖",
        "url": "http://www.gamersky.com/ent/111/"
    },
    {
        "name": "oops",
        "title": "囧圖",
        "url": "http://www.gamersky.com/ent/147/"
    },
    {
        "name": "beauty",
        "title": "福利圖",
        "url": "http://tag.gamersky.com/news/66.html"
    },
    {
        "name": "easy-moment",
        "title": "輕鬆一刻",
        "url": "http://www.gamersky.com/ent/20503/"
    },
    {
        "name": "trivia",
        "title": "冷知識",
        "url": "http://www.gamersky.com/ent/198/"
    },
    {
        "name": "cold-tucao",
        "title": "冷吐槽",
        "url": "http://www.gamersky.com/ent/20108/"
    }
]

main.py

import aiohttp
import json
from aiotg import Bot, Chat

with open('token.json') as t, open('category.json') as c:
    token = json.loads(t.read())
    category = json.loads(c.read())

bot = Bot(**token)

@bot.command("/reply")
async def resply(chat: Chat, match):
    kb = [[item['title']] for item in category]
    keyboard = {
        "keyboard": kb,
        "resize_keyboard": True
    }
    await chat.send_text(text="看看你的鍵盤", reply_markup=json.dumps(keyboard))

bot.run(debug=True)

只須要在調用chat的發送消息函數中，指定 reply_markup 參數，你就能個性化的設定用戶鍵盤， reply_markup 參數須要一個json對象，官方指定爲ReplyKeyboardMarkup類型，其中keyboard須要傳遞一個KeyboardButton的數組。

每一個keyboard的成員表明着鍵盤中的行，你能夠經過修改每行中KeyboardButton的個數來排列你的鍵盤，好比咱們讓鍵盤每行顯示兩個KeyboardButton，以下所示

@bot.command("/reply")
async def reply(chat: Chat, match):
    # kb = [[item['title']] for item in category]
    kb, row = [], -1
    for idx, item in enumerate(category):
        if idx % 2 == 0:
            kb.append([])
            row += 1
        kb[row].append(item['title'])
    keyboard = {
        "keyboard": kb,
        "resize_keyboard": True
    }
    await chat.send_text(text="看看你的鍵盤", reply_markup=json.dumps(keyboard))

4. 內聯鍵盤和消息更新

內聯鍵盤的意思就是附着在消息上的鍵盤，內聯鍵盤由內聯按鈕組成，每一個按鈕會附帶一個回調數據，每次點擊按鈕以後會有對應的回調函數處理。

@bot.command("/inline")
async def inlinekeyboard(chat: Chat, match):

    inlinekeyboardmarkup = {
            'type': 'InlineKeyboardMarkup',
            'inline_keyboard': [
                [{'type': 'InlineKeyboardButton',
                  'text': '上一頁',
                  'callback_data': 'page-pre'},
                 {'type': 'InlineKeyboardButton',
                  'text': '下一頁',
                  'callback_data': 'page-next'}]
                ]
            }

    await chat.send_text('請翻頁', reply_markup=json.dumps(inlinekeyboardmarkup))

@bot.callback(r'page-(\w+)')
async def buttonclick(chat, cq, match):
    await chat.send_text('You clicked {}'.format(match.group(1)))

有時候咱們想修改以前已經發送過的消息內容，例如當用戶點擊了內聯鍵盤，而鍵盤的功能是進行翻頁更新消息的內容。這時候咱們可使用 editMessageText 功能。例如點擊上面內聯鍵盤中的上一頁按鈕，你能夠看到消息的內容被更改了。

@bot.callback(r'page-(\w+)')
async def buttonclick(chat, cq, match):
    await chat.edit_text(message_id=chat.message['message_id'], text="消息被修改了")

5.內聯請求模式

內聯請求模式感受更適合在羣組中使用，在討論組中輸入@botname + 特定指令，輸入框的上方就會顯示查詢內容，你能夠返回給用戶文章類型、圖片類型或者其餘類型的查詢信息。官網有更詳細的內容。

@bot.inline
async def inlinemode(iq):
    results = [{
            'type': 'article',
            'id': str(index),
            'title': article['title'],
            'input_message_content': { 'message_text': article['title']},
            'description': f"這裏是{article['title']}"
        } for index, article in enumerate(category)]
    await iq.answer(results)

咱們設定當用戶輸入內聯指令時，bot返回能夠選擇的圖片種類，返回結果的類型是article類型，官方還提供了語音，圖片，gif，視頻，音頻。表情等類型，你能夠根據本身的須要進行選擇。

爬蟲機器人功能實現

我使用aiotg編寫的機器人是用來抓取來自遊民星空的圖片。

1. 爬蟲功能

爬蟲功能的實現是用aiohttp發送web請求，使用beautifulsoup進行html解析，核心代碼以下

import re
import aiohttp
from bs4 import BeautifulSoup


def aioget(url):
    return aiohttp.request('GET', url)


def filter_img(tag):
    if tag.name != 'p':
        return False
    try:
        if tag.attrs['align'] == 'center':
            for child in tag.contents:
                if child.name == 'img' or child.name == 'a':
                    return True
        return False
    except KeyError:
        if 'style' in tag.attrs:
            return True
        else:
            return False
            
            
async def fetch_img(url):
    async with aioget(url) as resp:
        resp_text = await resp.text()
        content = BeautifulSoup(resp_text, "lxml")
        imgs = content.find(class_="Mid2L_con").findAll(filter_img)
        results = []
        for img in imgs:
            try:
                results.append({
                    'src':  img.find('img').attrs['src'],
                    'desc': '\n'.join(list(img.stripped_strings))
                })
            except AttributeError:
                continue
        return results

我將aiohttp的get請求稍微包裝了一下，簡潔一些。html中元素的提取就不在贅述，就是找找html中的規律

2. 指令功能

指令功能實現須要使用aiotg.bot.command裝飾器進行命令註冊，下面列出 /start的功能實現

@bot.command(r"/start")
async def list_category(chat: Chat, match):
    kb, row = [], -1
    for idx, item in enumerate(category["name"]):
        if idx % 2 == 0:
            kb.append([])
            row += 1
        kb[row].append(item)
    keyboard = {
        "keyboard": kb,
        "resize_keyboard": True
    }
    await chat.send_text(text="請選擇你喜歡的圖片類型", reply_markup=json.dumps(keyboard))

關於自定義鍵盤部分在上文中已經講過，讀者能夠本身編碼實現

3. callback功能

讀者能夠看到在消息上附有頁面切換按鈕，每一個按鈕會帶着一個callbackdata，當點擊按鈕會調用相應的callback函數進行處理，這裏點擊下一頁時會進行翻頁。

看頁面更新了，關於更新頁面的實如今上面也講到了如何進行消息更新。

@bot.callback(r"page-(?P<name>\w+)-(?P<page>\d+)")
async def change_lists(chat: Chat, cq, match):
    req_name = match.group('name')
    page = match.group('page')
    url = category[req_name]
    text, markup = await format_message(req_name, url, page)
    await chat.edit_text(message_id=chat.message['message_id'], text=text, markup=markup)

也是使用裝飾器進行回調函數註冊，使用chat.edit_text進行消息更新。

callback功能也用在了圖片的更新。點擊下一頁更新圖片。

4.內聯請求模式功能

當用戶在輸入框中輸入@botusername+指令時，會在輸入框上顯示查詢內容。

當沒有指令時，會顯示一些可以查看的圖片類型。

當輸入對應類型漢字的前幾個字時，bot會匹配你想看的圖片列表，並羅列出來

@bot.inline(r"([\u4e00-\u9fa5]+)")
async def inline_name(iq, match):
    req = match.group(1)
    req_name = match_category(req.strip(), category['name'])
    ptype = 'G' if req_name == "dynamic" else 'P'
    if req_name is None:
        return
    results, _ = await fetch_lists(category[req_name])
    c_results = [{
            'type': 'article',
            'id': str(index),
            'title': item['title'],
            'input_message_content': {
                'message_text': '/' + ptype + item['date'] + '_' + item['key']
            },
            'description': item['desc']
        } for index, item in enumerate(results)]
    await iq.answer(c_results)

redis緩存

當發送給用戶圖片時，telegram會返回一個和圖片對應的file_id, 當再次發送相同的圖片時，只須要在調用send_photo時，將photo參數賦值爲file_id便可，因此每次使用爬蟲進行抓取圖片時，將圖片的fild_id存在redis中，用戶請求圖片時，若是圖片以前已經抓取過，這時候只要從redis中取出file_id，再調用send_photo便可。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。