python爬蟲進階教程：百萬英雄答題輔助系統

時間 2020-04-04

原文原文鏈接

1、前言

看了網上不少的教程都是經過OCR識別的，這種方法的優勢在於通用性強。不一樣的答題活動均可以參加，可是缺點也明顯，速度有限，而且若是經過調用第三方OCR，有次數限制。可是使用本教程提到的數據接口。咱們能很容易的獲取數據，速度快，可是接口是變化的，須要及時更新。javascript

2、實戰解析

一、背景介紹

百萬英雄答題是一個最近很火爆的答題軟件，答對12題的人，能夠平分最後的獎金。獎金不錯，筆者參加過幾回，不過得到的都是小獎，最後幾塊錢的那種。對於不難的題目，可以直接百度出答案的題目，若是有個軟件輔助實時給出參考，仍是一件很舒服的事情。想幹就幹，走起！html

二、先睹爲快

先看下部署效果，經過服務器後端處理，經過前端顯示，親測延時3s：前端

爲啥作成這樣呢？由於這樣，別的人也能夠經過瀏覽器進行訪問，獨樂不如衆樂嘛！java

Github開源地址：https://github.com/Jack-Cherish/python-spidernode

三、西瓜視頻APP抓包

對於如何抓包，我想應該都會了，我在手機APP抓包教程中有詳細講解，若有不會的，請暫時移步：http://blog.csdn.net/c406495762/article/details/76850843python

在比賽答題的時候，咱們能夠經過抓包，找到這樣的接口（點擊放大）：jquery

能夠看到，參數如上圖所示。其中heartbeat後面的參數是一個隨着場次的增長，逐漸增長的一個數，後面其餘的例如iid和device_id是每一個人的用戶信息，在接口的最後，有個rticket參數，這個是一個時間戳，能夠經過time.time()模擬。git

2018-1-17更新：據朋友反應，url的有效參數只有heartbeat和rticket參數，用戶信息能夠不填寫。github

注意：只有在答題直播開始的時候，才能經過接口抓取到數據，沒有直播的時候，是獲取不到數據的，是亂碼。後端

經過這個接口獲取數據，而後對數據進行解析，在經過百度知道索問題，簡單高效。有了這個思想，就能夠開始寫代碼了。

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

# -*-coding:utf-8 -*-

import requests

from lxml import etree

from bs4 import BeautifulSoup

import urllib

import time, re, types, os

"""

代碼寫的匆忙，原本想再重構下，完善好註釋再發，可是比較忙，想一想算了，因此自行完善吧！寫法很不規範，勿見怪。

做者： Jack Cui

Website:https://cuijiahua.com

注: 本軟件僅用於學習交流，請勿用於任何商業用途！

"""

class BaiWan():

def __init__(self):

# 百度知道搜索接口

self.baidu = 'http://zhidao.baidu.com/search?'

# 百萬英雄及接口,每一個人的接口都不同，裏面包含的手機信息，所以不公佈，請自行抓包，有疑問歡迎留言：https://cuijiahua.com/liuyan.html

self.api = 'https://api-spe-ttl.ixigua.com/xxxxxxx={}'.format(int(time.time()*1000))

# 獲取答案並解析問題

def get_question(self):

to = True

while to:

list_dir = os.listdir('./')

if 'question.txt' not in list_dir:

fw = open('question.txt', 'w')

fw.write('百萬英雄還沒有出題請稍後!')

fw.close()

go = True

while go:

req = requests.get(self.api, verify=False)

req.encoding = 'utf-8'

html = req.text

print(html)

if '*' in html:

question_start = html.index('*')

try:

question_end = html.index('？')

except:

question_end = html.index('?')

question = html[question_start:question_end][2:]

if question != None:

fr = open('question.txt', 'r')

text = fr.readline()

fr.close()

if text != question:

print(question)

go = False

with open('question.txt', 'w') as f:

f.write(question)

else:

time.sleep(1)

else:

to = False

else:

to = False

temp = re.findall(r'[\u4e00-\u9fa5a-zA-Z0-9\+\-\*/]', html[question_end+1:])

b_index = []

print(temp)

for index, each in enumerate(temp):

if each == 'B':

b_index.append(index)

elif each == 'P' and (len(temp) - index) <= 3 :

b_index.append(index)

break

if len(b_index) == 4:

a = ''.join(temp[b_index[0] + 1:b_index[1]])

b = ''.join(temp[b_index[1] + 1:b_index[2]])

c = ''.join(temp[b_index[2] + 1:b_index[3]])

alternative_answers = [a,b,c]

if '下列' in question:

question = a + ' ' + b + ' ' + c + ' ' + question.replace('下列', '')

elif '如下' in question:

question = a + ' ' + b + ' ' + c + ' ' + question.replace('如下', '')

else:

alternative_answers = []

# 根據問題和備選答案搜索答案

self.search(question, alternative_answers)

time.sleep(1)

def search(self, question, alternative_answers):

print(question)

print(alternative_answers)

infos = {"word":question}

# 調用百度接口

url = self.baidu + 'lm=0&rn=10&pn=0&fr=search&ie=gbk&' + urllib.parse.urlencode(infos, encoding='GB2312')

print(url)

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36',

}

sess = requests.Session()

req = sess.get(url = url, headers=headers, verify=False)

req.encoding = 'gbk'

# print(req.text)

bf = BeautifulSoup(req.text, 'lxml')

answers = bf.find_all('dd',class_='dd answer')

for answer in answers:

print(answer.text)

# 推薦答案

recommend = ''

if alternative_answers != []:

best = []

print('\n')

for answer in answers:

# print(answer.text)

for each_answer in alternative_answers:

if each_answer in answer.text:

best.append(each_answer)

print(each_answer,end=' ')

# print(answer.text)

print('\n')

break

statistics = {}

for each in best:

if each not in statistics.keys():

statistics[each] = 1

else:

statistics[each] += 1

errors = ['沒有', '不是', '不對', '不正確','錯誤','不包括','不包含','不在','錯']

error_list = list(map(lambda x: x in question, errors))

print(error_list)

if sum(error_list) >= 1:

for each_answer in alternative_answers:

if each_answer not in statistics.items():

recommend = each_answer

print('推薦答案：', recommend)

break

elif statistics != {}:

recommend = sorted(statistics.items(), key=lambda e:e[1], reverse=True)[0][0]

print('推薦答案：', recommend)

# 寫入文件

with open('file.txt', 'w') as f:

f.write('問題：' + question)

f.write('\n')

f.write('*' * 50)

f.write('\n')

if alternative_answers != []:

f.write('選項：')

for i in range(len(alternative_answers)):

f.write(alternative_answers[i])

f.write(' ')

f.write('\n')

f.write('*' * 50)

f.write('\n')

f.write('參考答案：\n')

for answer in answers:

f.write(answer.text)

f.write('\n')

f.write('*' * 50)

f.write('\n')

if recommend != '':

f.write('最終答案請自行斟酌！\t')

f.write('推薦答案：' + sorted(statistics.items(), key=lambda e:e[1], reverse=True)[0][0])

if __name__ == '__main__':

bw = BaiWan()

bw.get_question()

獲取數據和查找答案就是這樣，很簡單，代碼寫的也較爲凌亂，大牛能夠按照這個思路改一改。

四、網站部署

沒作事後端和前端，花了一天時間，現學現賣弄好的，javascript也是現看現用，百度的程序，調試調試而已。可能有不少用法比較low的地方，用法不對，請勿見怪，有大牛感興趣，能夠自行完善。

這是我當時看的一些文章：

Node.js和Socket.IO通訊基礎：菜鳥學習nodejs--Socket.IO即時通信

Node.js逐行讀取txt文件：Line-Reader

Node.js定時任務：Node-Schedule

後端app.js：

var http = require('http');

var fs = require('fs');

var schedule = require("node-schedule");

var message = {};

var count = 0;

var server = http.createServer(function (req,res){

fs.readFile('./index.html',function(error,data){

res.writeHead(200,{'Content-Type':'text/html'});

res.end(data,'utf-8');

});

}).listen(80);

console.log('Server running!');

var lineReader = require('line-reader');

function messageGet(){

lineReader.eachLine('file.txt', function(line, last) {

count++;

var name = 'line' + count;

console.log(name);

console.log(line);

message[name] = line;

});

if(count == 25){

count = 0;

}

else{

for(var i = count+1; i <= 25; i++){

var name = 'line' + i;

message[name] = 'f';

}

count = 0;

}

var io = require('socket.io').listen(server);

var rule = new schedule.RecurrenceRule();

var times = [];

for(var i=1; i<1800; i++){

times.push(i);

}

rule.second = times;

schedule.scheduleJob(rule, function(){

messageGet();

});

io.sockets.on('connection',function(socket){

// console.log('User connected' + count + 'user(s) present');

socket.emit('users',message);

socket.broadcast.emit('users',message);

socket.on('disconnect',function(){

console.log('User disconnected');

//socket.broadcast.emit('users',message);

});

前端index.html：

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

<!DOCTYPE html>

<html>

<head>

<title>Jack Cui答題輔助系統</title>

</head>

<body>

<h1>百萬英雄答題輔助系統</h1>

var socket = io.connect('http://你的IP:端口');

var line1 = document.getElementById('line1');

var line2 = document.getElementById('line2');

var line3 = document.getElementById('line3');

var line4 = document.getElementById('line4');

var line5 = document.getElementById('line5');

var line6 = document.getElementById('line6');

var line7 = document.getElementById('line7');

var line8 = document.getElementById('line8');

var line9 = document.getElementById('line9');

var line10 = document.getElementById('line10');

var line11 = document.getElementById('line11');

var line12 = document.getElementById('line12');

var line13 = document.getElementById('line13');

var line14 = document.getElementById('line14');

var line15 = document.getElementById('line15');

var line16 = document.getElementById('line16');

var line17 = document.getElementById('line17');

var line18 = document.getElementById('line18');

var line19 = document.getElementById('line19');

var line20 = document.getElementById('line20');

var line21 = document.getElementById('line21');

var line22 = document.getElementById('line22');

var line23 = document.getElementById('line23');

var line24 = document.getElementById('line24');

var line25 = document.getElementById('line25');

socket.on('users',function(data){

if(data.line1 == 'f'){

line1.innerHTML = ''

}

else{

line1.innerHTML = data.line1

}

if(data.line2 == 'f'){

line2.innerHTML = ''

}

else{

line2.innerHTML = data.line2

}

if(data.line3 == 'f'){

line3.innerHTML = ''

}

else{

line3.innerHTML = data.line3

}

if(data.line4 == 'f'){

line4.innerHTML = ''

}

else{

line4.innerHTML = data.line4

}

if(data.line5 == 'f'){

line5.innerHTML = ''

}

else{

line5.innerHTML = data.line5

}

if(data.line6 == 'f'){

line6.innerHTML = ''

}

else{

line6.innerHTML = data.line6

}

if(data.line7 == 'f'){

line7.innerHTML = ''

}

else{

line7.innerHTML = data.line7

}

if(data.line8 == 'f'){

line8.innerHTML = ''

}

else{

line8.innerHTML = data.line8

}

if(data.line9 == 'f'){

line9.innerHTML = ''

}

else{

line9.innerHTML = data.line9

}

if(data.line10 == 'f'){

line10.innerHTML = ''

}

else{

line10.innerHTML = data.line10

}

if(data.line11 == 'f'){

line11.innerHTML = ''

}

else{

line11.innerHTML = data.line11

}

if(data.line12 == 'f'){

line12.innerHTML = ''

}

else{

line12.innerHTML = data.line12

}

if(data.line13 == 'f'){

line13.innerHTML = ''

}

else{

line13.innerHTML = data.line13

}

if(data.line14 == 'f'){

line14.innerHTML = ''

}

else{

line14.innerHTML = data.line14

}

if(data.line15 == 'f'){

line15.innerHTML = ''

}

else{

line15.innerHTML = data.line15

}

if(data.line16 == 'f'){

line16.innerHTML = ''

}

else{

line16.innerHTML = data.line16

}

if(data.line17 == 'f'){

line17.innerHTML = ''

}

else{

line17.innerHTML = data.line17

}

if(data.line18 == 'f'){

line18.innerHTML = ''

}

else{

line18.innerHTML = data.line18

}

if(data.line19 == 'f'){

line19.innerHTML = ''

}

else{

line19.innerHTML = data.line19

}

if(data.line20 == 'f'){

line20.innerHTML = ''

}

else{

line20.innerHTML = data.line20

}

if(data.line21 == 'f'){

line21.innerHTML = ''

}

else{

line21.innerHTML = data.line21

}

if(data.line22 == 'f'){

line22.innerHTML = ''

}

else{

line22.innerHTML = data.line22

}

if(data.line23 == 'f'){

line23.innerHTML = ''

}

else{

line23.innerHTML = data.line23

}

if(data.line24 == 'f'){

line24.innerHTML = ''

}

else{

line24.innerHTML = data.line24

}

if(data.line25 == 'f'){

line25.innerHTML = ''

}

else{

line25.innerHTML = data.line25

}

});

</script>

</body>

</html>

將這些部署到服務器上。這是個人部署效果：

部署好後。使用指令運行Node.js服務：

1	node app.js

運行python3腳本：

1	python3 baiwan.py

若是一切都搭建好了，那麼這個百萬英雄答題輔助系統就能夠運行了！

3、總結

本軟件僅用於學習交流，請勿用於任何商業用途。
也能夠對代碼進行簡單修改，python打印信息，只在本地查看，無需寫入txt文件，部署到服務器上。
代碼亂，沒有通過優化，還需重構。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。