爬取goole翻譯和百度翻譯用於產生類似句子數據集

項目的目的

在作問答系統研究的時候,想經過deeplearning的方法得到句子語義,並計算兩個問句的類似度,爲此須要類似問題的數據集,可是通常類似問題數據集很難獲取,特別是質量較高的數據集。爲此,想到用目前最早進的翻譯系統來實現構建數據。javascript

另外,當前NLP比較成功應用就是機器翻譯,之因此deeplearning能成功應用到翻譯,得意於龐大的、高質量的翻譯語料數據。html

1.爬取goole翻譯

說明

谷歌翻譯直接經過request請求是獲取不到結果的,須要tk值,tk值須要由問句+tkk值來生成。java

  • 獲取tkk:

requests獲取主頁面,只需re正則在主頁面上獲取tkk值(以前須要經過js腳原本實現,參考python

  • 獲取tk

經過js腳本實現:gettk.jsgit

var b = function (a, b) {
	for (var d = 0; d < b.length - 2; d += 3) {
		var c = b.charAt(d + 2),
			c = "a" <= c ? c.charCodeAt(0) - 87 : Number(c),
			c = "+" == b.charAt(d + 1) ? a >>> c : a << c;
		a = "+" == b.charAt(d) ? a + c & 4294967295 : a ^ c
	}
	return a
}

var tk =  function (a,TKK) {
	//console.log(a,TKK);
	for (var e = TKK.split("."), h = Number(e[0]) || 0, g = [], d = 0, f = 0; f < a.length; f++) {
		var c = a.charCodeAt(f);
		128 > c ? g[d++] = c : (2048 > c ? g[d++] = c >> 6 | 192 : (55296 == (c & 64512) && f + 1 < a.length && 56320 == (a.charCodeAt(f + 1) & 64512) ? (c = 65536 + ((c & 1023) << 10) + (a.charCodeAt(++f) & 1023), g[d++] = c >> 18 | 240, g[d++] = c >> 12 & 63 | 128) : g[d++] = c >> 12 | 224, g[d++] = c >> 6 & 63 | 128), g[d++] = c & 63 | 128)
	}
	a = h;
	for (d = 0; d < g.length; d++) a += g[d], a = b(a, "+-a^+6");
	a = b(a, "+-3^+b+-f");
	a ^= Number(e[1]) || 0;
	0 > a && (a = (a & 2147483647) + 2147483648);
	a %= 1E6;
	return a.toString() + "." + (a ^ h)
}
  • 最終地址

將tk值和須要翻譯的句子代人以下格式github

中-英 :json

https://translate.google.cn/translate_a/single?client=t&sl=zh-CN&tl=en&hl=zh-CN&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&ie=UTF-8&oe=UTF-8&source=bh&ssel=0&tsel=0&kc=1&tk=xxxxxx&q=xxxxxxxdom

英-中:ide

https://translate.google.cn/translate_a/single?client=t&sl=en&tl=zh-CN&hl=zh-CN&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&ie=UTF-8&oe=UTF-8&source=bh&ssel=0&tsel=0&kc=1&tk=xxxxxx&q=xxxxxxxpost

程序中簡單增長了中英文判斷。

  • requests獲取結果

獲取的結果在三維的列表中,list[0][0][0]即爲所需的結果

程序 goole_trans.py

import requests
import urllib
import re
import json
import execjs

class Goole_translate():
    def __init__(self):
        self.url_base = 'https://translate.google.cn'
        self.headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'}
        self.get_tkk()

    def get_tkk(self):
        page = requests.get(self.url_base, headers= self.headers )
        tkks = re.findall(r"TKK='(.+?)';", page.text)
        if tkks:
            self.tkk = tkks[0]
            return self.tkk
        else:
            raise ('no found tkk')

    def translate(self, query_string):
        last_url = self.get_last_url(query_string)
        response = requests.get(last_url, headers=self.headers)
        if response.status_code != 200:
            self.get_tkk()
            last_url = self.get_last_url(query_string)
            response = requests.get(last_url, headers=self.headers)

        content = response.content # bytes類型
        text = content.decode()  # str類型  , 兩步能夠用text=response.text替換
        dict_text = json.loads(text)  # 數據是json各式
        result = dict_text[0][0][0]
        return result

    def get_tk(self, query_string):
        tem = execjs.compile(open(r"gettk.js").read())
        tk = tem.call('tk', query_string, self.tkk)
        return tk

    def query_string(self, query_string):
        '''將字符串轉換爲utf8格式的字符串,自己已utf8格式定義的字符串能夠不須要'''
        query_url_trans = urllib.parse.quote(query_string)  # 漢字url編碼, 轉爲utf-8各式
        return query_url_trans

    def get_last_url(self, query_string):
        url_parm = 'sl=en&tl=zh-CN'
        for uchar in query_string:
            if uchar >= u'\u4e00' and uchar <= u'\u9fa5':
                url_parm = 'sl=zh-CN&tl=en'
                break

        url_param_part = self.url_base + "/translate_a/single?"
        url_param = url_param_part + "client=t&"+ url_parm+"&hl=zh-CN&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&ie=UTF-8&oe=UTF-8&source=btn&ssel=3&tsel=3&kc=0&"
        url_get = url_param + "tk=" + str(self.get_tk(query_string)) + "&q=" + str(self.query_string(query_string))
        return url_get

if __name__=="__main__":
    query_string = 'how are you'
    gt = Goole_translate()
    en = gt.translate(query_string)
    print(en)

注意

頻繁訪問可能被封,沒有測試過,能夠設置延時或ip代理

參考

http://www.cnblogs.com/by-dream/p/6554340.html

https://blog.csdn.net/boyheroes/article/details/78681357

2.爬取百度翻譯

  • 自動檢測中英文

  • 獲取百度翻譯結果

程序 baidu_trans.py

# coding=utf-8
import requests
import json

class Baidu_translate():
    def __init__(self):
        self.headers = {"User-Agent":"Mozilla/5.0 (Linux; Android 5.1.1; Nexus 6 Build/LYZ28E) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Mobile Safari/537.36"}
        self.lang_detect_url = "https://fanyi.baidu.com/langdetect"
        self.trans_url = "https://fanyi.baidu.com/basetrans"

    def get_lang(self,query_string):
        '''自動檢測語言'''
        data = {'query':query_string}
        response = requests.post(self.lang_detect_url, data=data, headers=self.headers)
        return json.loads(response.text)['lan']

    def translate(self,query_string):
        '''翻譯'''
        lang = self.get_lang(query_string)
        data = {"query":query_string,"from":"zh","to":"en"} if lang== "zh" else {"query":query_string,"from":"en","to":"zh"}
        response = requests.post(self.trans_url, data=data, headers=self.headers)
        result = json.loads(response.text)["trans"][0]["dst"]
        return result


if __name__ == '__main__':
    query_string = 'how are you'
    bt = Baidu_translate()
    en = bt.translate(query_string)
    print(en)

參考

https://blog.csdn.net/blues_f/article/details/79319461

3.中文-英文-中文

經過中-英-中能夠產生類似問答對語料。

  • q_zh:中文問句
  • g_en:谷歌對中文問句中-英翻譯
  • b_zh:百度對谷歌結果進行英-中翻譯

程序 zh_en_zh_translate.py

'''
經過翻譯實現中文問句的類似問法,來產生類似問題對數據集。可用於語義類似模型訓練。

goole翻譯:中文-英文
baidu翻譯:英文-中文

注意:本程序未設置ip代理,頻繁訪問謹防被封。(只作了簡單的隨機延遲措施)
'''
import time
import random
from goole_trans import Goole_translate
from baidu_trans import Baidu_translate

gt = Goole_translate()
bt = Baidu_translate()

r_file = 'data/zh.txt'
w_file = 'data/zh_en_zh.txt'
fw = open(w_file,'w',encoding='utf8')
with open(r_file,'r',encoding='utf8') as f:
    for line in f:
        r = random.random()*10
        time.sleep(r)
        ls = line.strip().split('\t')
        query_string = ls[0]
        g_en = gt.translate(query_string)
        b_zh = bt.translate(g_en)
        fw.write(query_string+'\t'+g_en+'\t'+b_zh+'\n')

        print('q_zh:',query_string)
        print('g_en:',g_en)
        print('b_zh:',b_zh)
        print('\n')
fw.close()

運行結果

q_zh: 下週有什麼好產品?
g_en: What are the good products next week?
b_zh: 下週的好產品是什麼?

------------

q_zh: 第一次使用,額度多少?
g_en: What is the amount of the first use?
b_zh: 第一次使用的數量是多少?

------------

q_zh: 我何時能夠經過微粒貸借錢
g_en: When can I borrow money from micro-credit?
b_zh: 我何時能夠從小額信貸中借錢?

------------

q_zh: 借款後多長時間給打電話
g_en: How long does it take to make a call after borrowing?
b_zh: 借錢後打電話須要多長時間?

------------

q_zh: 沒看到微粒貸
g_en: Didn't see the micro-credit
b_zh: 沒有看到小額信貸

------------

q_zh: 原來的手機號不用了,怎麼換
g_en: The original mobile phone number is not used, how to change
b_zh: 原來的手機號碼沒有用,怎麼改

注意

程序均未設置ip代理,頻繁訪問謹防被封。(只作了簡單的隨機延遲措施),後期若更新,見github: github地址

相關文章
相關標籤/搜索