在作問答系統研究的時候,想經過deeplearning的方法得到句子語義,並計算兩個問句的類似度,爲此須要類似問題的數據集,可是通常類似問題數據集很難獲取,特別是質量較高的數據集。爲此,想到用目前最早進的翻譯系統來實現構建數據。javascript
另外,當前NLP比較成功應用就是機器翻譯,之因此deeplearning能成功應用到翻譯,得意於龐大的、高質量的翻譯語料數據。html
谷歌翻譯直接經過request請求是獲取不到結果的,須要tk值,tk值須要由問句+tkk值來生成。java
requests獲取主頁面,只需re正則在主頁面上獲取tkk值(以前須要經過js腳原本實現,參考)python
經過js腳本實現:gettk.jsgit
var b = function (a, b) { for (var d = 0; d < b.length - 2; d += 3) { var c = b.charAt(d + 2), c = "a" <= c ? c.charCodeAt(0) - 87 : Number(c), c = "+" == b.charAt(d + 1) ? a >>> c : a << c; a = "+" == b.charAt(d) ? a + c & 4294967295 : a ^ c } return a } var tk = function (a,TKK) { //console.log(a,TKK); for (var e = TKK.split("."), h = Number(e[0]) || 0, g = [], d = 0, f = 0; f < a.length; f++) { var c = a.charCodeAt(f); 128 > c ? g[d++] = c : (2048 > c ? g[d++] = c >> 6 | 192 : (55296 == (c & 64512) && f + 1 < a.length && 56320 == (a.charCodeAt(f + 1) & 64512) ? (c = 65536 + ((c & 1023) << 10) + (a.charCodeAt(++f) & 1023), g[d++] = c >> 18 | 240, g[d++] = c >> 12 & 63 | 128) : g[d++] = c >> 12 | 224, g[d++] = c >> 6 & 63 | 128), g[d++] = c & 63 | 128) } a = h; for (d = 0; d < g.length; d++) a += g[d], a = b(a, "+-a^+6"); a = b(a, "+-3^+b+-f"); a ^= Number(e[1]) || 0; 0 > a && (a = (a & 2147483647) + 2147483648); a %= 1E6; return a.toString() + "." + (a ^ h) }
將tk值和須要翻譯的句子代人以下格式github
中-英 :json
英-中:ide
程序中簡單增長了中英文判斷。
獲取的結果在三維的列表中,list[0][0][0]即爲所需的結果
import requests import urllib import re import json import execjs class Goole_translate(): def __init__(self): self.url_base = 'https://translate.google.cn' self.headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'} self.get_tkk() def get_tkk(self): page = requests.get(self.url_base, headers= self.headers ) tkks = re.findall(r"TKK='(.+?)';", page.text) if tkks: self.tkk = tkks[0] return self.tkk else: raise ('no found tkk') def translate(self, query_string): last_url = self.get_last_url(query_string) response = requests.get(last_url, headers=self.headers) if response.status_code != 200: self.get_tkk() last_url = self.get_last_url(query_string) response = requests.get(last_url, headers=self.headers) content = response.content # bytes類型 text = content.decode() # str類型 , 兩步能夠用text=response.text替換 dict_text = json.loads(text) # 數據是json各式 result = dict_text[0][0][0] return result def get_tk(self, query_string): tem = execjs.compile(open(r"gettk.js").read()) tk = tem.call('tk', query_string, self.tkk) return tk def query_string(self, query_string): '''將字符串轉換爲utf8格式的字符串,自己已utf8格式定義的字符串能夠不須要''' query_url_trans = urllib.parse.quote(query_string) # 漢字url編碼, 轉爲utf-8各式 return query_url_trans def get_last_url(self, query_string): url_parm = 'sl=en&tl=zh-CN' for uchar in query_string: if uchar >= u'\u4e00' and uchar <= u'\u9fa5': url_parm = 'sl=zh-CN&tl=en' break url_param_part = self.url_base + "/translate_a/single?" url_param = url_param_part + "client=t&"+ url_parm+"&hl=zh-CN&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&ie=UTF-8&oe=UTF-8&source=btn&ssel=3&tsel=3&kc=0&" url_get = url_param + "tk=" + str(self.get_tk(query_string)) + "&q=" + str(self.query_string(query_string)) return url_get if __name__=="__main__": query_string = 'how are you' gt = Goole_translate() en = gt.translate(query_string) print(en)
頻繁訪問可能被封,沒有測試過,能夠設置延時或ip代理
http://www.cnblogs.com/by-dream/p/6554340.html
https://blog.csdn.net/boyheroes/article/details/78681357
自動檢測中英文
獲取百度翻譯結果
# coding=utf-8 import requests import json class Baidu_translate(): def __init__(self): self.headers = {"User-Agent":"Mozilla/5.0 (Linux; Android 5.1.1; Nexus 6 Build/LYZ28E) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Mobile Safari/537.36"} self.lang_detect_url = "https://fanyi.baidu.com/langdetect" self.trans_url = "https://fanyi.baidu.com/basetrans" def get_lang(self,query_string): '''自動檢測語言''' data = {'query':query_string} response = requests.post(self.lang_detect_url, data=data, headers=self.headers) return json.loads(response.text)['lan'] def translate(self,query_string): '''翻譯''' lang = self.get_lang(query_string) data = {"query":query_string,"from":"zh","to":"en"} if lang== "zh" else {"query":query_string,"from":"en","to":"zh"} response = requests.post(self.trans_url, data=data, headers=self.headers) result = json.loads(response.text)["trans"][0]["dst"] return result if __name__ == '__main__': query_string = 'how are you' bt = Baidu_translate() en = bt.translate(query_string) print(en)
https://blog.csdn.net/blues_f/article/details/79319461
經過中-英-中能夠產生類似問答對語料。
''' 經過翻譯實現中文問句的類似問法,來產生類似問題對數據集。可用於語義類似模型訓練。 goole翻譯:中文-英文 baidu翻譯:英文-中文 注意:本程序未設置ip代理,頻繁訪問謹防被封。(只作了簡單的隨機延遲措施) ''' import time import random from goole_trans import Goole_translate from baidu_trans import Baidu_translate gt = Goole_translate() bt = Baidu_translate() r_file = 'data/zh.txt' w_file = 'data/zh_en_zh.txt' fw = open(w_file,'w',encoding='utf8') with open(r_file,'r',encoding='utf8') as f: for line in f: r = random.random()*10 time.sleep(r) ls = line.strip().split('\t') query_string = ls[0] g_en = gt.translate(query_string) b_zh = bt.translate(g_en) fw.write(query_string+'\t'+g_en+'\t'+b_zh+'\n') print('q_zh:',query_string) print('g_en:',g_en) print('b_zh:',b_zh) print('\n') fw.close()
q_zh: 下週有什麼好產品? g_en: What are the good products next week? b_zh: 下週的好產品是什麼? ------------ q_zh: 第一次使用,額度多少? g_en: What is the amount of the first use? b_zh: 第一次使用的數量是多少? ------------ q_zh: 我何時能夠經過微粒貸借錢 g_en: When can I borrow money from micro-credit? b_zh: 我何時能夠從小額信貸中借錢? ------------ q_zh: 借款後多長時間給打電話 g_en: How long does it take to make a call after borrowing? b_zh: 借錢後打電話須要多長時間? ------------ q_zh: 沒看到微粒貸 g_en: Didn't see the micro-credit b_zh: 沒有看到小額信貸 ------------ q_zh: 原來的手機號不用了,怎麼換 g_en: The original mobile phone number is not used, how to change b_zh: 原來的手機號碼沒有用,怎麼改
程序均未設置ip代理,頻繁訪問謹防被封。(只作了簡單的隨機延遲措施),後期若更新,見github: github地址