影視劇字幕聊天語料庫特色,把影視劇說話內容一句一句以回車換行羅列三千多萬條中國話,相鄰第二句極可能是第一句最好回答。一個問句有不少種回答,能夠根據相關程度以及歷史聊天記錄全部回答排序,找到最優,是一個搜索排序過程。javascript
lucene+ik。lucene開源免費搜索引擎庫,java語言開發。ik IKAnalyzer,開源中文切詞工具。語料庫切詞建索引,文本搜索作文本相關性檢索,把下一句取出做答案候選集,答案排序,問題分析。css
建索引。eclipse建立maven工程,maven自動生成pom.xml文件,配置包依賴信息,dependencies標籤中添加依賴:html
<dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-core</artifactId> <version>4.10.4</version> </dependency> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-queryparser</artifactId> <version>4.10.4</version> </dependency> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-analyzers-common</artifactId> <version>4.10.4</version> </dependency> <dependency> <groupId>io.netty</groupId> <artifactId>netty-all</artifactId> <version>5.0.0.Alpha2</version> </dependency> <dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> <version>1.1.41</version> </dependency>
project標籤增長配置,依賴jar包自動拷貝lib目錄:java
<build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-dependency-plugin</artifactId> <executions> <execution> <id>copy-dependencies</id> <phase>prepare-package</phase> <goals> <goal>copy-dependencies</goal> </goals> <configuration> <outputDirectory>${project.build.directory}/lib</outputDirectory> <overWriteReleases>false</overWriteReleases> <overWriteSnapshots>false</overWriteSnapshots> <overWriteIfNewer>true</overWriteIfNewer> </configuration> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-jar-plugin</artifactId> <configuration> <archive> <manifest> <addClasspath>true</addClasspath> <classpathPrefix>lib/</classpathPrefix> <mainClass>theMainClass</mainClass> </manifest> </archive> </configuration> </plugin> </plugins> </build>
https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/ik-analyzer/IK%20Analyzer%202012FF_hf1_source.rar 下載ik源代碼把src/org目錄拷到chatbotv1工程src/main/java下,刷新maven工程。python
com.shareditor.chatbotv1包下maven自動生成App.java,改爲Indexer.java:git
Analyzer analyzer = new IKAnalyzer(true); IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_4_9, analyzer); iwc.setOpenMode(OpenMode.CREATE); iwc.setUseCompoundFile(true); IndexWriter indexWriter = new IndexWriter(FSDirectory.open(new File(indexPath)), iwc); BufferedReader br = new BufferedReader(new InputStreamReader( new FileInputStream(corpusPath), "UTF-8")); String line = ""; String last = ""; long lineNum = 0; while ((line = br.readLine()) != null) { line = line.trim(); if (0 == line.length()) { continue; } if (!last.equals("")) { Document doc = new Document(); doc.add(new TextField("question", last, Store.YES)); doc.add(new StoredField("answer", line)); indexWriter.addDocument(doc); } last = line; lineNum++; if (lineNum % 100000 == 0) { System.out.println("add doc " + lineNum); } } br.close(); indexWriter.forceMerge(1); indexWriter.close();
編譯拷貝src/main/resources全部文件到target目錄,target目錄執行github
java -cp $CLASSPATH:./lib/:./chatbotv1-0.0.1-SNAPSHOT.jar com.shareditor.chatbotv1.Indexer ../../subtitle/raw_subtitles/subtitle.corpus ./index
生成索引目錄index經過lukeall-4.9.0.jar查看。web
檢索服務。netty建立http服務server,代碼在https://github.com/warmheartli/ChatBotCourse的chatbotv1目錄:算法
Analyzer analyzer = new IKAnalyzer(true); QueryParser qp = new QueryParser(Version.LUCENE_4_9, "question", analyzer); if (topDocs.totalHits == 0) { qp.setDefaultOperator(Operator.AND); query = qp.parse(q); System.out.println(query.toString()); indexSearcher.search(query, collector); topDocs = collector.topDocs(); } if (topDocs.totalHits == 0) { qp.setDefaultOperator(Operator.OR); query = qp.parse(q); System.out.println(query.toString()); indexSearcher.search(query, collector); topDocs = collector.topDocs(); } ret.put("total", topDocs.totalHits); ret.put("q", q); JSONArray result = new JSONArray(); for (ScoreDoc d : topDocs.scoreDocs) { Document doc = indexSearcher.doc(d.doc); String question = doc.get("question"); String answer = doc.get("answer"); JSONObject item = new JSONObject(); item.put("question", question); item.put("answer", answer); item.put("score", d.score); item.put("doc", d.doc); result.add(item); } ret.put("result", result);
查詢索引,query詞作切詞拼lucene query,檢索索引question字段,匹配返回answer字段值做候選集,挑出候選集一條做答案。server經過http訪問,如http://127.0.0.1:8765/?q=hello 。中文需轉urlcode發送,java端讀取按urlcode解析,server啓動方法:apache
java -cp $CLASSPATH:./lib/:./chatbotv1-0.0.1-SNAPSHOT.jar com.shareditor.chatbotv1.Searcher
聊天界面。一個展現聊天內容框框,選擇ckeditor,支持html格式內容展現,一個輸入框和發送按鈕,html代碼:
<div class="col-sm-4 col-xs-10"> <div class="row"> <textarea id="chatarea"> <div style='color: blue; text-align: left; padding: 5px;'>機器人: 喂,大哥您好,您終於肯跟我聊天了,來侃侃唄,我來者不拒!</div> <div style='color: blue; text-align: left; padding: 5px;'>機器人: 啥?你問我怎麼這麼聰明會聊天?由於我剛剛吃了一堆影視劇字幕!</div> </textarea> </div> <br /> <div class="row"> <div class="input-group"> <input type="text" id="input" class="form-control" autofocus="autofocus" onkeydown="submitByEnter()" /> <span class="input-group-btn"> <button class="btn btn-default" type="button" onclick="submit()">發送</button> </span> </div> </div> </div> <script type="text/javascript"> CKEDITOR.replace('chatarea', { readOnly: true, toolbar: ['Source'], height: 500, removePlugins: 'elementspath', resize_enabled: false, allowedContent: true }); </script>
調用聊天server,要一個發送請求獲取結果控制器:
public function queryAction(Request $request) { $q = $request->get('input'); $opts = array( 'http'=>array( 'method'=>"GET", 'timeout'=>60, ) ); $context = stream_context_create($opts); $clientIp = $request->getClientIp(); $response = file_get_contents('http://127.0.0.1:8765/?q=' . urlencode($q) . '&clientIp=' . $clientIp, false, $context); $res = json_decode($response, true); $total = $res['total']; $result = ''; if ($total > 0) { $result = $res['result'][0]['answer']; } return new Response($result); }
控制器路由配置:
chatbot_query: path: /chatbot/query defaults: { _controller: AppBundle:ChatBot:query }
聊天server響應時間比較長,不致使web界面卡住,執行submit時異步發請求和收結果:
var xmlHttp; function submit() { if (window.ActiveXObject) { xmlHttp = new ActiveXObject("Microsoft.XMLHTTP"); } else if (window.XMLHttpRequest) { xmlHttp = new XMLHttpRequest(); } var input = $("#input").val().trim(); if (input == '') { jQuery('#input').val(''); return; } addText(input, false); jQuery('#input').val(''); var datastr = "input=" + input; datastr = encodeURI(datastr); var url = "/chatbot/query"; xmlHttp.open("POST", url, true); xmlHttp.onreadystatechange = callback; xmlHttp.setRequestHeader("Content-type", "application/x-www-form-urlencoded"); xmlHttp.send(datastr); } function callback() { if (xmlHttp.readyState == 4 && xmlHttp.status == 200) { var responseText = xmlHttp.responseText; addText(responseText, true); } }
addText往ckeditor添加一段文本:
function addText(text, is_response) { var oldText = CKEDITOR.instances.chatarea.getData(); var prefix = ''; if (is_response) { prefix = "<div style='color: blue; text-align: left; padding: 5px;'>機器人: " } else { prefix = "<div style='color: darkgreen; text-align: right; padding: 5px;'>我: " } CKEDITOR.instances.chatarea.setData(oldText + "" + prefix + text + "</div>"); }
代碼: https://github.com/warmheartli/ChatBotCourse https://github.com/warmheartli/shareditor.com
效果演示:http://www.shareditor.com/chatbot/
導流。統計網站流量狀況。cnzz統計看最近半個月受訪頁面流量狀況,用戶訪問集中頁面。增長圖庫動態按鈕。吸引用戶點擊,在每一個頁面右下角放置動態小圖標,頁面滾動它不動,用戶點了直接跳到想要引流的頁面。搜客服漂浮代碼。 建立js文件,lrtk.js :
$(function() { var tophtml="<a href=\"http://www.shareditor.com/chatbot/\" target=\"_blank\"><div id=\"izl_rmenu\" class=\"izl-rmenu\"><div class=\"btn btn-phone\"></div><div class=\"btn btn-top\"></div></div></a>"; $("#top").html(tophtml); $("#izl_rmenu").each(function() { $(this).find(".btn-phone").mouseenter(function() { $(this).find(".phone").fadeIn("fast"); }); $(this).find(".btn-phone").mouseleave(function() { $(this).find(".phone").fadeOut("fast"); }); $(this).find(".btn-top").click(function() { $("html, body").animate({ "scroll-top":0 },"fast"); }); }); var lastRmenuStatus=false; $(window).scroll(function() { var _top=$(window).scrollTop(); if(_top>=0) { $("#izl_rmenu").data("expanded",true); } else { $("#izl_rmenu").data("expanded",false); } if($("#izl_rmenu").data("expanded")!=lastRmenuStatus) { lastRmenuStatus=$("#izl_rmenu").data("expanded"); if(lastRmenuStatus) { $("#izl_rmenu .btn-top").slideDown(); } else { $("#izl_rmenu .btn-top").slideUp(); } } }); });
上半部分定義id=top的div標籤內容。一個id爲izl_rmenu的div,css格式定義在另外一個文件lrtk.css裏:
.izl-rmenu{position:fixed;left:85%;bottom:10px;padding-bottom:73px;z-index:999;} .izl-rmenu .btn{width:72px;height:73px;margin-bottom:1px;cursor:pointer;position:relative;} .izl-rmenu .btn-top{background:url(http://www.shareditor.com/uploads/media/default/0001/01/thumb_416_default_big.png) 0px 0px no-repeat;background-size: 70px 70px;display:none;}
下半部分當頁面滾動時div展開。
在全部頁面公共代碼部分增長
<div id="top"></div>
龐大語料庫運用,LSTM-RNN訓練,中文語料轉成算法識別向量形式,最強大word embedding工具word2vec。
word2vec輸入切詞文本文件,影視劇字幕語料庫回車換行分隔完整句子,因此咱們先對其作切詞,word_segment.py文件:
# coding:utf-8 import sys import importlib importlib.reload(sys) import jieba from jieba import analyse def segment(input, output): input_file = open(input, "r") output_file = open(output, "w") while True: line = input_file.readline() if line: line = line.strip() seg_list = jieba.cut(line) segments = "" for str in seg_list: segments = segments + " " + str segments = segments + "\n" output_file.write(segments) else: break input_file.close() output_file.close() if __name__ == '__main__': if 3 != len(sys.argv): print("Usage: ", sys.argv[0], "input output") sys.exit(-1) segment(sys.argv[1], sys.argv[2]);
使用:
python word_segment.py subtitle/raw_subtitles/subtitle.corpus segment_result
word2vec生成詞向量。word2vec可從https://github.com/warmheartli/ChatBotCourse/tree/master/word2vec獲取,make編譯生成二進制文件。 執行:
./word2vec -train ../segment_result -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
生成vectors.bin詞向量,二進制格式,word2vec自帶distance工具來驗證:
./distance vectors.bin
詞向量二進制文件格式加載。word2vec生成詞向量二進制格式:詞數目(空格)向量維度。 加載詞向量二進制文件python腳本:
# coding:utf-8 import sys import struct import math import numpy as np reload(sys) sys.setdefaultencoding( "utf-8" ) max_w = 50 float_size = 4 def load_vectors(input): print "begin load vectors" input_file = open(input, "rb") # 獲取詞表數目及向量維度 words_and_size = input_file.readline() words_and_size = words_and_size.strip() words = long(words_and_size.split(' ')[0]) size = long(words_and_size.split(' ')[1]) print "words =", words print "size =", size word_vector = {} for b in range(0, words): a = 0 word = '' # 讀取一個詞 while True: c = input_file.read(1) word = word + c if False == c or c == ' ': break if a < max_w and c != '\n': a = a + 1 word = word.strip() # 讀取詞向量 vector = np.empty([200]) for index in range(0, size): m = input_file.read(float_size) (weight,) = struct.unpack('f', m) vector[index] = weight # 將詞及其對應的向量存到dict中 word_vector[word.decode('utf-8')] = vector input_file.close() print "load vectors finish" return word_vector if __name__ == '__main__': if 2 != len(sys.argv): print "Usage: ", sys.argv[0], "vectors.bin" sys.exit(-1) d = load_vectors(sys.argv[1]) print d[u'真的']
運行方式以下:
python word_vectors_loader.py vectors.bin
參考資料:
《Python 天然語言處理》
http://www.shareditor.com/blogshow?blogId=113
http://www.shareditor.com/blogshow?blogId=114
http://www.shareditor.com/blogshow?blogId=115
歡迎推薦上海機器學習工做機會,個人微信:qingxingfengzi