NLP項目

時間 2019-11-30

標籤 nlp 項目简体版

原文原文鏈接

GitHub NLP項目：天然語言處理項目的相關乾貨整理

天然語言處理（NLP）是計算機科學，人工智能，語言學關注計算機和人類（天然）語言之間的相互做用的領域。本文做者爲天然語言處理NLP初學者整理了一份龐大的天然語言處理項目領域的概覽，包括了不少人工智能應用程序。選取的參考文獻與資料都側重於最新的深度學習研究成果。這些天然語言處理項目資源能爲想要深刻鑽研一個天然語言處理NLP任務的人們提供一個良好的開端。php

天然語言處理項目的相關乾貨整理：

指代消解

https://github.com/Kyubyong/nlp_tasks#coreference-resolutionhtml

論文自動評分

論文：Automatic Text Scoring Using Neural Networks（使用神經網絡的自動文本評分）：https://arxiv.org/abs/1606.04289

論文：A Neural Approach to Automated Essay Scoring（一種自動將論文評分的神經學方法）：http://www.aclweb.org/old_anthology/D/D16/D16-1193.pdf

挑戰：Kaggle:The Hewlett Foundation: Automated Essay Scoring（Kaggle：The Hewlett
Foundation:論文自動評分系統）：https://www.kaggle.com/c/asap-aes

項目：Enhanced AI Scoring Engine（加強的人工智能得分引擎）：https://github.com/edx/ease

自動語音識別

維基百科：語言識別：https://en.wikipedia.org/wiki/Speech_recognition

論文：DeepSpeech 2: End-to-End Speech Recognition in English and Mandarin（深度語音2:用英語和普通話進行端對端語音識別）：https://arxiv.org/abs/1512.02595

論文：WaveNet:A Generative Model for Raw Audio（WaveNet:原始音頻的生成模型）：https://arxiv.org/abs/1609.03499

項目：A TensorFlow implementation of Baidu’s Deep Speech architecture（百度深度語音架構的一個TensorFlow實現：https://github.com/mozilla/DeepSpeech

項目：Speech-to-Text-WaveNet: End-to-end sentence level English speech recognition using DeepMind’s WaveNet（Speech-to-Text-WaveNet:
使用DeepMind的WaveNet，對端到端句子的英語水平語音識別）：https://github.com/buriburisuri/speech-to-text-wavenet

挑戰：The 5th CHiME Speech Separation and Recognition Challenge（第五屆CHiME語音的分離和識別挑戰）：http://spandh.dcs.shef.ac.uk/chime_challenge/

資料：The 5thCHiME Speech Separation and Recognition Challenge（第五屆CHiME語音的分離和識別挑戰）：http://spandh.dcs.shef.ac.uk/chime_challenge/download.html

資料：CSTRVCTK Corpus ：http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html

資料：LibriSpeech ASR corpus：http://www.openslr.org/12/

資料：Switchboard-1 Telephone Speech Corpus：https://catalog.ldc.upenn.edu/ldc97s62

資料：TED-LIUM Corpus：http://www-lium.univ-lemans.fr/en/content/ted-lium-corpus

自動摘要

維基百科：自動摘要：https://en.wikipedia.org/wiki/Automatic_summarization

書籍：Automatic Text Summarization（自動本文摘要）：https://www.amazon.com/Automatic-Text-Summarization-Juan-Manuel-Torres-Moreno/dp/1848216688/ref=sr_1_1?s=books&ie=UTF8&qid=1507782304&sr=1-1&keywords=Automatic+Text+Summarization

論文：Text Summarization Using Neural Networks（使用神經網絡進行文本摘要）：http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.823.8025&rep=rep1&type=pdf

論文：Ranking with Recursive Neural Networks and Its Application to Multi-DocumentSummarization（使用遞歸神經網絡及其應用程序對多文檔摘要進行排序）：https://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/viewFile/9414/9520

資料：Text Analytics Conferences（文本分析會議）：https://tac.nist.gov/data/index.html

資料：Document Understanding Conferences（文書理解會議）：http://www-nlpir.nist.gov/projects/duc/data.html

共指消解

信息：共指消解：https://nlp.stanford.edu/projects/coref.shtml

論文：Deep Reinforcement Learning for Mention-Ranking Coreference Models（對Mention-Ranking的共指模型進行深度強化學習：https://arxiv.org/abs/1609.08667

論文：Improving Coreference Resolution by Learning Entity-Level Distributed
Representations（經過學習實體級分佈式表示來改善相關的解決方案）：https://arxiv.org/abs/1606.01323

挑戰：CoNLL 2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes（CoNLL
2012共享任務:在OntoNotes中對多語言的不受限制的共指進行建模）：http://conll.cemantix.org/2012/task-description.html

挑戰：CoNLL 2011 Shared Task: Modeling Unrestricted Coreference in OntoNotes（CoNLL
2011共享任務:在OntoNotes中對多語言的不受限制的共指進行建模）：http://conll.cemantix.org/2011/task-description.html

語法錯誤校訂

論文：Neural Network Translation Models for Grammatical Error Correction（語法錯誤校訂的神經網絡翻譯模型）：https://arxiv.org/abs/1606.00189

挑戰：CoNLL 2013 Shared Task: Grammatical Error Correction（CoNLL 2013共享任務:語法錯誤校訂）：http://www.comp.nus.edu.sg/~nlp/conll13st.html

挑戰：CoNLL 2014Shared Task: Grammatical Error Correction（CoNLL 2014共享任務:語法錯誤校訂）：http://www.comp.nus.edu.sg/~nlp/conll14st.html

資料：NUSNon-commercial research/trial corpus license：http://www.comp.nus.edu.sg/~nlp/conll14st/nucle_license.pdf

資料：Lang-8 Learner Corpora：http://cl.naist.jp/nldata/lang-8/

資料：Cornell Movie–Dialogs Corpus：http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html

項目：Deep Text Corrector（深度文本校訂器）：https://github.com/atpaino/deep-text-corrector

產品：deep grammar：http://deepgrammar.com/

字素轉換到音素

論文：Grapheme-to-Phoneme Models for （Almost）Any Language（適合(幾乎)任何語言的字素到音素的模型）：
https://pdfs.semanticscholar.org/b9c8/fef9b6f16b92c6859f6106524fdb053e9577.pdf

論文：Polyglot Neural Language Models: A Case Study in Cross-Lingual Phonetic Representation
Learning（多語言神經語言模型:跨語語音表達學習的案例研究）：
https://arxiv.org/pdf/1605.03832.pdf

論文：Multi task Sequence-to-Sequence Models for Grapheme-to-Phoneme Conversion（多任務序列到序列的字素到音素轉換的模型）：
https://pdfs.semanticscholar.org/26d0/09959fa2b2e18cddb5783493738a1c1ede2f.pdf

項目：Sequence-to-Sequence G2P toolkit（序列到序列G2P工具包）：
https://github.com/cmusphinx/g2p-seq2seq

資料：Multilingual Pronunciation Data（多語種發音數據）：
https://drive.google.com/drive/folders/0B7R_gATfZJ2aWkpSWHpXUklWUmM

語種識別

維基百科：語種識別：https://en.wikipedia.org/wiki/Language_identification

論文：AUTOMATIC LANGUAGE IDENTIFICATION USING DEEP NEURAL NETWORKS（使用深度神經網絡的自動語言識別）：
https://repositorio.uam.es/bitstream/handle/10486/666848/automatic_lopez-moreno_ICASSP_2014_ps.pdf?sequence=1

挑戰： 2015 Language Recognition Evaluation（2015語言識別評估）：
https://www.nist.gov/itl/iad/mig/2015-language-recognition-evaluation

語言建模

維基百科：語言模型：https://en.wikipedia.org/wiki/Language_model

工具包： KenLM Language Model Toolkit（KenLM語言模型工具包）：
http://kheafield.com/code/kenlm/

論文：Distributed Representations of Words and Phrases and their Compositionality（詞彙和短語的分佈表示及其組合性）：
http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

論文：Character-Aware Neural Language Models（Character-Aware神經語言模型）：
https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/viewFile/12489/12017

資料： Penn Treebank ：
https://github.com/townie/PTB-dataset-from-Tomas-Mikolov-s-webpage/tree/master/data

詞形還原

維基百科：詞形還原：https://en.wikipedia.org/wiki/Lemmatisation

工具包：WordNet Lemmatizer：
http://www.nltk.org/api/nltk.stem.html#nltk.stem.wordnet.WordNetLemmatizer.lemmatize

資料：Treebank-3：https://catalog.ldc.upenn.edu/ldc99t42

脣語辨別

維基百科：脣讀法：https://en.wikipedia.org/wiki/Lip_reading

論文：Lip Reading Sentences in the Wild （在野外讀懂脣語）：
https://arxiv.org/abs/1611.05358
https://arxiv.org/abs/1706.05739

項目： Lip Reading – Cross Audio-Visual Recognition using 3D Convolutional Neural
Networks（脣讀法—使用3D卷積神經網絡的交叉視聽識別：
https://github.com/astorfi/lip-reading-deeplearning

資料： The GRID audiovisual sentence corpus：
http://spandh.dcs.shef.ac.uk/gridcorpus/

機器翻譯

論文：Neural Machine Translation by Jointly Learning to Align and Translate（經過共同窗習來調整和翻譯神經機器翻譯）：
https://arxiv.org/abs/1409.0473

論文：Neural Machine Translation in Linear Tim（在線性時間中的神經機器翻譯）：
https://arxiv.org/abs/1610.10099

挑戰： ACL2014 NINTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION（ACL2014第九屆統計機器翻譯研討會）：
http://www.statmt.org/wmt14/translation-task.html#download

資料：OpenSubtitles2016:http://opus.lingfil.uu.se/OpenSubtitles2016.php

資料： WIT3:Web Inventory of Transcribed and Translated Talks:https://wit3.fbk.eu/

資料： The QCRI Educational Domain （QED） Corpus：
http://alt.qcri.org/resources/qedcorpus/

命名實體識別

維基百科：命名實體識別：https://en.wikipedia.org/wiki/Named-entity_recognition

論文：Neural Architectures for Named Entity Recognition（命名實體識別的神經結構）：https://arxiv.org/abs/1603.01360

項目： OSU Twitter NLP Tool：https://github.com/aritter/twitter_nlp

挑戰： Named Entity Recognition in Twitter（在推特上被命名的實體識別）：
https://noisy-text.github.io/2016/ner-shared-task.html

資料：CoNLL-2002 NER corpus：
https://github.com/teropa/nlp/tree/master/resources/corpora/conll2002

資料：CoNLL-2003 NER corpus：
https://github.com/synalp/NER/tree/master/corpus/CoNLL-2003

釋義檢測

論文：Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase
Detection（動態池和展開遞歸自動編碼器的釋義檢測）：
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.650.7199&rep=rep1&type=pdf

項目：Paralex: Paraphrase-Driven Learning for Open Question Answering（Paralex：釋義驅動學習的開放問答）：
http://knowitall.cs.washington.edu/paralex/

資料：Microsoft Research Paraphrase Corpus：
https://www.microsoft.com/en-us/download/details.aspx?id=52398

資料：Microsoft Research Video Description Corpus ：
https://www.microsoft.com/en-us/download/details.aspx?id=52422&from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fdownloads%2F38cf15fd-b8df-477e-a4e4-a4680caa75af%2F

資料： Pascal Dataset：
http://nlp.cs.illinois.edu/HockenmaierGroup/pascal-sentences/index.html

資料：Flicker Dataset：http://nlp.cs.illinois.edu/HockenmaierGroup/8k-pictures.html

資料： TheSICK data set：http://clic.cimec.unitn.it/composes/sick.html

資料： PPDB:The Paraphrase Database：http://www.cis.upenn.edu/~ccb/ppdb/

資料：WikiAnswers Paraphrase Corpus：
http://knowitall.cs.washington.edu/paralex/wikianswers-paraphrases-1.0.tar.gz

語法分析

維基百科：語法分析：https://en.wikipedia.org/wiki/Parsing

工具包：The Stanford Parser: A statistical parser：https://nlp.stanford.edu/software/lex-parser.shtml

工具包： spaCyparser：https://spacy.io/docs/usage/dependency-parse

論文：A fastand accurate dependency parser using neural networks（快速而準確地使用神經網絡的依賴解析器）：http://www.aclweb.org/anthology/D14-1082

挑戰：CoNLL2017 Shared Task: Multilingual Parsing from Raw Text to Universal
Dependencies（CoNLL2017共享任務:從原始文本到通用依賴項的多語言解析）：http://universaldependencies.org/conll17/

挑戰：CoNLL2016 Shared Task: Multilingual Shallow Discourse Parsing（CoNLL2016共享任務:多語言的淺會話解析）：http://www.cs.brandeis.edu/~clp/conll16st/

詞性標記

維基百科：詞性標記：https://en.wikipedia.org/wiki/Part-of-speech_tagging

論文：Unsupervised Part-Of-Speech Tagging with Anchor Hidden Markov Models（有Anchor Hidden
Markov模型的非監督性的詞性標記）
https://transacl.org/ojs/index.php/tacl/article/viewFile/837/192

資料：Treebank-3：https://catalog.ldc.upenn.edu/ldc99t42

工具包：nltk.tag package：http://www.nltk.org/api/nltk.tag.html

拼音與中文轉換

論文：Neural Network Language Model for Chinese Pinyin Input Method Engine（中文拼音輸入法引擎的神經網絡語言模型）：
http://aclweb.org/anthology/Y15-1052

項目：Neural Chinese Transliterator：
https://github.com/Kyubyong/neural_chinese_transliterator

問答系統

維基百科：問答系統：https://en.wikipedia.org/wiki/Question_answering

論文：Ask Me Anything: Dynamic Memory Networks for Natural Language Processing（天然語言處理的動態內存網絡）：
http://www.thespermwhale.com/jaseweston/ram/papers/paper_21.pdf

論文：Dynamic Memory Networks for Visual and Textual Question Answering（用於視覺和文本的問答系統的動態記憶網絡）：
http://proceedings.mlr.press/v48/xiong16.pdf

挑戰：TREC Question Answering Task（TREC問答系統任務）：
http://trec.nist.gov/data/qamain.html

挑戰：SemEval-2017 Task 3: Community Question Answering:
http://alt.qcri.org/semeval2017/task3/

資料：MSMARCO: Microsoft MAchine Reading COmprehension Dataset(MSMARCO:微軟機器閱讀理解數據集）http://www.msmarco.org/

資料：Maluuba NewsQA：https://github.com/Maluuba/newsqa

資料：SQuAD:100,000+ Questions for Machine Comprehension of Text（SQuAD:100,000+個文本的機器理解的問題）：https://rajpurkar.github.io/SQuAD-explorer/

資料：Graph Questions: A Characteristic-rich Question Answering Dataset（圖形問題:一個特徵豐富的問題回答數據集）：https://github.com/ysu1989/GraphQuestions

資料： Story Cloze Test and ROC Stories Corpora：
http://cs.rochester.edu/nlp/rocstories/

資料：Microsoft Research WikiQA Corpus：
https://www.microsoft.com/en-us/download/details.aspx?id=52419&from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fdownloads%2F4495da01-db8c-4041-a7f6-7984a4f6a905%2Fdefault.aspx

資料：DeepMind Q&A Dataset：http://cs.nyu.edu/~kcho/DMQA/

資料： QASent：http://cs.stanford.edu/people/mengqiu/data/qg-emnlp07-data.tgz

關係提取

維基百科：關係提取：https://en.wikipedia.org/wiki/Relationship_extraction

論文：A deep learning approach for relationship extraction from interaction context in social manufacturing
paradigm（一種從社會生產範例的互動情境中提取關係深度學習的方法）：http://www.sciencedirect.com/science/article/pii/S0950705116001210

語義角色標記

維基百科：語義角色標記：https://en.wikipedia.org/wiki/Semantic_role_labeling

書籍：Semantic Role Labeling（語義角色標記）：
https://www.amazon.com/Semantic-Labeling-Synthesis-Lectures-Technologies/dp/1598298313/ref=sr_1_1?s=books&ie=UTF8&qid=1507776173&sr=1-1&keywords=Semantic+Role+Labeling

論文：End-to-end Learning of Semantic Role Labeling Using Recurrent Neural Networks（使用循環神經網絡對語義角色標籤進行端到端學習）：
http://www.aclweb.org/anthology/P/P15/P15-1109.pdf

論文：Neural Semantic Role Labeling with Dependency Path Embeddings（有着依賴路徑嵌入的神經語義角色標記）:
https://arxiv.org/abs/1605.07515
挑戰：CoNLL-2005 Shared Task: Semantic Role
Labeling（CoNLL-2005共享任務:語義角色標記）
http://www.cs.upc.edu/~srlconll/st05/st05.html

挑戰：CoNLL-2004 Shared Task: Semantic Role Labeling（CoNLL-2004共享任務:語義角色標記）：
http://www.cs.upc.edu/~srlconll/st04/st04.html

工具包：Illinois Semantic Role Labeler（SRL）：http://cogcomp.org/page/software_view/SRL

資料：CoNLL-2005 Shared Task: Semantic Role Labeling（CoNLL-2005共享任務:語義角色標記）：
http://www.cs.upc.edu/~srlconll/soft.html

語句邊界消歧

維基百科：語句邊界消歧：https://en.wikipedia.org/wiki/Sentence_boundary_disambiguation

論文：A Quantitative and Qualitative Evaluation of Sentence Boundary Detection for theClinical Domain（對臨牀領域的語句邊界檢測進行定量和定性的評估）：
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5001746/

工具包： NLTK Tokenizers：http://www.nltk.org/_modules/nltk/tokenize.html

資料： The British National Corpus：http://www.natcorp.ox.ac.uk/

資料：Switchboard-1 Telephone Speech Corpus：https://catalog.ldc.upenn.edu/ldc97s62

情緒分析

維基百科：情緒分析：https://en.wikipedia.org/wiki/Sentiment_analysis

信息：Awesome Sentiment Analysis（了不得的情緒分析）：
https://github.com/xiamx/awesome-sentiment-analysis

挑戰：Kaggle: UMICH SI650 – Sentiment Classification（Kaggle: UMICH SI650 – 情緒分類）：
https://www.kaggle.com/c/si650winter11#description

挑戰：SemEval-2017 Task 4: Sentiment Analysis in Twitter（SemEval-2017任務4:推特上的情緒分析）：
http://alt.qcri.org/semeval2017/task4/

項目：SenticNet：http://sentic.net/about/

資料：Multi-Domain Sentiment Dataset（version2.0）：
http://www.cs.jhu.edu/~mdredze/datasets/sentiment/

資料：Stanford Sentiment Treebank：https://nlp.stanford.edu/sentiment/code.html

資料：Twitter Sentiment Corpus：http://www.sananalytics.com/lab/twitter-sentiment/

資料：Twitter Sentiment Analysis Training Corpus：http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/

源分離

維基百科：源分離：https://en.wikipedia.org/wiki/Source_separation

論文：From Blind to Guided Audio Source Separation（從盲目到有指導性的音頻源分離）：https://hal-univ-rennes1.archives-ouvertes.fr/hal-00922378/document

論文：Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation
（對單聲道分離的掩膜和深層循環神經網絡的聯合優化）：https://arxiv.org/abs/1502.04149

挑戰：Signal Separation Evaluation Campaign（信號分離評估活動）：https://sisec.inria.fr/

挑戰： CHiME Speech Separation and Recognition Challenge(CHiME語音分離和識別的挑戰)：http://spandh.dcs.shef.ac.uk/chime_challenge/

說話者識別

維基百科：說話者識別：https://en.wikipedia.org/wiki/Speaker_recognition

論文：A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL
NETWORK（一種使用語音識別的深度神經網絡的新方案）：
https://pdfs.semanticscholar.org/204a/ff8e21791c0a4113a3f75d0e6424a003c321.pdf

論文：DEEP NEURAL NETWORKS FOR SMALL FOOTPRINT TEXT-DEPENDENT SPEAKER VERIFICATION（深度神經網絡，用於小範圍的文本依賴的說話者驗證）：https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41939.pdf

挑戰： NIST Speaker Recognition Evaluation（NIST說話者識別評價）：https://www.nist.gov/itl/iad/mig/speaker-recognition

語音分段

維基百科：語音分段：https://en.wikipedia.org/wiki/Speech_segmentation

論文：Word Segmentation by 8-Month-Olds: When Speech Cues Count More Than
Statistics（8個月大嬰兒的單詞分段:當語音提示比統計數字更重要時）：http://www.utm.toronto.edu/infant-child-centre/sites/files/infant-child-centre/public/shared/elizabeth-johnson/Johnson_Jusczyk.pdf

論文：Unsupervised Word Segmentation and Lexicon Discovery Using Acoustic Word
Embeddings（不受監督的單詞分割和使用聲學詞嵌入的詞彙發現）：https://arxiv.org/abs/1603.02845

資料：CALLHOME Spanish Speech：https://catalog.ldc.upenn.edu/ldc96s35

語音合成

維基百科：語音合成：https://en.wikipedia.org/wiki/Speech_synthesis

論文：WaveNet:A Generative Model for Raw Audio（WaveNet:原始音頻的生成模型）：https://arxiv.org/abs/1609.03499

論文：Tacotron:Towards End-to-End Speech Synthesis（Tacotron:對端到端的語音合成）：https://arxiv.org/abs/1703.10135

資料： The World English Bible：https://github.com/Kyubyong/tacotron

資料： LJ Speech Dataset：https://github.com/keithito/tacotron

資料： Lessac Data：http://www.cstr.ed.ac.uk/projects/blizzard/2011/lessac_blizzard2011/

挑戰：Blizzard Challenge 2017：https://synsig.org/index.php/Blizzard_Challenge_2017

項目： The Festvox project：http://www.festvox.org/index.html

工具包：Merlin: The Neural Network （NN） based Speech Synthesis System（Merlin：基於神經網絡的語音合成系統）：https://github.com/CSTR-Edinburgh/merlin

語音加強

維基百科：語音加強：https://en.wikipedia.org/wiki/Speech_enhancement

書籍： Speech enhancement: theory and practice（語音加強：理論與實踐）：
https://www.amazon.com/Speech-Enhancement-Theory-Practice-Second/dp/1466504218/ref=sr_1_1?ie=UTF8&qid=1507874199&sr=8-1&keywords=Speech+enhancement%3A+theory+and+practice

論文 An Experimental Study on Speech Enhancement Based on Deep Neural Network（一項基於深度神經網絡的語音加強實驗）：
http://staff.ustc.edu.cn/~jundu/Speech%20signal%20processing/publications/SPL2014_Xu.pdf

論文： A Regression Approach to Speech Enhancement Based on Deep Neural
https://www.researchgate.net/profile/Yong_Xu63/publication/272436458_A_Regression_Approach_to_Speech_Enhancement_Based_on_Deep_Neural_Networks/links/57fdfdda08aeaf819a5bdd97.pdf

論文：Speech Enhancement Based on Deep Denoising Autoencoder（基於深度降噪自編碼的語音加強）：
https://www.researchgate.net/profile/Yu_Tsao/publication/283600839_Speech_enhancement_based_on_deep_denoising_Auto-Encoder/links/577b486108ae213761c9c7f8/Speech-enhancement-based-on-deep-denoising-Auto-Encoder.pdf

詞幹提取

維基百科：詞幹提取：https://en.wikipedia.org/wiki/Stemming

論文： A BACKPROPAGATION NEURAL NETWORK TO IMPROVE ARABIC STEMMING（一個反向傳播的神經網絡，用來改善阿拉伯語的詞幹提取）：
http://www.jatit.org/volumes/Vol82No3/7Vol82No3.pdf

工具包： NLTK Stemmers：http://www.nltk.org/howto/stem.html

術語提取

維基百科：術語提取：https://en.wikipedia.org/wiki/Terminology_extraction

論文： Neural Attention Models for Sequence Classification: Analysis and Application to KeyTerm Extraction and Dialogue Act
Detection（序列分類的神經提示模型:分析和應用於關鍵詞提取和對話法檢測）：https://arxiv.org/pdf/1604.00077.pdf

文本簡化

維基百科：文本簡化：https://en.wikipedia.org/wiki/Text_simplification

論文：Aligning Sentences from Standard Wikipedia to Simple Wikipedia（調整句子，從標準的維基百科到簡單的維基百科）：
https://ssli.ee.washington.edu/~hannaneh/papers/simplification.pdf

論文：Problems in Current Text Simplification Research: New Data Can Help（當前文本簡化研究中的問題:可提供幫助的新數據）：
https://pdfs.semanticscholar.org/2b8d/a013966c0c5e020ebc842d49d8ed166c8783.pdf

資料：Newsela Data：https://newsela.com/data/

文本蘊涵

維基百科：文本蘊含：https://en.wikipedia.org/wiki/Textual_entailment

項目：Textual Entailment with TensorFlow（文本蘊含與TensorFlow）：
https://github.com/Steven-Hewitt/Entailment-with-Tensorflow

競賽：SemEval-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge（SemEval-2013任務7:聯合學生反應分析和第8屆認知文本蘊含挑戰）：
https://www.cs.york.ac.uk/semeval-2013/task7.html

音譯

維基百科：音譯：https://en.wikipedia.org/wiki/Transliteration

論文：A Deep Learning Approach to Machine Transliteration（一個機器音譯的深度學習方法）：https://pdfs.semanticscholar.org/54f1/23122b8dd1f1d3067cf348cfea1276914377.pdf

項目：Neural Japanese Transliteration—can you do better than SwiftKey™ Keyboard?（神經日語音譯：你能比SwiftKey鍵盤作得更好嗎?）：
https://github.com/Kyubyong/neural_japanese_transliterator

詞嵌入

維基百科：詞嵌入：https://en.wikipedia.org/wiki/Word_embedding

工具包：Gensim: word2vec：https://radimrehurek.com/gensim/models/word2vec.html

工具包：fastText：https://github.com/facebookresearch/fastText

工具包：GloVe:Global Vectors for Word Representation：
https://nlp.stanford.edu/projects/glove/

信息：Where to get a pretrained model？（哪裏可以得到一個預先訓練的模型？）：https://github.com/3Top/word2vec-api

項目：Pre-trained word vectors of 30+ languages（30多種語言的預先訓練的詞向量）：https://github.com/Kyubyong/wordvectors

項目：Polyglot: Distributed word representations for multilingual NLP（Polyglot:多語言NLP的分佈式詞彙表徵）：https://sites.google.com/site/rmyeid/projects/polyglot

詞彙預測

信息：What is Word Prediction?(什麼是詞彙預測？）：
http://www2.edc.org/ncip/library/wp/what_is.htm

論文： The prediction of character based on recurrent neural network language
model（基於循環神經網絡語言模型的字符預測）：http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7960065

論文： An Embedded Deep Learning based Word Prediction（一個基於深度學習的詞彙預測）：https://arxiv.org/abs/1707.01662

論文：Evaluating Word Prediction: Framing Keystroke Savings（評估單詞預測:框擊鍵保存）：http://aclweb.org/anthology/P08-2066

資料：An Embedded Deep Learning based Word Prediction（一個基於深度學習的詞彙預測）：https://github.com/Meinwerk/WordPrediction/master.zip

項目： Word Prediction using Convolutional Neural Networks—can you do better than iPhone™
Keyboard?（使用卷積神經網絡的詞彙預測——你能比iPhone鍵盤作得更好嗎?）：https://github.com/Kyubyong/word_prediction

詞分割

論文： Neural Word Segmentation Learning for Chinese（中文的神經詞分割學習）：https://arxiv.org/abs/1606.04300

項目：Convolutional neural network for Chinese word segmentation（中文的詞分割的卷積神經網絡）：https://github.com/chqiwang/convseg

工具包：Stanford Word Segmenter：https://nlp.stanford.edu/software/segmenter.html

工具包： NLTK Tokenizers：http://www.nltk.org/_modules/nltk/tokenize.html

詞義消歧

維基百科：詞義消歧：https://en.wikipedia.org/wiki/Word-sense_disambiguation

論文：Train-O-Matic: Large-Scale Supervised Word Sense Disambiguation in Multiple Languages without Manual Training
Data（Train-O-Matic:在沒有人工訓練數據的狀況下，在多 -
種語言中大規模的監督詞義消歧）：http://www.aclweb.org/anthology/D17-1008

資料：Train-O-Matic Data：http://trainomatic.org/data/train-o-matic-data.zip

資料：BabelNet：http://babelnet.org/

原項目地址：https://github.com/Kyubyong/nlp_tasks#speech-segmentation

有趣的項目

karpathy/char-rnn · GitHub ：一個基於RNN的文本生成器。能夠自動生成莎士比亞的劇本或者shell代碼。
https://github.com/karpathy/char-rnngit

phunterlau/wangfeng-rnn · GitHub ：基於char-rnn的汪峯歌詞生成器
https://github.com/phunterlau/wangfeng-rnngithub

google/deepdream · GitHub ：畫出神經網絡眼中的世界
https://github.com/google/deepdreamweb

facebook/MemNN · GitHub ：memnn的一個官方實現。能夠回答諸如「小明在操場；小王在辦公室；小明撿起了足球；小王走進了廚房。問：小王在去廚房前在哪裏？」，這樣涉及推理和理解的問題。
https://github.com/facebook/MemNNshell

skaae/lasagne-draw · GitHub ：用RNN生成手寫數字。
https://github.com/skaae/lasagne-drawapi

keras/addition_rnn.py at master · fchollet/keras · GitHub ：用RNN自動學會加法規則。
https://github.com/keras-team/keras/blob/master/examples/addition_rnn.pybabel

karpathy/neuraltalk · GitHub ：自動根據圖像生成文本描述。
https://github.com/karpathy/neuraltalk網絡

ryankiros/neural-storyteller · GitHub: 看圖講故事
https://github.com/ryankiros/neural-storyteller架構

karpathy/neuraltalk2 · GitHub：看圖生成標註
https://github.com/karpathy/neuraltalk2

jcjohnson/neural-style · GitHub：將照片變成大師風格的繪畫
https://github.com/jcjohnson/neural-style

Newmu/dcgan_code · GitHub: 卷積生成式對抗網絡，生成圖像
https://github.com/Newmu/dcgan_code

nagadomi/waifu2x · GitHub：CNN來放大動漫圖片
https://github.com/nagadomi/waifu2x

去年我在Neuraltalk2 的基礎上作了個視頻字幕自動生成的實驗，如今把代碼公佈在Github上：
GitHub - cgq5/Video-Caption-with-Neuraltalk2: Code release of captioning videos using Neuraltalk2.
https://github.com/cgq5/Video-Caption-with-Neuraltalk2

生成結果在這裏：https://www.youtube.com/watch?v=FmSsek5luHk