中文機器翻譯數據集

時間 2020-12-13

標籤 php html git github web app ide this spa 翻譯欄目 PHP 简体版

原文原文鏈接

Datasetphp

WMT 2018html

AI challenger (英中翻譯 規模最大的口語領域英中雙語對照數據集)git

UM-Corpus: A Large English-Chinese Parallel Corpusgithub

OpenSubtitles2016web

Methodside

AI Challenger 2017 奇遇記this

機器翻譯如何解決數據量小的問題？spa

天然語言處理任務數據集

keywords: NLP, DataSet, corpus process翻譯

語料處理通常步驟

如下處理步驟出自[Mikolov T, et al. Exploiting Similarities among Languages for Machine Translation[J]. Computer Science, 2013.]

Tokenization of text using scripts (from www.statmt.org)
Duplicate sentences were removed
Numeric values were rewritten as a single token
special characters were removed (such as !?,:)

AI Challenger - 英中翻譯評測

適用領域：機器翻譯

規模最大的口語領域英中雙語對照數據集。提供了超過1000萬的英中對照的句子對做爲數據集合。全部雙語句對通過人工檢查，數據集從規模、相關度、質量上都有保障。

訓練集：10,000,000 句
驗證集（同聲傳譯）：934 句
驗證集（文本翻譯）：8000 句

https://challenger.ai/datasets/translation

WMT(Workshop on Machine Translation) - 機器翻譯研討會

適用領域：機器翻譯

WMT 是機器翻譯領域最重要的公開數據集。數據規模較大，取決於不一樣的語言，一般在百萬句到千萬句不等。

2017年WMT的網址 http://www.statmt.org/wmt17/

UN Parallel Corpus - 聯合國平行語料

適用領域：機器翻譯

聯合國平行語料庫由已進入公有領域的聯合國正式記錄和其餘會議文件組成。語料庫包含1990至2014年編寫並經人工翻譯的文字內容，包括以語句爲單位對齊的文本。

語料庫旨在提供多語種的語言資源，幫助在機器翻譯等各類天然語言處理方面開展研究和取得進展。爲了方便使用，本語料庫還提供現成的特定語種雙語文本和六語種平行語料子庫。

介紹：https://conferences.unite.un.org/UNCorpus/zh#introduction

下載：https://conferences.unite.un.org/UNCorpus/zh/DownloadOverview

（目前一直下載不下來）

2nd International Chinese Word Segmentation Bakeoff

適用領域：中文分詞

This directory contains the training, test, and gold-standard data
used in the 2nd International Chinese Word Segmentation Bakeoff.

http://sighan.cs.uchicago.edu/bakeoff2005/

20 Newsgroups

適用領域：文本分類

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

http://qwone.com/~jason/20Newsgroups/

NLPCC 2017 新聞標題分類

適用領域：文本分類

http://tcci.ccf.org.cn/conference/2017/taskdata.php

Reuters-21578 Text Categorization Collection

適用領域：文本分類

This is a collection of documents that appeared on Reuters newswire in 1987. The documents were assembled and indexed with categories.

http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html

全網新聞數據(SogouCA)

適用領域：文本分類、事件檢測跟蹤、新詞發現、命名實體識別自動摘要

來自若干新聞站點2012年6月—7月期間國內，國際，體育，社會，娛樂等18個頻道的新聞數據，提供URL和正文信息

http://www.sogou.com/labs/resource/ca.php

CMU World Wide Knowledge Base (Web->KB) project

適用領域：知識抽取

To develop a probabilistic, symbolic knowledge base that mirrors the content of the world wide web. If successful, this will make text information on the web available in computer-understandable form, enabling much more sophisticated information retrieval and problem solving.

http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/

Wikidump

適用領域：word embedding

中文：https://dumps.wikimedia.org/zhwiki/latest/

GitHub 項目

大規模中文天然語言處理語料 Large Scale Chinese Corpus for NLP

https://github.com/brightmart/nlp_chinese_corpus

相關文章

相關標籤/搜索

DocFX文檔翻譯

瀏覽器信息

網站主機教程

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。

最新文章

本站公眾號

歡迎關注本站公眾號,獲取更多信息

相關文章

>>更多相關文章<<