面向互聯網的文本信息處理，語音和音樂搜索技術的發展示狀【蒐集資料時學習所得，未詳加整理】

時間 2019-11-12

標籤面向互聯網文本信息處理語音音樂搜索技術發展蒐集資料學習所得未詳整理简体版

原文原文鏈接

Speech recognition:

Key Words:

Distributed Speech Recognition(DSR 將嵌入式語言識別系統的識別功能架構在服務器上[並不是是指分佈式服務器，而是指終端與服務器屬於分佈式關係^[8]])

Network Speech Recognition(NSR 重點在於網絡,終端高效實時傳輸語音信號,服務器處理^[9])。當下都是終端語音信號由服務器/雲來作處理。

Emotion Speech Recognition(ESR), Spoken Information Retrieval, Speech Recognition, Spoken Term Detection, Speaker Recognition, Voice Control, Language Modeling，Speech Signal Processing / Speech Processing, Speech Enhancement, O utbust Speech Recognition, Feature Compensation, Model Compensation, Automatic Speech Recognition(ASR), Speech Separation, S ignal Analysis, Acoustic Speech Recognition Systems, Voice Activity Detection(VAD, 檢測通訊時的靜音期，節省帶寬), Acoustic feature extraction (AFE), Speech Enhancement,

語音識別技術綜述^[1]:

語音識別系統：語音的聲學模型（訓練學習）、模式匹配（識別算法）| 語言模型語言處理

聲學模型：動態時間歸整模型 (DTW)、隱馬爾可夫模型(HMM)、人工神經網絡模型(ANN)

語言模型：規則模型、統計模型

目前研究的難點主要表如今：(1)語音識別系統的適應性差。主要體如今對環境依賴性強。(2)高噪聲環境下語音識別進展困難，由於此時人的發音變化很大，像聲音變高，語速變慢，音調及共振峯變化等等，必須尋找新的信號分析處理方法。(3)如何把語言學、生理學、心理學方面知識量化、建模並有效用於語音識別，目前也是一個難點。(4)因爲咱們對人類的聽覺理解、知識積累和學習機制以及大腦神經系統的控制機理等方面的認識還很不清楚，這必將阻礙語音識別的進一步發展。

目前語音識別領域的研究熱點包括：穩健語音識別(識別的魯棒性)、語音輸入設備研究、聲學HMM模型的細化、說話人自適應技術、大詞彙量關鍵詞識別、高效的識別(搜索)算法研究、可信度評測算法研究、ANN的應用、語言模型及深層次的天然語言理解。

說話人自適應技術 (Speaker Adaptation ,SA)；非特定人 (Speaker Independent ,SI)；特定人 (Speaker Dependent ,SD) 『SA+SI』

自適應：批處理式、在線式、當即式 | 監督無監督

An Overview of Noise-Robust Automatic Speech Recognition^[2]:

Historically, ASR applications have included voice dialing, call routing, interactive voice response, data entry and dictation, voice command and control, structured document creation (e.g., medical and legal transcriptions), appliance control by voice, computer-aided language learning, content-based spoken audio search, and robotics.

More recently, with the exponentialgrowth of big data and computing power, ASR technology hasadvanced to the stage where more challenging applications arebecoming a reality. Examples are voice search and interactionswith mobile devices (e.g., Siri on iPhone, Bing voice searchon winPhone, and Google Now on Android), voice control in home entertainment systems (e.g., Kinect on xBox), andvarious speech-centric information processing applicationscapitalizing on downstream processing of ASR outputs.

Music Search:

Key Words:

Speech Transcription, Multimedia Information Retrieval, Music Search, Search engine, Mobile Internet, Music Retrieval, Audio Information Retrieval, Audio Mining, Adaptive Music Retrieval ,Music Information Retrieval, Content-based Retrieval, Music Cognition, Music Creation, Music Database Retrieval, Query By Example—QBE, Query By Humming—QBH, Q uery By Voice (QBV), A udio-visual Speech Recognition, Speech-reading, Multimodal Database, Optical Music Recognition, Instrument Identification, Context-aware Music Retrieval (Content Based Music Retrieval), Music Recommandation, Commercial music recommenders, Contextual music recommendation and retrieval,

研究方法：Fuzzy system, Neural network, Expert system, Genetic algorithm

多版本音樂識別技術：Feature extraction, key invariance（基調不變性）, tempo invariance（節拍/速度不變性）, structure invariance（結構不變性）, similarity computing（類似度計算）

MIDI(Musie InstrumentDigitalInterface)格式, WAVE( Waveform Audio File Format )格式『通常研究MIDI格式』

Feature Extraction:

Time Domain 『ACF(Autocorrelation function), SMDF(Average magnitude difference function), SIFT(Simple inverse filter tracking)』

Frequency Domain『Harmonic product spectrum, Cepstrum』

Big Data for Musicology^[4]:

Automatic Music Transcription (AMT, the process of converting an acoustic musical signal into some form of musical notation)

The most popular approach is parallelisation with Map-Reduce , using the Hadoop framework.

Modeling Concept Dynamics for Large Scale Music Search^[5]:

DMCM ( Dynamic Musical Concept Mixture )
SMCH ( Stochastic Music Concept Histogram )

The music preprocessing layer extracts multiple acoustic features and maps them into an audio word from a precomputed codebook.

The concept dynamics modeling layer derives from the underlying audio words a Stochastic Music Concept Histogram

(SMCH), essentially a probability distribution over the high-level concepts.

其餘的技術：

Wang J C, Shih Y C, Wu M S, et al. Colorizing tags in tag cloud: a novel query-by-tag music search system[C]// Proceedings of the 19th ACM international conference on Multimedia. ACM, 2011.【與雲計算技術關係並非很緊密，重點在於聚類、分類，符合本身的審美觀，並且頗有趣！】

聲學模型訓練

Text Processing:

Key Words:

text classification, maximum entropy model

最大熵：它反映了人類認識世界的一個樸素原則，即在對某個事件一無所知的狀況下，選擇一個模型使它的分佈應該儘量均勻^[16]

基於雲平臺文本處理案例：html

【圖數據處理】git

GraphLab：CMU提出了GraphLab開源分佈式計算系統是一種新的面向機器學習的並行框架web

Pregel：Google提出的適合複雜機器學習的分佈式圖數據計算框架算法

Horton：由Microsoft開發用於圖數據匹配api

當下，通常是用Hadoop、Sector/Sphere等已有的開源框架來處理語音識別服務器

Multimedia Mining:網絡

Image Mining, Video Mining, Audio Mining, Text Mining架構

Audio Mining:

To mine audio data, one could convert it into text using speech transcription techniques.Audio data could also be mined directly by using audio information processing techniques and then mining selected audio data.^[10]app

要麼轉換成文本信號再作Text Mining，要麼直接對聲信號處理再挖掘有用的聲音數據框架

The text based approachalso known as large-vocabulary continuous speech recognition (LVCSR), converts speech to text and then identifies words in a dictionary that can contain several hundred thousand entries. If a word or name is not in the dictionary, the LVCSR system will choose the most similar word it can find. ^[11]

NLP: Natural Language Processing

詞類區分（POS: Part-of-Speech tagging）

專名識別（NE: named entity tagging）《立委隨筆：機器學習和天然語言處理》 ^[12]

word accuracy, hit and miss rates, response time,efficiency, precision and system compatibility

WIKI:

Document Retrieval / text retrieval : form based『suffix tree』 content based 『inverted index』^[13]

Full text Research / free-text Research: a search engine examines all of the words in every stored document

" Text retrieval is a critical area of study today, since it is the fundamental basis of all internet search engines."^[14]

String Searching : string matching

Indexing^[14]

When dealing with a small number of documents, it is possible for the full-text-search engine to directly scan the contents of the documents with each query, a strategy called "serial scanning." This is what some tools, such as grep, do when searching.

However, when the number of documents to search is potentially large, or the quantity of search queries to perform is substantial, the problem of full-text search is often divided into two tasks:indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms (often called an index, but more correctly named a concordance). In the search stage, when performing a specific query, only the index is referenced, rather than the text of the original documents.

The indexer will make an entry in the index for each term or word found in a document, and possibly note its relative position within the document. Usually the indexer will ignore stop words (such as "the" and "and") that are both common and insufficiently meaningful to be useful in searching. Some indexers also employ language-specific stemming on the words being indexed. For example, the words "drives", "drove", and "driven" will be recorded in the index under the single concept word "drive."

MIR^[15]

Music Information Retrieval: MIR uses audio signal analysis to extract meaningful features of music.

Recommender systems : few are based upon MIR techniques, instead making use of similarity between users or laborious data compilation.

Track separation and instrument recognition

Automatic music transcription : converting an audio recording into symbolic notation

" multi-pitch detection, onset detection , duration estimation, instrument identification, and the extraction of rhythmic information"

Automatic categorization

Music generation

Contextual music information retrieval and recommendation: State of the art and challenges^[3]:

	Useful	Unuseful
event-scale information (i.e., transcribing individual notes or chords)	instrument detection, QYE, QYH	describe music
phrase-level information (i.e., analyzing note sequences for periodicities)	analyzes longer temporal excerpts, tempo detection, playlist sequencing, music summarization
piece-level information (i.e., analyzing longer excerpts of audio tracks)	a more abstract representation of a music track, user’s perception of music, Used for genre detection, content-based music recommenders

Four levels of retrieval tasks: 『研究主要集中在genre level, work level, instance level』

genre levelsearching for rock songs is a task at a genre level

artist levellooking for artists similar to Björk is clearly a task at an artist level

work levelfinding cover versions of the song 「Let it Be」 by The Beatles is a task at a work level

instance level identifying a particular recording of Mahler’s fifth symphony is a task at an instance level

Content-based music information retrieval: QBE, QBH, Genre Classification,

Music recommendation: Collaborative filtering(CF), Content-based approach『不多用於Music Recommmendation, 可合用倆方式』

Contextual and social music retrieval and recommendation: Environment-related context(season, temperature, time, weather conditions ), User-related context( Activity, Demographical, Emotional state ), Multimedia context(Text, Images)

Emotion recognition in music: ML

Music and the social web: Tag acquisition『可用於MIR、Music Recommendation』

Other Key Words: Activity recognition, computational data mining, raw audio, Clustering, Classification, regression, vector machines, KDD(Knowledge Discovery in Database), Acoustic Vector Sensors (AVS), Direction of arrival (DOA), Analog-to-digital Converter(ADC)

References:

[1] 邢銘生, 朱浩, 王宏斌. 語音識別技術綜述[J]. 科協論壇, 2010, (3):62-63. DOI:10.3969/j.issn.1007-3973.2010.03.033.

[2] Li J, Deng L, Gong Y, et al. An Overview of Noise-Robust Automatic Speech Recognition[J]. Audio Speech & Language Processing IEEE/ACM Transactions on, 2014, 22(4):745-777.

[3] Kaminskas M, Ricci F. Contextual music information retrieval and recommendation: State of the art and challenges[J]. Computer Science Review, 2012, 6:89–119.

[4] Weyde, Tillman, et al. "Big Data for Musicology." Proceedings of the 1st International Workshop on Digital Libraries for Musicology. ACM, 2014.

[5] Shen J, Pang H H, Wang M, et al. Modeling concept dynamics for large scale music search[J]. Research Collection School of Information Systems, 2012:455-464.

[6] Low Y, Gonzalez J, Kyrola A, et al. GraphLab: A Distributed Framework for Machine Learning in the Cloud[J]. Eprint Arxiv, 2011.

[7] Bhatt C A, Kankanhalli M S. 2011 MTAP Multimedia data mining state of the art and challenges[J]. Multimedia Tools & Applications, 2014, 51(1):35-76.

[8] 姜幹新. 基於HMM的分佈式語音識別系統的研究與應用[D]. 浙江大學計算機科學與技術學院, 2010.

[9] Shahzad Hussain. Web Based Network Speech Recognition[D]. Tampereen teknillinen yliopisto - Tampere University of Technology, 2013.

[10] Kamde P M, Algur S P. A Survey on Web Multimedia Mining[J]. International Journal of Multimedia & Its Applications, 2011, 3(3).

[11] Bhatt C A, Kankanhalli M S. Multimedia data mining: state of the art and challenges[J]. Multimedia Tools & Applications, 2011, 51(1):35-76.

[12] 李維. 立委隨筆：機器學習和天然語言處理. http://blog.sciencenet.cn/blog-362400-294037.html , 2010-2-13.

[13] Wikipedia . https://en.wikipedia.org/wiki/Document_retrieval , 22 June 2015.

[14] Wikipedia . https://en.wikipedia.org/wiki/Full_text_search , 13 June 2015.

[15] Wikipedia . https://en.wikipedia.org/wiki/Music_information_retrieval , 14 June 2015 .

[16] 李榮陸, 王建會, 陳曉雲,等. 使用最大熵模型進行中文文本分類[J]. 計算機研究與發展, 2005, 42(1):94-101. DOI:10.1007/978-3-540-24655-8_63.

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。

面向互聯網的文本信息處理，語音和音樂搜索技術的發展示狀【蒐集資料時學習所得，未詳加整理】

Indexing[14]

Indexing^[14]