Key Words:
Distributed Speech Recognition(DSR 將嵌入式語言識別系統的識別功能架構在服務器上[並不是是指分佈式服務器,而是指終端與服務器屬於分佈式關係[8]])
Network Speech Recognition(NSR 重點在於網絡,終端高效實時傳輸語音信號,服務器處理[9])。當下都是終端語音信號由服務器/雲來作處理。
Emotion Speech Recognition(ESR), Spoken Information Retrieval, Speech Recognition, Spoken Term Detection, Speaker Recognition, Voice Control, Language Modeling,Speech Signal Processing / Speech Processing, Speech Enhancement, O utbust Speech Recognition, Feature Compensation, Model Compensation, Automatic Speech Recognition(ASR), Speech Separation, S ignal Analysis, Acoustic Speech Recognition Systems, Voice Activity Detection(VAD, 檢測通訊時的靜音期,節省帶寬), Acoustic feature extraction (AFE), Speech Enhancement,
語音識別技術綜述[1]:
語音識別系統:語音的聲學模型(訓練學習)、模式匹配(識別算法)| 語言模型 語言處理
聲學模型:動態時間歸整模型 (DTW)、隱馬爾可夫模型(HMM)、人工神經網絡模型(ANN)
語言模型:規則模型、統計模型
目前研究的難點主要表如今:(1)語音識別系統的適應性差。主要體如今對環境依賴性強。(2)高噪聲環境下語音識別進展困難,由於此時人的發音變化很大,像聲音變高,語速變慢,音調及共振峯變化等等,必須尋找新的信號分析處理方法。(3)如何把語言學、生理學、心理學方面知識量化、建模並有效用於語音識別,目前也是一個難點。(4)因爲咱們對人類的聽覺理解、知識積累和學習機制以及大腦神經系統的控制機理等方面的認識還很不清楚,這必將阻礙語音識別的進一步發展。
目前語音識別領域的研究熱點包括:穩健語音識別(識別的魯棒性)、語音輸入設備研究、聲學HMM模型的細化、說話人自適應技術、大詞彙量關鍵詞識別、高效的識別(搜索)算法研究、可信度評測算法研究、ANN的應用、語言模型及深層次的天然語言理解。
說話人自適應技術 (Speaker Adaptation ,SA);非特定人 (Speaker Independent ,SI);特定人 (Speaker Dependent ,SD) 『SA+SI』
自適應:批處理式、在線式、當即式 | 監督 無監督
An Overview of Noise-Robust Automatic Speech Recognition[2]:
Historically, ASR applications have included voice dialing, call routing, interactive voice response, data entry and dictation, voice command and control, structured document creation (e.g., medical and legal transcriptions), appliance control by voice, computer-aided language learning, content-based spoken audio search, and robotics.
More recently, with the exponentialgrowth of big data and computing power, ASR technology hasadvanced to the stage where more challenging applications arebecoming a reality. Examples are voice search and interactionswith mobile devices (e.g., Siri on iPhone, Bing voice searchon winPhone, and Google Now on Android), voice control in home entertainment systems (e.g., Kinect on xBox), andvarious speech-centric information processing applicationscapitalizing on downstream processing of ASR outputs.
Key Words:
Speech Transcription, Multimedia Information Retrieval, Music Search, Search engine, Mobile Internet, Music Retrieval, Audio Information Retrieval, Audio Mining, Adaptive Music Retrieval ,Music Information Retrieval, Content-based Retrieval, Music Cognition, Music Creation, Music Database Retrieval, Query By Example—QBE, Query By Humming—QBH, Q uery By Voice (QBV), A udio-visual Speech Recognition, Speech-reading, Multimodal Database, Optical Music Recognition, Instrument Identification, Context-aware Music Retrieval (Content Based Music Retrieval), Music Recommandation, Commercial music recommenders, Contextual music recommendation and retrieval,
研究方法:Fuzzy system, Neural network, Expert system, Genetic algorithm
多版本音樂識別技術 :Feature extraction, key invariance(基調不變性), tempo invariance(節拍/速度不變性), structure invariance(結構不變性), similarity computing(類似度計算)
MIDI(Musie InstrumentDigitalInterface)格式, WAVE( Waveform Audio File Format )格式『通常研究MIDI格式』
Feature Extraction:
Time Domain 『ACF(Autocorrelation function), SMDF(Average magnitude difference function), SIFT(Simple inverse filter tracking)』
Frequency Domain『Harmonic product spectrum, Cepstrum』
Big Data for Musicology[4]:Automatic Music Transcription (AMT, the process of converting an acoustic musical signal into some form of musical notation)The most popular approach is parallelisation with Map-Reduce , using the Hadoop framework.Modeling Concept Dynamics for Large Scale Music Search[5]:DMCM ( Dynamic Musical Concept Mixture )SMCH ( Stochastic Music Concept Histogram )
The music preprocessing layer extracts multiple acoustic features and maps them into an audio word from a precomputed codebook.The concept dynamics modeling layer derives from the underlying audio words a Stochastic Music Concept Histogram(SMCH), essentially a probability distribution over the high-level concepts.
![]()
![]()
其餘的技術:Wang J C, Shih Y C, Wu M S, et al. Colorizing tags in tag cloud: a novel query-by-tag music search system[C]// Proceedings of the 19th ACM international conference on Multimedia. ACM, 2011.【與雲計算技術關係並非很緊密,重點在於聚類、分類,符合本身的審美觀,並且 頗有趣! 】聲學模型訓練
Key Words:text classification, maximum entropy model
最大熵:它反映了人類認識世界的一個樸素原則,即在對某個事件一無所知的狀況下,選擇一個模型使它的分佈應該儘量均勻[16]
基於雲平臺文本處理案例:html
【圖數據處理】git
GraphLab:CMU提出了GraphLab開源分佈式計算系統是一種新的面向機器學習的並行框架web
Pregel:Google提出的適合複雜機器學習的分佈式圖數據計算框架算法
Horton:由Microsoft開發用於圖數據匹配api
當下,通常是用Hadoop、Sector/Sphere等已有的開源框架來處理語音識別服務器
Multimedia Mining:網絡
Image Mining, Video Mining, Audio Mining, Text Mining架構
To mine audio data, one could convert it into text using speech transcription techniques.Audio data could also be mined directly by using audio information processing techniques and then mining selected audio data.[10]app
要麼轉換成文本信號再作Text Mining,要麼直接對聲信號處理再挖掘有用的聲音數據框架
The text based approachalso known as large-vocabulary continuous speech recognition (LVCSR), converts speech to text and then identifies words in a dictionary that can contain several hundred thousand entries. If a word or name is not in the dictionary, the LVCSR system will choose the most similar word it can find. [11]
When dealing with a small number of documents, it is possible for the full-text-search engine to directly scan the contents of the documents with each query, a strategy called "serial scanning." This is what some tools, such as grep, do when searching.
However, when the number of documents to search is potentially large, or the quantity of search queries to perform is substantial, the problem of full-text search is often divided into two tasks:indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms (often called an index, but more correctly named a concordance). In the search stage, when performing a specific query, only the index is referenced, rather than the text of the original documents.
The indexer will make an entry in the index for each term or word found in a document, and possibly note its relative position within the document. Usually the indexer will ignore stop words (such as "the" and "and") that are both common and insufficiently meaningful to be useful in searching. Some indexers also employ language-specific stemming on the words being indexed. For example, the words "drives", "drove", and "driven" will be recorded in the index under the single concept word "drive."
MIR[15]
Useful | Unuseful | |
event-scale information (i.e., transcribing individual notes or chords) |
instrument detection, QYE, QYH |
describe music |
phrase-level information (i.e., analyzing note sequences for periodicities) |
analyzes longer temporal excerpts, tempo detection, playlist sequencing, music summarization |
|
piece-level information (i.e., analyzing longer excerpts of audio tracks) |
a more abstract representation of a music track, user’s perception of music, Used for genre detection, content-based music recommenders |
Four levels of retrieval tasks: 『研究主要集中在genre level, work level, instance level』
genre levelsearching for rock songs is a task at a genre level
artist levellooking for artists similar to Björk is clearly a task at an artist level
work levelfinding cover versions of the song 「Let it Be」 by The Beatles is a task at a work level
instance level identifying a particular recording of Mahler’s fifth symphony is a task at an instance level
Other Key Words: Activity recognition, computational data mining, raw audio, Clustering, Classification, regression, vector machines, KDD(Knowledge Discovery in Database), Acoustic Vector Sensors (AVS), Direction of arrival (DOA), Analog-to-digital Converter(ADC)