爲了對GMM-HMM在語音識別上的應用有個宏觀認識,花了些時間讀了下HTK(用htk完成簡單的孤立詞識別)的部分源碼,對該算法總算有了點大概認識,達到了預期我想要的。不得不說,網絡上關於語音識別的通俗易懂教程太少,都是各類公式滿天飛,不多有說具體細節的,固然了,那須要有實戰經驗才行。下面總結如下幾點,對其有個宏觀印象便可(以孤立詞識別爲例)。html
1、每一個單詞的讀音都對應一個HMM模型,你們都知道HMM模型中有個狀態集S,那麼每一個狀態用什麼來表示呢,數字?向量?矩陣?其實這個狀態集中的狀態沒有具體的數學要求,只是一個名稱而已,你能夠用’1’, ’2’, ‘3’…表示,也能夠用’a’, ‘b’, ’c ’表示。另外每一個HMM模型中到底該用多少個狀態,是經過先驗知識人爲設定的。算法
2、HMM的每個狀態都對應有一個觀察值,這個觀察值能夠是一個實數,也能夠是個向量,且每一個狀態對應的觀察值的維度應該相同。假設如今有一個單詞的音頻文件,首先須要將其進行採樣獲得數字信息(A/D轉換),而後分幀進行MFCC特徵提取,假設每一幀音頻對應的MFCC特徵長度爲39,則每一個音頻文件就轉換成了N個MFCC向量(不一樣音頻文件對應的N可能不一樣),這就成了一個序列,而在訓練HMM模型的參數時(好比用Baum-Welch算法),每次輸入到HMM中的數據要求就是一個觀測值序列。這時,每一個狀態對應的觀測值爲39維的向量,由於向量中元素的取值是連續的,須要用多維密度函數來模擬,一般狀況下用的是多維高斯函數。在GMM-HMM體系中,這個擬合函數是用K個多維高斯混合獲得的。假設知道了每一個狀態對應的K個多維高斯的全部參數,則該GMM生成該狀態上某一個觀察向量(一幀音頻的MFCC係數)的機率就能夠求出來了。網絡
3、對每一個單詞創建一個HMM模型,須要用到該單詞的訓練樣本,這些訓練樣本是提早標註好的,即每一個樣本對應一段音頻,該音頻只包含這個單詞的讀音。當有了該單詞的多個訓練樣本後,就用這些樣本結合Baum-Welch算法和EM算法來訓練出GMM-HMM的全部參數,這些參數包括初始狀態的機率向量,狀態之間的轉移矩陣,每一個狀態對應的觀察矩陣(這裏對應的是GMM,即每一個狀態對應的K個高斯的權值,每一個高斯的均值向量和方差矩陣)。dom
4、在識別階段,輸入一段音頻,若是該音頻含有多個單詞,則能夠手動先將其分割開(考慮的是最簡單的方法),而後提取每一個單詞的音頻MFCC特徵序列,將該序列輸入到每一個HMM模型(已提早訓練好的)中,採用前向算法求出每一個HMM模型生成該序列的機率,最後取最大機率對應的那個模型,而那個模型所表示的單詞就是咱們識別的結果。機器學習
5、在創建聲學模型時,能夠用Deep Learning的方法來代替GMM-HMM中的GMM,由於GMM模擬任意函數的功能取決於混合高斯函數的個數,因此具備必定的侷限性,屬於淺層模型。而Deep Network能夠模擬任意的函數,於是表達能力更強。注意,這裏用來代替GMM的Deep Nets模型要求是產生式模型,好比DBN,DBM等,由於在訓練HMM-DL網絡時,須要用到HMM的某個狀態產生一個樣本的機率。ide
6、GMM-HMM在具體實現起來仍是至關複雜的。函數
7、通常涉及到時間序列時纔會使用HMM,好比這裏音頻中的語音識別,視頻中的行爲識別等。若是咱們用GMM-HMM對靜態的圖片分類,由於這裏沒涉及到時間信息,因此HMM的狀態數可設爲1,那麼此時的GMM-HMM算法就退化成GMM算法了。tornado
MFCC:oop
MFCC的matlab實現教程可參考:張智星老師的網頁教程mfcc. 最基本的12維特徵。學習
function mfcc=frame2mfcc(frame, fs, filterNum, mfccNum, plotOpt) % frame2mfcc: Frame to MFCC conversion. % Usage: mfcc=frame2mfcc(frame, fs, filterNum, mfccNum, plotOpt) % % For example: % waveFile='what_movies_have_you_seen_recently.wav'; % [y, fs, nbits]=wavReadInt(waveFile); % startIndex=12000; % frameSize=512; % frame=y(startIndex:startIndex+frameSize-1); % frame2mfcc(frame, fs, 20, 12, 1); % Roger Jang 20060417 if nargin<1, selfdemo; return; end if nargin<2, fs=16000; end if nargin<3, filterNum=20; end if nargin<4, mfccNum=12; end if nargin<5, plotOpt=0; end frameSize=length(frame); % ====== Preemphasis should be done at wave level %a=0.95; %frame2 = filter([1, -a], 1, frame); frame2=frame; % ====== Hamming windowing frame3=frame2.*hamming(frameSize); % ====== FFT [fftMag, fftPhase, fftFreq, fftPowerDb]=fftOneSide(frame3, fs); % ====== Triangular band-pass filter bank triFilterBankPrm=getTriFilterBankPrm(fs, filterNum); % Get parameters for triangular band-pass filter bank % Triangular bandpass filter. for i=1:filterNum tbfCoef(i)=dot(fftPowerDb, trimf(fftFreq, triFilterBankPrm(:,i)));%獲得filterNum個濾波係數 end % ====== DCT mfcc=zeros(mfccNum, 1); %DCT變換的先後個數也沒有變 for i=1:mfccNum coef = cos((pi/filterNum)*i*((1:filterNum)-0.5))'; %mfcc中的前mfccNum個係數 mfcc(i) = sum(coef.*tbfCoef');%直接按照DCT公式 end % ====== Log energy %logEnergy=10*log10(sum(frame.*frame)); %mfcc=[logEnergy; mfcc]; if plotOpt subplot(2,1,1); plot(frame, '.-'); set(gca, 'xlim', [-inf inf]); title('Input frame'); subplot(2,1,2); plot(mfcc, '.-'); set(gca, 'xlim', [-inf inf]); title('MFCC vector'); end % ====== trimf.m (from fuzzy toolbox) function y = trimf(x, prm) %由頻率的橫座標算出三角形內的縱座標,0~1 a = prm(1); b = prm(2); c = prm(3); y = zeros(size(x)); % Left and right shoulders (y = 0) index = find(x <= a | c <= x); y(index) = zeros(size(index)); %只考慮三角波內的量 % Left slope if (a ~= b) index = find(a < x & x < b); y(index) = (x(index)-a)/(b-a); end % right slope if (b ~= c) index = find(b < x & x < c); y(index) = (c-x(index))/(c-b); end % Center (y = 1) index = find(x == b); y(index) = ones(size(index)); % ====== Self demo function selfdemo waveFile='what_movies_have_you_seen_recently.wav'; [y, fs, nbits]=wavReadInt(waveFile); startIndex=12000; frameSize=512; frame=y(startIndex:startIndex+frameSize-1); feval(mfilename, frame, fs, 20, 12, 1);
ZCR:
過0檢測,用於判斷每一幀中過零點的數量狀況,最簡單的版本可參考:zeros cross rate.
waveFile='csNthu.wav'; frameSize=256; overlap=0; [y, fs, nbits]=wavread(waveFile); frameMat=enframe(y, frameSize, overlap); frameNum=size(frameMat, 2); for i=1:frameNum frameMat(:,i)=frameMat(:,i)-mean(frameMat(:,i)); % mean justification end zcr=sum(frameMat(1:end-1, :).*frameMat(2:end, :)<0); sampleTime=(1:length(y))/fs; frameTime=((0:frameNum-1)*(frameSize-overlap)+0.5*frameSize)/fs; subplot(2,1,1); plot(sampleTime, y); ylabel('Amplitude'); title(waveFile); subplot(2,1,2); plot(frameTime, zcr, '.-'); xlabel('Time (sec)'); ylabel('Count'); title('ZCR');
EPD:
端點檢測,檢測聲音的起始點和終止點,可參考:EPD in Time Domain,在時域中的最簡單檢測方法。
waveFile='sunday.wav'; [wave, fs, nbits] = wavread(waveFile); frameSize = 256; overlap = 128; wave=wave-mean(wave); % zero-mean substraction frameMat=buffer2(wave, frameSize, overlap); % frame blocking,每一列表明一幀 frameNum=size(frameMat, 2); % no. of frames volume=frame2volume(frameMat); % volume,求每一幀的能量,絕對值或者平方和,volume爲行向量 volumeTh1=max(volume)*0.1; % volume threshold 1 volumeTh2=median(volume)*0.1; % volume threshold 2 volumeTh3=min(volume)*10; % volume threshold 3 volumeTh4=volume(1)*5; % volume threshold 4 index1 = find(volume>volumeTh1); %找出volume大於閾值的那些幀序號 index2 = find(volume>volumeTh2); index3 = find(volume>volumeTh3); index4 = find(volume>volumeTh4); %frame2sampleIndex()爲從幀序號找到樣本點的序號(即每個採樣點的序號) %endPointX長度爲2,包含了起點和終點的樣本點序號 endPoint1=frame2sampleIndex([index1(1), index1(end)], frameSize, overlap); endPoint2=frame2sampleIndex([index2(1), index2(end)], frameSize, overlap); endPoint3=frame2sampleIndex([index3(1), index3(end)], frameSize, overlap); endPoint4=frame2sampleIndex([index4(1), index4(end)], frameSize, overlap); subplot(2,1,1); time=(1:length(wave))/fs; plot(time, wave); ylabel('Amplitude'); title('Waveform'); axis([-inf inf -1 1]); line(time(endPoint1( 1))*[1 1], [-1, 1], 'color', 'm');%標起點終點線 line(time(endPoint2( 1))*[1 1], [-1, 1], 'color', 'g'); line(time(endPoint3( 1))*[1 1], [-1, 1], 'color', 'k'); line(time(endPoint4( 1))*[1 1], [-1, 1], 'color', 'r'); line(time(endPoint1(end))*[1 1], [-1, 1], 'color', 'm'); line(time(endPoint2(end))*[1 1], [-1, 1], 'color', 'g'); line(time(endPoint3(end))*[1 1], [-1, 1], 'color', 'k'); line(time(endPoint4(end))*[1 1], [-1, 1], 'color', 'r'); legend('Waveform', 'Boundaries by threshold 1', 'Boundaries by threshold 2', 'Boundaries by threshold 3', 'Boundaries by threshold 4'); subplot(2,1,2); frameTime=frame2sampleIndex(1:frameNum, frameSize, overlap); plot(frameTime, volume, '.-'); ylabel('Sum of Abs.'); title('Volume'); axis tight; line([min(frameTime), max(frameTime)], volumeTh1*[1 1], 'color', 'm'); line([min(frameTime), max(frameTime)], volumeTh2*[1 1], 'color', 'g'); line([min(frameTime), max(frameTime)], volumeTh3*[1 1], 'color', 'k'); line([min(frameTime), max(frameTime)], volumeTh4*[1 1], 'color', 'r'); legend('Volume', 'Threshold 1', 'Threshold 2', 'Threshold 3', 'Threshold 4');
GMM:
GMM用在擬合數據分佈上,本質上是先假設樣本的機率分佈爲GMM,而後用多個樣本去學習這些GMM的參數。GMM建模在語音中可用於某個單詞的發音,某我的的音色等。其訓練過程可參考:speaker recognition.
function [M, V, W, logProb] = gmmTrain(data, gaussianNum, dispOpt) % gmmTrain: Parameter training for gaussian mixture model (GMM) % Usage: function [M, V, W, logProb] = gmm(data, gaussianNum, dispOpt) % data: dim x dataNum matrix where each column is a data point % gaussianNum: No. of Gaussians or initial centers % dispOpt: Option for displaying info during training % M: dim x meanNum matrix where each column is a mean vector % V: 1 x gaussianNum vector where each element is a variance for a Gaussian % W: 1 x gaussianNum vector where each element is a weighting factor for a Gaussian % Roger Jang 20000610 if nargin==0, selfdemo; return; end if nargin<3, dispOpt=0; end maxLoopCount = 50; % Max. iteration minImprove = 1e-6; % Min. improvement minVariance = 1e-6; % Min. variance logProb = zeros(maxLoopCount, 1); % Array for objective function [dim, dataNum] = size(data); % Set initial parameters % Set initial M %M = data(1+floor(rand(gaussianNum,1)*dataNum),:); % Randomly select several data points as the centers if length(gaussianNum)==1, % Using vqKmeans to find initial centers fprintf('Start KMEANS to find the initial mu...\n'); % M = vqKmeansMex(data, gaussianNum, 0); M = vqKmeans(data, gaussianNum, 0); %利用聚類的方法求均值,聚成gaussianNum類 % M = vqLBG(data, gaussianNum, 0); fprintf('Start GMM training...\n'); if any(any(~isfinite(M))); keyboard; end else % gaussianNum is in fact the initial centers M = gaussianNum; gaussianNum = size(M, 2); end % Set initial V as the distance to the nearest center if gaussianNum==1 V=1; else distance=pairwiseSqrDist(M);%pairwiseSqrDist是dll %distance=pairwiseSqrDist2(M); distance(1:(gaussianNum+1):gaussianNum^2)=inf; % Diagonal elements are inf [V, index]=min(distance); % Initial variance for each Gaussian end % Set initial W W = ones(1, gaussianNum)/gaussianNum; % Weight for each Gaussian,初始化時是均分權值 if dispOpt & dim==2, displayGmm(M, V, data); end for i = 1:maxLoopCount %開始迭代訓練參數,EM算法 % Expectation step: % P(i,j) is the probability of data(:,j) to the i-th Gaussian % Prob爲每一個樣本在GMM下的機率 [prob, P]=gmmEval(data, M, V, W); logProb(i)=sum(log(prob)); %全部樣本的聯合機率 if dispOpt fprintf('i = %d, log prob. = %f\n',i-1, logProb(i)); end PW = diag(W)*P; BETA=PW./(ones(gaussianNum,1)*sum(PW)); % BETA(i,j) is beta_i(x_j) sumBETA=sum(BETA,2); % Maximization step: eqns (2.96) to (2.98) from Bishop p.67: M = (data*BETA')./(ones(dim,1)*sumBETA'); DISTSQ = pairwiseSqrDist(M, data); % Distance of M to data %DISTSQ = pairwiseSqrDist2(M, data); % Distance of M to data V = max((sum(BETA.*DISTSQ, 2)./sumBETA)/dim, minVariance); % (2.97) W = (1/dataNum)*sumBETA; % (2.98) if dispOpt & dim==2, displayGmm(M, V, data); end if i>1, if logProb(i)-logProb(i-1)<minImprove, break; end; end end [prob, P]=gmmEval(data, M, V, W); logProb(i)=sum(log(prob)); fprintf('Iteration count = %d, log prob. = %f\n',i, logProb(i)); logProb(i+1:maxLoopCount) = []; % ====== Self Demo ====== function selfdemo %[data, gaussianNum] = dcdata(2); data = rand(1000,2); gaussianNum = 8; data=data'; plotOpt=1; [M, V, W, lp] = feval(mfilename, data, gaussianNum, plotOpt); pointNum = 40; x = linspace(min(data(1,:)), max(data(1,:)), pointNum); y = linspace(min(data(2,:)), max(data(2,:)), pointNum); [xx, yy] = meshgrid(x, y); data = [xx(:) yy(:)]'; z = gmmEval(data, M, V, W); zz = reshape(z, pointNum, pointNum); figure; mesh(xx, yy, zz); axis tight; box on; rotate3d on figure; contour(xx, yy, zz, 30); axis image % ====== Other subfunctions ====== function displayGmm(M, V, data) % Display function for EM algorithm figureH=findobj(0, 'tag', mfilename); if isempty(figureH) figureH=figure; set(figureH, 'tag', mfilename); colordef black plot(data(1,:), data(2,:),'.r'); axis image theta=linspace(-pi, pi, 21); x=cos(theta); y=sin(theta); sigma=sqrt(V); for i=1:length(sigma) circleH(i)=line(x*sigma(i)+M(1,i), y*sigma(i)+M(2,i), 'color', 'y'); end set(circleH, 'tag', 'circleH', 'erasemode', 'xor'); else circleH=findobj(figureH, 'tag', 'circleH'); theta=linspace(-pi, pi, 21); x=cos(theta); y=sin(theta); sigma=sqrt(V); for i=1:length(sigma) set(circleH(i), 'xdata', x*sigma(i)+M(1,i), 'ydata', y*sigma(i)+M(2,i)); end drawnow end
Speaker identification:
給N我的的語音資料,用GMM能夠訓練這N我的的聲音模型,而後給定一段語音,判斷該語音與這N我的中哪一個最類似。方法是求出該語音在N個GMM模型下的機率,選出機率最大的那個。可參考:speaker recognition.
function [recogRate, confusionMatrix, speakerData]=speakerIdentify(speakerData, speakerGmm, useIntGmm) % speakerIdentify: speaker identification using GMM parameters % Usage: [recogRate, confusionMatrix, speakerData]=speakerIdentify(speakerData, speakerGmm, useIntGmm) % speakerData: structure array generated by speakerDataRead.m % speakerGmm: speakerGmm(i).gmmPrm is the GMM parameters for speaker i. % useIntGmm: use fixed-point GMM % Roger Jang, 20070517, 20080726 if nargin<3, useIntGmm=0; end % ====== Speaker identification using GMM parameters speakerNum=length(speakerData); for i=1:speakerNum % fprintf('%d/%d: Recognizing wave files by %s\n', i, speakerNum, speakerData(i).name); for j=1:length(speakerData(i).sentence) % fprintf('\tSentece %d...\n', j); frameNum=size(speakerData(i).sentence(j).fea, 2); logProb=zeros(speakerNum, frameNum); %logProb(i,m)表示第i我的第j個句子中第m幀在GMM模型下的log機率 %找出一個句子,看它屬於哪一個speaker for k=1:speakerNum, % fprintf('\t\tSpeaker %d...\n', k); % logProb(k, :)=gmmEval(speakerData(i).sentence(j).fea, speakerGmm(k).gmmPrm); if ~useIntGmm % logProb(k, :)=gmmEvalMex(speakerData(i).sentence(j).fea, gmm(k).mean, gmm(k).covariance, gmm(k).weight); logProb(k, :)=gmmEval(speakerData(i).sentence(j).fea, speakerGmm(k).gmmPrm); else % logProb(k, :)=gmmEvalIntMex(speakerData(i).sentence(j).fea, gmm(k).mean, gmm(k).covariance, gmm(k).weight); logProb(k, :)=gmmEvalIntMex(speakerData(i).sentence(j).fea, speakerGmm(i).gmmPrm); end end cumLogProb=sum(logProb, 2); [maxProb, index]=max(cumLogProb); speakerData(i).sentence(j).predictedSpeaker=index; %找出身份 speakerData(i).sentence(j).logProb=logProb; end end % ====== Compute confusion matrix and recognition rate confusionMatrix=zeros(speakerNum); for i=1:speakerNum, predictedSpeaker=[speakerData(i).sentence.predictedSpeaker]; [index, count]=elementCount(predictedSpeaker); confusionMatrix(i, index)=count; end recogRate=sum(diag(confusionMatrix))/sum(sum(confusionMatrix));
GMM-HMM:
訓練階段:給出HMM的k個狀態,每一個狀態下的觀察樣本的生成能夠用一個機率分佈來擬合,這裏是採用GMM擬合的。其實,能夠把GMM-HMM總體當作是一個生成模型。給定該模型的5個初始參數(結合隨機和訓練樣本得到),啓動EM算法的E步:得到訓練樣本分佈,即計算訓練樣本在各個狀態下的機率。M步:用這些訓練樣本從新評估那5個參數。
測試階段:(以孤立詞識別爲例)給定每一個詞發音的frame矩陣,取出某一個GMM-HMM模型,算出該發音每一幀數據在取出的GMM-HMM模型各個state下的機率,結合模型的轉移機率和初始機率,得到對應的clique tree,可用圖模型的方法inference出生成該語音的機率。比較多個GMM-HMM模型,取最大機率的模型對應的詞。
參考資料:
機器學習&數據挖掘筆記_13(用htk完成簡單的孤立詞識別)