【iOS10 SpeechRecognition】語音識別現說現譯的最佳實踐

時間 2019-12-06

標籤 ios10 ios speechrecognition 語音識別現說最佳實踐欄目 iOS 简体版

原文原文鏈接

首先想強調一下「語音識別」四個字字面意義上的需求：用戶說話而後立刻把用戶說的話轉成文字顯示！，這纔是開發者真正須要的功能。git

作需求以前實際上是先谷歌百度一下看有沒有造好的輪子直接用，結果然的很呵呵，都是標着這個庫深刻學習的標題，裏面調用一下api從URL裏取出一個本地語音文件進行識別，這就沒了？最基本的需求都無法實現。github

今天整理下對於此功能的兩種實現方式：api

首先看下識別請求的API有兩種 SFSpeechAudioBufferRecognitionRequest 和 SFSpeechURLRecognitionRequest ，而且實現解析的方式也有兩種 block 和 delegate。我就相互組合下兩種方法把這些內容都能涵蓋。數組

在開發以前須要先在info.plist註冊用戶隱私權限，雖然你們都已經知道了我仍是說一嘴爲了本文的完整性。微信

Privacy - Microphone Usage Description
Privacy - Speech Recognition Usage Description

再使用requestAuthorization來請求使用權限app

    [SFSpeechRecognizer requestAuthorization:^(SFSpeechRecognizerAuthorizationStatus status) {
        // 對結果枚舉的判斷
    }];

關於麥克風的權限在首次開始錄音時也會提出權限選擇。oop

1、 SFSpeechAudioBufferRecognitionRequest 加上 block的方式

用這種方式實現主要分爲如下幾個步驟學習

①多媒體引擎的創建

成員變量須要添加如下幾個屬性，便於開始結束釋放等atom

@property(nonatomic,strong)SFSpeechRecognizer *bufferRec;
@property(nonatomic,strong)SFSpeechAudioBufferRecognitionRequest *bufferRequest;
@property(nonatomic,strong)SFSpeechRecognitionTask *bufferTask;
@property(nonatomic,strong)AVAudioEngine *bufferEngine;
@property(nonatomic,strong)AVAudioInputNode *buffeInputNode;

初始化建議寫在啓動的方法裏，便於啓動和關閉，若是準備使用全局的也能夠只初始化一次spa

    self.bufferRec = [[SFSpeechRecognizer alloc]initWithLocale:[NSLocale localeWithLocaleIdentifier:@"zh_CN"]];
    self.bufferEngine = [[AVAudioEngine alloc]init];
    self.buffeInputNode = [self.bufferEngine inputNode];

②建立語音識別請求

    self.bufferRequest = [[SFSpeechAudioBufferRecognitionRequest alloc]init];
    self.bufferRequest.shouldReportPartialResults = true;

shouldReportPartialResults 其中這個屬性能夠自行設置開關，是等你一句話說完再回調一次，仍是每個散碎的語音片斷都會回調。

③創建任務，並執行任務

    // block外的代碼也都是準備工做，參數初始設置等
    self.bufferRequest = [[SFSpeechAudioBufferRecognitionRequest alloc]init];
    self.bufferRequest.shouldReportPartialResults = true;
    __weak ViewController *weakSelf = self;
    self.bufferTask = [self.bufferRec recognitionTaskWithRequest:self.bufferRequest resultHandler:^(SFSpeechRecognitionResult * _Nullable result, NSError * _Nullable error) {
            // 接收到結果後的回調
    }];
    
    // 監聽一個標識位並拼接流文件
    AVAudioFormat *format =[self.buffeInputNode outputFormatForBus:0];
    [self.buffeInputNode installTapOnBus:0 bufferSize:1024 format:format block:^(AVAudioPCMBuffer * _Nonnull buffer, AVAudioTime * _Nonnull when) {
        [weakSelf.bufferRequest appendAudioPCMBuffer:buffer];
    }];
    
    // 準備並啓動引擎
    [self.bufferEngine prepare];
    NSError *error = nil;
    if (![self.bufferEngine startAndReturnError:&error]) {
        NSLog(@"%@",error.userInfo);
    };
    self.showBufferText.text = @"等待命令中.....";

對runloop稍微瞭解過的人都知道，block外面的代碼是在前一個運行循環先執行的，正常的啓動流程是先初始化參數而後啓動引擎，而後會不斷地調用拼接buffer的這個回調方法，而後一個單位的buffer攢夠了後會回調一次上面的語音識別結果的回調，有時候沒聲音也會調用buffer的方法，可是不會調用上面的resulthandler回調，這個方法內部應該有個容錯（音量power沒到設定值會自動忽略）。

④接收到結果的回調

結果的回調就是在上面resultHandler裏面的block裏了，執行後返回的參數就是result和error了，能夠針對這個結果作一些操做。

        if (result != nil) {
            self.showBufferText.text = result.bestTranscription.formattedString;
        }
        if (error != nil) {
            NSLog(@"%@",error.userInfo);
        }

這個結果類型SFSpeechRecognitionResult能夠看看裏面的屬性，有最佳結果，還有備選結果的數組。若是想作精確匹配的應該得把備選數組的答案也都過濾一遍。

⑤結束監聽

    [self.bufferEngine stop];
    [self.buffeInputNode removeTapOnBus:0];
    self.showBufferText.text = @"";
    self.bufferRequest = nil;
    self.bufferTask = nil;

這個中間的bus是臨時標識的節點，大概理解和端口的概念差很少。

2、SFSpeechURLRecognitionRequest 和 delegate的方法

block和delegate的主要區別是，block方式使用簡潔， delegate則能夠有更多的自定義需求的空間，由於裏面有更多的結果回調生命週期方法。

這五個方法也沒什麼好說的，都是顧名思義。要注意的一點是第二個方法會調用屢次，第三個方法會在一句話說完時調用一次。

// Called when the task first detects speech in the source audio
- (void)speechRecognitionDidDetectSpeech:(SFSpeechRecognitionTask *)task;

// Called for all recognitions, including non-final hypothesis
- (void)speechRecognitionTask:(SFSpeechRecognitionTask *)task didHypothesizeTranscription:(SFTranscription *)transcription;

// Called only for final recognitions of utterances. No more about the utterance will be reported
- (void)speechRecognitionTask:(SFSpeechRecognitionTask *)task didFinishRecognition:(SFSpeechRecognitionResult *)recognitionResult;

// Called when the task is no longer accepting new audio but may be finishing final processing
- (void)speechRecognitionTaskFinishedReadingAudio:(SFSpeechRecognitionTask *)task;

// Called when the task has been cancelled, either by client app, the user, or the system
- (void)speechRecognitionTaskWasCancelled:(SFSpeechRecognitionTask *)task;

// Called when recognition of all requested utterances is finished.
// If successfully is false, the error property of the task will contain error information
- (void)speechRecognitionTask:(SFSpeechRecognitionTask *)task didFinishSuccessfully:(BOOL)successfully;

這種實現的思路是，先實現一個錄音器（能夠手動控制開始結束，也能夠是根據音調大小自動開始結束的同步錄音器相似於會說話的湯姆貓），而後將錄音文件存到一個本地目錄，而後使用URLRequest的方式讀取出來進行翻譯。步驟分解以下

①創建同步錄音器

須要如下這些屬性

/** 錄音設備 */
@property (nonatomic, strong) AVAudioRecorder *recorder;
/** 監聽設備 */
@property (nonatomic, strong) AVAudioRecorder *monitor;
/** 錄音文件的URL */
@property (nonatomic, strong) NSURL *recordURL;
/** 監聽器 URL */
@property (nonatomic, strong) NSURL *monitorURL;
/** 定時器 */
@property (nonatomic, strong) NSTimer *timer;

屬性的初始化

    // 參數設置
    NSDictionary *recordSettings = [[NSDictionary alloc] initWithObjectsAndKeys:
                                    [NSNumber numberWithFloat: 14400.0], AVSampleRateKey,
                                    [NSNumber numberWithInt: kAudioFormatAppleIMA4], AVFormatIDKey,
                                    [NSNumber numberWithInt: 2], AVNumberOfChannelsKey,
                                    [NSNumber numberWithInt: AVAudioQualityMax], AVEncoderAudioQualityKey,
                                    nil];
    
    NSString *recordPath = [NSTemporaryDirectory() stringByAppendingPathComponent:@"record.caf"];
    _recordURL = [NSURL fileURLWithPath:recordPath];
    
    _recorder = [[AVAudioRecorder alloc] initWithURL:_recordURL settings:recordSettings error:NULL];
    
    // 監聽器
    NSString *monitorPath = [NSTemporaryDirectory() stringByAppendingPathComponent:@"monitor.caf"];
    _monitorURL = [NSURL fileURLWithPath:monitorPath];
    _monitor = [[AVAudioRecorder alloc] initWithURL:_monitorURL settings:recordSettings error:NULL];
    _monitor.meteringEnabled = YES;

其中參數設置的那個字典裏，的那些常量你們不用過於上火，這是以前寫的代碼直接扒來用的，上文中設置的最優語音質量。

②開始與結束

要想經過聲音大小來控制開始結束的話，須要在錄音器外再額外設置個監聽器用來查看語音的大小經過peakPowerForChannel 方法查看當前話筒環境的聲音環境音量。而且有個定時器來控制音量檢測的週期。大體代碼以下

- (void)setupTimer {
    [self.monitor record];
    self.timer = [NSTimer scheduledTimerWithTimeInterval:0.1 target:self selector:@selector(updateTimer) userInfo:nil repeats:YES]; //董鉑然博客園
}

// 監聽開始與結束的方法
- (void)updateTimer {

    // 不更新就無法用了
    [self.monitor updateMeters];
    
    // 得到0聲道的音量，徹底沒有聲音-160.0，0是最大音量
    float power = [self.monitor peakPowerForChannel:0];
    
    //        NSLog(@"%f", power);
    if (power > -20) {
        if (!self.recorder.isRecording) {
            NSLog(@"開始錄音");
            [self.recorder record];
        }
    } else {
        if (self.recorder.isRecording) {
            NSLog(@"中止錄音");
            [self.recorder stop];
            [self recognition];
        }
    }
}

③語音識別的任務請求

- (void)recognition {
    // 時鐘中止
    [self.timer invalidate];
    // 監聽器也中止
    [self.monitor stop];
    // 刪除監聽器的錄音文件
    [self.monitor deleteRecording];
    
    //建立語音識別操做類對象
    SFSpeechRecognizer *rec = [[SFSpeechRecognizer alloc]initWithLocale:[NSLocale localeWithLocaleIdentifier:@"zh_CN"]];
    //            SFSpeechRecognizer *rec = [[SFSpeechRecognizer alloc]initWithLocale:[NSLocale localeWithLocaleIdentifier:@"en_ww"]];  //董鉑然博客園
    
    //經過一個本地的音頻文件來解析
    SFSpeechRecognitionRequest * request = [[SFSpeechURLRecognitionRequest alloc]initWithURL:_recordURL];
    [rec recognitionTaskWithRequest:request delegate:self];
}

這段經過一個本地文件進行識別轉漢字的代碼，應該是網上傳的最多的，由於不用動腦子都能寫出來。可是單有這一段代碼基本是沒有什麼卵用的。（除了人家微信如今有個長按把語音轉文字的功能，其餘誰的App需求我真想不到會直接拿出一個本地音頻文件來解析，自動生成mp3歌詞？周杰倫的歌解析難度比較大，還有語音識別時間要求不能超過1分鐘）

④結果回調的代理方法

- (void)speechRecognitionTask:(SFSpeechRecognitionTask *)task didFinishRecognition:(SFSpeechRecognitionResult *)recognitionResult
{
    NSLog(@"%s",__FUNCTION__);
    NSLog(@"%@",recognitionResult.bestTranscription.formattedString);
    [self setupTimer];
}

用的最多的就這個方法了，另外不一樣時刻的回調方法能夠按需添加，這裏也就是簡單展現，能夠看個人demo程序裏有更多功能。

https://github.com/dsxNiubility/SXSpeechRecognitionTwoWays

iOS10在語音相關識別相關功能上有了一個大的飛躍，主要體如今兩點一點就是上面的語音識別，另外一點是sirikit能夠實現將外部的信息透傳到App內進行操做，可是暫時侷限性比較明顯，只可以實現官網所說叫車，發信息等消息類型，甚至連「打開美團搜索烤魚店」這種類型都還不能識別，因此暫時也沒法往下作過多研究，等待蘋果以後的更新吧。