句子互動 | 用Snowboy打造本身的樹莓派語音助手

時間 2019-11-08

標籤句子互動 snowboy 打造本身樹莓語音助手简体版

原文原文鏈接

做者：梁皓然

Xanthous Tech 創始人，前亞馬遜全棧工程師。2016年回國創業，組建團隊在全球範圍內爲大公司提供Chatbot諮詢開發服務，應用RASA對話系統，並基於微信將Chatbot和MiniProgram進行了深度整合。php

設想

一個聊天機器人（Chatbot）須要理解天然語言，並做出對應的回覆。一個chatbot模塊能夠拆解成以下部分：node

在開發者的世界裏面，如今已經有很多開源的工具能夠製做chatbot模塊，各大雲平臺上也已經有各類各樣的雲服務來支持，對接到市面上的聊天平臺上。在工做中，也常常和Slack上面的機器人打交道，而且經過機器人在開發和運維流程裏面作各類提醒和自動化。ios

如今各類各樣的語音助手也開始出如今咱們的身邊，像小度和小愛，像Siri，還有Alexa和Google Home等設備。我還記得我買回來的第一個Amazon Echo，嘗試對着它說各類各樣的話，看看怎麼樣回覆，朋友也常常惡做劇，來到我家經過Echo給我在亞馬遜下了各類各樣的訂單。手機上的Hey Siri和OK Google也很是方便，儘管只是設一下鬧鐘或者是作一些功能。git

做爲一個開發者，和漫威電影的愛好者，我常常在想有沒有辦法作一個屬於本身的語音助手，像鋼鐵俠電影裏面的Jarvis和Friday同樣。對於我來講，一個 voice chatbot能夠拆解成下面的部分：github

看起來，我只須要把每一個部件鏈接起來，而後放到一個機器上面跑就能夠了！可是想了一下，又想到了一個問題，這個語音助手須要像市面上的設備同樣，須要喚醒。若是沒有喚醒步驟，一直作監聽的話，對存儲資源和網絡鏈接的需求是很是大的。通過一番搜索以後，我找到了Snowboy。web

Snowboy是kitt.ai製做的一個熱詞檢測庫 (Hotwords Detection Library)。經過訓練熱詞以後，能夠離線運行，而且功耗很低，能夠支持在樹莓派等設備上運行。官方提供Python, Golang, NodeJS, iOS 和Android的wrapper能夠整合到代碼裏面。npm

實踐

因而我就拿出了塵封已久的樹莓派，連上了麥克風和音箱，開始本身倒騰能不能作出來一個簡單的能聽懂我說話的小Jarvis。最近也入購了一個iPad Pro，因此我準備直接經過iPad Pro鏈接樹莓派進入ssh編程，順便練一下vim，哈哈。編程

下面列舉一下配置：json

Board: NanoPi K1 Plus - 特別喜歡友善之臂的板子，性價比高。這個板子有2G內存，有Wi-Fi + Ethernet（須要網線接口鏈接iPad），甚至帶有板載麥克風。搭配的OS是UbuntuCore 16.04 LTS，能夠經過apt安裝絕大部分的依賴。axios

Microphone: Blue Snowball - 由於我主要在家辦公，因此常常須要視頻會議。 Blue的麥克風是USB鏈接的，在Linux下能夠免驅直接使用。

根據上圖Voice Chatbot的拆解，我決定把如下這幾個服務鏈接起來測試一下完整流程：

Hotword Detection: Snowboy

Speech-to-Text: 科大訊飛語音聽寫

Chatbot: 圖靈機器人

Text-to-Speech: 科大訊飛在線語音合成

機器啓動以後安裝nvm 用最新版的NodeJS v10 LTS。而後建立 package.json 並安裝 snowboy nodejs wrapper:

npm init
npm install snowboy --save
複製代碼

須要詳細讀取文檔安裝全部Snowboy編譯所需的依賴（TODO）。依賴安裝完以後，咱們參考一下Snowboy的sample代碼：

// index.js

const record = require('node-record-lpcm16');
const Detector = require('snowboy').Detector;
const Models = require('snowboy').Models;

const models = new Models();

models.add({
  file: 'resources/models/snowboy.umdl',
  sensitivity: '0.5',
  hotwords : 'snowboy'
});

const detector = new Detector({
  resource: "resources/common.res",
  models: models,
  audioGain: 2.0,
  applyFrontend: true
});

detector.on('silence', function () {
  console.log('silence');
});

detector.on('sound', function (buffer) {
  // <buffer> contains the last chunk of the audio that triggers the "sound"
  // event. It could be written to a wav stream.
  console.log('sound');
});

detector.on('error', function () {
  console.log('error');
});

detector.on('hotword', function (index, hotword, buffer) {
  // <buffer> contains the last chunk of the audio that triggers the "hotword"
  // event. It could be written to a wav stream. You will have to use it
  // together with the <buffer> in the "sound" event if you want to get audio
  // data after the hotword.
  console.log(buffer);
  console.log('hotword', index, hotword);
});

const mic = record.start({
  threshold: 0,
  verbose: true
});

mic.pipe(detector);
複製代碼

由於這個sample沒有指定node-record-lpcm16的版本號，通過一番調試發現新版1.x版本已經改了API，因此我這邊翻了一下文檔才發現API的改動：

// index.js

const { record } = require('node-record-lpcm16');

const mic = record({
  sampleRate: 16000,
  threshold: 0.5,
  recorder: 'rec',
  device: 'plughw:CARD=Snowball',
}).stream();
複製代碼

這裏加了一些新的參數，首先是指定Snowball的硬件ID，這個硬件ID能夠經過arecord -L命令找到。另外設置了16k的採樣率，由於Snowboy的model都是用16k採樣率的音頻來訓練的，採樣率不一致就識別不出來。另外把閾值調高了一些，阻擋一些噪音。

按照文檔修改使用Jarvis的模型，並調整靈敏度參數：

// index.js

models.add({
  file: 'snowboy/resources/models/jarvis.umdl',
  sensitivity: '0.8,0.80',
  hotwords : ['jarvis', 'jarvis'],
});
複製代碼

使用Jarvis模型測試以後發現已經能夠識別Jarvis的hotword，而且觸發hotword回調。這裏我想了一下，我須要把音頻流保存下來，而後傳到訊飛進行聽寫獲取文字。因此當hotword事件觸發的時候，須要把mic的流轉移到一個fsWriteStream裏面寫入音頻文件。Snowboy的Detector也有sound和silence的回調，因此我經過一個簡單的flag來實現了語音錄製，並在說話結束的時候傳到訊飛的聽寫API。

// index.js

const { xunfeiTranscriber } = require('./xunfei_stt');

let audios = 0;
let duplex;
let silenceCount;
let speaking;

const init = () => {
  const filename = `audio${audios}.wav`;
  duplex = fs.createWriteStream(filename, { binary: true });
  silenceCount = 0;
  speaking = false;
  console.log(`initialized audio write stream to ${filename}`);
};

const transcribe = () => {
  console.log('transcribing');
  const filename = `audio${audios}.wav`;
  xunfeiTranscriber.push(filename);
};

detector.on('silence', function () {
  if (speaking) {
    if (++silenceCount > MAX_SILENCE_COUNT) {
      mic.unpipe(duplex);
      duplex.destroy();
      transcribe();
      audios++;
      init();
    }
  }
  console.log('silence', speaking, silenceCount);
});

detector.on('sound', function (buffer) {
  if (speaking) {
    silenceCount = 0;
  }

  console.log('sound');
});

detector.on('hotword', function (index, hotword, buffer) {
  if (!speaking) {
    silenceCount = 0;
    speaking = true;
    mic.pipe(duplex);
  }

  console.log('hotword', index, hotword);
});

mic.pipe(detector);
init();
複製代碼

上面這段代碼裏面xunfeiTranscriber就是咱們的訊飛聽寫模塊。由於如今存的是一個音頻文件，因此若是API是直接把整個音頻傳過去而後得到文字的話，是最舒服的。可是很遺憾，訊飛棄用了REST API，而轉用了基於WebSocket的流式聽寫API，因此只能老老實實手擼一個client。這裏我用了EventEmitter來作消息通訊，這樣能夠比較快地和主程序互通訊息。

// xunfei_stt.js

const EventEmitter = require('events');
const WebSocket = require('ws');

let ws;
let transcriptionBuffer = '';

class XunfeiTranscriber extends EventEmitter {
  constructor() {
    super();
    this.ready = false;
    this.on('ready', () => {
      console.log('transcriber ready');
      this.ready = true;
    });
    this.on('error', (err) => {
      console.log(err);
    });
    this.on('result', () => {
      cleanupWs();
      this.ready = false;
      init();
    });
  }

  push(audioFile) {
    if (!this.ready) {
      console.log('transcriber not ready');
      return;
    }

    this.emit('push', audioFile);
  }
}

function init() {
  const host = 'iat-api.xfyun.cn';
  const path = '/v2/iat';

  const xunfeiUrl = () => {
    return `ws://${host}${path}?host=${host}&date=${encodeURIComponent(dateString)}&authorization=${authorization}`;
  };

  const url = xunfeiUrl();

  console.log(url);

  ws = new WebSocket(url);

  ws.on('open', () => {
    console.log('transcriber connection established');
    xunfeiTranscriber.emit('ready');
  });

  ws.on('message', (data) => {
    console.log('incoming xunfei transcription result');

    const payload = JSON.parse(data);

    if (payload.code !== 0) {
      cleanupWs();
      init();
      xunfeiTranscriber.emit('error', payload);
      return;
    }

    if (payload.data) {
      transcriptionBuffer += payload.data.result.ws.reduce((acc, item) => {
        return acc + item.cw.map(cw => cw.w);
      }, '');

      if (payload.data.status === 2) {
        xunfeiTranscriber.emit('result', transcriptionBuffer);
      }
    }
  });

  ws.on('error', (error) => {
    console.log(error);
    cleanupWs();
  });

  ws.on('close', () => {
    console.log('closed');
    init();
  });
}

const xunfeiTranscriber = new XunfeiTranscriber();

init();

module.exports = {
  xunfeiTranscriber,
};
複製代碼

處理push事件這個地方比較棘手，通過測試發現，訊飛聽寫API只支持每條websocket消息發送13k的音頻信息。音頻信息是經過base64編碼的，因此每條最多隻能發大概9k字節。這裏須要根據訊飛API文檔進行分批發送，而且在最後必定須要發end frame，否則API會超時致使關閉。返回的文字也是分段的，因此須要一個buffer來存儲，等所有文字都返回以後再拼接輸出。

// xunfei_stt.js

const fs = require('fs');

xunfeiTranscriber.on('push', function pushAudioFile(audioFile) {
  transcriptionBuffer = '';

  const audioPayload = (statusCode, audioBase64) => ({
    common: statusCode === 0 ? {
      app_id: process.env.XUNFEI_APPID,
    } : undefined,
    business: statusCode === 0 ? {
      language: 'zh_cn',
      domain: 'iat',
      ptt: 0,
    } : undefined,
    data: {
      status: statusCode,
      format: 'audio/L16;rate=16000',
      encoding: 'raw',
      audio: audioBase64,
    },
  });

  const chunkSize = 9000;
  const buffer = new Buffer(chunkSize);

  fs.open(audioFile, 'r', (err, fd) => {
    if (err) {
      throw err;
    }

    let i = 0;

    function readNextChunk() {
      fs.read(fd, buffer, 0, chunkSize, null, (errr, nread) => {
        if (errr) {
          throw errr;
        }

        if (nread === 0) {
          console.log('sending end frame');

          ws.send(JSON.stringify({
            data: { status: 2 },
          }));

          return fs.close(fd, (err) => {
            if (err) {
              throw err;
            }
          });
        }

        let data;
        if (nread < chunkSize) {
          data = buffer.slice(0, nread);
        } else {
          data = buffer;
        }

        const audioBase64 = data.toString('base64');
        console.log('chunk', i, 'size', audioBase64.length);
        const payload = audioPayload(i >= 1 ? 1 : 0, audioBase64);

        ws.send(JSON.stringify(payload));
        i++;

        readNextChunk();
      });
    }

    readNextChunk();
  });
});
複製代碼

細心的同窗應該留意到有些重啓邏輯在這段代碼裏面，這是由於測試過程當中，發現訊飛這個API每一個鏈接只支持發送一條消息，接受新的音頻流須要從新鏈接API。。。因此只好在每條消息發送完以後主動關閉WebSocket鏈接。

接下來是整合圖靈機器人獲取回覆的部分了，xunfeiTranscriber提供一個result事件，因此這裏經過監聽result事件，把消息收到以後傳入圖靈機器人。

// index.js

const { tulingBot } = require('./tuling_bot');

xunfeiTranscriber.on('result', async (data) => {
  console.log('transcriber result:', data);
  const response = await tulingBot(data);
  console.log(response);
});
複製代碼

// tuling_bot.js

const axios = require('axios');

const url = 'http://openapi.tuling123.com/openapi/api/v2';

async function tulingBot(text) {
  const response = await axios.post(url, {
    reqType: 0,
    perception: {
      inputText: {
        text,
      },
    },
    userInfo: {
      apiKey: process.env.TULING_API_KEY,
      userId: 'myUser',
    },
  });

  console.log(JSON.stringify(response.data, null, 2));
  return response.data;
}

module.exports = {
  tulingBot,
};
複製代碼

對接完圖靈機器人以後，咱們須要把圖靈機器人返回的文字進行語音合成。這裏訊飛語音合成的WebAPI仍是基於REST的，也已經有人作了對應的開源實現了，因此比較簡單。

// index.js

const { xunfeiTTS } = require('./xunfei_tts');

xunfeiTranscriber.on('result', async (data) => {
  console.log('transcriber result:', data);
  const response = await tulingBot(data);

  const playVoice = (filename) => {
    return new Promise((resolve, reject) => {
      const speaker = new Speaker({
        channels: 1,
        bitDepth: 16,
        sampleRate: 16000,
      });
      const outStream = fs.createReadStream(filename);
      // this is just to activate the speaker, 2s delay
      speaker.write(Buffer.alloc(32000, 10));
      outStream.pipe(speaker);
      outStream.on('end', resolve);
    });
  };

  for (let i = 0; i < response.results.length; i++) {
    const result = response.results[i];
    if (result.values && result.values.text) {
      const outputFilename = await xunfeiTTS(result.values.text, `${audios-1}-${i}`);
      if (outputFilename) {
        await playVoice(outputFilename);
      }
    }
  }
});
複製代碼

// xunfei_tts.js
const fs = require('fs');
const xunfei = require('xunfeisdk');
const { promisify } = require('util');

const writeFileAsync = promisify(fs.writeFile);

const client = new xunfei.Client(process.env.XUNFEI_APPID);
client.TTSAppKey = process.env.XUNFEI_TTS_KEY;

async function xunfeiTTS(text, audios) {
  console.log('turning following text into speech:', text);

  try {
    const result = await client.TTS(
      text,
      xunfei.TTSAufType.L16_16K,
      xunfei.TTSAueType.RAW,
      xunfei.TTSVoiceName.XiaoYan,
    );

    console.log(result);

    const filename = `response${audios}.wav`;

    await writeFileAsync(filename, result.audio);

    console.log(`response written to ${filename}`);

    return filename;
  } catch (err) {
    console.log(err.response.status);
    console.log(err.response.headers);
    console.log(err.response.data);

    return null;
  }
}

module.exports = {
  xunfeiTTS,
};
複製代碼

最後這個機器人就能夠聽懂我說的話啦！

下面附上完整代碼

後記

我以爲總體的運行效果仍是不錯的，而且能夠高度自定義。我但願後面再測試一下其餘不一樣廠商的語音API，而且對接上Rasa和Wechaty，這樣在家裏就能夠和機器人對話，而且可以在微信裏面得到一些圖文的信息。訊飛的API整合出乎意料以外地複雜，而且有一個我以爲比較致命的問題是，訊飛的WebAPI鏈接延時特別嚴重，我一開始覺得是板子的問題，後面發現單獨調用圖靈API和訊飛API，發現圖靈API的響應速度很是快，可是訊飛API就在鏈接上就花了很長時間，因此如今的STT模塊須要預熱，等鏈接準備好才能夠說話。後面我想換用其餘廠商的API，看看能不能改善一下體驗。

但願這個demo可以起到一個拋磚引玉的做用，在將來能夠看到更多更酷炫的語音助手和機器人。