課程五(Sequence Models),第三週(Sequence models & Attention mechanism) —— 2.Programming assignments:Trigger

Expected OutputTrigger Word Detection

Welcome to the final programming assignment of this specialization!html

In this week's videos, you learned about applying deep learning to speech recognition. In this assignment, you will construct a speech dataset and implement an algorithm for trigger word detection (sometimes also called keyword detection, or wakeword detection). Trigger word detection is the technology that allows devices like Amazon Alexa, Google Home, Apple Siri, and Baidu DuerOS to wake up upon hearing a certain word.python

For this exercise, our trigger word will be "Activate." Every time it hears you say "activate," it will make a "chiming" sound. By the end of this assignment, you will be able to record a clip of yourself talking, and have the algorithm trigger a chime when it detects you saying "activate."算法

After completing this assignment, perhaps you can also extend it to run on your laptop so that every time you say "activate" it starts up your favorite app, or turns on a network connected lamp in your house, or triggers some other event。編程

In this assignment you will learn to:數組

  • Structure a speech recognition project
  • Synthesize and process audio recordings to create train/dev datasets
  • Train a trigger word detection model and make predictions

【中文翻譯】網絡

觸發字檢測數據結構


歡迎來到這個專業化的最終編程任務!架構

在本週的視頻中, 您瞭解瞭如何將深度學習應用於語音識別。在這個任務中, 您將構造一個語音數據集並實現觸發字檢測 (有時也稱爲關鍵字檢測或 wakeword 檢測) 的算法。觸發字檢測是一種技術, 它容許亞馬遜 Alexa、谷歌主頁、蘋果 Siri 和百度 DuerOS 等設備在聽到某個單詞時醒來。app

對於本練習, 咱們的觸發器詞是 "Activate". 每次聽到您說 "Activate" 時, 都會發出 "chiming " 聲音。完成此任務後, 您將可以錄製本身說話的剪輯, 而且在檢測到您說 "activate" 時, 該算法會觸發一個chime。dom

完成此任務後, 您可能還能夠將其擴展到laptop上運行, 以便每次您說 "activate ", 就會啓動您喜歡的應用程序, 或打開您的房子中的連接了網絡的燈, 或觸發其餘事件。

 

在這個做業中, 您將學習:

  • 構造語音識別項目
  • 合成和處理錄音以建立訓練/開發數據集
  • 訓練觸發字檢測模型並進行預測

 

Lets get started! Run the following cell to load the package you are going to use.

【code】

import numpy as np
from pydub import AudioSegment
import random
import sys
import io
import os
import glob
import IPython
from td_utils import *
%matplotlib inline

  

1 - Data synthesis: Creating a speech dataset

Let's start by building a dataset for your trigger word detection algorithm. A speech dataset should ideally be as close as possible to the application you will want to run it on. In this case, you'd like to detect the word "activate" in working environments (library, home, offices, open-spaces ...). You thus need to create recordings with a mix of positive words ("activate") and negative words (random words other than activate) on different background sounds. Let's see how you can create such a dataset.

【中文翻譯】

讓咱們首先構建一個數據集, 用於觸發詞檢測算法。語音數據集最好儘量靠近要運行它的應用程序。在這種狀況下, 您但願在工做環境 (圖書館、家裏、辦公室、開放空間...) 中檢測單詞  "激活 "。所以, 您須要在不一樣的背景聲音中建立帶有 positive words ( "activate ") 和negative words (除了激活之外的隨機單詞) 的混合的錄製。讓咱們看看如何建立這樣的數據集。

 

1.1 - Listening to the data

One of your friends is helping you out on this project, and they've gone to libraries, cafes, restaurants, homes and offices all around the region to record background noises, as well as snippets of audio of people saying positive/negative words. This dataset includes people speaking in a variety of accents.

In the raw_data directory, you can find a subset of the raw audio files of the positive words, negative words, and background noise. You will use these audio files to synthesize a dataset to train the model. The "activate" directory contains positive examples of people saying the word "activate". The "negatives" directory contains negative examples of people saying random words other than "activate". There is one word per audio recording. The "backgrounds" directory contains 10 second clips of background noise in different environments.

【中文翻譯】

你的一個朋友正在幫助你這個項目, 他們已經去圖書館, 咖啡館, 餐館, 家庭和辦公室各地的地區, 以記錄背景噪音, 以及包含人說積極/否認詞的音頻片斷。這個數據集包括用各類口音說話的人。

在 raw_data 目錄中, 您能夠找到positive words、negative words和背景噪音的原始音頻文件的子集。您將使用這些音頻文件合成數據集來訓練模型。 "activate " 目錄包含一些人說 "activate " 這個詞的正面例子。"negatives " 目錄中包含一些負面的例子, 說明 其餘人說的是隨機單詞。每一個錄音都有一個字。 "backgrounds " 目錄在不一樣環境中包含10秒的背景噪音片斷。

 

Run the cells below to listen to some examples.

【code】

IPython.display.Audio("./raw_data/activates/1.wav")

【result】

【注】原網頁是音頻,發出activate的聲音。這裏是截圖。

 

【code】

IPython.display.Audio("./raw_data/negatives/4.wav")

【result】

【注】原網頁是音頻。這裏是截圖。

 

【code】

IPython.display.Audio("./raw_data/backgrounds/1.wav")

【result】

【注】原網頁是音頻。這裏是截圖。  

 

You will use these three type of recordings (positives/negatives/backgrounds) to create a labelled dataset.  

  

 

1.2 - From audio recordings to spectrograms

What really is an audio recording? A microphone records little variations in air pressure over time, and it is these little variations in air pressure that your ear also perceives as sound. You can think of an audio recording is a long list of numbers measuring the little air pressure changes detected by the microphone. We will use audio sampled at 44100 Hz (or 44100 Hertz). This means the microphone gives us 44100 numbers per second. Thus, a 10 second audio clip is represented by 441000 numbers (= 10×4410010).

It is quite difficult to figure out from this "raw" representation of audio whether the word "activate" was said. In order to help your sequence model more easily learn to detect triggerwords, we will compute a spectrogram of the audio. The spectrogram tells us how much different frequencies are present in an audio clip at a moment in time.

(If you've ever taken an advanced class on signal processing or on Fourier transforms, a spectrogram is computed by sliding a window over the raw audio signal, and calculates the most active frequencies in each window using a Fourier transform. If you don't understand the previous sentence, don't worry about it.)

【中文翻譯】 

1.2-從錄音到圖譜
什麼是錄音?麥克風記錄了隨着時間的推移, 氣壓的微小變化, 這是空氣壓力的小變化, 你的耳朵覺察到聲音。你能夠認爲錄音是一長串的數字, 測量麥克風檢測到的小氣壓變化。咱們將使用音頻採樣,在44100赫茲 。這意味着麥克風每秒給咱們44100個數字。所以, 10 秒的音頻剪輯由441000個數字 (= 10×4410010) 表示。

 從這個 "原始 " 的音頻中,判斷"activate " 這個詞是否說了是至關困難的。爲了幫助您的序列模型更容易地學會檢測 triggerwords, 咱們將計算音頻的頻譜圖。頻譜圖告訴咱們在某一時刻音頻剪輯中有多少不一樣的頻率。

(若是您曾經上過信號處理或傅立葉變換這類的課, 則會知道,經過將窗口滑動在原始音頻信號滑動來計算頻譜圖, 並使用傅立葉變換計算每一個窗口中最活躍的頻率。若是你不理解前一句話, 不要擔憂。)

 

Lets see an example.

 【code】

IPython.display.Audio("audio_examples/example_train.wav")

【result】

【注】原網頁是音頻。這裏是截圖。

 

【code】

x = graph_spectrogram("audio_examples/example_train.wav")

【result】

The graph above represents how active each frequency is (y axis) over a number of time-steps (x axis).

  

【中文翻譯】

圖 1: 音頻記錄的頻譜圖譜, 其中的顏色顯示不一樣頻率 (響亮) 在不一樣時間點的音頻中存在的程度。綠色正方形意味某一頻率是更加活躍或更多存,在在音頻片斷 (更大);藍色方塊表示頻率較少。

 

The dimension of the output spectrogram depends upon the hyperparameters of the spectrogram software and the length of the input. In this notebook, we will be working with 10 second audio clips as the "standard length" for our training examples. The number of timesteps of the spectrogram will be 5511. You'll see later that the spectrogram will be the input x into the network, and so Tx=5511.

【中文翻譯】

輸出頻譜圖的維度取決於頻譜圖軟件的 hyperparameters 和輸入的長度。在本筆記本中, 咱們將使用10秒的音頻剪輯做爲咱們的訓練樣本的 "標準長度 "。頻譜圖的 timesteps 數將爲5511。稍後您將看到頻譜圖將做爲x輸入 到網絡中, 所以 Tx=5511。

 【code】

_, data = wavfile.read("audio_examples/example_train.wav")
print("Time steps in audio recording before spectrogram", data[:,0].shape)
print("Time steps in input after spectrogram", x.shape)

【result】

Time steps in audio recording before spectrogram (441000,)
Time steps in input after spectrogram (101, 5511)

  

Now, you can define:  

Tx = 5511 # The number of time steps input to the model from the spectrogram                  從頻譜圖向模型輸入的時間步長數
n_freq = 101 # Number of frequencies input to the model at each time step of the spectrogram  在頻譜圖的每一個時間步驟中向模型輸入的頻率數

  

 

Note that even with 10 seconds being our default training example length, 10 seconds of time can be discretized to different numbers of value. You've seen 441000 (raw audio) and 5511 (spectrogram). In the former case, each step represents 10/4410000.000023 seconds. In the second case, each step represents 10/55110.0018 seconds.

For the 10sec of audio, the key values you will see in this assignment are:

  • 441000(raw audio)
  • 5511=Tx (spectrogram output, and dimension of input to the neural network).
  • 10000 (used by the pydub module to synthesize audio)
  • 1375=Ty (the number of steps in the output of the GRU you'll build).

Note that each of these representations correspond to exactly 10 seconds of time. It's just that they are discretizing them to different degrees. All of these are hyperparameters and can be changed (except the 441000, which is a function of the microphone). We have chosen values that are within the standard ranges uses for speech systems.

 

Consider the Ty=1375 number above. This means that for the output of the model, we discretize the 10s into 1375 time-intervals (each one of length 10/13750.0072s) and try to predict for each of these intervals whether someone recently finished saying "activate."

Consider also the 10000 number above. This corresponds to discretizing the 10sec clip into 10/10000 = 0.001 second itervals. 0.001 seconds is also called 1 millisecond, or 1ms. So when we say we are discretizing according to 1ms intervals, it means we are using 10,000 steps.

 

【中文翻譯】 

請注意, 即便10秒是咱們的默認訓練示例長度, 10 秒的時間能夠被離散到不一樣的值數。您已經看到 441000 (原始音頻) 和 5511 (頻譜圖)。在前一種狀況下, 每一個步驟表示10秒/441000≈0.000023 秒鐘。在第二種狀況下, 每一個步驟表明 10/5511≈0.0018 秒。

對於10sec 的音頻, 您將在該做業中看到的關鍵值是:

  • 441000 (原始音頻)
  • 5511 = Tx (頻譜圖輸出和輸入到神經網絡的維度)。
  • 10000 (由 pydub 模塊合成音頻)
  • 1375 = Ty (您要構建的 GRU 的輸出中的步驟數)。

請注意, 每一個表示形式對應的時間正好爲10秒。只是他們離散不一樣的程度。全部這些都是 hyperparameters , 能夠改變 (除了 441000, 這是麥克風的功能)。咱們選擇了在語音系統的標準範圍內使用的值。

請考慮上面的 Ty=1375 。這意味着, 對於模型的輸出, 咱們將10s 離散爲1375個時間間隔 (每一個長度 10/1375≈0.0072s), 並嘗試預測每一個間隔是否有人最近完成了 "activate"。

還要考慮上面的10000個數字。這對應於將10s 離散爲10000個時間間隔 (每一個長度0/10000 = 0.001s), 0.001 秒也稱爲1毫秒或1ms。因此, 當咱們說, 咱們根據1ms 的時間間隔離散, 這意味着咱們使用10,000步驟。

 

【code】

Ty = 1375 # The number of time steps in the output of our model 模型輸出中的時間步長數

 

1.3 - Generating a single training example

Because speech data is hard to acquire and label, you will synthesize your training data using the audio clips of activates, negatives, and backgrounds. It is quite slow to record lots of 10 second audio clips with random "activates" in it. Instead, it is easier to record lots of positives and negative words, and record background noise separately (or download background noise from free online sources).

To synthesize a single training example, you will:

  • Pick a random 10 second background audio clip
  • Randomly insert 0-4 audio clips of "activate" into this 10sec clip
  • Randomly insert 0-2 audio clips of negative words into this 10sec clip

Because you had synthesized the word "activate" into the background clip, you know exactly when in the 10sec clip the "activate" makes its appearance. You'll see later that this makes it easier to generate the labels yt as well.

You will use the pydub package to manipulate audio. Pydub converts raw audio files into lists of Pydub data structures (it is not important to know the details here). Pydub uses 1ms as the discretization interval (1ms is 1 millisecond = 1/1000 seconds) which is why a 10sec clip is always represented using 10,000 steps.

【中文翻譯】

因爲語音數據難以獲取和標記, 您將使用activates, negatives, and backgrounds的音頻剪輯來合成訓練數據。在其中隨機 "activates " ,錄製大量10秒的音頻剪輯是至關緩慢的。相反, 記錄大量的 positive 和 negative的單詞, 並單獨記錄背景噪音 (或從免費的在線來源下載背景噪音)是更容易的。

要合成一個訓練樣本, 您將:

  • 選擇隨機10秒背景音頻剪輯
  • 隨機插入0-4 音頻剪輯 "activate" 到這個10sec 剪輯
  • 在10sec 剪輯中隨機插入0-2 個negative 單詞的音頻剪輯

由於您已將單詞 "activate " 合成到背景剪輯中, 因此您確切知道在10sec 剪輯中 "activate " 的出現。稍後您將看到, 這使得生成標籤 y⟨t⟩也更容易。

您將使用 pydub 包來操做音頻。Pydub 將原始音頻文件轉換爲 Pydub 數據結構列表 (在此處瞭解詳細信息,但並不重要)。Pydub 使用1ms 做爲離散化間隔 (1ms 是1毫秒 = 1/1000 秒), 這就是爲何10sec 剪輯老是使用1萬個步驟表示的緣由。

 

【code】

# Load audio segments using pydub 
activates, negatives, backgrounds = load_raw_audio()

print("background len: " + str(len(backgrounds[0])))    # Should be 10,000, since it is a 10 sec clip
print("activate[0] len: " + str(len(activates[0])))     # Maybe around 1000, since an "activate" audio clip is usually around 1 sec (but varies a lot)
print("activate[1] len: " + str(len(activates[1])))     # Different "activate" clips can have different lengths 

 

【result】

background len: 10000
activate[0] len: 916
activate[1] len: 1579

  

Overlaying positive/negative words on the background:

Given a 10sec background clip and a short audio clip (positive or negative word), you need to be able to "add" or "insert" the word's short audio clip onto the background. To ensure audio segments inserted onto the background do not overlap, you will keep track of the times of previously inserted audio clips. You will be inserting multiple clips of positive/negative words onto the background, and you don't want to insert an "activate" or a random word somewhere that overlaps with another clip you had previously added.

For clarity, when you insert a 1sec "activate" onto a 10sec clip of cafe noise, you end up with a 10sec clip that sounds like someone sayng "activate" in a cafe, with "activate" superimposed on the background cafe noise. You do not end up with an 11 sec clip. You'll see later how pydub allows you to do this.

【中文翻譯】  

background上疊加positive/negative詞:

給定10sec 背景剪輯和短音頻剪輯 (positive or negative 單詞), 您須要可以在背景上 "添加 " 或 將單詞的短音頻剪輯插入 。要確保插入到背景上的音頻段不重疊, 您將跟蹤之前插入的音頻剪輯的時間。您將在背景上插入多個positive or negative 單詞剪輯, 而且不但願在與之前添加的其餘剪輯重疊的地方插入 "activate " 或隨機單詞。

爲清楚起見, 當您插入 1sec "activate " 到10sec 剪輯的咖啡館噪音, 你結束了一個10sec 剪輯, 聽起來像有人說 "激活 " 在咖啡館, 。你不會獲得一個11秒的剪輯。稍後您將看到 pydub 如何容許您這樣作。

  

Creating the labels at the same time you overlay:

Recall also that the labels yt represent whether or not someone has just finished saying "activate." Given a background clip, we can initialize yt=0 for all t, since the clip doesn't contain any "activates."

When you insert or overlay an "activate" clip, you will also update labels for yt, so that 50 steps of the output now have target label 1. You will train a GRU to detect when someone has finished saying "activate". For example, suppose the synthesized "activate" clip ends at the 5sec mark in the 10sec audio---exactly halfway into the clip. Recall that Ty=1375, so timestep 687int(1375*0.5) corresponds to the moment at 5sec into the audio. So, you will set y688=1. Further, you would quite satisfied if the GRU detects "activate" anywhere within a short time-internal after this moment, so we actually set 50 consecutive values of the label yt⟩ to 1. Specifically, we have y688=y689==y737=1.

This is another reason for synthesizing the training data: It's relatively straightforward to generate these labels yt as described above. In contrast, if you have 10sec of audio recorded on a microphone, it's quite time consuming for a person to listen to it and mark manually exactly when "activate" finished.

【中文翻譯】 

在合成的同時建立標籤:

還記得標籤 y⟨t⟩表明是否有人剛剛說 "activate". 給定一個背景剪輯, 咱們能夠對全部 t 初始化 y⟨t⟩=0, 由於剪輯不包含任何 "激活"。

當您插入或覆蓋 "activate" 剪輯時, 您還將更新 y⟨t⟩的標籤, 以便輸出的50步如今具備目標標籤1。您將訓練一個 GRU, 以檢測什麼時候有人已完成說 "activate "。例如, 假設合成 "activate " 剪輯結束於10sec 音頻中的5sec 標記,---剛好位於剪輯的一半處。還記得Ty=1375, 所以 timestep 687 = int (1375 * 0.5) 對應於音頻的第5sec 。因此, 你會設置 y⟨688⟩=1。此外, 若是 GRU 在這時刻的以後的短期內的任何位置檢測 "activate ",咱們是很高興的。 那麼咱們實際上將連續50個標籤 y⟨t⟩的值設置爲1。具體地說, 咱們有 y⟨688⟩=y⟨689⟩=⋯=y⟨737⟩=1。

這是合成訓練數據的另外一個緣由: 按照上面的描述, 生成這些標籤 y⟨t⟩相對簡單。相比之下, 若是在麥克風上錄製了10sec 音頻, 那麼一我的聽它, 並在 "activate " 完成時手動標記是至關耗時的。

 

Here's a figure illustrating the labels yt, for a clip which we have inserted "activate", "innocent", 」activate", "baby." Note that the positive labels "1" are associated only with the positive words.

To implement the training set synthesis process, you will use the following helper functions. All of these function will use a 1ms discretization interval, so the 10sec of audio is alwsys discretized into 10,000 steps.

  1. get_random_time_segment(segment_ms) gets a random time segment in our background audio
  2. is_overlapping(segment_time, existing_segments) checks if a time segment overlaps with existing segments
  3. insert_audio_clip(background, audio_clip, existing_times) inserts an audio segment at a random time in our background audio using get_random_time_segment and is_overlapping
  4. insert_ones(y, segment_end_ms) inserts 1's into our label vector y after the word "activate"

The function get_random_time_segment(segment_ms) returns a random time segment onto which we can insert an audio clip of duration segment_ms. Read through the code to make sure you understand what it is doing.

【中文翻譯】 

 這裏有一個圖, 說明標籤 y⟨t⟩, 對於一個剪輯, 咱們已經插入  "activate", "innocent", 」activate", "baby." 。請注意, positive 標籤 " 1  "只與positive的單詞關聯。

要實訓練集的合成過程, 您將使用如下幫助函數。全部這些函數都將使用1ms 離散化間隔, 因此10sec 的音頻 老是離散爲10000步。

  1. get_random_time_segment (segment_ms) 在咱們的背景音頻中獲取隨機時間段
  2. is_overlapping (segment_time、existing_segments) 檢查時間段是否與現有段重疊
  3. insert_audio_clip( background, audio_clip, existing_times) 在咱們的背景音頻中使用 get_random_time_segment 和 is_overlapping 在隨機時間插入音頻段
  4. insert_ones (y, segment_end_ms) 在 檢查到字 "activate "後,插入1到咱們的標籤向量 y 中。

函數 get_random_time_segment (segment_ms) 返回一個隨機時間段, 咱們能夠在上面插入一個持續時間 segment_ms 的音頻剪輯. 通讀代碼以確保您瞭解它正在作什麼。

 

【code】

def get_random_time_segment(segment_ms):
    """
    Gets a random time segment of duration segment_ms in a 10,000 ms audio clip.
    
    Arguments:
    segment_ms -- the duration of the audio clip in ms ("ms" stands for "milliseconds")
    
    Returns:
    segment_time -- a tuple of (segment_start, segment_end) in ms
    """
    
    segment_start = np.random.randint(low=0, high=10000-segment_ms)   # Make sure segment doesn't run past the 10sec background 
    segment_end = segment_start + segment_ms - 1
    
    return (segment_start, segment_end)

 

Next, suppose you have inserted audio clips at segments (1000,1800) and (3400,4500). I.e., the first segment starts at step 1000, and ends at step 1800. Now, if we are considering inserting a new audio clip at (3000,3600) does this overlap with one of the previously inserted segments? In this case, (3000,3600) and (3400,4500) overlap, so we should decide against inserting a clip here.

For the purpose of this function, define (100,200) and (200,250) to be overlapping, since they overlap at timestep 200. However, (100,199) and (200,250) are non-overlapping.

【中文翻譯】  

接下來, 假設您已經在段 (1000,1800) 和 (3400,4500) 中插入了音頻剪輯。即, 第一段從步驟1000開始, 在步驟1800結束。如今, 若是咱們正在考慮在 (3000,3600) 中插入一個新的音頻剪輯, 這與之前插入的片斷有重疊嗎?在這種狀況下, (3000,3600) 和 (3400,4500) 重疊, 因此咱們應該決定不在此插入剪輯。

爲了這個函數的目的, 定義 (100,200) 和 (200,250) 重疊, 由於它們在 timestep 200 重疊。然而, (100,199) 和 (200,250) 是不重疊的。

 

Exercise: Implement is_overlapping(segment_time, existing_segments) to check if a new time segment overlaps with any of the previous segments. You will need to carry out 2 steps:

  1. Create a "False" flag, that you will later set to "True" if you find that there is an overlap.
  2. Loop over the previous_segments' start and end times. Compare these times to the segment's start and end times. If there is an overlap, set the flag defined in (1) as True. You can use:
    for ....: if ... <= ... and ... >= ...: ... 
    Hint: There is overlap if the segment starts before the previous segment ends, and the segment ends after the previous segment starts.

 

【中文翻譯】 

練習: 實現 is_overlapping (segment_time、existing_segments) 檢查新的時間段是否與前面的任何段重疊。您將須要執行2步驟:

  1. 建立一個 "False " 標誌, 若是發現有重疊, 您之後將設置爲 "True "。
  2. 在 previous_segments 的開始和結束時間循環。將這些時間與段的開始和結束時間進行比較。若是存在重疊, 請將 (1) 中定義的標誌設置爲 True。您可使用:
 for ....: if ... <= ... and ... >= ...: ...

提示: 若是段在上一個段結束以前開始, 而且段在上一段開始後結束, 則會有重疊。

 

【code】

# GRADED FUNCTION: is_overlapping

def is_overlapping(segment_time, previous_segments):
    """
    Checks if the time of a segment overlaps with the times of existing segments.
    
    Arguments:
    segment_time -- a tuple of (segment_start, segment_end) for the new segment
    previous_segments -- a list of tuples of (segment_start, segment_end) for the existing segments
    
    Returns:
    True if the time segment overlaps with any of the existing segments, False otherwise
    """
    
    segment_start, segment_end = segment_time
    
    ### START CODE HERE ### (≈ 4 line)
    # Step 1: Initialize overlap as a "False" flag. (≈ 1 line)
    overlap = False
    
    # Step 2: loop over the previous_segments start and end times.
    # Compare start/end times and set the flag to True if there is an overlap (≈ 3 lines)
    for previous_start, previous_end in previous_segments:
        if segment_start <= previous_end and segment_end >= previous_start:
            overlap = True
    ### END CODE HERE ###

    return overlap
overlap1 = is_overlapping((950, 1430), [(2000, 2550), (260, 949)])
overlap2 = is_overlapping((2305, 2950), [(824, 1532), (1900, 2305), (3424, 3656)])
print("Overlap 1 = ", overlap1)
print("Overlap 2 = ", overlap2)

【result】

Overlap 1 =  False
Overlap 2 =  True

Expected Output

Overlap 1	False
Overlap 2	True

  

Now, lets use the previous helper functions to insert a new audio clip onto the 10sec background at a random time, but making sure that any newly inserted segment doesn't overlap with the previous segments.

Exercise: Implement insert_audio_clip() to overlay an audio clip onto the background 10sec clip. You will need to carry out 4 steps:

  1. Get a random time segment of the right duration in ms.
  2. Make sure that the time segment does not overlap with any of the previous time segments. If it is overlapping, then go back to step 1 and pick a new time segment.
  3. Add the new time segment to the list of existing time segments, so as to keep track of all the segments you've inserted.
  4. Overlay the audio clip over the background using pydub. We have implemented this for you.

【中文翻譯】  

如今, 讓咱們使用之前的幫助函數, 在隨機時間將新的音頻剪輯插入到10sec 背景上, 但要確保任何新插入的段不會與前面的段重疊。

練習: 實現 insert_audio_clip () 將音頻剪輯覆蓋到背景10sec 剪輯上。您將須要執行4步驟:

  1. 在 ms 級別,獲取適當時間的隨機時間段。
  2. 請確保時間段與之前的任什麼時候間段不重疊。若是它是重疊的, 則返回步驟1並選取一個新的時間段。
  3. 將新的時間段添加到現有時間段的列表中, 以便跟蹤已插入的全部段。
  4. 使用 pydub 將音頻剪輯覆蓋在背景上。咱們已經爲您實施了此項措施。

【code】

# GRADED FUNCTION: insert_audio_clip

def insert_audio_clip(background, audio_clip, previous_segments):
    """
    Insert a new audio segment over the background noise at a random time step, ensuring that the 
    audio segment does not overlap with existing segments. 在隨機時間步驟中, 在背景噪音上插入新的音頻段, 以確保音頻段不與現有段重疊。
    
    Arguments:
    background -- a 10 second background audio recording.  
    audio_clip -- the audio clip to be inserted/overlaid. 
    previous_segments -- times where audio segments have already been placed  音頻段已放置的時間
    
    Returns:
    new_background -- the updated background audio
    """
    
    # Get the duration of the audio clip in ms
    segment_ms = len(audio_clip)
    
    ### START CODE HERE ### 
    # Step 1: Use one of the helper functions to pick a random time segment onto which to insert 
    # the new audio clip. (≈ 1 line)
    segment_time = get_random_time_segment(segment_ms)
    
    # Step 2: Check if the new segment_time overlaps with one of the previous_segments. If so, keep 
    # picking new segment_time at random until it doesn't overlap. (≈ 2 lines)
    while is_overlapping(segment_time, previous_segments):
        segment_time = get_random_time_segment(segment_ms)

    # Step 3: Add the new segment_time to the list of previous_segments (≈ 1 line)
    previous_segments.append(segment_time)
    ### END CODE HERE ###
    
    # Step 4: Superpose audio segment and background 疊加音頻段和背景
    new_background = background.overlay(audio_clip, position = segment_time[0])
    
    return new_background, segment_time
np.random.seed(5)
audio_clip, segment_time = insert_audio_clip(backgrounds[0], activates[0], [(3790, 4400)])
audio_clip.export("insert_test.wav", format="wav")
print("Segment Time: ", segment_time)
IPython.display.Audio("insert_test.wav")

【result】

Segment Time:  (2254, 3169)

【注】原文是音頻,在背景聲音中插入了「activate」音頻。此處是圖片。

Expected Output

Segment Time	(2254, 3169)

 

【code】

# Expected audio
IPython.display.Audio("audio_examples/insert_reference.wav")

【result】

【注】原文是音頻,在背景聲音中插入了「activate」音頻。此處是圖片。 

  

Finally, implement code to update the labels yt, assuming you just inserted an "activate." In the code below, y is a (1,1375) dimensional vector, since Ty=1375.

If the "activate" ended at time step t, then set yt+1=1 as well as for up to 49 additional consecutive values. However, make sure you don't run off the end of the array and try to update y[0][1375], since the valid indices are y[0][0] through y[0][1374] because Ty=1375. So if "activate" ends at step 1370, you would get only y[0][1371] = y[0][1372] = y[0][1373] = y[0][1374] = 1

Exercise: Implement insert_ones(). You can use a for loop. (If you are an expert in python's slice operations, feel free also to use slicing to vectorize this.) If a segment ends at segment_end_ms (using a 10000 step discretization), to convert it to the indexing for the outputs y (using a 1375 step discretization), we will use this formula:

segment_end_y = int(segment_end_ms * Ty / 10000.0)

【中文翻譯】  

最後, 實現代碼,來更新標籤 y⟨t⟩, 假設您剛剛插入了 "activate. " 在下面的代碼中, y 是一個 (1,1375) 維向量, 由於 Ty=1375。

若是 "activate " 在時間步驟 t 結束, 則設置 y⟨t+1⟩=1,以及多達49個附加的連續值也設置爲1。可是, 請確保不會從數組的末尾運行, 並嘗試更新 y [0] [1375], 由於有效索引是 y [0] [0] 經過 y [0] [1374], 由於 Ty=1375。因此, 若是  "activate " 結束在步驟 1370, 你會獲得只有 y [0] [1371] = y [0] [1372] = y [0] [1373] = y [0] [1374] = 1

練習: 實施 insert_ones ()。可使用 for 循環。(若是您是 python 切片操做的專家, 也能夠隨意使用切片量化)。若是某個段在 segment_end_ms (使用10000步離散化) 結束, 將其轉換爲 y (使用1375步離散化) 的輸出的索引, 咱們將使用此公式:

segment_end_y = int(segment_end_ms * Ty / 10000.0)

【code】

# GRADED FUNCTION: insert_ones

def insert_ones(y, segment_end_ms):
    """
    Update the label vector y. The labels of the 50 output steps strictly after the end of the segment 
    should be set to 1. By strictly we mean that the label of segment_end_y should be 0 while, the
    50 following labels should be ones.
    更新標籤向量 y。50輸出步驟的標籤在段結束後嚴格設置爲1。嚴格來講, 咱們的意思是 segment_end_y 的標籤應該是 0, 而
    50 個接下來的標籤應該是1。
    
    Arguments:
    y -- numpy array of shape (1, Ty), the labels of the training example
    segment_end_ms -- the end time of the segment in ms
    
    Returns:
    y -- updated labels
    """
    
    # duration of the background (in terms of spectrogram time-steps)
    segment_end_y = int(segment_end_ms * Ty / 10000.0)
    
    # Add 1 to the correct index in the background label (y)
    ### START CODE HERE ### (≈ 3 lines)
    for i in range(segment_end_y + 1, segment_end_y + 51):
        if i < Ty:
            y[0,i] =1
    ### END CODE HERE ###
    
    return y
arr1 = insert_ones(np.zeros((1, Ty)), 9700)
plt.plot(insert_ones(arr1, 4251)[0,:])
print("sanity checks:", arr1[0][1333], arr1[0][634], arr1[0][635])

【result】

 

Expected Output】  

 

  

Finally, you can use insert_audio_clip and insert_ones to create a new training example.

Exercise: Implement create_training_example(). You will need to carry out the following steps:

  1. Initialize the label vector y as a numpy array of zeros and shape (1,Ty).
  2. Initialize the set of existing segments to an empty list.
  3. Randomly select 0 to 4 "activate" audio clips, and insert them onto the 10sec clip. Also insert labels at the correct position in the label vector y.
  4. Randomly select 0 to 2 negative audio clips, and insert them into the 10sec clip.

【中文翻譯】  

最後, 您可使用 insert_audio_clip 和 insert_ones 建立一個新的訓練樣本。

練習: 實現 create_training_example ()。您將須要執行如下步驟:

  1. 將標籤向量 y初始化爲零和形狀的 numpy 數組 (1, Ty) 。
  2. 將現有段集初始化爲空列表。
  3. 隨機選擇0到 4 "activate " 音頻剪輯, 並將它們插入到10sec 剪輯上。還要在標籤矢量 y 的正確位置插入標籤。
  4. 隨機選擇0到2個negative音頻剪輯, 並將其插入10sec 剪輯中。

【code】

# GRADED FUNCTION: create_training_example

def create_training_example(background, activates, negatives):
    """
    Creates a training example with a given background, activates, and negatives.
    
    Arguments:
    background -- a 10 second background audio recording
    activates -- a list of audio segments of the word "activate"
    negatives -- a list of audio segments of random words that are not "activate"
    
    Returns:
    x -- the spectrogram of the training example
    y -- the label at each time step of the spectrogram
    """
    
    # Set the random seed
    np.random.seed(18)
    
    # Make background quieter
    background = background - 20

    ### START CODE HERE ###
    # Step 1: Initialize y (label vector) of zeros (≈ 1 line)
    y = np.zeros((1, Ty))

    # Step 2: Initialize segment times as empty list (≈ 1 line)
    previous_segments = []
    ### END CODE HERE ###
    
    # Select 0-4 random "activate" audio clips from the entire list of "activates" recordings
    number_of_activates = np.random.randint(0, 5)
    random_indices = np.random.randint(len(activates), size=number_of_activates)
    random_activates = [activates[i] for i in random_indices]
    
    ### START CODE HERE ### (≈ 3 lines)
    # Step 3: Loop over randomly selected "activate" clips and insert in background
    for random_activate in random_activates:
        # Insert the audio clip on the background
        background, segment_time = insert_audio_clip(background, random_activate, previous_segments)
        # Retrieve segment_start and segment_end from segment_time
        segment_start, segment_end = segment_time
        # Insert labels in "y"
        y = insert_ones(y, segment_end)
    ### END CODE HERE ###

    # Select 0-2 random negatives audio recordings from the entire list of "negatives" recordings
    number_of_negatives = np.random.randint(0, 3)
    random_indices = np.random.randint(len(negatives), size=number_of_negatives)
    random_negatives = [negatives[i] for i in random_indices]

    ### START CODE HERE ### (≈ 2 lines)
    # Step 4: Loop over randomly selected negative clips and insert in background
    for random_negative in random_negatives:
        # Insert the audio clip on the background 
        background, _ = insert_audio_clip(background, random_negative, previous_segments)
    ### END CODE HERE ###
    
    # Standardize the volume of the audio clip  標準化音頻剪輯的音量
    background = match_target_amplitude(background, -20.0)

    # Export new training example 
    file_handle = background.export("train" + ".wav", format="wav")
    print("File (train.wav) was saved in your directory.")
    
    # Get and plot spectrogram of the new recording (background with superposition of positive and negatives)
    x = graph_spectrogram("train.wav")
    
    return x, y
x, y = create_training_example(backgrounds[0], activates, negatives)

【result】

Expected Output

Now you can listen to the training example you created and compare it to the spectrogram generated above.

 【code】

IPython.display.Audio("train.wav")

【result】

【注】原網頁這背景音頻裏插入了兩個activate,和一個非acitivate詞,此處是截圖。

Expected Output】  

 【code】

IPython.display.Audio("audio_examples/train_reference.wav")

 【注】原網頁這背景音頻裏插入了兩個activate,和一個非acitivate詞,此處是截圖。 

 

Finally, you can plot the associated labels for the generated training example. 

【code】

plt.plot(y[0])

【result】

Expected Output

 

1.4 - Full training set

You've now implemented the code needed to generate a single training example. We used this process to generate a large training set. To save time, we've already generated a set of training examples.

【code】

# Load preprocessed training examples
X = np.load("./XY_train/X.npy")
Y = np.load("./XY_train/Y.npy")

  

1.5 - Development set

To test our model, we recorded a development set of 25 examples. While our training data is synthesized, we want to create a development set using the same distribution as the real inputs. Thus, we recorded 25 10-second audio clips of people saying "activate" and other random words, and labeled them by hand. This follows the principle described in Course 3 that we should create the dev set to be as similar as possible to the test set distribution; that's why our dev set uses real rather than synthesized audio.

【中文翻譯】 

1.5-開發集
爲了測試咱們的模型, 咱們記錄了一組25個樣本的開發集。雖然咱們的訓練數據是合成的, 咱們但願建立一個開發集,使用與實際輸入相同的分佈。所以, 咱們記錄了 25 個10 秒的音頻剪輯的人說 "activate " 和其餘隨機詞, 並手動標記他們。這遵循了課程3中描述的原則, 即咱們應該建立一個與測試集分佈儘量類似的開發集;這就是爲何咱們的開發集使用真正的而不是合成的音頻。

【code】

# Load preprocessed dev set examples
X_dev = np.load("./XY_dev/X_dev.npy")
Y_dev = np.load("./XY_dev/Y_dev.npy")

  

2 - Model

Now that you've built a dataset, lets write and train a trigger word detection model!

The model will use 1-D convolutional layers, GRU layers, and dense layers. Let's load the packages that will allow you to use these layers in Keras. This might take a minute to load.

【code】

from keras.callbacks import ModelCheckpoint
from keras.models import Model, load_model, Sequential
from keras.layers import Dense, Activation, Dropout, Input, Masking, TimeDistributed, LSTM, Conv1D
from keras.layers import GRU, Bidirectional, BatchNormalization, Reshape
from keras.optimizers import Adam

  

2.1 - Build the model

Here is the architecture we will use. Take some time to look over the model and see if it makes sense.

Figure 3

 

One key step of this model is the 1D convolutional step (near the bottom of Figure 3). It inputs the 5511 step spectrogram, and outputs a 1375 step output, which is then further processed by multiple layers to get the final Ty=1375 step output. This layer plays a role similar to the 2D convolutions you saw in Course 4, of extracting low-level features and then possibly generating an output of a smaller dimension.

Computationally, the 1-D conv layer also helps speed up the model because now the GRU has to process only 1375 timesteps rather than 5511 timesteps. The two GRU layers read the sequence of inputs from left to right, then ultimately uses a dense+sigmoid layer to make a prediction for yt. Because y is binary valued (0 or 1), we use a sigmoid output at the last layer to estimate the chance of the output being 1, corresponding to the user having just said "activate."

Note that we use a uni-directional RNN rather than a bi-directional RNN. This is really important for trigger word detection, since we want to be able to detect the trigger word almost immediately after it is said. If we used a bi-directional RNN, we would have to wait for the whole 10sec of audio to be recorded before we could tell if "activate" was said in the first second of the audio clip.

【中文翻譯】 

此模型的一個關鍵步驟是1D 卷積步驟 (靠近圖3的底部)。它輸入5511步頻譜圖, 輸出1375步輸出, 而後由多個層進一步處理以得到最終的 Ty=1375步驟輸出。此層扮演一個角色, 相似於您在課程4中看到的2D  convolutions, 即提取低級特徵, 而後可能生成較小維度的輸出。

計算上, 1-D  conv 層也有助於加快模型的速度, 由於如今 GRU 只能處理 1375 timesteps 而不是 5511 timesteps。兩個 GRU 層從左向右讀取輸入序列, 最後使用一個dense+sigmoid層對 y⟨t⟩進行預測。因爲 y 是二進制值 (0 或 1), 咱們使用在最後一層的sigmoid輸出來估計輸出爲1的概率, 對應於剛纔說 "activate" 的用戶。

請注意, 咱們使用的是單向 RNN, 而不是雙向 RNN。這對於觸發詞檢測很是重要, 由於咱們但願可以在說完後當即檢測到觸發器詞。若是咱們使用雙向 RNN, 咱們將不得不等待整個10sec 的音頻被記錄, 而後咱們才能夠說 "activate " 是否在第一秒的音頻剪輯裏有說到。

 

Implementing the model can be done in four steps:

Step 1: CONV layer. Use Conv1D() to implement this, with 196 filters, a filter size of 15 (kernel_size=15), and stride of 4. [See documentation.]

Step 2: First GRU layer. To generate the GRU layer, use:

X = GRU(units = 128, return_sequences = True)(X)

Setting return_sequences=True ensures that all the GRU's hidden states are fed to the next layer. Remember to follow this with Dropout and BatchNorm layers.

Step 3: Second GRU layer. This is similar to the previous GRU layer (remember to use return_sequences=True), but has an extra dropout layer.

Step 4: Create a time-distributed dense layer as follows:

X = TimeDistributed(Dense(1, activation = "sigmoid"))(X)

This creates a dense layer followed by a sigmoid, so that the parameters used for the dense layer are the same for every time step. [See documentation.]

Exercise: Implement model(), the architecture is presented in Figure 3.

【中文翻譯】 

實現模型能夠在四步驟中完成:

步驟 1: CONV 層。使用 Conv1D () 實現此目的, 使用196個過濾器, 過濾器大小爲 15 (kernel_size=15), 步長爲4。[請參閱文檔]。

步驟 2: 第一個 GRU 層。要生成 GRU 層, 請使用:

X = GRU(units = 128, return_sequences = True)(X)

設置 return_sequences = True 可確保全部 GRU 的隱藏狀態都被送入下一層。記住在這個層以後接着加Dropout和 BatchNorm 層。

步驟 3: 第二個 GRU 層。這相似於上一個 GRU 層 (記住使用 return_sequences = True), 但有一個額外的dropout層。

步驟 4: 建立一個時間分佈dense層, 以下所示:

X = TimeDistributed(Dense(1, activation = "sigmoid"))(X)

這將建立一個dense層, 後跟一個 sigmoid, 所以用於dense層的參數在每一時間步中都是相同的。[請參閱文檔]。

練習: 實現模型 (), 架構如圖3所示。

【code】

# GRADED FUNCTION: model

def model(input_shape):
    """
    Function creating the model's graph in Keras.
    
    Argument:
    input_shape -- shape of the model's input data (using Keras conventions)

    Returns:
    model -- Keras model instance
    """
    
    X_input = Input(shape = input_shape)
    
    ### START CODE HERE ###
    
    # Step 1: CONV layer (≈4 lines)
    X = Conv1D(196, 15, strides=4)(X_input)             # CONV1D
    X = BatchNormalization()(X)                         # Batch normalization
    X = Activation('relu')(X)                           # ReLu activation
    X = Dropout(0.8)(X)                                 # dropout (use 0.8)

    # Step 2: First GRU Layer (≈4 lines)
    X = GRU(units = 128, return_sequences=True)(X)      # GRU (use 128 units and return the sequences)
    X = Dropout(0.8)(X)                                 # dropout (use 0.8)
    X = BatchNormalization()(X)                         # Batch normalization
    
    # Step 3: Second GRU Layer (≈4 lines)
    X = GRU(units = 128, return_sequences=True)(X)      # GRU (use 128 units and return the sequences)
    X = Dropout(0.8)(X)                                 # dropout (use 0.8)
    X = BatchNormalization()(X)                         # Batch normalization
    X = Dropout(0.8)(X)                                 # dropout (use 0.8)
    
    # Step 4: Time-distributed dense layer (≈1 line)
    X = TimeDistributed(Dense(1, activation = "sigmoid"))(X) # time distributed  (sigmoid)

    ### END CODE HERE ###

    model = Model(inputs = X_input, outputs = X)
    
    return model  
model = model(input_shape = (Tx, n_freq))

 

Let's print the model summary to keep track of the shapes.

【code】

model.summary()

【result】

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 5511, 101)         0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 1375, 196)         297136    
_________________________________________________________________
batch_normalization_1 (Batch (None, 1375, 196)         784       
_________________________________________________________________
activation_1 (Activation)    (None, 1375, 196)         0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 1375, 196)         0         
_________________________________________________________________
gru_1 (GRU)                  (None, 1375, 128)         124800    
_________________________________________________________________
dropout_2 (Dropout)          (None, 1375, 128)         0         
_________________________________________________________________
batch_normalization_2 (Batch (None, 1375, 128)         512       
_________________________________________________________________
gru_2 (GRU)                  (None, 1375, 128)         98688     
_________________________________________________________________
dropout_3 (Dropout)          (None, 1375, 128)         0         
_________________________________________________________________
batch_normalization_3 (Batch (None, 1375, 128)         512       
_________________________________________________________________
dropout_4 (Dropout)          (None, 1375, 128)         0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 1375, 1)           129       
=================================================================
Total params: 522,561
Trainable params: 521,657
Non-trainable params: 904

Expected Output

Total params	522,561
Trainable params	521,657
Non-trainable params	904

 

The output of the network is of shape (None, 1375, 1) while the input is (None, 5511, 101). The Conv1D has reduced the number of steps from 5511 at spectrogram to 1375.  

  

2.2 - Fit the model

Trigger word detection takes a long time to train. To save time, we've already trained a model for about 3 hours on a GPU using the architecture you built above, and a large training set of about 4000 examples. Let's load the model.

【code】

model = load_model('./models/tr_model.h5')

 

You can train the model further, using the Adam optimizer and binary cross entropy loss, as follows. This will run quickly because we are training just for one epoch and with a small training set of 26 examples.  

【code】

opt = Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, decay=0.01)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=["accuracy"])
model.fit(X, Y, batch_size = 5, epochs=1)

【result】

Epoch 1/1
26/26 [==============================] - 27s - loss: 0.0726 - acc: 0.9806    
<keras.callbacks.History at 0x7f82f4fa5a58>

 

2.3 - Test the model

Finally, let's see how your model performs on the dev set.

【code】

loss, acc = model.evaluate(X_dev, Y_dev)
print("Dev set accuracy = ", acc)

【result】

25/25 [==============================] - 4s
Dev set accuracy =  0.945978164673

 

This looks pretty good! However, accuracy isn't a great metric for this task, since the labels are heavily skewed to 0's, so a neural network that just outputs 0's would get slightly over 90% accuracy. We could define more useful metrics such as F1 score or Precision/Recall. But let's not bother with that here, and instead just empirically see how the model does. 

【中文翻譯】  

這看起來不錯!然而,, 對這項任務,準確性不是一個很棒的指標, 由於標籤是嚴重扭曲到 0,。因此一個神經網絡, 只是輸出0的將獲得略高於90% 的準確性。咱們能夠定義更有用的指標, 如 F1 評分或 Precision/Recall。可是, 讓咱們不要費心在這裏, 而只是經驗主義地看看模型是如何作的。  

  

3 - Making Predictions

Now that you have built a working model for trigger word detection, let's use it to make predictions. This code snippet runs audio (saved in a wav file) through the network.

【中文翻譯】

如今, 您已經創建了一個觸發詞檢測的工做模型, 讓咱們使用它來進行預測。此代碼段經過網絡運行音頻 (保存在 wav 文件中)。  

【code】

def detect_triggerword(filename):
    plt.subplot(2, 1, 1)

    x = graph_spectrogram(filename)
    # the spectogram outputs (freqs, Tx) and we want (Tx, freqs) to input into the model
    x  = x.swapaxes(0,1)
    x = np.expand_dims(x, axis=0)
    predictions = model.predict(x)
    
    plt.subplot(2, 1, 2)
    plt.plot(predictions[0,:,0])
    plt.ylabel('probability')
    plt.show()
    return predictions

  

Once you've estimated the probability of having detected the word "activate" at each output step, you can trigger a "chiming" sound to play when the probability is above a certain threshold. Further, yt might be near 1 for many values in a row after "activate" is said, yet we want to chime only once. So we will insert a chime sound at most once every 75 output steps. This will help prevent us from inserting two chimes for a single instance of "activate". (This plays a role similar to non-max suppression from computer vision.)  

【中文翻譯】  

當您估計在每一個輸出步驟中檢測到 "activate " 這個詞的機率時, 當機率高於某一閾值時, 能夠觸發 "chiming " 聲音。此外, 在 "activate " 以後, y⟨t⟩在一行中的許多值可能接近 1, 但咱們只須要一次。所以, 咱們將在每75個輸出步驟中最多插入一個chiming聲音。這將有助於防止咱們爲單一的  "activate " 實例插入兩個chiming。(這與計算機視覺中的non-max suppression相似)。 

 【code】

chime_file = "audio_examples/chime.wav"
def chime_on_activate(filename, predictions, threshold):
    audio_clip = AudioSegment.from_wav(filename)
    chime = AudioSegment.from_wav(chime_file)
    Ty = predictions.shape[1]
    # Step 1: Initialize the number of consecutive output steps to 0 將連續輸出步驟的數量初始化爲0
    consecutive_timesteps = 0
    # Step 2: Loop over the output steps in the y 
    for i in range(Ty):
        # Step 3: Increment consecutive output steps 遞增連續輸出步驟
        consecutive_timesteps += 1
        # Step 4: If prediction is higher than the threshold and more than 75 consecutive output steps have passed 若是預測高於閾值, 超過75個連續的輸出步驟已經過
        if predictions[0,i,0] > threshold and consecutive_timesteps > 75:
            # Step 5: Superpose audio and background using pydub  使用 pydub疊加音頻和背景
            audio_clip = audio_clip.overlay(chime, position = ((i / Ty) * audio_clip.duration_seconds)*1000)
            # Step 6: Reset consecutive output steps to 0   將連續的輸出步驟重置爲0
            consecutive_timesteps = 0
        
    audio_clip.export("chime_output.wav", format='wav')

  

 

3.3 - Test on dev examples]

Let's explore how our model performs on two unseen audio clips from the development set. Lets first listen to the two dev set clips.

【code】

IPython.display.Audio("./raw_data/dev/1.wav")

【result】

【注】原文是音頻,這裏第截圖

 

【code】

IPython.display.Audio("./raw_data/dev/2.wav") 

【result】  

 【注】原文是音頻,這裏第截圖

 

Now lets run the model on these audio clips and see if it adds a chime after "activate"! 

 【code】

filename = "./raw_data/dev/1.wav"
prediction = detect_triggerword(filename)
chime_on_activate(filename, prediction, 0.5)
IPython.display.Audio("./chime_output.wav")

【result】

 

  

  【code】

filename  = "./raw_data/dev/2.wav"
prediction = detect_triggerword(filename)
chime_on_activate(filename, prediction, 0.5)
IPython.display.Audio("./chime_output.wav")

【result】  

 

 

Congratulations

You've come to the end of this assignment!

Here's what you should remember:

  • Data synthesis is an effective way to create a large training set for speech problems, specifically trigger word detection.
  • Using a spectrogram and optionally a 1D conv layer is a common pre-processing step prior to passing audio data to an RNN, GRU or LSTM.
  • An end-to-end deep learning approach can be used to built a very effective trigger word detection system.

Congratulations on finishing the fimal assignment!

Thank you for sticking with us through the end and for all the hard work you've put into learning deep learning. We hope you have enjoyed the course!

【中文翻譯】

祝賀
你已經完成任務了!

如下是你應該記住的:

  • 數據合成是建立一個用於語音問題的大型訓練集的有效方法, 特別是觸發詞檢測。
  • 在將音頻數據傳遞給 RNN、GRU 或 LSTM 以前, 使用頻譜圖和可選的 1D conv 層是一個常見的預處理步驟。
  • 端到端的深層學習方法能夠用來構建一個很是有效的觸發詞檢測系統。

恭喜你完成了 fimal 任務!

謝謝你跟着咱們堅持到最後以及你全部的辛勤工做, 你已經投入到學習深刻學習。咱們但願你喜歡這門課!

 

4 - Try your own example! (OPTIONAL/UNGRADED)

In this optional and ungraded portion of this notebook, you can try your model on your own audio clips!

Record a 10 second audio clip of you saying the word "activate" and other random words, and upload it to the Coursera hub as myaudio.wav. Be sure to upload the audio as a wav file. If your audio is recorded in a different format (such as mp3) there is free software that you can find online for converting it to wav. If your audio recording is not 10 seconds, the code below will either trim or pad it as needed to make it 10 seconds.

 【中文翻譯】

4-嘗試你本身的例子!(可選/不評分)
在本筆記本的這個可選和不評分的部分中, 您能夠在本身的音頻剪輯上試用您的模型!

記錄一個10秒的音頻剪輯, 你說的字 "activate " 和其餘隨機詞, 並上傳到 Coursera hub做爲 myaudio. wav。請務必將音頻上載爲 wav 文件。若是您的音頻以不一樣的格式 (如 mp3) 錄製, 則能夠在聯網狀態下找到用於將其轉換爲 wav 的免費軟件。若是您的錄音不是10秒, 下面的代碼將根據須要修剪或填充它, 使其10秒。

【code】

# Preprocess the audio to the correct format
def preprocess_audio(filename):
    # Trim or pad audio segment to 10000ms
    padding = AudioSegment.silent(duration=10000)
    segment = AudioSegment.from_wav(filename)[:10000]
    segment = padding.overlay(segment)
    # Set frame rate to 44100
    segment = segment.set_frame_rate(44100)
    # Export as wav
    segment.export(filename, format='wav')

  

 Once you've uploaded your audio file to Coursera, put the path to your file in the variable below.

 【code】

your_filename = "audio_examples/my_audio.wav"
preprocess_audio(your_filename)
IPython.display.Audio(your_filename) # listen to the audio you uploaded 

【result】

【注】原文是音頻,這裏第截圖

 

Finally, use the model to predict when you say activate in the 10 second audio clip, and trigger a chime. If beeps are not being added appropriately, try to adjust the chime_threshold.  

【中文翻譯】

最後, 使用模型預測當您在第二個10s音頻剪輯中說activate, 並觸發一個chime。若是未正確添加chime, 請嘗試調整 chime_threshold。

【code】

chime_threshold = 0.5
prediction = detect_triggerword(your_filename)
chime_on_activate(your_filename, prediction, chime_threshold)
IPython.display.Audio("./chime_output.wav")

【result】

相關文章
相關標籤/搜索