用K-means聚類算法實現音調的分類與可視化

時間 2019-11-13

標籤 means 算法實現音調分類可視化简体版

原文原文鏈接

本文由伯樂在線 - ggspeed 翻譯，耶魯怕冷校稿。未經許可，禁止轉載！
英文出處：jared polivka。歡迎加入翻譯組。python

利用 K-means 聚類算法來聚類和可視化音調

Galvanize 數據科學課程包括了一系列在科技產業的數據科學家中流行的機器學習課題，可是學生在 Galvanize 得到的技能並不只限於那些最流行的科技產業應用。例如，在 Galvanize 的數據科學強化課中，音頻信號和音樂分析較少被討論，卻它是一個有趣的機器學習概念應用。借用 Galvanize 課程中的課題，本篇教程爲你們展現瞭如何利用 K-means 聚類算法從錄音中分類和可視化音調，該方法會用到如下幾個 python 工具包： NumPy/SciPy, Scikit-learn 和 Plotly。算法

K-means 聚類是什麼

k-means 聚類算法是基於未標識數據集將相關項聚類的經常使用技術。給定 K 值後，該算法會將每一個數據點劃分到離其最近的中心點對應的簇，從而將整個數據集分紅 k 組。k-means 算法有很普遍的應用，好比識別手機發射塔的有效位置，或爲製造商選擇服裝的型號。而本教程將會爲你們展現如何應用 k-means 根據音調來給音頻分類。數組

音調的簡單入門

一個音符是一串疊加的不一樣頻率的 Sine 型波，而識別音符的音調須要識別那些聽上去最突出的 Sine 型波的頻率。app

最簡單的音符僅包含一個 Sine 型波：機器學習

繪製的強度圖譜中，每一個組成要素頻率的大小顯示了上面波形的一個單獨的頻率。函數

主流樂器製造出來的聲音是由不少 sine 型波元素構成的，因此他們比上面展現的純 sine 型波聽起來更復雜。一樣的音符(E3)，由吉他彈奏出來的波形聽看起來以下：
工具

它的強度圖譜顯示了一個更大的基礎頻率的集合：
學習

k-means 能夠運用樣例音頻片斷的強度圖譜來給音調片斷分類。給定一個有 n 個不一樣頻率的強度圖譜集合，k-means 將會給樣例圖譜分類，從而使在 n 維空間中每一個圖譜到它們組中心的歐式距離最小。ui

使用Numpy/SciPy從一個錄音中建立數據集

本教程將會使用一個有 3 個不一樣音調的錄音小樣，每一個音調是由吉他彈奏了 2 秒。spa

運用 SciPy 的 wavfile 模塊能夠輕鬆將一個 .wav 文件轉化爲 NumPy 數值。

Python

import scipy.io.wavfile as wav

filename = 'Guitar - Major Chord - E Gsharp B.wav'

# wav.read returns the sample_rate and a numpy array containing each audio sample from the .wav file

sample_rate, recording = wav.read(filename)

這段錄音應該被分爲多個小段，從而使每段的音調均可以被獨立地分類。

Python

def split_recording(recording, segment_length, sample_rate):

segments = []

index = 0

while index < len(recording):

segment = recording[index:index + segment_length<em>sample_rate]

segments.append(segment)

index += segment_length</em>sample_rate

return segments

segment_length = .5 # length in seconds

segments = split_recording(recording, segment_length, sample_rate)

每一段的強度圖譜能夠經過傅里葉變換得到；傅里葉變換會將波形數據從時間域轉換到頻率域。如下的代碼展現瞭如何使用 NumPy 實現傅里葉變換(Fourie transform)模塊。

Python

def calculate_normalized_power_spectrum(recording, sample_rate):

# np.fft.fft returns the discrete fourier transform of the recording

fft = np.fft.fft(recording)

number_of_samples = len(recording)

# sample_length is the length of each sample in seconds

sample_length = 1./sample_rate

# fftfreq is a convenience function which returns the list of frequencies measured by the fft

frequencies = np.fft.fftfreq(number_of_samples, sample_length)

positive_frequency_indices = np.where(frequencies>0)

# positive frequences returned by the fft

frequencies = frequencies[positive_frequency_indices]

# magnitudes of each positive frequency in the recording

magnitudes = abs(fft[positive_frequency_indices])

# some segments are louder than others, so normalize each segment

magnitudes = magnitudes / np.linalg.norm(magnitudes)

return frequencies, magnitudes

一些輔助函數會建立一個空的 NumPy 數值並將咱們的樣例強度圖譜放入其中。

Python

def create_power_spectra_array(segment_length, sample_rate):

number_of_samples_per_segment = int(segment_length * sample_rate)

time_per_sample = 1./sample_rate

frequencies = np.fft.fftfreq(number_of_samples_per_segment, time_per_sample)

positive_frequencies = frequencies[frequencies>0]

power_spectra_array = np.empty((0, len(positive_frequencies)))

return power_spectra_array

def fill_power_spectra_array(splits, power_spectra_array, fs):

filled_array = power_spectra_array

for segment in splits:

freqs, mags = calculate_normalized_power_spectrum(segment, fs)

filled_array = np.vstack((filled_array, mags))

return filled_array

power_spectra_array = create_power_spectra_array(segment_length,sample_rate)

power_spectra_array = fill_power_spectra_array(segments, power_spectra_array, sample_rate)

「power_spectra_array 「是咱們的訓練數據集，它包含了一個強度圖譜，在此圖譜中錄音按每 0.5 秒的間隔進行了分段。

利用 Scikit-learn 來執行 k-means

Scikit-learn 有一個易用的 k-means 實現。咱們的音頻樣例包括 3 個不一樣的音調，因此將 k 設置爲 3。

Python

from sklearn.cluster import KMeans

kmeans = KMeans(3, max<em>iter = 1000, n_init = 100)

kmeans.fit_transform(power_spectra_array)

predictions = kmeans.predict(power_spectra_array)

「predictions」是一個 Python 數據，它包含了 12 個音頻分段的分組標籤(一個任意的整數)。

Python

1 2	print predictions => [2 2 2 2 0 0 0 0 1 1 1 1]

這個數組說明了在聽這段音頻時連續音頻分段被正確地分在了一塊兒。

使用 Plotly 可視化結果

爲了更好的理解預測結果，須要繪製每一個樣例的強度圖譜，每一個樣例均用顏色來標記出其對應的 k-means 分組結果。

Python

# find x-values for plot (frequencies)

number<em>of_samples = int(segment_length*sample_rate)

sample_length = 1./sample_rate

frequencies = np.fft.fftfreq(number_of_samples, sample_length)

# create plot

traces = []

for pitch_id, color in enumerate(['red','blue','green']):

for power_spectrum in power_spectra_array[predictions == pitch_id]:

trace = Scatter(x=frequencies[0:500],

y=power_spectrum[0:500],

mode='lines',

showlegend=False,

line=Line(shape='linear',

color=color,

opacity = .01,

width = 1))

traces.append(trace)

layout = Layout(xaxis=XAxis(title='Frequency (Hz)'),

yaxis=YAxis(title = 'Amplitude (normalized)'),

title = 'Power Spectra of Sample Audio Segments')

data_to_plot = Data(traces)

fig = Figure(data=data_to_plot, layout=layout)

# py.iplot plots inline using IPython Notebook

py.iplot(fig, filename = 'K-Means Classification of Power Spectrum')

下面的圖中每一個有色的細線表明了樣例 .wav 文件中 12 個音頻分段的強度圖譜。不一樣顏色的線表示了 k-means 預測出來的分段音調。其中藍色，綠色，紅色圖譜的高峯分別在 82.41 Hz (E), 103.83 Hz (G#), and 123.47 Hz (B)，這些是音頻小樣的音符。音頻小樣中頻率最強的是低頻，因此只有由 FFT (快速傅里葉變換)測量出的最低的 500 個頻率被包含進了如下圖表。

繪製在 3 個採樣音調中共有的 2 個最強泛音的振幅，這種天然的聚類過程便十分明顯了。