Train MFCC using Machine Learning Algorithm

Mel Frequency Cepstral Coefficient (MFCC) is by far the most successful feature used in the field of Speech Processing. Speech is a non-stationary signal. As such, normal signal processing techniques cannot be directly applied to it. People working on this area discovered that if the speech signal is observed using a very small duration window, the speech content in that small duration appears to be more or less stationary. That brought in the concept of short-time processing of speech. In this technique, a small duration window (say 25 mil sec) is considered for processing at a time. This small duration is called a frame. To process the whole speech segment, you need to move the window from beginning to end of the segment consistently with equal steps, called shift. Based on the frame-size and frame-shift, set in the code you have, it gives you M frames.

Now, for each of the frames, MFCC coefficients are computed. Visit the following site for a tutorial on MFCCs.

In general, a 39-dimensional feature vector is used which is composed of first 13 MFCCs and their corresponding 13 delta and 13 delta-delta. I do not know what features are generated in your code, but the N you mention is the feature vector length. Your matrix CC has features of all the frames in the speech segment.

Depending on what you want to learn, you can use the feature vectors in different ways. You can learn the temporal changes in the speech segment to mark landmarks. HMM is a very good machine learning tool to learn temporal relationship in data. You can learn the distribution of the vectors in speech segments and use it for classification of test speech. For this you can use GMM modeling. You can use SVM for the classification purpose. Hope that helps.

One of the approaches used is to consider a window of frames. Each frame is a vector of the MFCCs (you can include deltas too). Then you select a frame and form a window of n frames before and after the selected frame - forming a window of 2*n+1 frames. This would be a single sample of input. You can use mini batches to select a subset of frames from your entire dataset to train on.

MFCC stands for Mel frequency cepstral coefficients. As you can see there are 4 words in the abbreviation. Mel, frequency, cepstral and coefficients. The idea of MFCC is to convert audio in time domain into frequency domain so that we can understand all the information present in speech signals. But just converting time domain signals into frequency domain may not be very optimal. We can do more than just converting time domain signals into frequency domain signals. Our ear has cochlea which basically has more filters at low frequency and very few filters at higher frequency. This can be mimicked using Mel filters. So the idea of MFCC is to convert time domain signals into frequency domain signal by mimicking cochlea function using Mel filters.

First Page