Bring this project to life
In this tutorial, we show how to implement a music genre classifier from scratch in TensorFlow/Keras using features calculated by the Librosa library.
We will use the most popular publicly available Dataset for music genre classification : the GTZAN. This datasets contains a range of recordings reflecting different circumstances, the files were gathered between year 2000 and 2001 from a number of sources, including personal CDs, radio, and microphone recordings. Even with the fact that it's more than 2 decades old, it is still considered as the go-to Dataset when it comes to machine learning applications regarding music genre classification.
The dataset contains 10 classes, each class with a 100 of different 30 seconds audio files. The classes are: blues, classical, country, disco, hip-hop, jazz, metal, pop, reggae and rock.
In this tutorial, we will only use 3 genres (reggae, rock and classical) for simplification purposes. But, the same principles are still valid for higher numbers of genres.
Let's start by downloading and extracting the Dataset files.
Preparing the Dataset
We first will download the files from Google Drive using the gdrive package. We will then unarchive each of the files using unrar from the terminal. You can do this with the command line or using line magic.
Importing python libraries
Now we'll import the needed libraries. TensorFlow will be used for model training, evaluation and prediction, Librosa for all the audio related manipulations including feature generation, Numpy for numerical handling and Matplotlib for printing the features as images.
Audio features
Each music genre is characterized by its own specifications: pitch, melody, chord progressions and instrumentation type. To have a reliable classification, we will use a set of features that capture the essence of these elements, giving the model a better chance to be properly trained to distinguish between genres.
In this tutorial, we'll construct four features that we will be used to create a single feature vector related to each file that our model will be trained on.
These features are:
- Mel frequency Cepstral coefficiens
- Mel spectogram
- Chroma vector
- Tonal Centroid Features
Mel Frequency Cepstral Coefficients (MFCC)
MFCCs or Mel-Frequency Cepstral Coefficients are Cepstral coefficients calculated by a discrete cosine transform applied to the power spectrum of a signal. The frequency bands of this spectrum are spaced logarithmically according to the Mel scale.
Next, we'll plot the MFCC image of an example file, using Matplotlib:
Mel Spectrogram
The Mel spectrogram is equal to the standard spectrogram in the Mel scale, which is a perceptual scale of pitches that listeners perceive to be equally spaced from one another. The conversion from the frequency domain in hertz to the Mel scale is done using the following formula:
Next, we plot the Mel spectrogram for the same audio file:
Chroma Vector
The Chroma features vector is constructed by having the full spectrum projected onto 12 bins that reflect the 12 unique semitones (or Chroma) of the musical octave: C, C#, D, D#, E , F, F#, G, G#, A, A#, B. This projection gives an intriguing and potent representation of music audio and is especially dependent on the music genre.
Since notes that are exactly one octave apart are perceived as being particularly similar in music, understanding the distribution of Chroma, even without knowing the absolute frequency (i.e., the original octave), can provide useful musical information about the audio and may even expose perceived musical similarities in the same music genre that are not visible in the original spectra.
The Chroma vector for the same audio sample is then plotted next:
Tonal Centroid Features (Tonnetz)
This representation is calculated by projecting Chroma features onto a 6-dimensional basis representing the perfect fifth, minor third, and major third each as two-dimensional coordinates.
For the same Chroma vector we have the following Tonnetz feature:
Bring this project to life
Putting the Features together
After creating the four functions for generating the features. We implement the function get_feature that will extract the envelope (min and max) and the mean of each feature along the time axis. This way, we will have a feature with constant size no matter what the length of the audio is. Next, we concatenate the four features, using Numpy, into a single 498 float array.
Calculating features for the full Dataset
Now, we'll use three genres to train our model on: reggae, classical and rock. For more nuances, like distinguishing between genres that have a lot of similarities, Rock and Metal for example, more features should be included.
We'll loop through each file of these three genres. And, for each one we'll construct the feature array and store it along with the respective label.
Splitting the Dataset into training, validation and testing parts
After creating the feature and label arrays, we use Numpy to shuffle the records. Then, split the Dataset into training, validation and testing parts: 60%, 20% and 20% respectively.
Training the model
For this model, we will implement using Keras two regular densely connected neural network layers, with a rectified linear unit activation function "relu", and 300 hidden units for the first layer and 200 for second layer. Then, for the output layer we will also implement a densely connected layer with the probabilistic distribution activation function "softmax". Then we'll train the model using 64 epochs:
Model evaluation
Then, we evaluate the model:
score = model.evaluate(x=features_test.tolist(),y=labels_test.tolist(), verbose=0)
print('Accuracy : ' + str(score[1]*100) + '%')
Using this simple model and with this set of features, we can achieve an accuracy around 86%.
Accuracy : 86.33333134651184%
Classification of a Youtube video
Then, we will use library youtube-dl to export a video from Youtube, that we will later classify.
pip install youtube-dl
We download the video and save it as a wave file. In this example, we use Bob Marley's "Is This Love" video clip.
youtube-dl -x --audio-format wav --output "audio_sample_full.wav" https://www.youtube.com/watch?v=69RdQFDuYPI
After that, we install the pydub library that we'll use to crop the wav file size
pip install pydub
We crop the wav file to a 30 seconds section, from 01:00:00 tp 01:30:00. Then, save the resulting file.
from pydub import AudioSegment
t1 = 60000 #Works in milliseconds
t2 = 90000
waveFile = AudioSegment.from_file("audio_sample_full.wav")
waveFile = waveFile[t1:t2]
waveFile.export('audio_sample_30s.wav', format="wav")
Then, we use our previously trained model to classify the audio music genre of the audio.
file_path = "audio_sample_30s.wav"
feature = get_feature(file_path)
y = model.predict(feature.reshape(1,498))
ind = numpy.argmax(y)
genres[ind]
Finally, we get the expected result:
Predicted genre : reggae
Conclusion
Audio processing machine learning projects are one of the least present in the artificial intelligence literature. In this article, we looked into a brief introduction to the music genre classification modeling techniques that could have a utility in modern applications like streaming sites for example. We introduced some audio features generated by Librosa, then used TensorFlow/Keras to create and train a model. Finally, we exported a YouTube video and classified its audio using the trained model.