Lesson Video:

This article is also a Jupyter Notebook available to be run from the top down. There will be code snippets that you can then run in any environment.

Below are the versions of fastai, fastcore, wwf, fastaudio, and torchaudio currently running at the time of writing this:

  • fastai: 2.1.5
  • fastcore: 1.3.4
  • wwf: 0.0.7
  • fastaudio: 0.1.3
  • torchaudio: 0.7.2

(Largely based on rbracco's tutorial, big thanks to him for his work on getting this going for us!)

fastai's audio module has been in development for a while by active forum members:

What makes Audio different?

While it is possible to train on raw audio (we simply pass in a 1D tensor of the signal), what is done now is to convert the audio to what is called a spectrogram to train on.

Free Digit Dataset

Essentially the audio version of MNIST, it contains 2,000 recordings from 10 speakers saying each digit 5 times. First, we'll grab the data and use a custom extract function:

from fastai.vision.all import *
from fastaudio.core.all import *
from fastaudio.augment.all import *
/usr/local/lib/python3.6/dist-packages/torchaudio/backend/utils.py:54: UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail.
  '"sox" backend is being deprecated. '

tar_extract_at_filename simply extracts at the file name (as the name suggests)

path_dig = untar_data(URLs.SPEAKERS10, extract_func=tar_extract_at_filename)

Now we want to grab just the audio files.

('.aif', '.aifc', '.aiff', '.au', '.m3u')
fnames = get_files(path_dig, extensions=audio_extensions)
(#5) [Path('/root/.fastai/data/ST-AEDS-20180100_1-OS/f0004_us_f0004_00268.wav'),Path('/root/.fastai/data/ST-AEDS-20180100_1-OS/f0004_us_f0004_00111.wav'),Path('/root/.fastai/data/ST-AEDS-20180100_1-OS/m0003_us_m0003_00309.wav'),Path('/root/.fastai/data/ST-AEDS-20180100_1-OS/f0003_us_f0003_00255.wav'),Path('/root/.fastai/data/ST-AEDS-20180100_1-OS/f0002_us_f0002_00334.wav')]

We can convert any audio file to a tensor with AudioTensor. Let's try opening a file:

at = AudioTensor.create(fnames[0])
at, at.shape
(AudioTensor([[0.0000, 0.0000, 0.0000,  ..., 0.0002, 0.0002, 0.0003]]),
 torch.Size([1, 75520]))
<matplotlib.axes._subplots.AxesSubplot at 0x7f0e0419edd8>

Preparing the dataset

fastai_audio has a AudioConfig class which allows us to prepare different settings for our dataset. Currently it has:

  • BasicMelSpectrogram
  • BasicMFCC
  • BasicSpectrogram
  • Voice

We'll be using the Voice module today, as this dataset just contains human voices.

cfg = AudioConfig.Voice()

Our configuration will limit options like the frequency range and the sampling rate

cfg.f_max, cfg.sample_rate
(8000.0, 16000)

We can then make a transform from this configuration to turn raw audio into a workable spectrogram per our settings:

aud2spec = AudioToSpec.from_cfg(cfg)

For our example, we'll crop out the original audio file to 1000 ms

crop1s = ResizeSignal(1000)

Let's build a Pipeline how we'd expect our data to come in

pipe = Pipeline([AudioTensor.create, crop1s, aud2spec])

And try visualizing what our newly made data becomes.

First, we'll remove that cropping:

pipe = Pipeline([AudioTensor.create, aud2spec])
for fn in fnames[:3]:
  audio = AudioTensor.create(fn)
/usr/local/lib/python3.6/dist-packages/torch/functional.py:516: UserWarning: stft will require the return_complex parameter be explicitly  specified in a future PyTorch release. Use return_complex=False  to preserve the current behavior or return_complex=True to return  a complex output. (Triggered internally at  /pytorch/aten/src/ATen/native/SpectralOps.cpp:653.)
  normalized, onesided, return_complex)
/usr/local/lib/python3.6/dist-packages/torch/functional.py:516: UserWarning: The function torch.rfft is deprecated and will be removed in a future PyTorch release. Use the new torch.fft module functions, instead, by importing torch.fft and calling torch.fft.fft or torch.fft.rfft. (Triggered internally at  /pytorch/aten/src/ATen/native/SpectralOps.cpp:590.)
  normalized, onesided, return_complex)