(Largely based on rbracco's tutorial, big thanks to him for his work on getting this going for us!)
fastai
's audio module has been in development for a while by active forum members:
from fastai.vision.all import *
from fastaudio.core.all import *
from fastaudio.augment.all import *
tar_extract_at_filename
simply extracts at the file name (as the name suggests)
path_dig = untar_data(URLs.SPEAKERS10, extract_func=tar_extract_at_filename)
Now we want to grab just the audio files.
audio_extensions[:5]
fnames = get_files(path_dig, extensions=audio_extensions)
fnames[:5]
We can convert any audio file to a tensor with AudioTensor
. Let's try opening a file:
at = AudioTensor.create(fnames[0])
at, at.shape
at.show()
cfg = AudioConfig.Voice()
Our configuration will limit options like the frequency range and the sampling rate
cfg.f_max, cfg.sample_rate
We can then make a transform from this configuration to turn raw audio into a workable spectrogram per our settings:
aud2spec = AudioToSpec.from_cfg(cfg)
For our example, we'll crop out the original audio file to 1000 ms
crop1s = ResizeSignal(1000)
Let's build a Pipeline
how we'd expect our data to come in
pipe = Pipeline([AudioTensor.create, crop1s, aud2spec])
And try visualizing what our newly made data becomes.
First, we'll remove that cropping:
pipe = Pipeline([AudioTensor.create, aud2spec])
for fn in fnames[:3]:
audio = AudioTensor.create(fn)
audio.show()
pipe(fn).show()
You can see that they're not all the same size here. Let's add that cropping back in:
pipe = Pipeline([AudioTensor.create, crop1s, aud2spec])
for fn in fnames[:3]:
audio = AudioTensor.create(fn)
audio.show()
pipe(fn).show()
And now everythign is 128x63
For our transforms, we'll want the same ones we used before
item_tfms = [ResizeSignal(1000), aud2spec]
Our filenames are labelled by the number followed by the name of the individual:
4_theo_37.wav
2_nicolas_7.wav
get_y = lambda x: x.name[0]
aud_digit = DataBlock(blocks=(AudioBlock, CategoryBlock),
get_items=get_audio_files,
splitter=RandomSplitter(),
item_tfms = item_tfms,
get_y=get_y)
And now we can build our DataLoaders
dls = aud_digit.dataloaders(path_dig, bs=64)
Let's look at a batch
dls.show_batch(max_n=3)
Training
Now that we have our Dataloaders
, we need to make a model. We'll make a function that changes a Learner
's first layer to accept a 1 channel input (similar to how we did for the Bengali.AI model)
def alter_learner(learn, n_channels=1):
"Adjust a `Learner`'s model to accept `1` channel"
layer = learn.model[0][0]
layer.in_channels=n_channels
layer.weight = nn.Parameter(layer.weight[:,1,:,:].unsqueeze(1))
learn.model[0][0] = layer
learn = Learner(dls, xresnet18(), CrossEntropyLossFlat(), metrics=accuracy)
Now we need to grab our number of channels:
n_c = dls.one_batch()[0].shape[1]; n_c
alter_learner(learn, n_c)
Now we can find our learning rate and fit!
learn.lr_find()
learn.fit_one_cycle(5, 1e-2)
learn.fit_one_cycle(5, 1e-3)
Not bad for zero data augmentation! But let's see if augmentation can help us out here!
DBMelSpec = SpectrogramTransformer(mel=True, to_db=True)
Let's take a look at our original settings:
aud2spec.settings
And we'll narrow this down a bit
aud2spec = DBMelSpec(n_mels=128, f_max=10000, n_fft=1024, hop_length=128, top_db=100)
For our transforms, we'll use:
RemoveSilence
- Splits a signal at points of silence more than 2 *
pad_ms
(default is 20)
- Splits a signal at points of silence more than 2 *
CropSignal
- Crops a signal by
duration
and adds padding if needed
- Crops a signal by
aud2spec
- Our
SpectrogramTransformer
with parameters
- Our
MaskTime
- Wrapper for
MaskFre
, which applieseinsum
operations
- Wrapper for
MaskFreq
Let's look a bit more at the padding CropSignal
uses:
There are three different types:
AudioPadTypes.Zeros
: The default, random zeros before and afterAudioPadType.Repeat
: Repeat the signal until proper length (great for coustic scene classification and voice recognition, terrible for speech recognition)AudioPadtype.ZerosAfter
: This is the default for many other libraries, just pad with zeros until you get the specified length.
Now let's rebuild our DataBlock
:
item_tfms = [RemoveSilence(), ResizeSignal(1000), aud2spec, MaskTime(size=4), MaskFreq(size=10)]
aud_digit = DataBlock(blocks=(AudioBlock, CategoryBlock),
get_items=get_audio_files,
splitter=RandomSplitter(),
item_tfms = item_tfms,
get_y=get_y)
dls = aud_digit.dataloaders(path_dig, bs=128)
Let's look at some augmented data:
dls.show_batch(max_n=3)
Let's try training again. Also, since we have to keep making an adustment to our model, let's make an audio_learner
function similar to cnn_learner
:
def audio_learner(dls, arch, loss_func, metrics):
"Prepares a `Learner` for audio processing"
learn = Learner(dls, arch, loss_func, metrics=metrics)
n_c = dls.one_batch()[0].shape[1]
if n_c == 1: alter_learner(learn)
return learn
learn = audio_learner(dls, xresnet18(), CrossEntropyLossFlat(), accuracy)
learn.lr_find()
learn.fit_one_cycle(10, 3e-3)
learn.fit_one_cycle(10, 3e-4)
With the help of some of our data augmentation, we were able to perform a bit higher!
Let's try it out!
aud2mfcc = AudioToMFCC(n_mfcc=40, melkwargs={'n_fft':2048, 'hop_length':256,
'n_mels':128})
item_tfms = [ResizeSignal(1000), aud2mfcc]
There's a shortcut for replacing the item transforms in a DataBlock
:
aud_digit.item_tfms
aud_digit.item_tfms = item_tfms
dls = aud_digit.dataloaders(path_dig, bs=128)
dls.show_batch(max_n=3)
Now let's build our learner and train again!
learn = audio_learner(dls, xresnet18(), CrossEntropyLossFlat(), accuracy)
learn.lr_find()
learn.fit_one_cycle(5, 1e-2)
Now we can begin to see why choosing your augmentation is important!
item_tfms = [ResizeSignal(1000), aud2mfcc, Delta()]
aud_digit.item_tfms = item_tfms
dls = aud_digit.dataloaders(path_dig, bs=128)
dls.show_batch(max_n=3)
Let's try training one more time:
learn = audio_learner(dls, xresnet18(), CrossEntropyLossFlat(), accuracy)
learn.lr_find()
learn.fit_one_cycle(5, 1e-2)
Let's try fitting for a few more:
learn.fit_one_cycle(5, 1e-2/10)