Cross Validation and Ensembling

How to perform one of the basic Kaggle techniques using fastai
Lesson 3

Lesson Video:

What is Cross Validation?

Typically we have one model that sees the entire training dataset and is evaluated on some test set at the end. Model ensembling works a bit differently. Instead each model of n folds is trained on a subset of the data. After the training is completed all of the models outputs are averaged and we take this as the overall prediction.

This technique is known as a Voting Ensemble where each model gets an equal “vote” towards the correct answer.

Generally this results in a set of models that can generalize better as a whole and has been seen to perform much better on unseen data in Kaggle competitions.

That being said, the computational requirements as a result are significantly higher as well as how long it takes for you to actually iterate over the results.

Let’s begin

For this problem yet again we’ll utilize the PETs dataset since we’re intimitely familiar with various levels of the API and how it’s written.

This will be especially important as we will need to use the Datasets level of the API to do this efficiently.

First we’ll import the library:

from fastai.vision.all import *

from sklearn.model_selection import StratifiedKFold
from fastai.vision.all import *

from sklearn.model_selection import StratifiedKFold

from sklearn.model_selection import StratifiedKFold

The particular type of K-Fold validation we’ll be performing is called Stratified K-Fold, which ensures that the distributions between all of the classes remains the same across all k folds.

Next we’ll bring in our dataset and setup our transforms like before:

path = untar_data(URLs.PETS)
fnames = get_image_files(path/'images')
pat = r'(.+)_\d+.jpg$'
item_tfms = [RandomResizedCrop(460, min_scale=0.75, ratio=(1.,1.)), ToTensor()]
batch_tfms = [IntToFloatTensor(), *aug_transforms(size=224, max_warp=0), Normalize.from_stats(*imagenet_stats)]
batch_size = 64

This time we will utilize the IndexSplitter that fastai provides as our K-Fold will result in a list of indicies representing our validation set.

K-Fold Validation and Data Subsets

When performing K-Fold validation we turn 2 subsets of data (train and validation) into three: train, validation, and test.

The test set should never be graded (validated) upon until the very end of the entire training and should have no impact on how we grade if intermediate models are training well.

The validation sets are then smaller subsets of the training dataset, such that if we had a k of 10 each validation set would be a unique 10% of the data.

Creating the splits

Let’s set aside our train/validation and test sets:

random.shuffle(fnames)

train_fnames = [filename for filename in fnames[:int(len(fnames) * .9)]]
test_fnames = [filename for filename in fnames[int(len(fnames) * .9):]]
random.shuffle(fnames)

train_fnames = [filename for filename in fnames[:int(len(fnames) * .9)]]
test_fnames = [filename for filename in fnames[int(len(fnames) * .9):]]

random.shuffle(fnames)

In order to efficiently randomly select from the data we can just shuffle our list of filenames inplace and then choose from that.


int(len(fnames) * .9)

For this example we’ll make our test set be 10% of the data, randomly.

We then need to actually create our K-Fold splits. To do so first let’s extract all of the labels from our dataset:

vocab = list(map(RegexLabeller(pat=r'/([^/]+)_\d+.*'), train_fnames))
pipe = Pipeline([
    RegexLabeller(pat=r'/([^/]+)_\d+.*'), Categorize(vocab=vocab)
])
pipe = Pipeline([
    RegexLabeller(pat=r'/([^/]+)_\d+.*'), Categorize(vocab=vocab)
])

Pipeline([ExampleClassA(), ExampleClassB()])

This is the basic Pipeline class which controls how transforms are applied (read more about this below)


    RegexLabeller(pat=r'/([^/]+)_\d+.*'), Categorize

Similar to what we saw earlier, we can create a Pipeline of only the transforms that will extract and encode the label from a filename.


pipe.setup(train_fnames)

Some transforms require certain setups to be performed, such as how Categorize knows the class list. This is done by calling pipe.setup and passing in a list of items for it to utilize

What is a Pipeline?

A Pipeline class is what fastai uses under the hood to call each set of transforms. It’s similar to PyTorch’s Compose we saw earlier except we can also control the transform behavior. For example, this is how fastai will either do random cropping or center cropping based on if we should apply the transforms to the training dataset or the validation dataset.

When going through a Pipeline it will call the __call__ function of a class or just call the normal function, hence why the previous code example used ExampleClassA() as it’s assumed if we have tfm = ExampleClassA() we can then do: tfm(some_input)

labels = list(map(pipe, train_fnames))
labels = list(map(pipe, train_fnames))

list(map(function, items))

The map function will return a generator where upon iterating it will apply a single function to an item from items. We can use list to instantly run that generator to its end and return all their results.


labels = list(map(pipe, train_fnames)

This let’s us apply the pipeline we just created onto every single fname in train_fnames and return that as a list rather than having to write a for loop

Now that we have the labels we can create the folds. For this example we will split the dataset into 5 subsets:

splits = []
skf = StratifiedKFold(n_splits=10, shuffle=True)
for _, valid_indexes in skf.split(
    np.zeros(len(labels)), labels
):
    split = IndexSplitter(valid_indexes)
    splits.append(split)
splits = []
skf = StratifiedKFold(n_splits=10, shuffle=True)
for _, valid_indexes in skf.split(
    np.zeros(len(labels)), labels
):
    split = IndexSplitter(valid_indexes)
    splits.append(split)

splits = []

All of our splits across the five validation folds will be stored into an array


skf = StratifiedKFold(n_splits=10, shuffle=True)

This will instantiate our KFold class and specify the number of splits to use and that each classes indicies should be shuffled before they’re distributed between each of the splits


(
    np.zeros(len(labels)), labels
)

The skf.split function needs to take in a list of X and a list of y. It’s possible to just make the X’s a list of numbers instead, which is what we’re doing here


 skf.split(
    np.zeros(len(labels)), labels
)

The split function is what will actually take our labels find their distributions and return an iterator for each of our specified subsets


for _, valid_indexes in 

The split’s iterator returns two sets of X and label indicies. Since we only care about the labels we just keep them


    split = IndexSplitter(valid_indexes)
    splits.append(split)

We then use the IndexSplitter class from fastai which will split between train and validation based on a set of indicies passed in (these will be the validation indicies). From there we store them in an array of splitters

Now that we have our splits we can create a training loop!

The Training Loop

We’ll store two arrays which keep the validation accuracy (metric) and the predictions on our test set:

valid_pcts = []
test_preds = []

And now create a train function which will take in a splitter, create a learner, and train:

def train(splitter:IndexSplitter):
    "Trains a single model over a set of splits based on `splitter`"
    dset = Datasets(
        train_fnames,
        tfms = [
            [PILImage.create], 
            [RegexLabeller(pat=r'/([^/]+)_\d+.*'), Categorize]
        ],
        splits = splitter(train_fnames)
    )
    dls = dset.dataloaders(
        bs=batch_size,
        after_item=item_tfms,
        after_batch=batch_tfms
    )
    learn = vision_learner(dls, resnet34, metrics=accuracy)
    learn.fit_one_cycle(1)
    valid_pcts.append(learn.validate()[1])
    dl = learn.dls.test_dl(test_fnames)
    preds, _ = learn.get_preds(dl=dl)
    test_preds.append(preds)
def train(splitter:IndexSplitter):
    "Trains a single model over a set of splits based on `splitter`"
    dset = Datasets(
        train_fnames,
        tfms = [
            [PILImage.create], 
            [RegexLabeller(pat=r'/([^/]+)_\d+.*'), Categorize]
        ],
        splits = splitter(train_fnames)
    )
    dls = dset.dataloaders(
        bs=batch_size,
        after_item=item_tfms,
        after_batch=batch_tfms
    )
    learn = vision_learner(dls, resnet34, metrics=accuracy)
    learn.fit_one_cycle(1)
    valid_pcts.append(learn.validate()[1])
    dl = learn.dls.test_dl(test_fnames)
    preds, _ = learn.get_preds(dl=dl)
    test_preds.append(preds)

    dset = Datasets(
        train_fnames,
        tfms = [
            [PILImage.create], 
            [RegexLabeller(pat=r'/([^/]+)_\d+.*'), Categorize]
        ],
        splits = splitter(train_fnames)
    )

This will create a dataset based on all of our training filenames and apply the passed in splitter from earlier to define our splits


    dls = dset.dataloaders(
        bs=batch_size,
        after_item=item_tfms,
        after_batch=batch_tfms
    )

Then we create a set of dataloaders based on the batch size and augmentation we defined earlier.


    learn = vision_learner(dls, resnet34, metrics=accuracy)

We create a model here and initialize it (and a Learner in this case), and attach accuracy to the Learner’s metrics.


    learn.fit_one_cycle(1)

Then we perform whatever training we want to perform over each split. In this case since it’s just an example the model is trained for a single epoch using the One Cycle policy


    valid_pcts.append(learn.validate()[1])

Then we take the validation metric for that model and store it away in the valid_pcts array. learn.get_preds will return in the following order: loss, metrics, and then other more-specific items if desired


    dl = learn.dls.test_dl(test_fnames)
    preds, _ = learn.get_preds(dl=dl)

Afterwards we perform inference by creating a test dataloader on the final hold-out dataset. This new dl is designed to apply the validation augmentation when a new batch is called.


    test_preds.append(preds)

Finally we store those new predictions in the array defined earlier.

for splitter in splits:
    train(splitter)
epoch train_loss valid_loss accuracy time
0 1.201077 0.350736 0.884384 00:32
epoch train_loss valid_loss accuracy time
0 1.131611 0.360337 0.885714 00:31
epoch train_loss valid_loss accuracy time
0 1.140909 0.401147 0.881203 00:31
epoch train_loss valid_loss accuracy time
0 1.164429 0.397282 0.872180 00:32
epoch train_loss valid_loss accuracy time
0 1.178342 0.423457 0.879699 00:31
epoch train_loss valid_loss accuracy time
0 1.143618 0.354578 0.890226 00:31
epoch train_loss valid_loss accuracy time
0 1.187801 0.380514 0.872180 00:32
epoch train_loss valid_loss accuracy time
0 1.178310 0.334488 0.890226 00:32
epoch train_loss valid_loss accuracy time
0 1.172838 0.354474 0.885714 00:32
epoch train_loss valid_loss accuracy time
0 1.221658 0.352071 0.897744 00:32

Performing the Ensemble

Currently we just have a set of predictions and some amount of models trained, we need a way to actually perform the ensembling (through equal voting) as mentioned earlier.

Since the term equal is in there, it can be inferred that we take the average across all of the different models and use that prediction as the result.

First let’s check the accuracy of one fold:

test_labels = torch.stack([pipe(fname) for fname in test_fnames])
accuracy(test_preds[0], test_labels)
TensorBase(0.8877)

Then we can get the results for all of them to see their distribution:

for preds in test_preds:
    print(accuracy(preds, test_labels))
TensorBase(0.8877)
TensorBase(0.8904)
TensorBase(0.9039)
TensorBase(0.8904)
TensorBase(0.8796)
TensorBase(0.9039)
TensorBase(0.8945)
TensorBase(0.8863)
TensorBase(0.9012)
TensorBase(0.8904)

The highest is 90.39%, and the lowest was 88.63%. How will the ensemble do?

Finally we can perform our vote:

votes = torch.stack(test_preds, dim=-1).sum(-1) / 5

And see our new accuracy:

accuracy(votes, test_labels)
TensorBase(0.9215)

We can see that ensembling worked well here and our results improved!

So, should we just shove even more folds into the mix to get a better model?

Ensembling in this way has dimishing returns, so finding the right number of folds is a trial and error hyperparameter. Also people will typically combine multiple different models when doing these folds so that there is an even starker variation to the data and some models that perofrm better on certain data can shine and improve the rest.