from fastai.vision.all import *
Below you will find the exact imports for everything we use today
from fastcore.foundation import L
from fastai.callback.fp16 import to_fp16
from fastai.callback.progress import ProgressCallback
from fastai.callback.schedule import fit_one_cycle
from fastai.data.core import Datasets, show_at
from fastai.data.external import untar_data, URLs
from fastai.data.transforms import IntToFloatTensor, Normalize, ToTensor, IndexSplitter, get_image_files, parent_label, Categorize
from fastai.metrics import accuracy
from fastai.vision.augment import aug_transforms, RandomResizedCrop
from fastai.vision.core import PILImage, imagenet_stats
from fastai.vision.learner import cnn_learner
import random
from sklearn.model_selection import StratifiedKFold
from torchvision.models.resnet import resnet34
path = untar_data(URLs.IMAGEWOOF)
path.ls()
Scenario:
- We have a training set
- We have a test set
item_tfms = [ToTensor(), RandomResizedCrop(460, min_scale=0.75, ratio=(1.,1.))]
batch_tfms = [IntToFloatTensor(), *aug_transforms(size=224, max_warp=0), Normalize.from_stats(*imagenet_stats)]
bs=64
We'll use the IndexSplitter
just to get to know it. What we really wind up doing is a RandomSplitter
split 80/20.
We can see IndexSplitter
's source code by doing:
IndexSplitter??
Next let's get our images
train_imgs = get_image_files(path/'train')
tst_imgs = get_image_files(path/'val')
We'll shuffle up our training set so the chance of including all classes is almost guarenteed
random.shuffle(train_imgs)
len(train_imgs)
And then we will do the 80/20 split
train_imgs
start_val = len(train_imgs) - int(len(train_imgs)*.2)
idxs = list(range(start_val, len(train_imgs)))
splitter = IndexSplitter(idxs)
splits = splitter(train_imgs)
Since we want to include our test set in with these splits, we'll make a split_list
of all three of our splits (train, valid, test)
split_list = [splits[0], splits[1]]
And we'll add in the range for our test set here:
split_list.append(L(range(len(train_imgs), len(train_imgs)+len(tst_imgs))))
split_list
Let's check that everything worked as intended. First building the Datasets
:
dsrc = Datasets(train_imgs+tst_imgs, tfms=[[PILImage.create], [parent_label, Categorize]],
splits = split_list)
We can look at an item:
show_at(dsrc.train, 3)
And if we check n_subsets
, we can see that three are there (for our three splits)
dsrc.n_subsets
Now let's build some DataLoaders
dls = dsrc.dataloaders(bs=bs, after_item=item_tfms, after_batch=batch_tfms)
dls.show_batch()
We can see the subsets was passed down to here as well:
dls.n_subsets
What this means is while dls.train
and dls.valid
will return what we would expect, if we were to instead index into our DataLoader
, we can find our testing data in there too:
dls[2].show_batch()
Let's do a quick baseline
learn = cnn_learner(dls, resnet34, pretrained=False, metrics=accuracy).to_fp16()
learn.fit_one_cycle(1)
Now how do we check it?
We can run learn.validate
on our subset
learn.validate(ds_idx=2)
First let's import our KFold
from sklearn.model_selection import StratifiedKFold
And grab all the labels from our dataset
train_labels = L(dsrc.items).map(dsrc.tfms[1])
Now let's make our K-Fold
kf = StratifiedKFold(n_splits=5, shuffle=True)
Finally we need to define a training loop to go over all our folds and gather our validation and test accuracy
n_splits = 10
import random
random.shuffle(train_imgs)
What's our loop going to look like?
val_pct = []
tst_preds = []
skf = StratifiedKFold(n_splits=10, shuffle=True)
for _, val_idx in kf.split(np.array(train_imgs+tst_imgs), train_labels):
splits = IndexSplitter(val_idx)
split = splits(train_imgs)
split_list = [split[0], split[1]]
split_list.append(L(range(len(train_imgs), len(train_imgs)+len(tst_imgs))))
dsrc = Datasets(train_imgs+tst_imgs, tfms=[[PILImage.create], [parent_label, Categorize]],
splits=split_list)
dls = dsrc.dataloaders(bs=bs, after_item=item_tfms, after_batch=batch_tfms)
learn = cnn_learner(dls, resnet34, pretrained=False, metrics=accuracy)
learn.fit_one_cycle(1)
val_pct.append(learn.validate()[1])
a,b = learn.get_preds(ds_idx=2)
tst_preds.append(a)
Now how do we combine all our predictions? We sum them all together then divide by our total (a voting ensemble is what this is referred to as)
First let's check the accuracy of one fold:
tst_preds_copy = tst_preds.copy()
accuracy(tst_preds_copy[0], b)
Then we can print out all the folds. We can see our highest accuracy on the test set was 26.27%
for i in tst_preds_copy:
print(accuracy(i, b))
Now let's perform our vote:
hat = tst_preds[0]
for pred in tst_preds[1:]:
hat += pred
hat
hat /= len(tst_preds)
And see what our new accuracy is
accuracy(hat, b)
That's an improvement ~2.5% or so! Not bad!
Ensembling in this way can have a diminishing return, so finding the right number of folds to use is something you should try to figure out through trial and error on subsamples of your dataset first (or if on Kaggle, see what other folks are using for theirs too!)