from fastai.vision.all import *
path = untar_data(URLs.PASCAL_2007)
Now how do we get our labels? fastai
has a get_annotations
function that we can use to grab the image and their bounding box. The one-line documentation states:
"Open a COCO style json in fname
and returns the list of filenames (with mabye prefix
) and labelled bounding boxes."
path.ls()
We'll want to read out of the train.json
imgs, lbl_bbox = get_annotations(path/'train.json')
imgs[0]
lbl_bbox[0]
Next, we want to be able to quickly look up a corresponding image to it's label. We'll use a dictionary
img2bbox = dict(zip(imgs, lbl_bbox))
Let's check the first item
first = {k: img2bbox[k] for k in list(img2bbox)[:1]}; first
Great! Now let's build our DataBlock
. We'll have two outputs, the bounding box itself and a label, with one input.
getters = [lambda o: path/'train'/o, lambda o: img2bbox[o][0], lambda o: img2bbox[o][1]]
For our transforms, we'll use some of the ones we defined earlier
item_tfms = [Resize(128, method='pad'),]
batch_tfms = [Rotate(), Flip(), Dihedral(), Normalize.from_stats(*imagenet_stats)]
Why do we need a custom get_images
? Because we want our images that came back to us, not the entire folder
def get_train_imgs(noop): return imgs
We'll now make our DataBlock
. We want to adjust n_inp
as we expect two outputs
pascal = DataBlock(blocks=(ImageBlock, BBoxBlock, BBoxLblBlock),
splitter=RandomSplitter(),
get_items=get_train_imgs,
getters=getters,
item_tfms=item_tfms,
batch_tfms=batch_tfms,
n_inp=1)
dls = pascal.dataloaders(path/'train')
dls.c = 20
dls.show_batch()
The Model
The architecture we are going to use is called RetinaNet
. I've exported this all myself for you guys to use quickly, if you want to explore what's going on in the code I'd recommend the Object Detection lesson here
Let's import it:
from wwf.vision.object_detection import *
We're still going to use transfer learning here by creating an encoder
(body) of our model and a head
encoder = create_body(resnet34, pretrained=True)
Now that we have our encoder, we can call the RetinaNet
architecture. We'll pass in the encoder, the number of classes, and what we want our final bias to be on the last convolutional layer (how we initialize our model). Jeremy has his example at -4 so let's use this
get_c(dls)
arch = RetinaNet(encoder, get_c(dls), final_bias=-4)
Another big difference is the head of our model. Instead of our linear layers with pooling layers:
create_head(124, 4)
We have one with a smoother, a classifer, and a box_regressor
(to get our points)
arch.smoothers
arch.classifier
arch.box_regressor
Loss Function
Now we can move onto our loss function. For RetinaNet to work, we need to define what the aspect ratio's and scales of our image should be. The paper used [1,2(1/3), 2(2/3)], but they also used an image size of 600 pixels, so even the largest feature map (box) gave anchors that covered less than the image. But for us it would go over. As such we will use -1/3 and -2/3 instead. We will need these for inference later!
ratios = [1/2,1,2]
scales = [1,2**(-1/3), 2**(-2/3)]
Let's make our loss function, which is RetinaNetFocalLoss
crit = RetinaNetFocalLoss(scales=scales, ratios=ratios)
Now let's make our Learner
!
We want to freeze our encoder
and keep everything else unfrozen to start
def _retinanet_split(m): return L(m.encoder,nn.Sequential(m.c5top6, m.p6top7, m.merges, m.smoothers, m.classifier, m.box_regressor)).map(params)
learn = Learner(dls, arch, loss_func=crit, splitter=_retinanet_split)
learn.freeze()
Now let's train!
learn.fit_one_cycle(10, slice(1e-5, 1e-4))
Word of Warning:
show_results
and predict
both do not currently work. I'd recommend utilizing the IceVision library for your Object Detection needs.