Binary Classification
In this example we will be walking through the fastai
tabular API to perform binary classification on the Salary dataset.
This notebook can run along side the first tabular lesson from Walk with fastai2, shown here
First we need to call the tabular module:
from fastai.tabular.all import *
And grab our dataset:
path = untar_data(URLs.ADULT_SAMPLE)
If we look at the contents of our folder, we will find our data lives in adult.csv
:
path.ls()
We'll go ahead and open it in Pandas
and take a look:
df = pd.read_csv(path/'adult.csv')
df.head()
TabularPandas
fastai
has a new way of dealing with tabular data by utilizing a TabularPandas
object. It expects some dataframe, some procs
, cat_names
, cont_names
, y_names
, y_block
, and some splits
. We'll walk through all of them
First we need to grab our categorical and continuous variables, along with how we want to process our data.
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
When we pre-process tabular data with fastai
, we do one or more of three transforms:
Categorify
Categorify
will transform columns that are in your cat_names
into that type, along with label encoding our categorical data.
First we'll make an instance of it:
cat = Categorify()
And now let's try transforming a dataframe
to = TabularPandas(df, cat, cat_names)
We can then extract that transform from to.procs.categorify
:
cats = to.procs.categorify
Let's take a look at the categories:
cats['relationship']
We can see that it added a #na#
category. Let's look at the actual column:
to.show(max_n=3)
We can see now, for example, that occupation
got returned a #na#
value (as it was missing)
If we call to.cats
we can see our one-hot encoded variables:
to.cats.head()
norm = Normalize()
Let's make another to
to = TabularPandas(df, norm, cont_names=cont_names)
norms = to.procs.normalize
And take a closer look.
We can grab the means and standard deviations like so:
norms.means
norms.stds
And we can also call to.conts
to take a look at our transformed data:
to.conts.head()
fm = FillMissing(fill_strategy=FillStrategy.median)
We'll recreate another TabularPandas
:
to = TabularPandas(df, fm, cont_names=cont_names)
Let's look at those missing values in the first few rows:
to.conts.head()
But wait! There's more!
to.cat_names
We have categorical values?! Yes!
to.cats.head()
We now have an additional boolean value based on if the value was missing or not too!
The DataLoaders
Now let's build our TabularPandas
for classifying. We're also going to want to split our data and declare our y_names
too:
splits = RandomSplitter()(range_of(df))
splits
What is range_of
?
range_of(df)[:5], len(df)
It's a list of total index's in our DataFrame
We'll use all our cat
and cont
names, the procs
, declare a y_name
, and finally specify a single-label classification problem with CategoryBlock
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
y_names = 'salary'
y_block = CategoryBlock()
Now that we have everything declared, let's build our TabularPandas
:
to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names,
y_names=y_names, y_block=y_block, splits=splits)
And now we can build the DataLoaders
. We can do this one of two ways, first just calling to.dataloaders()
on our data:
dls = to.dataloaders()
Or we can create the DataLoaders
ourselves (a train and valid). One great reason to do this this way is we can pass in different batch sizes into each TabDataLoader
, along with changing options like shuffle
and drop_last
So how do we use it? Our train and validation data live in to.train and to.valid right now, so we specify that along with our options. When you make a training DataLoader, you want shuffle
to be True
and drop_last
to be True
(so we drop the last incomplete batch)
trn_dl = TabDataLoader(to.train, bs=64, shuffle=True, drop_last=True)
val_dl = TabDataLoader(to.valid, bs=128)
Now we can make some DataLoaders
:
dls = DataLoaders(trn_dl, val_dl)
And show a batch of data:
dls.show_batch()
Why can we do the
.dataloaders()
? BecauseTabularPandas
itself is actually a set ofTabDataLoaders
! See below for a comparison test:
to._dbunch_type == dls._dbunch_type
Tabular Learner and Training a Model
Now we can build our Learner
! But what's special about a tabular neural network?
Categorical Variables
When dealing with our categorical data, we create what is called an embedding matrix. This allows for a higher dimentionality for relationships between the different categorical cardinalities. Finding the best size ratio was done through experiments by Jeremy on the Rossmann dataset
This "rule of thumb" is to use either a maximum embedding space of 600, or 1.6 times the cardinality raised to the 0.56, or written out as:
$$min(600, (1.6 * {var.nunique)}^{0.56})$$
Let's calculate these embedding sizes for our model to take a look-see:
emb_szs = get_emb_sz(to); emb_szs
If we want to see what each one aligns to, let's look at the order of cat_names
to.cat_names
Let's specifically look at workclass
:
to['workclass'].nunique()
If you notice, we had 10
there, this is to take one more column for any missing categorical values that may show
cont_len = len(to.cont_names); cont_len
And now we have all the pieces we need to build a TabularModel
!
batch = dls.one_batch(); len(batch)
batch[0][0], batch[1][0]
With the first being our categorical variables and the second being our numericals.
Now let's make our model. We'll want our size of our embeddings, the number of continuous variables, the number of outputs, and how large and how many fully connected layers we want to use:
net = TabularModel(emb_szs, cont_len, 2, [200,100])
Let's see it's architecture:
net
tabular_learner
Now that we know the background, let's build our model a little bit faster and generate a Learner
too:
learn = tabular_learner(dls, [200,100], metrics=accuracy)
And now we can fit!
learn.lr_find()
learn.fit(3, 1e-2)
Can we speed this up a little? Yes we can! The more you can load into a batch, the faster you can process the data. This is a careful balance, for tabular data I go to a maximum of 4096 rows per batch if the dataset is large enough for a decent number of batches:
dls = to.dataloaders(bs=1024)
learn = tabular_learner(dls, [200,100], metrics=accuracy)
learn.fit(3, 1e-2)
We can see we fit very quickly, but it didn't fit quite as well (there is a trade-off):
dls = to.dataloaders(bs=4096)
learn = tabular_learner(dls, [200,100], metrics=accuracy)
learn.fit(3, 1e-2)
row, cls, probs = learn.predict(df.iloc[0])
row.show()
Now let's try test_dl
. There's something special we can do here too:
dl = learn.dls.test_dl(df.iloc[:100])
Let's look at a batch:
dl.show_batch()
We have our labels! It'll grab them if possible by default!
What does that mean? Well, besides simply calling get_preds
, we can also run validate
to see how a model performs. This is nice as it can allow for efficient methods when calculating something like permutation importance:
learn.validate(dl=dl)
We'll also show an example of get_preds
:
preds = learn.get_preds(dl=dl)
preds[0][0]
What would happen if I accidently passed in an unlablled dataset to learn.validate
though? Let's find out:
df2 = df.iloc[:100].drop('salary', axis=1)
df2.head()
dl = learn.dls.test_dl(df2)
learn.validate(dl=dl)
We can see it will simply just return None
!