We'll call in the tabular
module:
from fastai.tabular.all import *
Below you will find exact imports for everything used today:
from fastcore.basics import range_of, ifnone
from fastai.callback.progress import ProgressCallback
from fastai.callback.schedule import lr_find
from fastai.data.block import CategoryBlock
from fastai.data.core import DataLoaders
from fastai.data.external import untar_data, URLs
from fastai.data.transforms import RandomSplitter
from fastai.learner import load_learner, Learner
from fastai.metrics import accuracy
from fastai.tabular.core import Categorify, FillMissing, FillStrategy, Normalize, TabularPandas, TabDataLoader
from fastai.tabular.model import TabularModel
from fastai.tabular.learner import tabular_learner
import pandas as pd
And let's grab some data!
path = untar_data(URLs.ADULT_SAMPLE)
path.ls()
The data we want lives in adult.csv
df = pd.read_csv(path/'adult.csv')
Let's take a look at it:
df.head()
TabularPandas
fastai
has a new way of dealing with tabular data in a TabularPandas
object. It expects some dataframe, some procs
, cat_names
, cont_names
, y_names
, y_block
, and some splits. We'll walk through all of them
First we need to grab our categorical and continuous variables, along with how we want to process our data.
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
When we pre-process tabular data, fastai
we do one or more of three transforms:
Categorify
will transform columns that are in your cat_names
into that type, along with label encoding our categorical data:
First we'll make an instance of it:
cat = Categorify()
df.dtypes
And now let's try transforming a dataframe
to = TabularPandas(df, cat, cat_names)
cats = to.procs.categorify
Let's take a look at the categories:
cats['race']
We can see that it added a #na#
category. Let's look at the actual column:
to.show(max_n=3)
We can see that for instance occupation
got returned a #na#
value (as it was missing)
And if we call to.cats
we can see our one-hot encoded variables:
to.cats.head()
Normalize
To properly work with our numerical columns, we need to show a relationship between them all that our model can understand. This is commonly done through Normalization, where we scale the data between -1 and 1, and compute a z-score
norm = Normalize()
Let's make another to
cont_names
to = TabularPandas(df, norm, cont_names=cont_names)
norms = to.procs.normalize
Let's take a look:
We can grab the means and standard deviations like so:
norms.means
norms.stds
And we can also call to.conts
to take a look at our transformed data:
to.conts.head()
FillMissing
Now the last thing we need to do is take care of any missing values in our continuous variables (we have a special #na#
for categorical data). We have three strategies we can use:
median
constant
mode
By default it uses median
:
fm = FillMissing(fill_strategy=FillStrategy.median)
We'll recreate another TabularPandas
:
to = TabularPandas(df, fm, cont_names=cont_names)
Let's look at those missing values in the first few rows:
to.conts.head()
But wait! There's more!
to.cat_names
We have categorical values?! Yes!
to.cats.head()
We now have an additional boolean value based on if the value was missing or not too!
The DataLoaders
Now let's build our TabularPandas
. We're also going to want to split our data too, and declare our y_names
:
splits = RandomSplitter()(range_of(df))
splits
What is range_of
?
range_of(df)[:5], len(df)
It's a list of total index's in our DataFrame
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
y_names = 'salary'
y_block = CategoryBlock()
Now that we have everything declared, let's build our TabularPandas
to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names,
y_names=y_names, y_block=y_block, splits=splits)
And we can build our DataLoaders
. We can do this one of two ways:
dls = to.dataloaders()
dls.show_batch()
We can create our DataLoaders (a train and a valid). One great reason to do this this way is we can pass in different batch sizes into each TabDataLoader, along with changing options like shuffle and drop_last (at the bottom I'll show why that's super cool)
So how do we use it? Our train and validation data live in to.train and to.valid right now, so we specify that along with our options. When you make a training DataLoader, you want shuffle to be True and drop_last to be True
trn_dl = TabDataLoader(to.train, bs=64, shuffle=True, drop_last=True)
val_dl = TabDataLoader(to.valid, bs=128)
Now we can make some DataLoaders
:
dls = DataLoaders(trn_dl, val_dl)
dls.show_batch()
Why can we do the .dataloaders()
? Because TabularPandas
actually are TabDataLoader
's!
to._dbunch_type
dls._dbunch_type
TabularLearner
Now we can build our model!
Categorical Variables:
When dealing with our categorical data, we create what is called an embedding matrix. This allows for a higher dimentionality for relationships between the different categorical cardinalities. Finding the best size ratio was done through experiments by Jeremy on the Rossmann dataset
def get_emb_sz(to, sz_dict=None):
"Get default embedding size from `TabularPreprocessor` `proc` or the ones in `sz_dict`"
return [_one_emb_sz(to.classes, n, sz_dict) for n in to.cat_names]
def _one_emb_sz(classes, n, sz_dict=None):
"Pick an embedding size for `n` depending on `classes` if not given in `sz_dict`."
sz_dict = ifnone(sz_dict, {})
n_cat = len(classes[n])
sz = sz_dict.get(n, int(emb_sz_rule(n_cat))) # rule of thumb
return n_cat,sz
And now if we go look at his rule of thumb:
def emb_sz_rule(n_cat):
"Rule of thumb to pick embedding size corresponding to `n_cat`"
return min(600, round(1.6 * n_cat**0.56))
We either choose a maximum size of 600, or 1.6 * the cardinality to the .56
emb_szs = get_emb_sz(to)
emb_szs
If we want to see what each one aligns to, let's look at the order of cat_names
to.cat_names
to['workclass'].nunique()
If you notice, we had 10
there, this is to take one more column for any missing categorical values that may show
Numericals we just simply pass in how many there are:
cont_len = len(to.cont_names)
cont_len
And now we can build our model!
What makes this model a little different is our batches is actually two inputs:
batch = dls.one_batch()
len(batch)
batch[0][0], batch[1][0]
With the first being our categorical variables and the second being our numericals.
Now let's make our model. We'll want our size of our embeddings, the number of continuous variables, the number of outputs, and how large and how many fully connected layers we want to use:
net = TabularModel(emb_szs, cont_len, 2, [200,100])
net
Now that we know the background, let's do that a bit quicker:
learn = tabular_learner(dls, [200,100], metrics=accuracy)
And now we can fit!
learn.lr_find()
learn.fit(3, 1e-2)
Can we speed this up a little? Yes we can! The more you can load into a batch, the faster you can process the data. This is a careful balance, for tabular data I go to a maximum of 4096 rows per batch:
dls = to.dataloaders(bs=1024)
learn = tabular_learner(dls, [200,100], metrics=accuracy)
learn.lr_find()
learn.fit(3, 1e-2)
We can see we fit very quickly, but it didn't fit quite as well (there is a trade-off):
dls = to.dataloaders(bs=4096)
learn = tabular_learner(dls, [200,100], metrics=accuracy)
learn.lr_find()
learn.fit_one_cycle(3, 1e-2)
learn.export('myModel.pkl')
del learn
learn = load_learner('myModel.pkl')
Once we load in our learner, we can create a test dataloader like so:
dl = learn.dls.test_dl(df.iloc[:100])
Let's look at a batch
dl.show_batch()
You can see it's actually labelled! Is that right?
df2 = df.iloc[:100].drop('salary', axis=1)
df2.head()
dl = learn.dls.test_dl(df2)
dl.show_batch()
And now we can pass either into our learn
! (You can't do validate
on a test_dl
that did not have ground truth labels)
learn.validate(dl=dl)
dl = learn.dls.test_dl(df.iloc[:100])
learn.validate(dl=dl)