Lesson Video:


This article is also a Jupyter Notebook available to be run from the top down. There will be code snippets that you can then run in any environment.

Below are the versions of fastai, fastcore, and wwf currently running at the time of writing this:

  • fastai: 2.1.10
  • fastcore: 1.3.13
  • wwf: 0.0.8

We'll call in the tabular module:

from fastai.tabular.all import *

Below you will find exact imports for everything used today:

from fastcore.basics import range_of, ifnone

from fastai.callback.progress import ProgressCallback
from fastai.callback.schedule import lr_find

from fastai.data.block import CategoryBlock
from fastai.data.core import DataLoaders
from fastai.data.external import untar_data, URLs
from fastai.data.transforms import RandomSplitter

from fastai.learner import load_learner, Learner
from fastai.metrics import accuracy

from fastai.tabular.core import Categorify, FillMissing, FillStrategy, Normalize, TabularPandas, TabDataLoader
from fastai.tabular.model import TabularModel
from fastai.tabular.learner import tabular_learner

import pandas as pd

And let's grab some data!

path = untar_data(URLs.ADULT_SAMPLE)
path.ls()
(#3) [Path('/root/.fastai/data/adult_sample/models'),Path('/root/.fastai/data/adult_sample/adult.csv'),Path('/root/.fastai/data/adult_sample/export.pkl')]

The data we want lives in adult.csv

df = pd.read_csv(path/'adult.csv')

Let's take a look at it:

df.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse NaN Wife White Female 0 1902 40 United-States >=50k
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States >=50k
2 38 Private 96185 HS-grad NaN Divorced NaN Unmarried Black Female 0 0 32 United-States <50k
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States >=50k
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States <50k

TabularPandas

fastai has a new way of dealing with tabular data in a TabularPandas object. It expects some dataframe, some procs, cat_names, cont_names, y_names, y_block, and some splits. We'll walk through all of them

First we need to grab our categorical and continuous variables, along with how we want to process our data.

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']

When we pre-process tabular data, fastai we do one or more of three transforms:

Categorify will transform columns that are in your cat_names into that type, along with label encoding our categorical data:

First we'll make an instance of it:

cat = Categorify()
df.dtypes
age                 int64
workclass          object
fnlwgt              int64
education          object
education-num     float64
marital-status     object
occupation         object
relationship       object
race               object
sex                object
capital-gain        int64
capital-loss        int64
hours-per-week      int64
native-country     object
salary             object
dtype: object

And now let's try transforming a dataframe

to = TabularPandas(df, cat, cat_names)
cats = to.procs.categorify

Let's take a look at the categories:

cats['race']
['#na#', ' Amer-Indian-Eskimo', ' Asian-Pac-Islander', ' Black', ' Other', ' White']

We can see that it added a #na# category. Let's look at the actual column:

to.show(max_n=3)
workclass education marital-status occupation relationship race
0 Private Assoc-acdm Married-civ-spouse #na# Wife White
1 Private Masters Divorced Exec-managerial Not-in-family White
2 Private HS-grad Divorced #na# Unmarried Black

We can see that for instance occupation got returned a #na# value (as it was missing)

And if we call to.cats we can see our one-hot encoded variables:

to.cats.head()
workclass education marital-status occupation relationship race
0 5 8 3 0 6 5
1 5 13 1 5 2 5
2 5 12 1 0 5 3
3 6 15 3 11 1 2
4 7 6 3 9 6 3

Normalize

To properly work with our numerical columns, we need to show a relationship between them all that our model can understand. This is commonly done through Normalization, where we scale the data between -1 and 1, and compute a z-score

norm = Normalize()

Let's make another to

cont_names
['age', 'fnlwgt', 'education-num']
to = TabularPandas(df, norm, cont_names=cont_names)
norms = to.procs.normalize

Let's take a look:

We can grab the means and standard deviations like so:

norms.means
{'age': 38.58164675532078,
 'education-num': 10.079815864562988,
 'fnlwgt': 189778.36651208502}
norms.stds
{'age': 13.640223192304274,
 'education-num': 2.573042731149292,
 'fnlwgt': 105548.3568809906}

And we can also call to.conts to take a look at our transformed data:

to.conts.head()
age fnlwgt education-num
0 0.763796 -0.838084 0.746270
1 0.397233 0.444987 1.523560
2 -0.042642 -0.886734 NaN
3 -0.042642 -0.728873 1.912205
4 0.250608 -1.018314 NaN

FillMissing

Now the last thing we need to do is take care of any missing values in our continuous variables (we have a special #na# for categorical data). We have three strategies we can use:

  • median
  • constant
  • mode

By default it uses median:

fm = FillMissing(fill_strategy=FillStrategy.median)

We'll recreate another TabularPandas:

to = TabularPandas(df, fm, cont_names=cont_names)

Let's look at those missing values in the first few rows:

to.conts.head()
age fnlwgt education-num
0 49 101320 12.0
1 44 236746 14.0
2 38 96185 10.0
3 38 112847 15.0
4 42 82297 10.0

But wait! There's more!

to.cat_names
(#1) ['education-num_na']

We have categorical values?! Yes!

to.cats.head()
education-num_na
0 False
1 False
2 True
3 False
4 True

We now have an additional boolean value based on if the value was missing or not too!

The DataLoaders

Now let's build our TabularPandas. We're also going to want to split our data too, and declare our y_names:

splits = RandomSplitter()(range_of(df))
splits
((#26049) [15171,25410,17757,17849,30950,11593,6452,12594,5929,18804...],
 (#6512) [32068,18868,23709,12945,29194,3945,17027,15055,12713,8583...])

What is range_of?

range_of(df)[:5], len(df)
([0, 1, 2, 3, 4], 32561)

It's a list of total index's in our DataFrame

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
y_names = 'salary'
y_block = CategoryBlock()

Now that we have everything declared, let's build our TabularPandas

to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names,
                   y_names=y_names, y_block=y_block, splits=splits)

And we can build our DataLoaders. We can do this one of two ways:

Straight

dls = to.dataloaders()
dls.show_batch()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary
0 Private HS-grad Never-married Sales Unmarried Black False 29.000000 138190.000773 9.0 <50k
1 Private 10th Never-married Other-service Own-child White False 17.000000 166241.999332 6.0 <50k
2 Local-gov Some-college Married-civ-spouse Protective-serv Husband Black False 44.000000 212665.000517 10.0 <50k
3 Private HS-grad Married-civ-spouse Craft-repair Husband White False 37.000000 170020.000727 9.0 <50k
4 Private Assoc-acdm Married-civ-spouse Exec-managerial Husband White False 52.999999 30244.006121 12.0 >=50k
5 Self-emp-inc Some-college Married-civ-spouse Sales Husband White False 46.000000 204927.999416 10.0 <50k
6 Private Some-college Never-married Sales Not-in-family White False 26.000000 126829.001901 10.0 <50k
7 Private Bachelors Divorced Sales Not-in-family White False 41.000000 53956.000088 13.0 <50k
8 Self-emp-not-inc HS-grad Widowed Exec-managerial Not-in-family White False 82.999999 213866.000108 9.0 <50k
9 Federal-gov Bachelors Separated Prof-specialty Not-in-family White False 31.000000 381645.000307 13.0 <50k

With Two DataLoaders

We can create our DataLoaders (a train and a valid). One great reason to do this this way is we can pass in different batch sizes into each TabDataLoader, along with changing options like shuffle and drop_last (at the bottom I'll show why that's super cool)

So how do we use it? Our train and validation data live in to.train and to.valid right now, so we specify that along with our options. When you make a training DataLoader, you want shuffle to be True and drop_last to be True

trn_dl = TabDataLoader(to.train, bs=64, shuffle=True, drop_last=True)
val_dl = TabDataLoader(to.valid, bs=128)

Now we can make some DataLoaders:

dls = DataLoaders(trn_dl, val_dl)
dls.show_batch()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary
0 Private Some-college Never-married Handlers-cleaners Own-child White False 20.000001 258517.000277 10.0 <50k
1 Self-emp-inc HS-grad Married-civ-spouse Sales Wife White False 45.000000 177542.999833 9.0 <50k
2 Private 10th Never-married Sales Own-child White False 17.000000 163493.999860 6.0 <50k
3 Private Some-college Married-civ-spouse Adm-clerical Husband Black False 33.000000 549413.003715 10.0 >=50k
4 Self-emp-not-inc Assoc-acdm Married-civ-spouse Craft-repair Husband White False 65.999999 174994.999890 12.0 <50k
5 Local-gov HS-grad Married-civ-spouse Exec-managerial Husband White False 52.999999 204397.000560 9.0 <50k
6 Private 11th Never-married Handlers-cleaners Own-child White False 17.000000 169036.999472 7.0 <50k
7 Self-emp-not-inc Masters Married-civ-spouse Exec-managerial Husband White False 42.000000 178780.000169 14.0 >=50k
8 Private Some-college Never-married Sales Own-child White False 23.000000 163665.000659 10.0 <50k
9 Private HS-grad Married-civ-spouse Adm-clerical Husband White False 34.000000 110476.001991 9.0 <50k

Why can we do the .dataloaders()? Because TabularPandas actually are TabDataLoader's!

to._dbunch_type
fastai.data.core.DataLoaders
dls._dbunch_type
fastai.data.core.DataLoaders

TabularLearner

Now we can build our model!

Categorical Variables:

When dealing with our categorical data, we create what is called an embedding matrix. This allows for a higher dimentionality for relationships between the different categorical cardinalities. Finding the best size ratio was done through experiments by Jeremy on the Rossmann dataset

def get_emb_sz(to, sz_dict=None):
    "Get default embedding size from `TabularPreprocessor` `proc` or the ones in `sz_dict`"
    return [_one_emb_sz(to.classes, n, sz_dict) for n in to.cat_names]
def _one_emb_sz(classes, n, sz_dict=None):
    "Pick an embedding size for `n` depending on `classes` if not given in `sz_dict`."
    sz_dict = ifnone(sz_dict, {})
    n_cat = len(classes[n])
    sz = sz_dict.get(n, int(emb_sz_rule(n_cat)))  # rule of thumb
    return n_cat,sz

And now if we go look at his rule of thumb:

def emb_sz_rule(n_cat):
    "Rule of thumb to pick embedding size corresponding to `n_cat`"
    return min(600, round(1.6 * n_cat**0.56))

We either choose a maximum size of 600, or 1.6 * the cardinality to the .56

emb_szs = get_emb_sz(to)
emb_szs
[(10, 6), (17, 8), (8, 5), (16, 8), (7, 5), (6, 4), (3, 3)]

If we want to see what each one aligns to, let's look at the order of cat_names

to.cat_names
(#7) ['workclass','education','marital-status','occupation','relationship','race','education-num_na']
to['workclass'].nunique()
9

If you notice, we had 10 there, this is to take one more column for any missing categorical values that may show

Numerical Variables

Numericals we just simply pass in how many there are:

cont_len = len(to.cont_names)
cont_len
3

And now we can build our model!

What makes this model a little different is our batches is actually two inputs:

batch = dls.one_batch()
len(batch)
3
batch[0][0], batch[1][0]
(tensor([ 5, 10,  3, 15,  1,  5,  1]), tensor([ 0.0352, -1.1254,  1.1420]))

With the first being our categorical variables and the second being our numericals.

Now let's make our model. We'll want our size of our embeddings, the number of continuous variables, the number of outputs, and how large and how many fully connected layers we want to use:

net = TabularModel(emb_szs, cont_len, 2, [200,100])
net
TabularModel(
  (embeds): ModuleList(
    (0): Embedding(10, 6)
    (1): Embedding(17, 8)
    (2): Embedding(8, 5)
    (3): Embedding(16, 8)
    (4): Embedding(7, 5)
    (5): Embedding(6, 4)
    (6): Embedding(3, 3)
  )
  (emb_drop): Dropout(p=0.0, inplace=False)
  (bn_cont): BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): LinBnDrop(
      (0): BatchNorm1d(42, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Linear(in_features=42, out_features=200, bias=False)
      (2): ReLU(inplace=True)
    )
    (1): LinBnDrop(
      (0): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Linear(in_features=200, out_features=100, bias=False)
      (2): ReLU(inplace=True)
    )
    (2): LinBnDrop(
      (0): Linear(in_features=100, out_features=2, bias=True)
    )
  )
)

Now that we know the background, let's do that a bit quicker:

learn = tabular_learner(dls, [200,100], metrics=accuracy)

And now we can fit!

learn.lr_find()
(0.012022644281387329, 0.00363078061491251)
learn.fit(3, 1e-2)
epoch train_loss valid_loss accuracy time
0 0.370782 0.411382 0.774416 00:06
1 0.357029 0.375285 0.839988 00:06
2 0.354338 0.359791 0.841216 00:06

Can we speed this up a little? Yes we can! The more you can load into a batch, the faster you can process the data. This is a careful balance, for tabular data I go to a maximum of 4096 rows per batch:

dls = to.dataloaders(bs=1024)
learn = tabular_learner(dls, [200,100], metrics=accuracy)
learn.lr_find()
(0.025118863582611083, 0.002511886414140463)
learn.fit(3, 1e-2)
epoch train_loss valid_loss accuracy time
0 0.392639 0.439587 0.764588 00:00
1 0.370950 0.361114 0.834920 00:00
2 0.360302 0.355392 0.836763 00:00

We can see we fit very quickly, but it didn't fit quite as well (there is a trade-off):

dls = to.dataloaders(bs=4096)
learn = tabular_learner(dls, [200,100], metrics=accuracy)
learn.lr_find()
(0.010000000149011612, 0.001737800776027143)
learn.fit_one_cycle(3, 1e-2)
epoch train_loss valid_loss accuracy time
0 0.524572 0.557502 0.757371 00:00
1 0.443588 0.479469 0.774877 00:00
2 0.411876 0.455629 0.766892 00:00

Labelled Test Sets

Now let's look at inference. We'll completely export and load a new learner in:

learn.export('myModel.pkl')
del learn
learn = load_learner('myModel.pkl')

Once we load in our learner, we can create a test dataloader like so:

dl = learn.dls.test_dl(df.iloc[:100])

Let's look at a batch

dl.show_batch()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary
0 Private Assoc-acdm Married-civ-spouse #na# Wife White False 49.000000 101320.001784 12.0 >=50k
1 Private Masters Divorced Exec-managerial Not-in-family White False 44.000000 236746.000055 14.0 >=50k
2 Private HS-grad Divorced #na# Unmarried Black True 38.000000 96184.997114 10.0 <50k
3 Self-emp-inc Prof-school Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander False 38.000000 112847.000794 15.0 >=50k
4 Self-emp-not-inc 7th-8th Married-civ-spouse Other-service Wife Black True 42.000000 82296.997560 10.0 <50k
5 Private HS-grad Never-married Handlers-cleaners Own-child White False 19.999999 63209.999480 9.0 <50k
6 Private Some-college Divorced #na# Other-relative White False 49.000000 44434.000166 10.0 <50k
7 Private 11th Married-civ-spouse #na# Husband White False 37.000000 138939.999411 7.0 <50k
8 Private HS-grad Married-civ-spouse Craft-repair Husband White False 46.000000 328215.996244 9.0 >=50k
9 Self-emp-inc HS-grad Married-civ-spouse #na# Husband White True 36.000000 216710.999023 10.0 >=50k

You can see it's actually labelled! Is that right?

df2 = df.iloc[:100].drop('salary', axis=1)
df2.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse NaN Wife White Female 0 1902 40 United-States
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States
2 38 Private 96185 HS-grad NaN Divorced NaN Unmarried Black Female 0 0 32 United-States
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States
dl = learn.dls.test_dl(df2)
dl.show_batch()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num
0 Private Assoc-acdm Married-civ-spouse #na# Wife White False 49.000000 101320.001784 12.0
1 Private Masters Divorced Exec-managerial Not-in-family White False 44.000000 236746.000055 14.0
2 Private HS-grad Divorced #na# Unmarried Black True 38.000000 96184.997114 10.0
3 Self-emp-inc Prof-school Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander False 38.000000 112847.000794 15.0
4 Self-emp-not-inc 7th-8th Married-civ-spouse Other-service Wife Black True 42.000000 82296.997560 10.0
5 Private HS-grad Never-married Handlers-cleaners Own-child White False 19.999999 63209.999480 9.0
6 Private Some-college Divorced #na# Other-relative White False 49.000000 44434.000166 10.0
7 Private 11th Married-civ-spouse #na# Husband White False 37.000000 138939.999411 7.0
8 Private HS-grad Married-civ-spouse Craft-repair Husband White False 46.000000 328215.996244 9.0
9 Self-emp-inc HS-grad Married-civ-spouse #na# Husband White True 36.000000 216710.999023 10.0

And now we can pass either into our learn! (You can't do validate on a test_dl that did not have ground truth labels)

learn.validate(dl=dl)
(#2) [None,None]
dl = learn.dls.test_dl(df.iloc[:100])
learn.validate(dl=dl)
(#2) [0.39352598786354065,0.7900000214576721]