Lesson Video:

We'll call in the tabular module:

from fastai.tabular.all import *

Below you will find exact imports for everything used today:

from fastcore.basics import range_of, ifnone

from fastai.callback.progress import ProgressCallback
from fastai.callback.schedule import lr_find

from fastai.data.block import CategoryBlock
from fastai.data.core import DataLoaders
from fastai.data.external import untar_data, URLs
from fastai.data.transforms import RandomSplitter

from fastai.learner import load_learner, Learner
from fastai.metrics import accuracy

from fastai.tabular.core import Categorify, FillMissing, FillStrategy, Normalize, TabularPandas, TabDataLoader
from fastai.tabular.model import TabularModel
from fastai.tabular.learner import tabular_learner

import pandas as pd

And let's grab some data!

path = untar_data(URLs.ADULT_SAMPLE)

path.ls()

(#3) [Path('/root/.fastai/data/adult_sample/models'),Path('/root/.fastai/data/adult_sample/adult.csv'),Path('/root/.fastai/data/adult_sample/export.pkl')]

The data we want lives in adult.csv

df = pd.read_csv(path/'adult.csv')

Let's take a look at it:

df.head()

`TabularPandas`

fastai has a new way of dealing with tabular data in a TabularPandas object. It expects some dataframe, some procs, cat_names, cont_names, y_names, y_block, and some splits. We'll walk through all of them

First we need to grab our categorical and continuous variables, along with how we want to process our data.

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']

When we pre-process tabular data, fastai we do one or more of three transforms:

`Categorify`

Categorify will transform columns that are in your cat_names into that type, along with label encoding our categorical data:

First we'll make an instance of it:

cat = Categorify()

df.dtypes

age                 int64
workclass          object
fnlwgt              int64
education          object
education-num     float64
marital-status     object
occupation         object
relationship       object
race               object
sex                object
capital-gain        int64
capital-loss        int64
hours-per-week      int64
native-country     object
salary             object
dtype: object

And now let's try transforming a dataframe

to = TabularPandas(df, cat, cat_names)

cats = to.procs.categorify

Let's take a look at the categories:

cats['race']

['#na#', ' Amer-Indian-Eskimo', ' Asian-Pac-Islander', ' Black', ' Other', ' White']

We can see that it added a #na# category. Let's look at the actual column:

to.show(max_n=3)

We can see that for instance occupation got returned a #na# value (as it was missing)

And if we call to.cats we can see our one-hot encoded variables:

to.cats.head()

`Normalize`

To properly work with our numerical columns, we need to show a relationship between them all that our model can understand. This is commonly done through Normalization, where we scale the data between -1 and 1, and compute a z-score

norm = Normalize()

Let's make another to

cont_names

['age', 'fnlwgt', 'education-num']

to = TabularPandas(df, norm, cont_names=cont_names)

norms = to.procs.normalize

Let's take a look:

We can grab the means and standard deviations like so:

norms.means

{'age': 38.58164675532078,
 'education-num': 10.079815864562988,
 'fnlwgt': 189778.36651208502}

norms.stds

{'age': 13.640223192304274,
 'education-num': 2.573042731149292,
 'fnlwgt': 105548.3568809906}

And we can also call to.conts to take a look at our transformed data:

to.conts.head()

`FillMissing`

Now the last thing we need to do is take care of any missing values in our continuous variables (we have a special #na# for categorical data). We have three strategies we can use:

median
constant
mode

By default it uses median:

fm = FillMissing(fill_strategy=FillStrategy.median)

We'll recreate another TabularPandas:

to = TabularPandas(df, fm, cont_names=cont_names)

Let's look at those missing values in the first few rows:

to.conts.head()

But wait! There's more!

to.cat_names

(#1) ['education-num_na']

We have categorical values?! Yes!

to.cats.head()

We now have an additional boolean value based on if the value was missing or not too!

The `DataLoaders`

Now let's build our TabularPandas. We're also going to want to split our data too, and declare our y_names:

splits = RandomSplitter()(range_of(df))

splits

((#26049) [15171,25410,17757,17849,30950,11593,6452,12594,5929,18804...],
 (#6512) [32068,18868,23709,12945,29194,3945,17027,15055,12713,8583...])

What is range_of?

range_of(df)[:5], len(df)

([0, 1, 2, 3, 4], 32561)

It's a list of total index's in our DataFrame

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
y_names = 'salary'
y_block = CategoryBlock()

Now that we have everything declared, let's build our TabularPandas

to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names,
                   y_names=y_names, y_block=y_block, splits=splits)

And we can build our DataLoaders. We can do this one of two ways:

Straight

dls = to.dataloaders()

dls.show_batch()

With Two DataLoaders

We can create our DataLoaders (a train and a valid). One great reason to do this this way is we can pass in different batch sizes into each TabDataLoader, along with changing options like shuffle and drop_last (at the bottom I'll show why that's super cool)

So how do we use it? Our train and validation data live in to.train and to.valid right now, so we specify that along with our options. When you make a training DataLoader, you want shuffle to be True and drop_last to be True

trn_dl = TabDataLoader(to.train, bs=64, shuffle=True, drop_last=True)
val_dl = TabDataLoader(to.valid, bs=128)

Now we can make some DataLoaders:

dls = DataLoaders(trn_dl, val_dl)

dls.show_batch()

Why can we do the .dataloaders()? Because TabularPandas actually are TabDataLoader's!

to._dbunch_type

fastai.data.core.DataLoaders

dls._dbunch_type

fastai.data.core.DataLoaders

`TabularLearner`

Now we can build our model!

Categorical Variables:

When dealing with our categorical data, we create what is called an embedding matrix. This allows for a higher dimentionality for relationships between the different categorical cardinalities. Finding the best size ratio was done through experiments by Jeremy on the Rossmann dataset

def get_emb_sz(to, sz_dict=None):
    "Get default embedding size from `TabularPreprocessor` `proc` or the ones in `sz_dict`"
    return [_one_emb_sz(to.classes, n, sz_dict) for n in to.cat_names]

def _one_emb_sz(classes, n, sz_dict=None):
    "Pick an embedding size for `n` depending on `classes` if not given in `sz_dict`."
    sz_dict = ifnone(sz_dict, {})
    n_cat = len(classes[n])
    sz = sz_dict.get(n, int(emb_sz_rule(n_cat)))  # rule of thumb
    return n_cat,sz

And now if we go look at his rule of thumb:

def emb_sz_rule(n_cat):
    "Rule of thumb to pick embedding size corresponding to `n_cat`"
    return min(600, round(1.6 * n_cat**0.56))

We either choose a maximum size of 600, or 1.6 * the cardinality to the .56

emb_szs = get_emb_sz(to)

emb_szs

[(10, 6), (17, 8), (8, 5), (16, 8), (7, 5), (6, 4), (3, 3)]

If we want to see what each one aligns to, let's look at the order of cat_names

to.cat_names

(#7) ['workclass','education','marital-status','occupation','relationship','race','education-num_na']

to['workclass'].nunique()

9

If you notice, we had 10 there, this is to take one more column for any missing categorical values that may show

Numerical Variables

Numericals we just simply pass in how many there are:

cont_len = len(to.cont_names)

cont_len

3

And now we can build our model!

`TabularModel`

What makes this model a little different is our batches is actually two inputs:

batch = dls.one_batch()

len(batch)

3

batch[0][0], batch[1][0]

(tensor([ 5, 10,  3, 15,  1,  5,  1]), tensor([ 0.0352, -1.1254,  1.1420]))

With the first being our categorical variables and the second being our numericals.

Now let's make our model. We'll want our size of our embeddings, the number of continuous variables, the number of outputs, and how large and how many fully connected layers we want to use:

net = TabularModel(emb_szs, cont_len, 2, [200,100])

net

TabularModel(
  (embeds): ModuleList(
    (0): Embedding(10, 6)
    (1): Embedding(17, 8)
    (2): Embedding(8, 5)
    (3): Embedding(16, 8)
    (4): Embedding(7, 5)
    (5): Embedding(6, 4)
    (6): Embedding(3, 3)
  )
  (emb_drop): Dropout(p=0.0, inplace=False)
  (bn_cont): BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): LinBnDrop(
      (0): BatchNorm1d(42, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Linear(in_features=42, out_features=200, bias=False)
      (2): ReLU(inplace=True)
    )
    (1): LinBnDrop(
      (0): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Linear(in_features=200, out_features=100, bias=False)
      (2): ReLU(inplace=True)
    )
    (2): LinBnDrop(
      (0): Linear(in_features=100, out_features=2, bias=True)
    )
  )
)

Now that we know the background, let's do that a bit quicker:

learn = tabular_learner(dls, [200,100], metrics=accuracy)

And now we can fit!

learn.lr_find()

(0.012022644281387329, 0.00363078061491251)

learn.fit(3, 1e-2)

Can we speed this up a little? Yes we can! The more you can load into a batch, the faster you can process the data. This is a careful balance, for tabular data I go to a maximum of 4096 rows per batch:

dls = to.dataloaders(bs=1024)

learn = tabular_learner(dls, [200,100], metrics=accuracy)

learn.lr_find()

(0.025118863582611083, 0.002511886414140463)

learn.fit(3, 1e-2)

We can see we fit very quickly, but it didn't fit quite as well (there is a trade-off):

dls = to.dataloaders(bs=4096)
learn = tabular_learner(dls, [200,100], metrics=accuracy)

learn.lr_find()

(0.010000000149011612, 0.001737800776027143)

learn.fit_one_cycle(3, 1e-2)

Labelled Test Sets

Now let's look at inference. We'll completely export and load a new learner in:

learn.export('myModel.pkl')

del learn

learn = load_learner('myModel.pkl')

Once we load in our learner, we can create a test dataloader like so:

dl = learn.dls.test_dl(df.iloc[:100])

Let's look at a batch

dl.show_batch()

You can see it's actually labelled! Is that right?

df2 = df.iloc[:100].drop('salary', axis=1)

df2.head()

dl = learn.dls.test_dl(df2)

dl.show_batch()

And now we can pass either into our learn! (You can't do validate on a test_dl that did not have ground truth labels)

learn.validate(dl=dl)

(#2) [None,None]

dl = learn.dls.test_dl(df.iloc[:100])

learn.validate(dl=dl)

(#2) [0.39352598786354065,0.7900000214576721]

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	NaN	Wife	White	Female	0	1902	40	United-States	>=50k
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k
2	38	Private	96185	HS-grad	NaN	Divorced	NaN	Unmarried	Black	Female	0	0	32	United-States	<50k
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

	workclass	education	marital-status	occupation	relationship	race
0	Private	Assoc-acdm	Married-civ-spouse	#na#	Wife	White
1	Private	Masters	Divorced	Exec-managerial	Not-in-family	White
2	Private	HS-grad	Divorced	#na#	Unmarried	Black

	age	fnlwgt	education-num
0	0.763796	-0.838084	0.746270
1	0.397233	0.444987	1.523560
2	-0.042642	-0.886734	NaN
3	-0.042642	-0.728873	1.912205
4	0.250608	-1.018314	NaN

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary
0	Private	HS-grad	Never-married	Sales	Unmarried	Black	False	29.000000	138190.000773	9.0	<50k
1	Private	10th	Never-married	Other-service	Own-child	White	False	17.000000	166241.999332	6.0	<50k
2	Local-gov	Some-college	Married-civ-spouse	Protective-serv	Husband	Black	False	44.000000	212665.000517	10.0	<50k
3	Private	HS-grad	Married-civ-spouse	Craft-repair	Husband	White	False	37.000000	170020.000727	9.0	<50k
4	Private	Assoc-acdm	Married-civ-spouse	Exec-managerial	Husband	White	False	52.999999	30244.006121	12.0	>=50k
5	Self-emp-inc	Some-college	Married-civ-spouse	Sales	Husband	White	False	46.000000	204927.999416	10.0	<50k
6	Private	Some-college	Never-married	Sales	Not-in-family	White	False	26.000000	126829.001901	10.0	<50k
7	Private	Bachelors	Divorced	Sales	Not-in-family	White	False	41.000000	53956.000088	13.0	<50k
8	Self-emp-not-inc	HS-grad	Widowed	Exec-managerial	Not-in-family	White	False	82.999999	213866.000108	9.0	<50k
9	Federal-gov	Bachelors	Separated	Prof-specialty	Not-in-family	White	False	31.000000	381645.000307	13.0	<50k

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary
0	Private	Some-college	Never-married	Handlers-cleaners	Own-child	White	False	20.000001	258517.000277	10.0	<50k
1	Self-emp-inc	HS-grad	Married-civ-spouse	Sales	Wife	White	False	45.000000	177542.999833	9.0	<50k
2	Private	10th	Never-married	Sales	Own-child	White	False	17.000000	163493.999860	6.0	<50k
3	Private	Some-college	Married-civ-spouse	Adm-clerical	Husband	Black	False	33.000000	549413.003715	10.0	>=50k
4	Self-emp-not-inc	Assoc-acdm	Married-civ-spouse	Craft-repair	Husband	White	False	65.999999	174994.999890	12.0	<50k
5	Local-gov	HS-grad	Married-civ-spouse	Exec-managerial	Husband	White	False	52.999999	204397.000560	9.0	<50k
6	Private	11th	Never-married	Handlers-cleaners	Own-child	White	False	17.000000	169036.999472	7.0	<50k
7	Self-emp-not-inc	Masters	Married-civ-spouse	Exec-managerial	Husband	White	False	42.000000	178780.000169	14.0	>=50k
8	Private	Some-college	Never-married	Sales	Own-child	White	False	23.000000	163665.000659	10.0	<50k
9	Private	HS-grad	Married-civ-spouse	Adm-clerical	Husband	White	False	34.000000	110476.001991	9.0	<50k

epoch	train_loss	valid_loss	accuracy	time
0	0.370782	0.411382	0.774416	00:06
1	0.357029	0.375285	0.839988	00:06
2	0.354338	0.359791	0.841216	00:06

epoch	train_loss	valid_loss	accuracy	time
0	0.392639	0.439587	0.764588	00:00
1	0.370950	0.361114	0.834920	00:00
2	0.360302	0.355392	0.836763	00:00

epoch	train_loss	valid_loss	accuracy	time
0	0.524572	0.557502	0.757371	00:00
1	0.443588	0.479469	0.774877	00:00
2	0.411876	0.455629	0.766892	00:00

Lesson 1 - Adults

Lesson Video:

TabularPandas

Categorify

Normalize

FillMissing

The DataLoaders