Binary Classification

In this example we will be walking through the fastai tabular API to perform binary classification on the Salary dataset.

This notebook can run along side the first tabular lesson from Walk with fastai2, shown here

First we need to call the tabular module:

from fastai.tabular.all import *

And grab our dataset:

path = untar_data(URLs.ADULT_SAMPLE)

If we look at the contents of our folder, we will find our data lives in adult.csv:

path.ls()

(#3) [Path('/home/ml1/.fastai/data/adult_sample/models'),Path('/home/ml1/.fastai/data/adult_sample/export.pkl'),Path('/home/ml1/.fastai/data/adult_sample/adult.csv')]

We'll go ahead and open it in Pandas and take a look:

df = pd.read_csv(path/'adult.csv')
df.head()

TabularPandas

fastai has a new way of dealing with tabular data by utilizing a TabularPandas object. It expects some dataframe, some procs, cat_names, cont_names, y_names, y_block, and some splits. We'll walk through all of them

First we need to grab our categorical and continuous variables, along with how we want to process our data.

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']

When we pre-process tabular data with fastai, we do one or more of three transforms:

Categorify

Categorify will transform columns that are in your cat_names into that type, along with label encoding our categorical data.

First we'll make an instance of it:

cat = Categorify()

And now let's try transforming a dataframe

to = TabularPandas(df, cat, cat_names)

We can then extract that transform from to.procs.categorify:

cats = to.procs.categorify

Let's take a look at the categories:

cats['relationship']

(#7) ['#na#',' Husband',' Not-in-family',' Other-relative',' Own-child',' Unmarried',' Wife']

We can see that it added a #na#category. Let's look at the actual column:

to.show(max_n=3)

We can see now, for example, that occupation got returned a #na#value (as it was missing)

If we call to.cats we can see our one-hot encoded variables:

to.cats.head()

Normalize

To properly work with our numerical columns, we need to show a relationship between them all that our model can understand. This is commonly done through Normalization, where we scale the data between -1 and 1, and compute a z-score

norm = Normalize()

Let's make another to

to = TabularPandas(df, norm, cont_names=cont_names)

norms = to.procs.normalize

And take a closer look.

We can grab the means and standard deviations like so:

norms.means

{'age': 38.58164675532078,
 'fnlwgt': 189778.36651208502,
 'education-num': 10.079815426825466}

norms.stds

{'age': 13.64022319230403,
 'fnlwgt': 105548.3568809906,
 'education-num': 2.5729591440613078}

And we can also call to.conts to take a look at our transformed data:

to.conts.head()

FillMissing

Now the last thing we need to do is take care of any missing values in our continuous variables (we have a special #na# for categorical data already). We have three strategies we can use:

median
mode
constant

By default it uses median:

fm = FillMissing(fill_strategy=FillStrategy.median)

We'll recreate another TabularPandas:

to = TabularPandas(df, fm, cont_names=cont_names)

Let's look at those missing values in the first few rows:

to.conts.head()

But wait! There's more!

to.cat_names

(#1) ['education-num_na']

We have categorical values?! Yes!

to.cats.head()

We now have an additional boolean value based on if the value was missing or not too!

The DataLoaders

Now let's build our TabularPandas for classifying. We're also going to want to split our data and declare our y_names too:

splits = RandomSplitter()(range_of(df))
splits

((#26049) [18724,19703,4062,9102,28824,4054,5833,16188,2731,28161...],
 (#6512) [24465,976,1726,10178,4740,3920,32288,26018,20274,9660...])

What is range_of?

range_of(df)[:5], len(df)

([0, 1, 2, 3, 4], 32561)

It's a list of total index's in our DataFrame

We'll use all our cat and cont names, the procs, declare a y_name, and finally specify a single-label classification problem with CategoryBlock

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
y_names = 'salary'
y_block = CategoryBlock()

Now that we have everything declared, let's build our TabularPandas:

to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names,
                   y_names=y_names, y_block=y_block, splits=splits)

And now we can build the DataLoaders. We can do this one of two ways, first just calling to.dataloaders() on our data:

dls = to.dataloaders()

Or we can create the DataLoaders ourselves (a train and valid). One great reason to do this this way is we can pass in different batch sizes into each TabDataLoader, along with changing options like shuffle and drop_last

So how do we use it? Our train and validation data live in to.train and to.valid right now, so we specify that along with our options. When you make a training DataLoader, you want shuffle to be True and drop_last to be True (so we drop the last incomplete batch)

trn_dl = TabDataLoader(to.train, bs=64, shuffle=True, drop_last=True)
val_dl = TabDataLoader(to.valid, bs=128)

Now we can make some DataLoaders:

dls = DataLoaders(trn_dl, val_dl)

And show a batch of data:

dls.show_batch()

Why can we do the .dataloaders()? Because TabularPandas itself is actually a set of TabDataLoaders! See below for a comparison test:

to._dbunch_type == dls._dbunch_type

True

Tabular Learner and Training a Model

Now we can build our Learner! But what's special about a tabular neural network?

Categorical Variables

When dealing with our categorical data, we create what is called an embedding matrix. This allows for a higher dimentionality for relationships between the different categorical cardinalities. Finding the best size ratio was done through experiments by Jeremy on the Rossmann dataset

This "rule of thumb" is to use either a maximum embedding space of 600, or 1.6 times the cardinality raised to the 0.56, or written out as:

$$min(600, (1.6 * {var.nunique)}^{0.56})$$

Let's calculate these embedding sizes for our model to take a look-see:

emb_szs = get_emb_sz(to); emb_szs

[(10, 6), (17, 8), (8, 5), (16, 8), (7, 5), (6, 4), (3, 3)]

If we want to see what each one aligns to, let's look at the order of cat_names

to.cat_names

(#7) ['workclass','education','marital-status','occupation','relationship','race','education-num_na']

Let's specifically look at workclass:

to['workclass'].nunique()

9

If you notice, we had 10 there, this is to take one more column for any missing categorical values that may show

Numerical Variables

Numericals we can simply pass in how many there are to the model:

cont_len = len(to.cont_names); cont_len

3

And now we have all the pieces we need to build a TabularModel!

TabularModel

What makes this model a little different is our batches is actually two inputs:

batch = dls.one_batch(); len(batch)

3

batch[0][0], batch[1][0]

(tensor([ 5, 12,  5,  5,  2,  5,  1]), tensor([-0.1858, -0.4134, -0.4253]))

With the first being our categorical variables and the second being our numericals.

Now let's make our model. We'll want our size of our embeddings, the number of continuous variables, the number of outputs, and how large and how many fully connected layers we want to use:

net = TabularModel(emb_szs, cont_len, 2, [200,100])

Let's see it's architecture:

net

TabularModel(
  (embeds): ModuleList(
    (0): Embedding(10, 6)
    (1): Embedding(17, 8)
    (2): Embedding(8, 5)
    (3): Embedding(16, 8)
    (4): Embedding(7, 5)
    (5): Embedding(6, 4)
    (6): Embedding(3, 3)
  )
  (emb_drop): Dropout(p=0.0, inplace=False)
  (bn_cont): BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): LinBnDrop(
      (0): BatchNorm1d(42, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Linear(in_features=42, out_features=200, bias=False)
      (2): ReLU(inplace=True)
    )
    (1): LinBnDrop(
      (0): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Linear(in_features=200, out_features=100, bias=False)
      (2): ReLU(inplace=True)
    )
    (2): LinBnDrop(
      (0): Linear(in_features=100, out_features=2, bias=True)
    )
  )
)

tabular_learner

Now that we know the background, let's build our model a little bit faster and generate a Learner too:

learn = tabular_learner(dls, [200,100], metrics=accuracy)

And now we can fit!

learn.lr_find()

SuggestedLRs(lr_min=0.0013182567432522773, lr_steep=1.3182567358016968)

learn.fit(3, 1e-2)

Can we speed this up a little? Yes we can! The more you can load into a batch, the faster you can process the data. This is a careful balance, for tabular data I go to a maximum of 4096 rows per batch if the dataset is large enough for a decent number of batches:

dls = to.dataloaders(bs=1024)
learn = tabular_learner(dls, [200,100], metrics=accuracy)
learn.fit(3, 1e-2)

We can see we fit very quickly, but it didn't fit quite as well (there is a trade-off):

dls = to.dataloaders(bs=4096)
learn = tabular_learner(dls, [200,100], metrics=accuracy)
learn.fit(3, 1e-2)

Inference

Now let's look at how we can perform inference. To do predictions we can use fastai's in-house learn.predict for individual rows, and get_preds + test_dl for batches of predictions:

row, cls, probs = learn.predict(df.iloc[0])

row.show()

Now let's try test_dl. There's something special we can do here too:

dl = learn.dls.test_dl(df.iloc[:100])

Let's look at a batch:

dl.show_batch()

We have our labels! It'll grab them if possible by default!

What does that mean? Well, besides simply calling get_preds, we can also run validate to see how a model performs. This is nice as it can allow for efficient methods when calculating something like permutation importance:

learn.validate(dl=dl)

(#2) [0.4870152175426483,0.7699999809265137]

We'll also show an example of get_preds:

preds = learn.get_preds(dl=dl)

preds[0][0]

tensor([0.5952, 0.4048])

What would happen if I accidently passed in an unlablled dataset to learn.validate though? Let's find out:

df2 = df.iloc[:100].drop('salary', axis=1)
df2.head()

dl = learn.dls.test_dl(df2)
learn.validate(dl=dl)

(#2) [None,None]

We can see it will simply just return None!

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	NaN	Wife	White	Female	0	1902	40	United-States	>=50k
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k
2	38	Private	96185	HS-grad	NaN	Divorced	NaN	Unmarried	Black	Female	0	0	32	United-States	<50k
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

	workclass	education	marital-status	occupation	relationship	race
0	Private	Assoc-acdm	Married-civ-spouse	#na#	Wife	White
1	Private	Masters	Divorced	Exec-managerial	Not-in-family	White
2	Private	HS-grad	Divorced	#na#	Unmarried	Black

	age	fnlwgt	education-num
0	0.763796	-0.838084	0.746294
1	0.397233	0.444987	1.523609
2	-0.042642	-0.886734	NaN
3	-0.042642	-0.728873	1.912267
4	0.250608	-1.018314	NaN

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary
0	Private	11th	Never-married	Sales	Own-child	White	False	17.0	200199.000263	7.0	<50k
1	Private	HS-grad	Divorced	Exec-managerial	Not-in-family	White	False	36.0	256635.997971	9.0	<50k
2	Private	HS-grad	Married-civ-spouse	Transport-moving	Husband	White	False	44.0	172032.000108	9.0	<50k
3	Self-emp-not-inc	5th-6th	Married-civ-spouse	Craft-repair	Husband	White	False	44.0	112506.998184	3.0	<50k
4	Local-gov	Some-college	Never-married	Exec-managerial	Not-in-family	White	False	40.0	74949.002528	10.0	<50k
5	?	10th	Never-married	?	Own-child	White	False	17.0	138506.999270	6.0	<50k
6	Private	Assoc-voc	Married-civ-spouse	Prof-specialty	Wife	White	False	28.0	247819.002902	11.0	<50k
7	Private	Masters	Married-civ-spouse	Exec-managerial	Husband	White	False	37.0	112496.998941	14.0	>=50k
8	Private	HS-grad	Separated	Handlers-cleaners	Not-in-family	Black	False	41.0	215479.000444	9.0	<50k
9	Private	Assoc-voc	Divorced	Sales	Own-child	White	False	31.0	163302.999641	11.0	<50k

epoch	train_loss	valid_loss	accuracy	time
0	0.373488	0.362226	0.841523	00:03
1	0.367182	0.354484	0.838759	00:03
2	0.368737	0.355205	0.834306	00:03

epoch	train_loss	valid_loss	accuracy	time
0	0.389950	0.407716	0.812500	00:00
1	0.366711	0.350281	0.839681	00:00
2	0.357688	0.355638	0.836763	00:00

epoch	train_loss	valid_loss	accuracy	time
0	0.449081	0.503046	0.772267	00:00
1	0.407749	0.465814	0.758446	00:00
2	0.387042	0.445514	0.776720	00:00

Tabular Binary Classification (Beginner)