We'll call in the `tabular`

module:

```
from fastai.tabular.all import *
```

Below you will find exact imports for everything used today:

```
from fastcore.basics import range_of, ifnone
from fastai.callback.progress import ProgressCallback
from fastai.callback.schedule import lr_find
from fastai.data.block import CategoryBlock
from fastai.data.core import DataLoaders
from fastai.data.external import untar_data, URLs
from fastai.data.transforms import RandomSplitter
from fastai.learner import load_learner, Learner
from fastai.metrics import accuracy
from fastai.tabular.core import Categorify, FillMissing, FillStrategy, Normalize, TabularPandas, TabDataLoader
from fastai.tabular.model import TabularModel
from fastai.tabular.learner import tabular_learner
import pandas as pd
```

And let's grab some data!

```
path = untar_data(URLs.ADULT_SAMPLE)
```

```
path.ls()
```

The data we want lives in `adult.csv`

```
df = pd.read_csv(path/'adult.csv')
```

Let's take a look at it:

```
df.head()
```

`TabularPandas`

`fastai`

has a new way of dealing with tabular data in a `TabularPandas`

object. It expects some dataframe, some `procs`

, `cat_names`

, `cont_names`

, `y_names`

, `y_block`

, and some splits. We'll walk through all of them

First we need to grab our categorical and continuous variables, along with how we want to process our data.

```
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
```

When we pre-process tabular data, `fastai`

we do one or more of three transforms:

`Categorify`

will transform columns that are in your `cat_names`

into that type, along with label encoding our categorical data:

First we'll make an instance of it:

```
cat = Categorify()
```

```
df.dtypes
```

And now let's try transforming a dataframe

```
to = TabularPandas(df, cat, cat_names)
```

```
cats = to.procs.categorify
```

Let's take a look at the categories:

```
cats['race']
```

We can see that it added a `#na#`

category. Let's look at the actual column:

```
to.show(max_n=3)
```

We can see that for instance `occupation`

got returned a `#na#`

value (as it was missing)

And if we call `to.cats`

we can see our one-hot encoded variables:

```
to.cats.head()
```

`Normalize`

To properly work with our numerical columns, we need to show a relationship between them all that our model can understand. This is commonly done through Normalization, where we scale the data between -1 and 1, and compute a `z-score`

```
norm = Normalize()
```

Let's make another `to`

```
cont_names
```

```
to = TabularPandas(df, norm, cont_names=cont_names)
```

```
norms = to.procs.normalize
```

Let's take a look:

We can grab the means and standard deviations like so:

```
norms.means
```

```
norms.stds
```

And we can also call `to.conts`

to take a look at our transformed data:

```
to.conts.head()
```

`FillMissing`

Now the last thing we need to do is take care of any missing values in our **continuous** variables (we have a special `#na#`

for categorical data). We have three strategies we can use:

`median`

`constant`

`mode`

By default it uses `median`

:

```
fm = FillMissing(fill_strategy=FillStrategy.median)
```

We'll recreate another `TabularPandas`

:

```
to = TabularPandas(df, fm, cont_names=cont_names)
```

Let's look at those missing values in the first few rows:

```
to.conts.head()
```

**But wait!** There's more!

```
to.cat_names
```

We have categorical values?! Yes!

```
to.cats.head()
```

We now have an additional boolean value based on if the value was missing or not too!

## The `DataLoaders`

Now let's build our `TabularPandas`

. We're also going to want to split our data too, and declare our `y_names`

:

```
splits = RandomSplitter()(range_of(df))
```

```
splits
```

What is `range_of`

?

```
range_of(df)[:5], len(df)
```

It's a list of total index's in our `DataFrame`

```
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
y_names = 'salary'
y_block = CategoryBlock()
```

Now that we have everything declared, let's build our `TabularPandas`

```
to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names,
y_names=y_names, y_block=y_block, splits=splits)
```

And we can build our `DataLoaders`

. We can do this one of two ways:

```
dls = to.dataloaders()
```

```
dls.show_batch()
```

We can create our DataLoaders (a train and a valid). One great reason to do this this way is we can pass in different batch sizes into each TabDataLoader, along with changing options like shuffle and drop_last (at the bottom I'll show why that's super cool)

So how do we use it? Our train and validation data live in to.train and to.valid right now, so we specify that along with our options. When you make a training DataLoader, you want shuffle to be True and drop_last to be True

```
trn_dl = TabDataLoader(to.train, bs=64, shuffle=True, drop_last=True)
val_dl = TabDataLoader(to.valid, bs=128)
```

Now we can make some `DataLoaders`

:

```
dls = DataLoaders(trn_dl, val_dl)
```

```
dls.show_batch()
```

Why can we do the `.dataloaders()`

? Because `TabularPandas`

actually **are** `TabDataLoader`

's!

```
to._dbunch_type
```

```
dls._dbunch_type
```

`TabularLearner`

Now we can build our model!

## Categorical Variables:

When dealing with our categorical data, we create what is called an **embedding matrix**. This allows for a higher dimentionality for relationships between the different categorical cardinalities. Finding the best size ratio was done through experiments by Jeremy on the Rossmann dataset

```
def get_emb_sz(to, sz_dict=None):
"Get default embedding size from `TabularPreprocessor` `proc` or the ones in `sz_dict`"
return [_one_emb_sz(to.classes, n, sz_dict) for n in to.cat_names]
```

```
def _one_emb_sz(classes, n, sz_dict=None):
"Pick an embedding size for `n` depending on `classes` if not given in `sz_dict`."
sz_dict = ifnone(sz_dict, {})
n_cat = len(classes[n])
sz = sz_dict.get(n, int(emb_sz_rule(n_cat))) # rule of thumb
return n_cat,sz
```

And now if we go look at his rule of thumb:

```
def emb_sz_rule(n_cat):
"Rule of thumb to pick embedding size corresponding to `n_cat`"
return min(600, round(1.6 * n_cat**0.56))
```

We either choose a maximum size of 600, or 1.6 * the cardinality to the .56

```
emb_szs = get_emb_sz(to)
```

```
emb_szs
```

If we want to see what each one aligns to, let's look at the order of `cat_names`

```
to.cat_names
```

```
to['workclass'].nunique()
```

If you notice, we had `10`

there, this is to take one more column for any missing categorical values that may show

Numericals we just simply pass in how many there are:

```
cont_len = len(to.cont_names)
```

```
cont_len
```

And now we can build our model!

What makes this model a little different is our batches is actually two inputs:

```
batch = dls.one_batch()
```

```
len(batch)
```

```
batch[0][0], batch[1][0]
```

With the first being our categorical variables and the second being our numericals.

Now let's make our model. We'll want our size of our embeddings, the number of continuous variables, the number of outputs, and how large and how many fully connected layers we want to use:

```
net = TabularModel(emb_szs, cont_len, 2, [200,100])
```

```
net
```

Now that we know the background, let's do that a bit quicker:

```
learn = tabular_learner(dls, [200,100], metrics=accuracy)
```

And now we can fit!

```
learn.lr_find()
```

```
learn.fit(3, 1e-2)
```

Can we speed this up a little? Yes we can! The more you can load into a batch, the faster you can process the data. This is a careful balance, for tabular data I go to a maximum of 4096 rows per batch:

```
dls = to.dataloaders(bs=1024)
```

```
learn = tabular_learner(dls, [200,100], metrics=accuracy)
```

```
learn.lr_find()
```

```
learn.fit(3, 1e-2)
```

We can see we fit very quickly, but it didn't fit quite as well (there is a trade-off):

```
dls = to.dataloaders(bs=4096)
learn = tabular_learner(dls, [200,100], metrics=accuracy)
```

```
learn.lr_find()
```

```
learn.fit_one_cycle(3, 1e-2)
```

```
learn.export('myModel.pkl')
```

```
del learn
```

```
learn = load_learner('myModel.pkl')
```

Once we load in our learner, we can create a test dataloader like so:

```
dl = learn.dls.test_dl(df.iloc[:100])
```

Let's look at a batch

```
dl.show_batch()
```

You can see it's actually labelled! Is that right?

```
df2 = df.iloc[:100].drop('salary', axis=1)
```

```
df2.head()
```

```
dl = learn.dls.test_dl(df2)
```

```
dl.show_batch()
```

And now we can pass either into our `learn`

! (You can't do `validate`

on a `test_dl`

that did not have ground truth labels)

```
learn.validate(dl=dl)
```

```
dl = learn.dls.test_dl(df.iloc[:100])
```

```
learn.validate(dl=dl)
```