## Binary Classification

In this example we will be walking through the `fastai`

tabular API to perform binary classification on the Salary dataset.

This notebook can run along side the first tabular lesson from Walk with fastai2, shown here

First we need to call the tabular module:

```
from fastai.tabular.all import *
```

And grab our dataset:

```
path = untar_data(URLs.ADULT_SAMPLE)
```

If we look at the contents of our folder, we will find our data lives in `adult.csv`

:

```
path.ls()
```

We'll go ahead and open it in `Pandas`

and take a look:

```
df = pd.read_csv(path/'adult.csv')
df.head()
```

## TabularPandas

`fastai`

has a new way of dealing with tabular data by utilizing a `TabularPandas`

object. It expects some dataframe, some `procs`

, `cat_names`

, `cont_names`

, `y_names`

, `y_block`

, and some `splits`

. We'll walk through all of them

First we need to grab our categorical and continuous variables, along with how we want to process our data.

```
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
```

When we pre-process tabular data with `fastai`

, we do one or more of three transforms:

### Categorify

`Categorify`

will transform columns that are in your `cat_names`

into that type, along with label encoding our categorical data.

First we'll make an instance of it:

```
cat = Categorify()
```

And now let's try transforming a dataframe

```
to = TabularPandas(df, cat, cat_names)
```

We can then extract that transform from `to.procs.categorify`

:

```
cats = to.procs.categorify
```

Let's take a look at the categories:

```
cats['relationship']
```

We can see that it added a `#na#`

category. Let's look at the actual column:

```
to.show(max_n=3)
```

We can see now, for example, that `occupation`

got returned a `#na#`

value (as it was missing)

If we call `to.cats`

we can see our one-hot encoded variables:

```
to.cats.head()
```

```
norm = Normalize()
```

Let's make another `to`

```
to = TabularPandas(df, norm, cont_names=cont_names)
```

```
norms = to.procs.normalize
```

And take a closer look.

We can grab the means and standard deviations like so:

```
norms.means
```

```
norms.stds
```

And we can also call `to.conts`

to take a look at our transformed data:

```
to.conts.head()
```

```
fm = FillMissing(fill_strategy=FillStrategy.median)
```

We'll recreate another `TabularPandas`

:

```
to = TabularPandas(df, fm, cont_names=cont_names)
```

Let's look at those missing values in the first few rows:

```
to.conts.head()
```

**But wait!** There's more!

```
to.cat_names
```

We have categorical values?! Yes!

```
to.cats.head()
```

We now have an additional boolean value based on if the value was missing or not too!

## The DataLoaders

Now let's build our `TabularPandas`

for classifying. We're also going to want to split our data and declare our `y_names`

too:

```
splits = RandomSplitter()(range_of(df))
splits
```

What is `range_of`

?

```
range_of(df)[:5], len(df)
```

It's a list of total index's in our `DataFrame`

We'll use all our `cat`

and `cont`

names, the `procs`

, declare a `y_name`

, and finally specify a single-label classification problem with `CategoryBlock`

```
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
y_names = 'salary'
y_block = CategoryBlock()
```

Now that we have everything declared, let's build our `TabularPandas`

:

```
to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names,
y_names=y_names, y_block=y_block, splits=splits)
```

And now we can build the `DataLoaders`

. We can do this one of two ways, first just calling `to.dataloaders()`

on our data:

```
dls = to.dataloaders()
```

Or we can create the `DataLoaders`

ourselves (a train and valid). One great reason to do this this way is we can pass in different batch sizes into each `TabDataLoader`

, along with changing options like `shuffle`

and `drop_last`

So how do we use it? Our train and validation data live in to.train and to.valid right now, so we specify that along with our options. When you make a training DataLoader, you want `shuffle`

to be `True`

and `drop_last`

to be `True`

(so we drop the last incomplete batch)

```
trn_dl = TabDataLoader(to.train, bs=64, shuffle=True, drop_last=True)
val_dl = TabDataLoader(to.valid, bs=128)
```

Now we can make some `DataLoaders`

:

```
dls = DataLoaders(trn_dl, val_dl)
```

And show a batch of data:

```
dls.show_batch()
```

Why can we do the

`.dataloaders()`

? Because`TabularPandas`

itself is actually a set of`TabDataLoaders`

! See below for a comparison test:

```
to._dbunch_type == dls._dbunch_type
```

## Tabular Learner and Training a Model

Now we can build our `Learner`

! But what's special about a tabular neural network?

### Categorical Variables

When dealing with our categorical data, we create what is called an **embedding matrix**. This allows for a higher dimentionality for relationships between the different categorical cardinalities. Finding the best size ratio was done through experiments by Jeremy on the Rossmann dataset

This "rule of thumb" is to use either a maximum embedding space of 600, or 1.6 times the cardinality raised to the 0.56, or written out as:

$$min(600, (1.6 * {var.nunique)}^{0.56})$$

Let's calculate these embedding sizes for our model to take a look-see:

```
emb_szs = get_emb_sz(to); emb_szs
```

If we want to see what each one aligns to, let's look at the order of `cat_names`

```
to.cat_names
```

Let's specifically look at `workclass`

:

```
to['workclass'].nunique()
```

If you notice, we had `10`

there, this is to take one more column for any missing categorical values that may show

```
cont_len = len(to.cont_names); cont_len
```

And now we have all the pieces we need to build a `TabularModel`

!

```
batch = dls.one_batch(); len(batch)
```

```
batch[0][0], batch[1][0]
```

With the first being our categorical variables and the second being our numericals.

Now let's make our model. We'll want our size of our embeddings, the number of continuous variables, the number of outputs, and how large and how many fully connected layers we want to use:

```
net = TabularModel(emb_szs, cont_len, 2, [200,100])
```

Let's see it's architecture:

```
net
```

### tabular_learner

Now that we know the background, let's build our model a little bit faster and generate a `Learner`

too:

```
learn = tabular_learner(dls, [200,100], metrics=accuracy)
```

And now we can fit!

```
learn.lr_find()
```

```
learn.fit(3, 1e-2)
```

Can we speed this up a little? Yes we can! The more you can load into a batch, the faster you can process the data. This is a careful balance, for tabular data I go to a maximum of 4096 rows per batch if the dataset is large enough for a decent number of batches:

```
dls = to.dataloaders(bs=1024)
learn = tabular_learner(dls, [200,100], metrics=accuracy)
learn.fit(3, 1e-2)
```

We can see we fit very quickly, but it didn't fit quite as well (there is a trade-off):

```
dls = to.dataloaders(bs=4096)
learn = tabular_learner(dls, [200,100], metrics=accuracy)
learn.fit(3, 1e-2)
```

```
row, cls, probs = learn.predict(df.iloc[0])
```

```
row.show()
```

Now let's try `test_dl`

. There's something special we can do here too:

```
dl = learn.dls.test_dl(df.iloc[:100])
```

Let's look at a batch:

```
dl.show_batch()
```

We have our labels! It'll grab them if possible by default!

What does that mean? Well, besides simply calling `get_preds`

, we can also run `validate`

to see how a model performs. This is nice as it can allow for efficient methods when calculating something like permutation importance:

```
learn.validate(dl=dl)
```

We'll also show an example of `get_preds`

:

```
preds = learn.get_preds(dl=dl)
```

```
preds[0][0]
```

What would happen if I accidently passed in an unlablled dataset to `learn.validate`

though? Let's find out:

```
df2 = df.iloc[:100].drop('salary', axis=1)
df2.head()
```

```
dl = learn.dls.test_dl(df2)
learn.validate(dl=dl)
```

We can see it will simply just return `None`

!