This notebook is largely based on the fantastic work of Etienne Tremblay and what came from experiments as noted here

## AutoEncoders, just what are they?

The problem: we have far too many input variables (200+).

Rather than trying to make a tabular model with all these features, we can compress them down via a technique called AutoEncoding.

Essentially we train a model whose sole purpose is to **recreate** the original input data!

But where does the reduction happen? Let's look at a model visualization of an auto-encoder, specifically the one we will be making!

The base of this model is extremely similar to `fastai`

's `TabularModel`

, minus a few distinctions:

- Our inputs immediatly pass through a
`BatchSwapNoise`

module, based on the Porto Seguro Winning Solution which inputs random noise into our data for variability - After going through the embedding matrix the "layers" of our model include an
`Encoder`

and`Decoder`

(shown below) which compresses our data to a 128-long vector before blowing it back up in the decoder - After outputted from the decoder we specifically decode the categorical and continuous variables back to their original shapes

This encoder blows up our inputs (in this case) until we reach a 128-long vector representation. From there we pass it to the decoder (shown below):

The decoder just does the reverse of what the encoder just did

Since we are building a model that will be able to accurately recreate our data based on its input, we'll need a few modifications and special handles.

First let's setup the example Adult dataset:

```
from fastai.tabular.all import *
```

```
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
```

Next we'll want our `cat_names`

, `cont_names`

, `procs`

, etc:

```
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
y_names = 'salary'
y_block = CategoryBlock()
splits = RandomSplitter()(range_of(df))
```

And finally we'll get a baseline accuracy:

```
to = TabularPandas(df, procs = [Categorify, FillMissing, Normalize], cat_names=cat_names, cont_names=cont_names,
splits=splits, y_names=['salary'], y_block=CategoryBlock())
```

```
dls = to.dataloaders(bs=1024)
```

```
learn = tabular_learner(dls, layers=[200,100], metrics=[accuracy])
learn.fit(5, 1e-2)
```

### ReadTabBatchIdentity

`fastai`

normally has `ReadTabBatch`

as an `ItemTransform`

in order to load in the outputs from a `TabularPandas`

object. We need to modify this slightly, so instead of returning `x,y`

, we return pairs of `x,x`

:

```
class ReadTabBatchIdentity(ItemTransform):
"Read a batch of data and return the inputs as both `x` and `y`"
def __init__(self, to): store_attr()
def encodes(self, to):
if not to.with_cont: res = (tensor(to.cats).long(),) + (tensor(to.cats).long(),)
else: res = (tensor(to.cats).long(),tensor(to.conts).float()) + (tensor(to.cats).long(), tensor(to.conts).float())
if to.device is not None: res = to_device(res, to.device)
return res
class TabularPandasIdentity(TabularPandas): pass
```

### TabDataLoaderIdentity

Next we need a way to use our new transform, and that is through a new `DataLoader`

:

```
@delegates()
class TabDataLoaderIdentity(TabDataLoader):
"A transformed `DataLoader` for AutoEncoder problems with Tabular data"
do_item = noops
def __init__(self, dataset, bs=16, shuffle=False, after_batch=None, num_workers=0, **kwargs):
if after_batch is None: after_batch = L(TransformBlock().batch_tfms)+ReadTabBatchIdentity(dataset)
super().__init__(dataset, bs=bs, shuffle=shuffle, after_batch=after_batch, num_workers=num_workers, **kwargs)
def create_batch(self, b): return self.dataset.iloc[b]
```

We also need to make `TabularPandasIdentity`

's `dl_type`

to `TabDataLoaderIdentity`

so it knows just what type of `DataLoader`

to generate:

```
TabularPandasIdentity._dl_type = TabDataLoaderIdentity
```

## Building the `TabularPandasIdentity`

Now we can build our `to`

object and the `DataLoaders`

:

```
to = TabularPandasIdentity(df, [Categorify, FillMissing, Normalize], cat_names, cont_names, splits=RandomSplitter(seed=32)(df))
dls = to.dataloaders(bs=1024)
```

We'll set the `n_inp`

to 2 and then work on building our loss function:

```
dls.n_inp = 2
```

Probably the most important part of this problem is creating a good loss function that can best measure how accurate our model could recreate the data. Let's look at how to approach this with categorical and continuous variables

For the categorical variables, we'll want to gather a dictionary of each unique possible class in it:

```
total_cats = {k:len(v) for k,v in to.classes.items()}
total_cats
```

We will then use this dictionary to figure out where to apply our `CrossEntropyLossFlat`

for each categorical variable.

We'll also need to know the total number of outputs possible for those variables:

```
sum([v for k,v in total_cats.items()])
```

For the continuous variables we need to know the means and standard deviations:

```
to.means
```

We're going to store the `means`

and `stds`

in a `DataFrame`

to make some further adjustments before usage:

```
means = pd.DataFrame.from_dict({k:[v] for k,v in to.means.items()})
stds = pd.DataFrame.from_dict({k:[v] for k,v in to.stds.items()})
```

That modification we will be making is gathering a sigmoid range based on the non-normalized data to reduce the range our values can be:

```
low = (df[cont_names].min().to_frame().T.values - means.values) / stds.values
high = (df[cont_names].max().to_frame().T.values - means.values) / stds.values
```

```
low, high
```

### RecreatedLoss

Next we'll make our loss function. As mentioned, this loss function will run `MSELossFlat`

on the continuous variables to measure accuracy and run `CrossEntropyLossFlat`

for the categoricals:

```
class RecreatedLoss(Module):
"Measures how well we have created the original tabular inputs"
def __init__(self, cat_dict):
ce = CrossEntropyLossFlat(reduction='sum')
mse = MSELossFlat(reduction='sum')
store_attr('cat_dict,ce,mse')
def forward(self, preds, cat_targs, cont_targs):
cats, conts = preds
tot_ce, pos = cats.new([0]), 0
for i, (k,v) in enumerate(self.cat_dict.items()):
tot_ce += self.ce(cats[:, pos:pos+v], cat_targs[:,i])
pos += v
norm_cats = cats.new([len(self.cat_dict)])
norm_conts = conts.new([conts.size(1)])
cat_loss = tot_ce/norm_cats
cont_loss = self.mse(conts, cont_targs)/norm_conts
total = cat_loss+cont_loss
return total / cats.size(0)
```

And all we need to do is pass in our `total_cats`

dictionary:

```
loss_func = RecreatedLoss(total_cats)
```

```
class BatchSwapNoise(Module):
"Swap Noise Module"
def __init__(self, p): store_attr()
def forward(self, x):
if self.training:
mask = torch.rand(x.size()) > (1 - self.p)
l1 = torch.floor(torch.rand(x.size()) * x.size(0)).type(torch.LongTensor)
l2 = (mask.type(torch.LongTensor) * x.size(1))
res = (l1 * l2).view(-1)
idx = torch.arange(x.nelement()) + res
idx[idx>=x.nelement()] = idx[idx>=x.nelement()]-x.nelement()
return x.flatten()[idx].view(x.size())
else:
return x
```

Notice how it is like Dropout, where the noise is only added during training

Next we'll make a custom `TabularAE`

model (AutoEncoder) for us to use:

```
class TabularAE(TabularModel):
"A simple AutoEncoder model"
def __init__(self, emb_szs, n_cont, hidden_size, cats, low, high, ps=0.2, embed_p=0.01, bswap=None):
super().__init__(emb_szs, n_cont, layers=[1024, 512, 256], out_sz=hidden_size, embed_p=embed_p, act_cls=Mish())
self.bswap = bswap
self.cats = cats
self.activation_cats = sum([v for k,v in cats.items()])
self.layers = nn.Sequential(*L(self.layers.children())[:-1] + nn.Sequential(LinBnDrop(256, hidden_size, p=ps, act=Mish())))
if(bswap != None): self.noise = BatchSwapNoise(bswap)
self.decoder = nn.Sequential(
LinBnDrop(hidden_size, 256, p=ps, act=Mish()),
LinBnDrop(256, 512, p=ps, act=Mish()),
LinBnDrop(512, 1024, p=ps, act=Mish())
)
self.decoder_cont = nn.Sequential(
LinBnDrop(1024, n_cont, p=ps, bn=False, act=None),
SigmoidRange(low=low, high=high)
)
self.decoder_cat = LinBnDrop(1024, self.activation_cats, p=ps, bn=False, act=None)
def forward(self, x_cat, x_cont=None, encode=False):
if(self.bswap != None):
x_cat = self.noise(x_cat)
x_cont = self.noise(x_cont)
encoded = super().forward(x_cat, x_cont)
if encode: return encoded # return the representation
decoded_trunk = self.decoder(encoded)
decoded_cats = self.decoder_cat(decoded_trunk)
decoded_conts = self.decoder_cont(decoded_trunk)
return decoded_cats, decoded_conts
```

Towards the end we will look at how to extract our vector-representations, but those with keen-eyes can spot where it is in the above code. We can pass in an `encode=False`

parameter to `forward`

, and so long as we keep it to a `False`

default, it will not break inside of the `fastai`

training framework

Now let's build a model. We'll use a hidden layer size (vector representation) of 128, 10% dropout with 1% embedding dropout, 1% weight decay and a noise level of 10%. Along with this we will pass in our `y_range`

:

```
emb_szs = get_emb_sz(to.train)
```

```
model = TabularAE(emb_szs, len(cont_names), 128, ps=0.1, cats=total_cats, embed_p=0.01,
bswap=.1, low=tensor(low).cuda(), high=tensor(high).cuda())
```

And finally our `Learner`

:

```
learn = Learner(dls, model, loss_func=loss_func, wd=0.01, opt_func=ranger)
```

**Note:**we are using the

`ranger`

optimizer here. During experiments we found it could train much better representations than Adam with fit/fit_one_cycleAnd now we'll fit until we begin overfitting (with `EarlyStoppingCallback`

):

```
learn.fit_flat_cos(100, cbs=[EarlyStoppingCallback()], lr=4e-3)
```

As we can see the best model was only after 6 epochs! Let's see how our representations stack up

## Getting the compressed representations

We're going to first gather the representations and grade them, before attempting to train a new classification model on the dataset.

To begin we need to make a new `DataLoader`

with all of our data:

```
dl = learn.dls.test_dl(df)
```

And then we will predict over all the data using raw `PyTorch`

. Notice we are passing `encode=True`

to grab the representations:

```
outs = []
for batch in dl:
with torch.no_grad():
learn.model.eval()
learn.model.cuda()
out = learn.model(*batch[:2], encode=True).cpu().numpy()
outs.append(out)
outs = np.concatenate(outs)
```

And now we can verify that it is indeed a 128-long vector:

```
outs.shape
```

Finally we need the actual predictions and targets for the categorical and continuous variables:

```
(cat_preds, cont_preds), (cat_targs, cont_targs) = learn.get_preds(dl=dl)
```

For measuring their overall accuracy we will use an `R2`

score:

```
cont_preds = pd.DataFrame(cont_preds, columns=cont_names)
cont_targs = pd.DataFrame(cont_targs, columns=cont_names)
```

We'll decode our values manually via our `stds`

and `means`

:

```
preds = pd.DataFrame((cont_preds.values * stds.values) + means.values, columns=cont_preds.columns)
targets = pd.DataFrame((cont_targs.values * stds.values) + means.values, columns=cont_targs.columns)
```

And measure the `min`

, `max`

, `mean`

, `median`

, and `R2`

score:

```
from sklearn.metrics import r2_score
```

```
mi = (np.abs(targets-preds)).min().to_frame().T
ma = (np.abs(targets-preds)).max().to_frame().T
mean = (np.abs(targets-preds)).mean().to_frame().T
median = (np.abs(targets-preds)).median().to_frame().T
r2 = pd.DataFrame.from_dict({c:[r2_score(targets[c], preds[c])] for c in preds.columns})
```

```
for d,name in zip([mi,ma,mean,median,r2], ['Min', 'Max', 'Mean', 'Median', 'R2']):
d = d.insert(0, 'GroupBy', name)
```

Let's see how it looks:

```
data = pd.concat([mi,ma,mean,median,r2])
data
```

Those `R2`

values look very good! Let's take their mean:

```
r2.mean(axis=1)
```

93.5% is not bad at all for 6 epochs! Let's take a look at the categorical variables next

```
cat_reduced = torch.zeros_like(cat_targs)
pos=0
for i, (k,v) in enumerate(total_cats.items()):
cat_reduced[:,i] = cat_preds[:,pos:pos+v].argmax(dim=1)
pos += v
cat_preds = pd.DataFrame(cat_reduced, columns=cat_names)
cat_targs = pd.DataFrame(cat_targs, columns=cat_names)
```

We'll measure a `balanced_accuracy`

as well as an `f1_score`

:

```
from sklearn.metrics import balanced_accuracy_score, f1_score
```

```
accuracy = pd.DataFrame.from_dict({c:[balanced_accuracy_score(cat_targs[c], cat_preds[c])] for c in cat_preds.columns})
```

```
f1 = pd.DataFrame.from_dict({c:[f1_score(cat_targs[c], cat_preds[c], average='weighted')] for c in cat_preds.columns})
```

```
for d,name in zip([accuracy, f1], ['Accuracy', 'F1']):
d = d.insert(0, 'MetricName', name)
pd.concat([accuracy, f1])
```

We can see that our accuracy is a *bit* lower than our continuous variables, but the F1 scores look very strong!

Let's check the overall accuracy:

```
accuracy.mean(axis=1)
```

85% is honestly quite good in this situation. Now let's take a look at how to *use* these representations

```
ys = df['salary'].to_numpy()
```

And make a dataframe that holds the representations and our `ys`

:

```
df_outs = pd.DataFrame(columns=['salary'] + list(range(0,128)))
df_outs['salary'] = ys
df_outs[list(range(0,128))] = outs
df_outs[list(range(0,128))] = df_outs[list(range(0,128))].astype(np.float16)
```

Next we'll make a new `TabularPandas`

object and set it up exactly like we normally would for the `Adult`

problem:

```
cont_names = list(range(0,128))
splits = RandomSplitter()(range_of(df))
to = TabularPandas(df_outs, procs = [Normalize], cont_names=cont_names, splits=splits, y_names=['salary'], reduce_memory=False,
y_block=CategoryBlock())
```

Build our `DataLoaders`

:

```
dls = to.dataloaders(bs=1024)
```

And then train

We need to redefine

`accuracy`

since currently it holds a`DataFrame`

```
def accuracy(inp, targ, axis=-1):
"Compute accuracy with `targ` when `pred` is bs * n_classes"
pred,targ = flatten_check(inp.argmax(dim=axis), targ)
return (pred == targ).float().mean()
```

```
learn = tabular_learner(dls, layers=[200,100], metrics=[accuracy])
```

```
learn.fit(5, 1e-2)
```

And we can see our model achieved a *better* accuracy than the original data!

**Note:**a fastai model can achieve ~84%, however in this instance it outperformed the baseline model