This notebook is largely based on the fantastic work of Etienne Tremblay and what came from experiments as noted here
AutoEncoders, just what are they?
The problem: we have far too many input variables (200+).
Rather than trying to make a tabular model with all these features, we can compress them down via a technique called AutoEncoding.
Essentially we train a model whose sole purpose is to recreate the original input data!
But where does the reduction happen? Let's look at a model visualization of an auto-encoder, specifically the one we will be making!
The base of this model is extremely similar to fastai
's TabularModel
, minus a few distinctions:
- Our inputs immediatly pass through a
BatchSwapNoise
module, based on the Porto Seguro Winning Solution which inputs random noise into our data for variability - After going through the embedding matrix the "layers" of our model include an
Encoder
andDecoder
(shown below) which compresses our data to a 128-long vector before blowing it back up in the decoder - After outputted from the decoder we specifically decode the categorical and continuous variables back to their original shapes
This encoder blows up our inputs (in this case) until we reach a 128-long vector representation. From there we pass it to the decoder (shown below):
The decoder just does the reverse of what the encoder just did
Since we are building a model that will be able to accurately recreate our data based on its input, we'll need a few modifications and special handles.
First let's setup the example Adult dataset:
from fastai.tabular.all import *
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
Next we'll want our cat_names
, cont_names
, procs
, etc:
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
y_names = 'salary'
y_block = CategoryBlock()
splits = RandomSplitter()(range_of(df))
And finally we'll get a baseline accuracy:
to = TabularPandas(df, procs = [Categorify, FillMissing, Normalize], cat_names=cat_names, cont_names=cont_names,
splits=splits, y_names=['salary'], y_block=CategoryBlock())
dls = to.dataloaders(bs=1024)
learn = tabular_learner(dls, layers=[200,100], metrics=[accuracy])
learn.fit(5, 1e-2)
ReadTabBatchIdentity
fastai
normally has ReadTabBatch
as an ItemTransform
in order to load in the outputs from a TabularPandas
object. We need to modify this slightly, so instead of returning x,y
, we return pairs of x,x
:
class ReadTabBatchIdentity(ItemTransform):
"Read a batch of data and return the inputs as both `x` and `y`"
def __init__(self, to): store_attr()
def encodes(self, to):
if not to.with_cont: res = (tensor(to.cats).long(),) + (tensor(to.cats).long(),)
else: res = (tensor(to.cats).long(),tensor(to.conts).float()) + (tensor(to.cats).long(), tensor(to.conts).float())
if to.device is not None: res = to_device(res, to.device)
return res
class TabularPandasIdentity(TabularPandas): pass
@delegates()
class TabDataLoaderIdentity(TabDataLoader):
"A transformed `DataLoader` for AutoEncoder problems with Tabular data"
do_item = noops
def __init__(self, dataset, bs=16, shuffle=False, after_batch=None, num_workers=0, **kwargs):
if after_batch is None: after_batch = L(TransformBlock().batch_tfms)+ReadTabBatchIdentity(dataset)
super().__init__(dataset, bs=bs, shuffle=shuffle, after_batch=after_batch, num_workers=num_workers, **kwargs)
def create_batch(self, b): return self.dataset.iloc[b]
We also need to make TabularPandasIdentity
's dl_type
to TabDataLoaderIdentity
so it knows just what type of DataLoader
to generate:
TabularPandasIdentity._dl_type = TabDataLoaderIdentity
to = TabularPandasIdentity(df, [Categorify, FillMissing, Normalize], cat_names, cont_names, splits=RandomSplitter(seed=32)(df))
dls = to.dataloaders(bs=1024)
We'll set the n_inp
to 2 and then work on building our loss function:
dls.n_inp = 2
Probably the most important part of this problem is creating a good loss function that can best measure how accurate our model could recreate the data. Let's look at how to approach this with categorical and continuous variables
For the categorical variables, we'll want to gather a dictionary of each unique possible class in it:
total_cats = {k:len(v) for k,v in to.classes.items()}
total_cats
We will then use this dictionary to figure out where to apply our CrossEntropyLossFlat
for each categorical variable.
We'll also need to know the total number of outputs possible for those variables:
sum([v for k,v in total_cats.items()])
For the continuous variables we need to know the means and standard deviations:
to.means
We're going to store the means
and stds
in a DataFrame
to make some further adjustments before usage:
means = pd.DataFrame.from_dict({k:[v] for k,v in to.means.items()})
stds = pd.DataFrame.from_dict({k:[v] for k,v in to.stds.items()})
That modification we will be making is gathering a sigmoid range based on the non-normalized data to reduce the range our values can be:
low = (df[cont_names].min().to_frame().T.values - means.values) / stds.values
high = (df[cont_names].max().to_frame().T.values - means.values) / stds.values
low, high
class RecreatedLoss(Module):
"Measures how well we have created the original tabular inputs"
def __init__(self, cat_dict):
ce = CrossEntropyLossFlat(reduction='sum')
mse = MSELossFlat(reduction='sum')
store_attr('cat_dict,ce,mse')
def forward(self, preds, cat_targs, cont_targs):
cats, conts = preds
tot_ce, pos = cats.new([0]), 0
for i, (k,v) in enumerate(self.cat_dict.items()):
tot_ce += self.ce(cats[:, pos:pos+v], cat_targs[:,i])
pos += v
norm_cats = cats.new([len(self.cat_dict)])
norm_conts = conts.new([conts.size(1)])
cat_loss = tot_ce/norm_cats
cont_loss = self.mse(conts, cont_targs)/norm_conts
total = cat_loss+cont_loss
return total / cats.size(0)
And all we need to do is pass in our total_cats
dictionary:
loss_func = RecreatedLoss(total_cats)
class BatchSwapNoise(Module):
"Swap Noise Module"
def __init__(self, p): store_attr()
def forward(self, x):
if self.training:
mask = torch.rand(x.size()) > (1 - self.p)
l1 = torch.floor(torch.rand(x.size()) * x.size(0)).type(torch.LongTensor)
l2 = (mask.type(torch.LongTensor) * x.size(1))
res = (l1 * l2).view(-1)
idx = torch.arange(x.nelement()) + res
idx[idx>=x.nelement()] = idx[idx>=x.nelement()]-x.nelement()
return x.flatten()[idx].view(x.size())
else:
return x
Notice how it is like Dropout, where the noise is only added during training
Next we'll make a custom TabularAE
model (AutoEncoder) for us to use:
class TabularAE(TabularModel):
"A simple AutoEncoder model"
def __init__(self, emb_szs, n_cont, hidden_size, cats, low, high, ps=0.2, embed_p=0.01, bswap=None):
super().__init__(emb_szs, n_cont, layers=[1024, 512, 256], out_sz=hidden_size, embed_p=embed_p, act_cls=Mish())
self.bswap = bswap
self.cats = cats
self.activation_cats = sum([v for k,v in cats.items()])
self.layers = nn.Sequential(*L(self.layers.children())[:-1] + nn.Sequential(LinBnDrop(256, hidden_size, p=ps, act=Mish())))
if(bswap != None): self.noise = BatchSwapNoise(bswap)
self.decoder = nn.Sequential(
LinBnDrop(hidden_size, 256, p=ps, act=Mish()),
LinBnDrop(256, 512, p=ps, act=Mish()),
LinBnDrop(512, 1024, p=ps, act=Mish())
)
self.decoder_cont = nn.Sequential(
LinBnDrop(1024, n_cont, p=ps, bn=False, act=None),
SigmoidRange(low=low, high=high)
)
self.decoder_cat = LinBnDrop(1024, self.activation_cats, p=ps, bn=False, act=None)
def forward(self, x_cat, x_cont=None, encode=False):
if(self.bswap != None):
x_cat = self.noise(x_cat)
x_cont = self.noise(x_cont)
encoded = super().forward(x_cat, x_cont)
if encode: return encoded # return the representation
decoded_trunk = self.decoder(encoded)
decoded_cats = self.decoder_cat(decoded_trunk)
decoded_conts = self.decoder_cont(decoded_trunk)
return decoded_cats, decoded_conts
Towards the end we will look at how to extract our vector-representations, but those with keen-eyes can spot where it is in the above code. We can pass in an encode=False
parameter to forward
, and so long as we keep it to a False
default, it will not break inside of the fastai
training framework
Now let's build a model. We'll use a hidden layer size (vector representation) of 128, 10% dropout with 1% embedding dropout, 1% weight decay and a noise level of 10%. Along with this we will pass in our y_range
:
emb_szs = get_emb_sz(to.train)
model = TabularAE(emb_szs, len(cont_names), 128, ps=0.1, cats=total_cats, embed_p=0.01,
bswap=.1, low=tensor(low).cuda(), high=tensor(high).cuda())
And finally our Learner
:
learn = Learner(dls, model, loss_func=loss_func, wd=0.01, opt_func=ranger)
ranger
optimizer here. During experiments we found it could train much better representations than Adam with fit/fit_one_cycleAnd now we'll fit until we begin overfitting (with EarlyStoppingCallback
):
learn.fit_flat_cos(100, cbs=[EarlyStoppingCallback()], lr=4e-3)
As we can see the best model was only after 6 epochs! Let's see how our representations stack up
dl = learn.dls.test_dl(df)
And then we will predict over all the data using raw PyTorch
. Notice we are passing encode=True
to grab the representations:
outs = []
for batch in dl:
with torch.no_grad():
learn.model.eval()
learn.model.cpu()
out = learn.model(*batch[:2], encode=True).cpu().numpy()
outs.append(out)
outs = np.concatenate(outs)
And now we can verify that it is indeed a 128-long vector:
outs.shape
Finally we need the actual predictions and targets for the categorical and continuous variables:
(cat_preds, cont_preds), (cat_targs, cont_targs) = learn.get_preds(dl=dl)
For measuring their overall accuracy we will use an R2
score:
cont_preds = pd.DataFrame(cont_preds, columns=cont_names)
cont_targs = pd.DataFrame(cont_targs, columns=cont_names)
We'll decode our values manually via our stds
and means
:
preds = pd.DataFrame((cont_preds.values * stds.values) + means.values, columns=cont_preds.columns)
targets = pd.DataFrame((cont_targs.values * stds.values) + means.values, columns=cont_targs.columns)
And measure the min
, max
, mean
, median
, and R2
score:
from sklearn.metrics import r2_score
mi = (np.abs(targets-preds)).min().to_frame().T
ma = (np.abs(targets-preds)).max().to_frame().T
mean = (np.abs(targets-preds)).mean().to_frame().T
median = (np.abs(targets-preds)).median().to_frame().T
r2 = pd.DataFrame.from_dict({c:[r2_score(targets[c], preds[c])] for c in preds.columns})
for d,name in zip([mi,ma,mean,median,r2], ['Min', 'Max', 'Mean', 'Median', 'R2']):
d = d.insert(0, 'GroupBy', name)
Let's see how it looks:
data = pd.concat([mi,ma,mean,median,r2])
data
Those R2
values look very good! Let's take their mean:
r2.mean(axis=1)
93.5% is not bad at all for 6 epochs! Let's take a look at the categorical variables next
cat_reduced = torch.zeros_like(cat_targs)
pos=0
for i, (k,v) in enumerate(total_cats.items()):
cat_reduced[:,i] = cat_preds[:,pos:pos+v].argmax(dim=1)
pos += v
cat_preds = pd.DataFrame(cat_reduced, columns=cat_names)
cat_targs = pd.DataFrame(cat_targs, columns=cat_names)
We'll measure a balanced_accuracy
as well as an f1_score
:
from sklearn.metrics import balanced_accuracy_score, f1_score
accuracy = pd.DataFrame.from_dict({c:[balanced_accuracy_score(cat_targs[c], cat_preds[c])] for c in cat_preds.columns})
f1 = pd.DataFrame.from_dict({c:[f1_score(cat_targs[c], cat_preds[c], average='weighted')] for c in cat_preds.columns})
for d,name in zip([accuracy, f1], ['Accuracy', 'F1']):
d = d.insert(0, 'MetricName', name)
pd.concat([accuracy, f1])
We can see that our accuracy is a bit lower than our continuous variables, but the F1 scores look very strong!
Let's check the overall accuracy:
accuracy.mean(axis=1)
85% is honestly quite good in this situation. Now let's take a look at how to use these representations
ys = df['salary'].to_numpy()
And make a dataframe that holds the representations and our ys
:
df_outs = pd.DataFrame(columns=['salary'] + list(range(0,128)))
df_outs['salary'] = ys
df_outs[list(range(0,128))] = outs
df_outs[list(range(0,128))] = df_outs[list(range(0,128))].astype(np.float16)
Next we'll make a new TabularPandas
object and set it up exactly like we normally would for the Adult
problem:
cont_names = list(range(0,128))
splits = RandomSplitter()(range_of(df))
to = TabularPandas(df_outs, procs = [Normalize], cont_names=cont_names, splits=splits, y_names=['salary'], reduce_memory=False,
y_block=CategoryBlock())
Build our DataLoaders
:
dls = to.dataloaders(bs=1024)
And then train
We need to redefine
accuracy
since currently it holds aDataFrame
def accuracy(inp, targ, axis=-1):
"Compute accuracy with `targ` when `pred` is bs * n_classes"
pred,targ = flatten_check(inp.argmax(dim=axis), targ)
return (pred == targ).float().mean()
learn = tabular_learner(dls, layers=[200,100], metrics=[accuracy])
learn.fit(5, 1e-2)
And we can see our model achieved a better accuracy than the original data!