Why Use Predetermined Stats?
If fastai will simply let us pass everything to a TabularPandas object to preprocess and train on, why should having custom statistics for our data?
Let's try to think of a scenario.
My data is a few trillion rows, so there is no way (currently) I can load this DataFrame into memory at once. What do I do? Perhaps I would want to train on batches of my data at a time (article on this will come soon). To do this though I would need all of my procs predetermined from the start so every transform is done the same across all of our mini-batches of data.
Currently there is an open PR for this integration, so for now this will live inside of Walk with fastai and we'll show how to use it as well!
Before we begin, let's import the tabular module:
Categorify
The first one we will look at is Categorify. Currently the source code looks like so:
class Categorify(TabularProc):
"Transform the categorical variables to something similar to `pd.Categorical`"
order = 1
def setups(self, to):
store_attr(classes={n:CategoryMap(to.iloc[:,n].items, add_na=(n in to.cat_names)) for n in to.cat_names})
def encodes(self, to): to.transform(to.cat_names, partial(_apply_cats, self.classes, 1))
def decodes(self, to): to.transform(to.cat_names, partial(_decode_cats, self.classes))
def __getitem__(self,k): return self.classes[k]
What our modification needs to do is on the __init__ we need an option to pass in a dictionary of class mappings, and setups needs to generate class mappings for those not passed in. Let's do so below:
class Categorify(TabularProc):
"Transform the categorical variables to something similar to `pd.Categorical`"
order = 1
def __init__(self, classes=None):
if classes is None: classes = defaultdict(L)
store_attr()
super().__init__()
def setups(self, to):
for n in to.cat_names:
if n not in self.classes or is_categorical_dtype(to[n]):
self.classes[n] = CategoryMap(to.iloc[:,n].items, add_na=n)
def encodes(self, to): to.transform(to.cat_names, partial(_apply_cats, self.classes, 1))
def decodes(self, to): to.transform(to.cat_names, partial(_decode_cats, self.classes))
def __getitem__(self,k): return self.classes[k]
Now we have successfully set up our Categorify. Let's look at a quick example below.
We'll make a DataFrame with two category columns:
df = pd.DataFrame({'a':[0,1,2,0,2], 'b': ['a', 'b', 'a', 'c', 'b']})
df.head()
Next we want to specify specific classes for a. We'll set a maximum range up to 4 rather than 2 shown in our DataFrame:
tst_classes = {'a':L(['#na#',0,1,2,3,4])}; tst_classes
Finally we will build a TabularPandas object with a modified version of Categorify:
to = TabularPandas(df, Categorify(classes=tst_classes), ['a','b'])
How do we tell it worked though? Let's check out to.classes, which is a shortcut for to.procs.categorify.classes.
What we should see is our dictionary we mapped for a, which we do!
to.classes
Normalize
Next let's move onto Normalize. Things get a bit tricky here because we also need to update the base Normalize transform as well.
Why?
Currently fastai's Normalize tabular proc overrides the setups for Normalize by storing away our means and stds. What we need to do is have an option to pass in our means and stds in the base Normalize.
Let's do so here with @patch
Very nice little one-liner.
Integrating with tabular though will not be so nice as a one-liner. Our user scenario looks something like so:
We can pass in custom means or custom standard deviations, and these should be in the form of a dictionary similar to how we had our classes earlier. Let's modify setups to account for this:
How do we test this?
We'll do a similar scenario to our Categorify example earlier. We'll have one column:
df = pd.DataFrame({'a':[0,1,2,3,4]})
df.head()
And normalize them with some custom statistics. In our case we'll make them 3 for a mean and 1 for the standard deviation
tst_means,tst_stds = {'a':3.}, {'a': 1.}
We'll pass this into Normalize and build a TabularPandas object:
norm = Normalize(means=tst_means, stds=tst_stds)
to = TabularPandas(df, norm, cont_names='a')
We can then check our mean and std values:
to.means, to.stds
And they line up!
FillMissing
The last preprocesser is FillMissing. For this one we want to give fastai the ability to accept custom na_dicts, as this is where the information is stored on what continuous columns contains missing values!
Compared to the last two, this integration is pretty trivial. First we'll give __init__ the ability to accept a na_dict, then our setups needs to check if we have an na_dict already and what columns aren't there from it. First let's look at the old:
class FillMissing(TabularProc):
"Fill the missing values in continuous columns."
def __init__(self, fill_strategy=FillStrategy.median, add_col=True, fill_vals=None):
if fill_vals is None: fill_vals = defaultdict(int)
store_attr()
def setups(self, dsets):
missing = pd.isnull(dsets.conts).any()
store_attr(na_dict={n:self.fill_strategy(dsets[n], self.fill_vals[n])
for n in missing[missing].keys()})
self.fill_strategy = self.fill_strategy.__name__
def encodes(self, to):
missing = pd.isnull(to.conts)
for n in missing.any()[missing.any()].keys():
assert n in self.na_dict, f"nan values in `{n}` but not in setup training set"
for n in self.na_dict.keys():
to[n].fillna(self.na_dict[n], inplace=True)
if self.add_col:
to.loc[:,n+'_na'] = missing[n]
if n+'_na' not in to.cat_names: to.cat_names.append(n+'_na')
Followed by our new:
We can see our setups checks for what new cont_names we have and then updates our na_dict with those missing keys. Let's test it out below:
df = pd.DataFrame({'a':[0,1,np.nan,1,2,3,4], 'b': [np.nan,1,2,3,4,5,6]})
df.head()
We'll pass in a dictionary for a but not b:
fill = FillMissing(na_dict={'a': 2.0})
to = TabularPandas(df, fill, cont_names=['a', 'b'])
And now let's look at our na_dict:
to.na_dict
We can see that it all works!
from wwf.tab.stats import *
from fastai.tabular.all import *
We'll make an example from the ADULT_SAMPLE dataset:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df.head()
We'll set everything up as we normally would for our TabularPandas:
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
splits = RandomSplitter()(range_of(df))
Except we'll define every proc ourselves. For our Categorify example we will use relationship, Normalize will use age, and FillMissing will use education-num:
First let's find those values:
df['relationship'].unique()
And we'll set that as a dictionary with a Single class as well:
classes = {'relationship': df['relationship'].unique() + [' Single ']}
And pass it to our Categorify:
cat = Categorify(classes=classes)
means,stds = {'age':15.}, {'age': 7.}
And pass it to Normalize:
norm = Normalize(means=means, stds=stds)
FillMissing
Lastly we have our FillMissing, which we will simply fill with 5.:
na_dict = {'education-num':5.}
And pass it in:
fill = FillMissing(na_dict=na_dict)
Bringing it together
Now let's build our TabularPandas object:
procs = [cat, norm, fill]
to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names,
y_names=['salary'], splits=splits)
cat_names and cont_names here, this is because FillMissing may override themAnd we're done!
Thanks again for reading, and I hope this article helps you with your tabular endeavors!