Why Use Predetermined Stats?
If fastai
will simply let us pass everything to a TabularPandas
object to preprocess and train on, why should having custom statistics for our data?
Let's try to think of a scenario.
My data is a few trillion rows, so there is no way (currently) I can load this DataFrame
into memory at once. What do I do? Perhaps I would want to train on batches of my data at a time (article on this will come soon). To do this though I would need all of my procs
predetermined from the start so every transform is done the same across all of our mini-batches of data.
Currently there is an open PR for this integration, so for now this will live inside of Walk with fastai and we'll show how to use it as well!
Before we begin, let's import the tabular module:
Categorify
The first one we will look at is Categorify
. Currently the source code looks like so:
class Categorify(TabularProc):
"Transform the categorical variables to something similar to `pd.Categorical`"
order = 1
def setups(self, to):
store_attr(classes={n:CategoryMap(to.iloc[:,n].items, add_na=(n in to.cat_names)) for n in to.cat_names})
def encodes(self, to): to.transform(to.cat_names, partial(_apply_cats, self.classes, 1))
def decodes(self, to): to.transform(to.cat_names, partial(_decode_cats, self.classes))
def __getitem__(self,k): return self.classes[k]
What our modification needs to do is on the __init__
we need an option to pass in a dictionary of class mappings, and setups
needs to generate class mappings for those not passed in. Let's do so below:
class Categorify(TabularProc):
"Transform the categorical variables to something similar to `pd.Categorical`"
order = 1
def __init__(self, classes=None):
if classes is None: classes = defaultdict(L)
store_attr()
super().__init__()
def setups(self, to):
for n in to.cat_names:
if n not in self.classes or is_categorical_dtype(to[n]):
self.classes[n] = CategoryMap(to.iloc[:,n].items, add_na=n)
def encodes(self, to): to.transform(to.cat_names, partial(_apply_cats, self.classes, 1))
def decodes(self, to): to.transform(to.cat_names, partial(_decode_cats, self.classes))
def __getitem__(self,k): return self.classes[k]
Now we have successfully set up our Categorify
. Let's look at a quick example below.
We'll make a DataFrame
with two category columns:
df = pd.DataFrame({'a':[0,1,2,0,2], 'b': ['a', 'b', 'a', 'c', 'b']})
df.head()
Next we want to specify specific classes for a
. We'll set a maximum range up to 4
rather than 2
shown in our DataFrame
:
tst_classes = {'a':L(['#na#',0,1,2,3,4])}; tst_classes
Finally we will build a TabularPandas
object with a modified version of Categorify
:
to = TabularPandas(df, Categorify(classes=tst_classes), ['a','b'])
How do we tell it worked though? Let's check out to.classes
, which is a shortcut for to.procs.categorify.classes
.
What we should see is our dictionary we mapped for a
, which we do!
to.classes
Normalize
Next let's move onto Normalize
. Things get a bit tricky here because we also need to update the base Normalize
transform as well.
Why?
Currently fastai
's Normalize
tabular proc overrides the setups
for Normalize
by storing away our means
and stds
. What we need to do is have an option to pass in our means
and stds
in the base Normalize
.
Let's do so here with @patch
Very nice little one-liner.
Integrating with tabular though will not be so nice as a one-liner. Our user scenario looks something like so:
We can pass in custom means or custom standard deviations, and these should be in the form of a dictionary similar to how we had our classes
earlier. Let's modify setups
to account for this:
How do we test this?
We'll do a similar scenario to our Categorify
example earlier. We'll have one column:
df = pd.DataFrame({'a':[0,1,2,3,4]})
df.head()
And normalize them with some custom statistics. In our case we'll make them 3
for a mean and 1
for the standard deviation
tst_means,tst_stds = {'a':3.}, {'a': 1.}
We'll pass this into Normalize
and build a TabularPandas
object:
norm = Normalize(means=tst_means, stds=tst_stds)
to = TabularPandas(df, norm, cont_names='a')
We can then check our mean
and std
values:
to.means, to.stds
And they line up!
FillMissing
The last preprocesser is FillMissing
. For this one we want to give fastai
the ability to accept custom na_dicts
, as this is where the information is stored on what continuous columns contains missing values!
Compared to the last two, this integration is pretty trivial. First we'll give __init__
the ability to accept a na_dict
, then our setups
needs to check if we have an na_dict
already and what columns aren't there from it. First let's look at the old:
class FillMissing(TabularProc):
"Fill the missing values in continuous columns."
def __init__(self, fill_strategy=FillStrategy.median, add_col=True, fill_vals=None):
if fill_vals is None: fill_vals = defaultdict(int)
store_attr()
def setups(self, dsets):
missing = pd.isnull(dsets.conts).any()
store_attr(na_dict={n:self.fill_strategy(dsets[n], self.fill_vals[n])
for n in missing[missing].keys()})
self.fill_strategy = self.fill_strategy.__name__
def encodes(self, to):
missing = pd.isnull(to.conts)
for n in missing.any()[missing.any()].keys():
assert n in self.na_dict, f"nan values in `{n}` but not in setup training set"
for n in self.na_dict.keys():
to[n].fillna(self.na_dict[n], inplace=True)
if self.add_col:
to.loc[:,n+'_na'] = missing[n]
if n+'_na' not in to.cat_names: to.cat_names.append(n+'_na')
Followed by our new:
We can see our setups
checks for what new cont_names
we have and then updates our na_dict
with those missing keys. Let's test it out below:
df = pd.DataFrame({'a':[0,1,np.nan,1,2,3,4], 'b': [np.nan,1,2,3,4,5,6]})
df.head()
We'll pass in a dictionary for a
but not b
:
fill = FillMissing(na_dict={'a': 2.0})
to = TabularPandas(df, fill, cont_names=['a', 'b'])
And now let's look at our na_dict
:
to.na_dict
We can see that it all works!
from wwf.tab.stats import *
from fastai.tabular.all import *
We'll make an example from the ADULT_SAMPLE
dataset:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df.head()
We'll set everything up as we normally would for our TabularPandas
:
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
splits = RandomSplitter()(range_of(df))
Except we'll define every proc ourselves. For our Categorify
example we will use relationship
, Normalize
will use age
, and FillMissing
will use education-num
:
First let's find those values:
df['relationship'].unique()
And we'll set that as a dictionary with a Single
class as well:
classes = {'relationship': df['relationship'].unique() + [' Single ']}
And pass it to our Categorify
:
cat = Categorify(classes=classes)
means,stds = {'age':15.}, {'age': 7.}
And pass it to Normalize
:
norm = Normalize(means=means, stds=stds)
FillMissing
Lastly we have our FillMissing
, which we will simply fill with 5.:
na_dict = {'education-num':5.}
And pass it in:
fill = FillMissing(na_dict=na_dict)
Bringing it together
Now let's build our TabularPandas
object:
procs = [cat, norm, fill]
to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names,
y_names=['salary'], splits=splits)
cat_names
and cont_names
here, this is because FillMissing
may override themAnd we're done!
Thanks again for reading, and I hope this article helps you with your tabular endeavors!