A guide for showing how to bring in pregenerated statistics for tabular

This article is also a Jupyter Notebook available to be run from the top down. There will be code snippets that you can then run in any environment.

Below are the versions of fastai, fastcore, and wwf currently running at the time of writing this:

  • fastai: 2.0.16
  • fastcore: 1.1.2
  • wwf: 0.0.4

Why Use Predetermined Stats?

If fastai will simply let us pass everything to a TabularPandas object to preprocess and train on, why should having custom statistics for our data?

Let's try to think of a scenario.

My data is a few trillion rows, so there is no way (currently) I can load this DataFrame into memory at once. What do I do? Perhaps I would want to train on batches of my data at a time (article on this will come soon). To do this though I would need all of my procs predetermined from the start so every transform is done the same across all of our mini-batches of data.

Currently there is an open PR for this integration, so for now this will live inside of Walk with fastai and we'll show how to use it as well!

Before we begin, let's import the tabular module:

Modifying the procs

Now let's modify each of our procs to have this ability, as right now it's currently not there!

Categorify

The first one we will look at is Categorify. Currently the source code looks like so:

class Categorify(TabularProc):
    "Transform the categorical variables to something similar to `pd.Categorical`"
    order = 1
    def setups(self, to):
        store_attr(classes={n:CategoryMap(to.iloc[:,n].items, add_na=(n in to.cat_names)) for n in to.cat_names})

    def encodes(self, to): to.transform(to.cat_names, partial(_apply_cats, self.classes, 1))
    def decodes(self, to): to.transform(to.cat_names, partial(_decode_cats, self.classes))
    def __getitem__(self,k): return self.classes[k]

What our modification needs to do is on the __init__ we need an option to pass in a dictionary of class mappings, and setups needs to generate class mappings for those not passed in. Let's do so below:

class Categorify[source]

Categorify(enc=None, dec=None, split_idx=None, order=None) :: TabularProc

Transform the categorical variables to something similar to pd.Categorical

class Categorify(TabularProc):
    "Transform the categorical variables to something similar to `pd.Categorical`"
    order = 1
    def __init__(self, classes=None):
        if classes is None: classes = defaultdict(L)
        store_attr()
        super().__init__()
    def setups(self, to):
        for n in to.cat_names:
            if n not in self.classes or is_categorical_dtype(to[n]):
                self.classes[n] = CategoryMap(to.iloc[:,n].items, add_na=n)

    def encodes(self, to): to.transform(to.cat_names, partial(_apply_cats, self.classes, 1))
    def decodes(self, to): to.transform(to.cat_names, partial(_decode_cats, self.classes))
    def __getitem__(self,k): return self.classes[k]

Now we have successfully set up our Categorify. Let's look at a quick example below.

We'll make a DataFrame with two category columns:

df = pd.DataFrame({'a':[0,1,2,0,2], 'b': ['a', 'b', 'a', 'c', 'b']})
df.head()
a b
0 0 a
1 1 b
2 2 a
3 0 c
4 2 b

Next we want to specify specific classes for a. We'll set a maximum range up to 4 rather than 2 shown in our DataFrame:

tst_classes = {'a':L(['#na#',0,1,2,3,4])}; tst_classes
{'a': (#6) ['#na#',0,1,2,3,4]}

Finally we will build a TabularPandas object with a modified version of Categorify:

to = TabularPandas(df, Categorify(classes=tst_classes), ['a','b'])

How do we tell it worked though? Let's check out to.classes, which is a shortcut for to.procs.categorify.classes.

What we should see is our dictionary we mapped for a, which we do!

to.classes
{'a': (#6) ['#na#',0,1,2,3,4], 'b': ['#na#', 'a', 'b', 'c']}

Normalize

Next let's move onto Normalize. Things get a bit tricky here because we also need to update the base Normalize transform as well.

Why?

Currently fastai's Normalize tabular proc overrides the setups for Normalize by storing away our means and stds. What we need to do is have an option to pass in our means and stds in the base Normalize.

Let's do so here with @patch

Normalize.__init__[source]

Normalize.__init__(x:Normalize, mean=None, std=None, axes=(0, 2, 3), means=None, stds=None)

Initialize self. See help(type(self)) for accurate signature.

Very nice little one-liner.

Integrating with tabular though will not be so nice as a one-liner. Our user scenario looks something like so:

We can pass in custom means or custom standard deviations, and these should be in the form of a dictionary similar to how we had our classes earlier. Let's modify setups to account for this:

setups[source]

setups(to:Tabular)

How do we test this?

We'll do a similar scenario to our Categorify example earlier. We'll have one column:

df = pd.DataFrame({'a':[0,1,2,3,4]})
df.head()
a
0 0
1 1
2 2
3 3
4 4

And normalize them with some custom statistics. In our case we'll make them 3 for a mean and 1 for the standard deviation

tst_means,tst_stds = {'a':3.}, {'a': 1.}

We'll pass this into Normalize and build a TabularPandas object:

norm = Normalize(means=tst_means, stds=tst_stds)
to = TabularPandas(df, norm, cont_names='a')

We can then check our mean and std values:

to.means, to.stds
({'a': 3.0}, {'a': 1.0})

And they line up!

FillMissing

The last preprocesser is FillMissing. For this one we want to give fastai the ability to accept custom na_dicts, as this is where the information is stored on what continuous columns contains missing values!

Compared to the last two, this integration is pretty trivial. First we'll give __init__ the ability to accept a na_dict, then our setups needs to check if we have an na_dict already and what columns aren't there from it. First let's look at the old:

class FillMissing(TabularProc):
    "Fill the missing values in continuous columns."
    def __init__(self, fill_strategy=FillStrategy.median, add_col=True, fill_vals=None):
        if fill_vals is None: fill_vals = defaultdict(int)
        store_attr()

    def setups(self, dsets):
        missing = pd.isnull(dsets.conts).any()
        store_attr(na_dict={n:self.fill_strategy(dsets[n], self.fill_vals[n])
                            for n in missing[missing].keys()})
        self.fill_strategy = self.fill_strategy.__name__

    def encodes(self, to):
        missing = pd.isnull(to.conts)
        for n in missing.any()[missing.any()].keys():
            assert n in self.na_dict, f"nan values in `{n}` but not in setup training set"
        for n in self.na_dict.keys():
            to[n].fillna(self.na_dict[n], inplace=True)
            if self.add_col:
                to.loc[:,n+'_na'] = missing[n]
                if n+'_na' not in to.cat_names: to.cat_names.append(n+'_na')

Followed by our new:

class FillMissing[source]

FillMissing(fill_strategy=median, add_col=True, fill_vals=None) :: TabularProc

Fill the missing values in continuous columns.

We can see our setups checks for what new cont_names we have and then updates our na_dict with those missing keys. Let's test it out below:

df = pd.DataFrame({'a':[0,1,np.nan,1,2,3,4], 'b': [np.nan,1,2,3,4,5,6]})
df.head()
a b
0 0.0 NaN
1 1.0 1.0
2 NaN 2.0
3 1.0 3.0
4 2.0 4.0

We'll pass in a dictionary for a but not b:

fill = FillMissing(na_dict={'a': 2.0}) 
to = TabularPandas(df, fill, cont_names=['a', 'b'])

And now let's look at our na_dict:

to.na_dict
{'a': 2.0, 'b': 3.5}

We can see that it all works!

Full Integration Example

Nor for those folks that don't particularly care about how we get to this point and simply want to use it, we'll do the following:

from wwf.tab.stats import *
from fastai.tabular.all import *

We'll make an example from the ADULT_SAMPLE dataset:

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse NaN Wife White Female 0 1902 40 United-States >=50k
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States >=50k
2 38 Private 96185 HS-grad NaN Divorced NaN Unmarried Black Female 0 0 32 United-States <50k
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States >=50k
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States <50k

We'll set everything up as we normally would for our TabularPandas:

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
splits = RandomSplitter()(range_of(df))

Except we'll define every proc ourselves. For our Categorify example we will use relationship, Normalize will use age, and FillMissing will use education-num:

Categorify

First let's find those values:

df['relationship'].unique()
array([' Wife', ' Not-in-family', ' Unmarried', ' Husband', ' Own-child',
       ' Other-relative'], dtype=object)

And we'll set that as a dictionary with a Single class as well:

classes = {'relationship': df['relationship'].unique() + [' Single ']}

And pass it to our Categorify:

cat = Categorify(classes=classes)

Normalize

Next we have normalize. We'll use a (very) wrong mean and standard deviation of 15. and 7.:

means,stds = {'age':15.}, {'age': 7.}

And pass it to Normalize:

norm = Normalize(means=means, stds=stds)

FillMissing

Lastly we have our FillMissing, which we will simply fill with 5.:

na_dict = {'education-num':5.}

And pass it in:

fill = FillMissing(na_dict=na_dict) 

Bringing it together

Now let's build our TabularPandas object:

procs = [cat, norm, fill]
to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names,
                   y_names=['salary'], splits=splits)

And we're done!

Thanks again for reading, and I hope this article helps you with your tabular endeavors!