/mnt/d/lib/python3.7/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0

Why Use Predetermined Stats?

If fastai will simply let us pass everything to a TabularPandas object to preprocess and train on, why should having custom statistics for our data?

Let's try to think of a scenario.

My data is a few trillion rows, so there is no way (currently) I can load this DataFrame into memory at once. What do I do? Perhaps I would want to train on batches of my data at a time (article on this will come soon). To do this though I would need all of my procs predetermined from the start so every transform is done the same across all of our mini-batches of data.

Currently there is an open PR for this integration, so for now this will live inside of Walk with fastai and we'll show how to use it as well!

Before we begin, let's import the tabular module:

Modifying the `procs`

Now let's modify each of our procs to have this ability, as right now it's currently not there!

Categorify

The first one we will look at is Categorify. Currently the source code looks like so:

class Categorify(TabularProc):
    "Transform the categorical variables to something similar to `pd.Categorical`"
    order = 1
    def setups(self, to):
        store_attr(classes={n:CategoryMap(to.iloc[:,n].items, add_na=(n in to.cat_names)) for n in to.cat_names})

    def encodes(self, to): to.transform(to.cat_names, partial(_apply_cats, self.classes, 1))
    def decodes(self, to): to.transform(to.cat_names, partial(_decode_cats, self.classes))
    def __getitem__(self,k): return self.classes[k]

What our modification needs to do is on the __init__ we need an option to pass in a dictionary of class mappings, and setups needs to generate class mappings for those not passed in. Let's do so below:

class Categorify(TabularProc):
    "Transform the categorical variables to something similar to `pd.Categorical`"
    order = 1
    def __init__(self, classes=None):
        if classes is None: classes = defaultdict(L)
        store_attr()
        super().__init__()
    def setups(self, to):
        for n in to.cat_names:
            if n not in self.classes or is_categorical_dtype(to[n]):
                self.classes[n] = CategoryMap(to.iloc[:,n].items, add_na=n)

    def encodes(self, to): to.transform(to.cat_names, partial(_apply_cats, self.classes, 1))
    def decodes(self, to): to.transform(to.cat_names, partial(_decode_cats, self.classes))
    def __getitem__(self,k): return self.classes[k]

Now we have successfully set up our Categorify. Let's look at a quick example below.

We'll make a DataFrame with two category columns:

df = pd.DataFrame({'a':[0,1,2,0,2], 'b': ['a', 'b', 'a', 'c', 'b']})
df.head()

Next we want to specify specific classes for a. We'll set a maximum range up to 4 rather than 2 shown in our DataFrame:

tst_classes = {'a':L(['#na#',0,1,2,3,4])}; tst_classes

{'a': (#6) ['#na#',0,1,2,3,4]}

Finally we will build a TabularPandas object with a modified version of Categorify:

to = TabularPandas(df, Categorify(classes=tst_classes), ['a','b'])

How do we tell it worked though? Let's check out to.classes, which is a shortcut for to.procs.categorify.classes.

What we should see is our dictionary we mapped for a, which we do!

to.classes

{'a': (#6) ['#na#',0,1,2,3,4], 'b': ['#na#', 'a', 'b', 'c']}

Normalize

Next let's move onto Normalize. Things get a bit tricky here because we also need to update the base Normalize transform as well.

Why?

Currently fastai's Normalize tabular proc overrides the setups for Normalize by storing away our means and stds. What we need to do is have an option to pass in our means and stds in the base Normalize.

Let's do so here with @patch

Very nice little one-liner.

Integrating with tabular though will not be so nice as a one-liner. Our user scenario looks something like so:

We can pass in custom means or custom standard deviations, and these should be in the form of a dictionary similar to how we had our classes earlier. Let's modify setups to account for this:

How do we test this?

We'll do a similar scenario to our Categorify example earlier. We'll have one column:

df = pd.DataFrame({'a':[0,1,2,3,4]})
df.head()

And normalize them with some custom statistics. In our case we'll make them 3 for a mean and 1 for the standard deviation

tst_means,tst_stds = {'a':3.}, {'a': 1.}

We'll pass this into Normalize and build a TabularPandas object:

norm = Normalize(means=tst_means, stds=tst_stds)
to = TabularPandas(df, norm, cont_names='a')

We can then check our mean and std values:

to.means, to.stds

({'a': 3.0}, {'a': 1.0})

And they line up!

FillMissing

The last preprocesser is FillMissing. For this one we want to give fastai the ability to accept custom na_dicts, as this is where the information is stored on what continuous columns contains missing values!

Compared to the last two, this integration is pretty trivial. First we'll give __init__ the ability to accept a na_dict, then our setups needs to check if we have an na_dict already and what columns aren't there from it. First let's look at the old:

class FillMissing(TabularProc):
    "Fill the missing values in continuous columns."
    def __init__(self, fill_strategy=FillStrategy.median, add_col=True, fill_vals=None):
        if fill_vals is None: fill_vals = defaultdict(int)
        store_attr()

    def setups(self, dsets):
        missing = pd.isnull(dsets.conts).any()
        store_attr(na_dict={n:self.fill_strategy(dsets[n], self.fill_vals[n])
                            for n in missing[missing].keys()})
        self.fill_strategy = self.fill_strategy.__name__

    def encodes(self, to):
        missing = pd.isnull(to.conts)
        for n in missing.any()[missing.any()].keys():
            assert n in self.na_dict, f"nan values in `{n}` but not in setup training set"
        for n in self.na_dict.keys():
            to[n].fillna(self.na_dict[n], inplace=True)
            if self.add_col:
                to.loc[:,n+'_na'] = missing[n]
                if n+'_na' not in to.cat_names: to.cat_names.append(n+'_na')

Followed by our new:

We can see our setups checks for what new cont_names we have and then updates our na_dict with those missing keys. Let's test it out below:

df = pd.DataFrame({'a':[0,1,np.nan,1,2,3,4], 'b': [np.nan,1,2,3,4,5,6]})
df.head()

We'll pass in a dictionary for a but not b:

fill = FillMissing(na_dict={'a': 2.0}) 
to = TabularPandas(df, fill, cont_names=['a', 'b'])

And now let's look at our na_dict:

to.na_dict

{'a': 2.0, 'b': 3.5}

We can see that it all works!

Full Integration Example

Nor for those folks that don't particularly care about how we get to this point and simply want to use it, we'll do the following:

from wwf.tab.stats import *
from fastai.tabular.all import *

We'll make an example from the ADULT_SAMPLE dataset:

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')

df.head()

We'll set everything up as we normally would for our TabularPandas:

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
splits = RandomSplitter()(range_of(df))

Except we'll define every proc ourselves. For our Categorify example we will use relationship, Normalize will use age, and FillMissing will use education-num:

Categorify

First let's find those values:

df['relationship'].unique()

array([' Wife', ' Not-in-family', ' Unmarried', ' Husband', ' Own-child',
       ' Other-relative'], dtype=object)

And we'll set that as a dictionary with a Single class as well:

classes = {'relationship': df['relationship'].unique() + [' Single ']}

And pass it to our Categorify:

cat = Categorify(classes=classes)

Normalize

Next we have normalize. We'll use a (very) wrong mean and standard deviation of 15. and 7.:

means,stds = {'age':15.}, {'age': 7.}

And pass it to Normalize:

norm = Normalize(means=means, stds=stds)

FillMissing

Lastly we have our FillMissing, which we will simply fill with 5.:

na_dict = {'education-num':5.}

And pass it in:

fill = FillMissing(na_dict=na_dict)

Bringing it together

Now let's build our TabularPandas object:

procs = [cat, norm, fill]
to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names,
                   y_names=['salary'], splits=splits)

Note: you may need to redefine your cat_names and cont_names here, this is because FillMissing may override them

And we're done!

Thanks again for reading, and I hope this article helps you with your tabular endeavors!

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	NaN	Wife	White	Female	0	1902	40	United-States	>=50k
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k
2	38	Private	96185	HS-grad	NaN	Divorced	NaN	Unmarried	Black	Female	0	0	32	United-States	<50k
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

Using Custom Transform Statistics (Intermediate)

Why Use Predetermined Stats?

Modifying the `procs`

Categorify

`class` `Categorify`[source]

Normalize

`Normalize.init`[source]

`setups`[source]

FillMissing

`class` `FillMissing`[source]

Full Integration Example

Categorify

Normalize

FillMissing

Bringing it together

	a	b
0	0.0	NaN
1	1.0	1.0
2	NaN	2.0
3	1.0	3.0
4	2.0	4.0

Using Custom Transform Statistics (Intermediate)

Why Use Predetermined Stats?

Modifying the procs

Categorify

class Categorify[source]

Normalize

Normalize.__init__[source]

setups[source]

FillMissing

class FillMissing[source]

Full Integration Example

Categorify

Normalize

FillMissing

Bringing it together

Modifying the `procs`

`class` `Categorify`[source]

`Normalize.init`[source]

`setups`[source]

`class` `FillMissing`[source]