A guide for exporting `TabularPandas` and use it for inference with non-neural networks

This article is also a Jupyter Notebook available to be run from the top down. There will be code snippets that you can then run in any environment.

Below are the versions of fastai, fastcore, and wwf currently running at the time of writing this:

  • fastai: 2.0.16
  • fastcore: 1.1.2
  • wwf: 0.0.4

Using TabularPandas as a Preprocessor

As mentioned in the documentation using fastai to preprocess our tabular data can be a nice way in which the library integrates with XGBoost and Random Forests.

The issue though is when doing inference currently there is no way to export our TabularPandas object so we can do inference without building DataLoaders and exporting a Learner. We'll solve this problem here and explain what we are doing.

This is a much shorter article as it's currently an active PR, but it will live here until the functionality is merged.

Grab the Data

Let's grab the ADULT_SAMPLE dataset quickly and work with it:

from fastai.tabular.all import *
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse NaN Wife White Female 0 1902 40 United-States >=50k
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States >=50k
2 38 Private 96185 HS-grad NaN Divorced NaN Unmarried Black Female 0 0 32 United-States <50k
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States >=50k
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States <50k

Building our TabularPandas

Next we'll want to make our sample TabularPandas object. For our added export and import functionalities we will use the @patch method out of fastcore which means we can add them on later.

Let's build our to object:

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [FillMissing, Categorify, Normalize]
splits = RandomSplitter()(range_of(df))

to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names,
                   y_names=['salary'], splits=splits)

The nice part about TabularPandas is now our data is completely preprocessed, as we can see blow by looking at a few rows of our xs:

to.train.xs.head()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num
18966 5 10 3 11 1 5 1 0.030856 -0.858388 1.146335
11592 5 13 5 5 2 2 1 -0.702314 -0.021613 1.537389
548 5 16 3 0 1 2 2 -0.115778 -0.336138 -0.026827
32008 8 2 3 2 1 5 1 0.764027 0.129564 -1.199989
23657 5 10 3 13 1 5 1 0.324124 1.550315 1.146335

Exporting our TabularPandas

The next bit we want to do is actually add our export funcionality. We'll save it away as a pickle file:

TabularPandas.export[source]

TabularPandas.export(fname='export.pkl', pickle_protocol=2)

Export the contents of self without the items

@patch
def export(self:TabularPandas, fname='export.pkl', pickle_protocol=2):
    "Export the contents of `self` without the items"
    old_to = self
    self = self.new_empty()
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        pickle.dump(self, open(Path(fname), 'wb'), protocol=pickle_protocol)
        self = old_to

And now we can directly use it:

to.export('to.pkl')

Loading It Back In

Now that we have exported our TabularPandas, how do we use it in deployment? We'll make a load_pandas function to bring our pickle in:

load_pandas[source]

load_pandas(fname)

Load in a TabularPandas object from fname

def load_pandas(fname):
    "Load in a `TabularPandas` object from `fname`"
    distrib_barrier()
    res = pickle.load(open(fname, 'rb'))
    return res

Let's do so for our newly exported to

to_load = load_pandas('to.pkl')

And we can see it has no data:

len(to_load)
0

So how do we process some new data? the key is a combination of two functions:

  • to.train.new()
  • to.process()

The first will setup our data as though it is based on our training data and the second will run our procs through it. Let's try it out on a subset of our DataFrame:

to_new = to_load.train.new(df.iloc[:10])
to_new.process()

And if we examine our data, we can see it's processed!

to_new.xs.head()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num
0 5 8 3 0 6 5 1 0.764027 -0.840572 0.755281
1 5 13 1 5 2 5 1 0.397441 0.451042 1.537389
2 5 12 1 0 5 3 2 -0.042461 -0.889547 -0.026827
3 6 15 3 11 1 2 1 -0.042461 -0.730635 1.928443
4 7 6 3 9 6 3 2 0.250807 -1.022003 -0.026827

To use this with your own projects simply make sure you've pip installed wwf and do:

from wwf.tabular.export import *

to.export(fname)

After training and do what we did above for using your exported TabularPandas object with new data