/mnt/d/lib/python3.7/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0

Using `TabularPandas` as a Preprocessor

As mentioned in the documentation using fastai to preprocess our tabular data can be a nice way in which the library integrates with XGBoost and Random Forests.

The issue though is when doing inference currently there is no way to export our TabularPandas object so we can do inference without building DataLoaders and exporting a Learner. We'll solve this problem here and explain what we are doing.

This is a much shorter article as it's currently an active PR, but it will live here until the functionality is merged.

Grab the Data

Let's grab the ADULT_SAMPLE dataset quickly and work with it:

from fastai.tabular.all import *

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')

df.head()

Building our `TabularPandas`

Next we'll want to make our sample TabularPandas object. For our added export and import functionalities we will use the @patch method out of fastcore which means we can add them on later.

Let's build our to object:

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [FillMissing, Categorify, Normalize]
splits = RandomSplitter()(range_of(df))

to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names,
                   y_names=['salary'], splits=splits)

The nice part about TabularPandas is now our data is completely preprocessed, as we can see blow by looking at a few rows of our xs:

to.train.xs.head()

Exporting our `TabularPandas`

The next bit we want to do is actually add our export funcionality. We'll save it away as a pickle file:

The history saving thread hit an unexpected error (OperationalError('disk I/O error')).History will not be written to the database.

@patch
def export(self:TabularPandas, fname='export.pkl', pickle_protocol=2):
    "Export the contents of `self` without the items"
    old_to = self
    self = self.new_empty()
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        pickle.dump(self, open(Path(fname), 'wb'), protocol=pickle_protocol)
        self = old_to

And now we can directly use it:

to.export('to.pkl')

Loading It Back In

Now that we have exported our TabularPandas, how do we use it in deployment? We'll make a load_pandas function to bring our pickle in:

def load_pandas(fname):
    "Load in a `TabularPandas` object from `fname`"
    distrib_barrier()
    res = pickle.load(open(fname, 'rb'))
    return res

Let's do so for our newly exported to

to_load = load_pandas('to.pkl')

And we can see it has no data:

len(to_load)

0

So how do we process some new data? the key is a combination of two functions:

to.train.new()
to.process()

The first will setup our data as though it is based on our training data and the second will run our procs through it. Let's try it out on a subset of our DataFrame:

to_new = to_load.train.new(df.iloc[:10])
to_new.process()

And if we examine our data, we can see it's processed!

to_new.xs.head()

To use this with your own projects simply make sure you've pip installed wwf and do:

from wwf.tabular.export import *

to.export(fname)

After training and do what we did above for using your exported TabularPandas object with new data

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	NaN	Wife	White	Female	0	1902	40	United-States	>=50k
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k
2	38	Private	96185	HS-grad	NaN	Divorced	NaN	Unmarried	Black	Female	0	0	32	United-States	<50k
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num
18966	5	10	3	11	1	5	1	0.030856	-0.858388	1.146335
11592	5	13	5	5	2	2	1	-0.702314	-0.021613	1.537389
548	5	16	3	0	1	2	2	-0.115778	-0.336138	-0.026827
32008	8	2	3	2	1	5	1	0.764027	0.129564	-1.199989
23657	5	10	3	13	1	5	1	0.324124	1.550315	1.146335

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num
0	5	8	3	0	6	5	1	0.764027	-0.840572	0.755281
1	5	13	1	5	2	5	1	0.397441	0.451042	1.537389
2	5	12	1	0	5	3	2	-0.042461	-0.889547	-0.026827
3	6	15	3	11	1	2	1	-0.042461	-0.730635	1.928443
4	7	6	3	9	6	3	2	0.250807	-1.022003	-0.026827

Exporting `TabularPandas` for Inference (Intermediate)

Using `TabularPandas` as a Preprocessor

Grab the Data

Building our `TabularPandas`

Exporting our `TabularPandas`

`TabularPandas.export`[source]

Loading It Back In

`load_pandas`[source]

Exporting `TabularPandas` for Inference (Intermediate)

Using TabularPandas as a Preprocessor

Grab the Data

Building our TabularPandas

Exporting our TabularPandas

TabularPandas.export[source]

Loading It Back In

load_pandas[source]

Using `TabularPandas` as a Preprocessor

Building our `TabularPandas`

Exporting our `TabularPandas`

`TabularPandas.export`[source]

`load_pandas`[source]