Using TabularPandas
as a Preprocessor
As mentioned in the documentation using fastai
to preprocess our tabular data can be a nice way in which the library integrates with XGBoost and Random Forests.
The issue though is when doing inference currently there is no way to export our TabularPandas
object so we can do inference without building DataLoaders
and exporting a Learner
. We'll solve this problem here and explain what we are doing.
This is a much shorter article as it's currently an active PR, but it will live here until the functionality is merged.
from fastai.tabular.all import *
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df.head()
Building our TabularPandas
Next we'll want to make our sample TabularPandas
object. For our added export
and import
functionalities we will use the @patch
method out of fastcore
which means we can add them on later.
Let's build our to
object:
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [FillMissing, Categorify, Normalize]
splits = RandomSplitter()(range_of(df))
to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names,
y_names=['salary'], splits=splits)
The nice part about TabularPandas
is now our data is completely preprocessed, as we can see blow by looking at a few rows of our xs
:
to.train.xs.head()
Exporting our TabularPandas
The next bit we want to do is actually add our export funcionality. We'll save it away as a pickle file:
@patch
def export(self:TabularPandas, fname='export.pkl', pickle_protocol=2):
"Export the contents of `self` without the items"
old_to = self
self = self.new_empty()
with warnings.catch_warnings():
warnings.simplefilter("ignore")
pickle.dump(self, open(Path(fname), 'wb'), protocol=pickle_protocol)
self = old_to
And now we can directly use it:
to.export('to.pkl')
Loading It Back In
Now that we have exported our TabularPandas
, how do we use it in deployment? We'll make a load_pandas
function to bring our pickle in:
def load_pandas(fname):
"Load in a `TabularPandas` object from `fname`"
distrib_barrier()
res = pickle.load(open(fname, 'rb'))
return res
Let's do so for our newly exported to
to_load = load_pandas('to.pkl')
And we can see it has no data:
len(to_load)
So how do we process some new data? the key is a combination of two functions:
to.train.new()
to.process()
The first will setup our data as though it is based on our training data and the second will run our procs
through it. Let's try it out on a subset of our DataFrame
:
to_new = to_load.train.new(df.iloc[:10])
to_new.process()
And if we examine our data, we can see it's processed!
to_new.xs.head()
To use this with your own projects simply make sure you've pip
installed wwf
and do:
from wwf.tabular.export import *
to.export(fname)
After training and do what we did above for using your exported TabularPandas
object with new data