Rossmann's Kaggle competition was a business sales prediction competition for $35,000 dollars to the winners.
The premise is that we're given the sales data and information of their stores for the past few years. We need to make a model that can predict the number of sales that will be made in the future.
We can do this through a tabular regression model.
Jeremy walks through feature engineering for this problem, for today though we will download a clean engineered dataset straight from Kaggle. To download it:
- Go to: https://www.kaggle.com/init27/fastai-v3-rossman-data-clean
- Go to
output
- Right click the download button
- Click "copy link location" for both train and test
!wget {url}
(Note: to walk through the feature engineering, see this notebook)
train = 'ENTER_URL_HERE'
test = 'ENTER_URL_HERE'
!wget {train} -q
!wget {test} -q
And now that we have our data, let's install the fastai
library
from fastai.tabular.all import *
train_df = pd.read_pickle('train_clean')
test_df = pd.read_pickle('test_clean')
train_df.head().T
With our time-series based approach, the feature engineering made a bunch of date-related categorical columns that we can utilize in our embeddings.
cat_vars = ['Store', 'DayOfWeek', 'Year', 'Month', 'Day', 'StateHoliday', 'CompetitionMonthsOpen',
'Promo2Weeks', 'StoreType', 'Assortment', 'PromoInterval', 'CompetitionOpenSinceYear', 'Promo2SinceYear',
'State', 'Week', 'Events', 'Promo_fw', 'Promo_bw', 'StateHoliday_fw', 'StateHoliday_bw',
'SchoolHoliday_fw', 'SchoolHoliday_bw', 'Promo', 'SchoolHoliday']
cont_vars = ['CompetitionDistance', 'Max_TemperatureC', 'Mean_TemperatureC', 'Min_TemperatureC',
'Max_Humidity', 'Mean_Humidity', 'Min_Humidity', 'Max_Wind_SpeedKm_h',
'Mean_Wind_SpeedKm_h', 'CloudCover', 'trend', 'trend_DE',
'AfterStateHoliday', 'BeforeStateHoliday']
dep_var = 'Sales'
When doing Regression with these large numbers, we often use the log of these values for our y
's. Let's transform them real quick:
train_df[dep_var] = np.log(train_df[dep_var])
Now let's declare our procs
procs = [FillMissing, Normalize, Categorify]
And splits. Now we want to ensure that (since this is time series) our validation and test
len(train_df), len(test_df)
test_df['Date'].min(), test_df['Date'].max()
Let's find that particular index
idx = train_df['Date'][(train_df['Date'] == train_df['Date'][len(test_df)])].index.max()
idx
So now our splits
will be every index after the 41395'th item
splits = (L(range(idx, len(train_df))),L(range(idx)))
splits
Let's make our TabularPandas
! Since we have a large DataFrame
, we can enable inplace
to True
and reduce_memory
to True
to save on some storage space (Note: reduce_memory
is by default True
). To use inplace
, we need to set chained_assignment
to None
in pandas
pd.options.mode.chained_assignment=None
to = TabularPandas(train_df, procs, cat_vars, cont_vars, dep_var, y_block=RegressionBlock(),
splits=splits, inplace=True, reduce_memory=True)
And now let's build our dataloaders
!
dls = to.dataloaders(bs=512)
dls.show_batch()
As we're doing regression, we want to dictate what the maximum value to be (and minimum) so we will use a y_range
max_log_y = np.max(train_df['Sales'])*1.2
max_log_y
And now we can make a y_range
y_range = torch.tensor([0, max_log_y]); y_range
Next comes our Learner
. We'll walk through each step
tc = tabular_config(ps=[0.001, 0.01], embed_p=0.04, y_range=y_range)
learn = tabular_learner(dls, layers=[1000,500],
metrics=exp_rmspe,
config=tc,
loss_func=MSELossFlat())
So we have a lot going on right there. ps
is overall dropout (helps with overfitting), embed_p
is dropout on the embedding weights, exp_rmspe
is Root Mean Square Percentage Error
Let's look at our model:
learn.summary()
learn.lr_find()
learn.fit_one_cycle(5, 3e-3, wd=0.2)
For comparison, an exp_rmspe
of 0.108 was 10th place
learn.export('myModel')
del learn
learn = load_learner('myModel')
Now we generate our test_dl
from our test_df
dl = learn.dls.test_dl(test_df)
We need to tell the Learner
to return the predictions for the test set we added
raw_test_preds = learn.get_preds(dl=dl)
Let's take a peek
raw_test_preds
You'll notice [0]
contains our predictions, and [1]
contains any labels (if we had any). This is nice because if we accidently run learn.validate()
on a non-labeled test set, we get the following:
learn.validate(dl=dl)
Still runs, just there are no labels so it's None
Now back to our predictions! We need to undo our log
transform first:
np.exp(raw_test_preds[0])
test_preds = np.exp(raw_test_preds[0]).numpy().T[0]
(If you want to learn about what T
does, compare below):
raw_test_preds[0].numpy()
test_preds
Now we can submit to Kaggle!
test_df['Sales'] = test_preds
test_df[['Id', "Sales"]] = test_df[['Id', 'Sales']].astype('int')
And finally we make our submission. NOTE: remove the index when generating your submission always!
test_df[['Id', 'Sales']].to_csv('submission.csv', index=False)
Permutation Importance
Permutation importance is a technique in which we shuffle each column in a dataframe and analyze how changing a particular column affected our y
values. The more that it was affected, the more "important" we can (generally) call a variable in our neural network. Let's build a quick algorithm to do this:
class PermutationImportance():
"Calculate and plot the permutation importance"
def __init__(self, learn:Learner, df=None, bs=None):
"Initialize with a test dataframe, a learner, and a metric"
self.learn = learn
self.df = df
bs = bs if bs is not None else learn.dls.bs
if self.df is not None:
self.dl = learn.dls.test_dl(self.df, bs=bs)
else:
self.dl = learn.dls[1]
self.x_names = learn.dls.x_names.filter(lambda x: '_na' not in x)
self.na = learn.dls.x_names.filter(lambda x: '_na' in x)
self.y = dls.y_names
self.results = self.calc_feat_importance()
self.plot_importance(self.ord_dic_to_df(self.results))
def measure_col(self, name:str):
"Measures change after column shuffle"
col = [name]
if f'{name}_na' in self.na: col.append(name)
orig = self.dl.items[col].values
perm = np.random.permutation(len(orig))
self.dl.items[col] = self.dl.items[col].values[perm]
metric = learn.validate(dl=self.dl)[1]
self.dl.items[col] = orig
return metric
def calc_feat_importance(self):
"Calculates permutation importance by shuffling a column on a percentage scale"
print('Getting base error')
base_error = self.learn.validate(dl=self.dl)[1]
self.importance = {}
pbar = progress_bar(self.x_names)
print('Calculating Permutation Importance')
for col in pbar:
self.importance[col] = self.measure_col(col)
for key, value in self.importance.items():
self.importance[key] = (base_error-value)/base_error #this can be adjusted
return OrderedDict(sorted(self.importance.items(), key=lambda kv: kv[1], reverse=True))
def ord_dic_to_df(self, dict:OrderedDict):
return pd.DataFrame([[k, v] for k, v in dict.items()], columns=['feature', 'importance'])
def plot_importance(self, df:pd.DataFrame, limit=20, asc=False, **kwargs):
"Plot importance with an optional limit to how many variables shown"
df_copy = df.copy()
df_copy['feature'] = df_copy['feature'].str.slice(0,25)
df_copy = df_copy.sort_values(by='importance', ascending=asc)[:limit].sort_values(by='importance', ascending=not(asc))
ax = df_copy.plot.barh(x='feature', y='importance', sort_columns=True, **kwargs)
for p in ax.patches:
ax.annotate(f'{p.get_width():.4f}', ((p.get_width() * 1.005), p.get_y() * 1.005))
And now we can simply call PermutationImportance
to run it!
res = PermutationImportance(learn, train_df.iloc[:1000], bs=64)
res.importance