Lesson Video:

Often for tabular problems, we deal with ensembling from other models. For today, we'll look at using XGBoost (Gradient Boosting) mixed in with fastai, and you'll notice we'll be using fastai to prepare our data!

from fastai.tabular.all import *

Let's first build our TabularPandas object:

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
y_names = 'salary'
y_block = CategoryBlock()
splits = RandomSplitter()(range_of(df))

to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names,
                   y_names=y_names, y_block=y_block, splits=splits)

XGBoost

Gradient Boosting
Documentation

import xgboost as xgb

We'll need our x's and our y's

X_train, y_train = to.train.xs, to.train.ys.values.ravel()
X_test, y_test = to.valid.xs, to.valid.ys.values.ravel()

model = xgb.XGBClassifier(n_estimators = 100, max_depth=8, learning_rate=0.1, subsample=0.5)

And now we can fit our classifier:

xgb_model = model.fit(X_train, y_train)

And we'll grab the raw probabilities from our test data:

xgb_preds = xgb_model.predict_proba(X_test)

xgb_preds

array([[0.89155704, 0.10844298],
       [0.6882768 , 0.31172317],
       [0.79331285, 0.20668715],
       ...,
       [0.49610275, 0.50389725],
       [0.90957344, 0.09042657],
       [0.9879613 , 0.01203871]], dtype=float32)

And check it's accuracy

accuracy(tensor(xgb_preds), tensor(y_test))

tensor(0.8340)

We can even plot the importance

from xgboost import plot_importance

plot_importance(xgb_model)

<matplotlib.axes._subplots.AxesSubplot at 0x7f16d079ca20>

Bring in `fastai`

dls = to.dataloaders()

learn = tabular_learner(dls, layers=[200,100], metrics=accuracy)

learn.fit(5, 1e-2)

As we can see, our neural network has 83.84%, slighlty higher than the GBT

Now we'll grab predictions

nn_preds = learn.get_preds()[0]

nn_preds

tensor([[0.9685, 0.0315],
        [0.6587, 0.3413],
        [0.5426, 0.4574],
        ...,
        [0.3229, 0.6771],
        [0.9517, 0.0483],
        [0.9978, 0.0022]])

Let's check to see if our feature importance changed at all

class PermutationImportance():
  "Calculate and plot the permutation importance"
  def __init__(self, learn:Learner, df=None, bs=None):
    "Initialize with a test dataframe, a learner, and a metric"
    self.learn = learn
    self.df = df if df is not None else None
    bs = bs if bs is not None else learn.dls.bs
    self.dl = learn.dls.test_dl(self.df, bs=bs) if self.df is not None else learn.dls[1]
    self.x_names = learn.dls.x_names.filter(lambda x: '_na' not in x)
    self.na = learn.dls.x_names.filter(lambda x: '_na' in x)
    self.y = dls.y_names
    self.results = self.calc_feat_importance()
    self.plot_importance(self.ord_dic_to_df(self.results))

  def measure_col(self, name:str):
    "Measures change after column shuffle"
    col = [name]
    if f'{name}_na' in self.na: col.append(name)
    orig = self.dl.items[col].values
    perm = np.random.permutation(len(orig))
    self.dl.items[col] = self.dl.items[col].values[perm]
    metric = learn.validate(dl=self.dl)[1]
    self.dl.items[col] = orig
    return metric

  def calc_feat_importance(self):
    "Calculates permutation importance by shuffling a column on a percentage scale"
    print('Getting base error')
    base_error = self.learn.validate(dl=self.dl)[1]
    self.importance = {}
    pbar = progress_bar(self.x_names)
    print('Calculating Permutation Importance')
    for col in pbar:
      self.importance[col] = self.measure_col(col)
    for key, value in self.importance.items():
      self.importance[key] = (base_error-value)/base_error #this can be adjusted
    return OrderedDict(sorted(self.importance.items(), key=lambda kv: kv[1], reverse=True))

  def ord_dic_to_df(self, dict:OrderedDict):
    return pd.DataFrame([[k, v] for k, v in dict.items()], columns=['feature', 'importance'])

  def plot_importance(self, df:pd.DataFrame, limit=20, asc=False, **kwargs):
    "Plot importance with an optional limit to how many variables shown"
    df_copy = df.copy()
    df_copy['feature'] = df_copy['feature'].str.slice(0,25)
    df_copy = df_copy.sort_values(by='importance', ascending=asc)[:limit].sort_values(by='importance', ascending=not(asc))
    ax = df_copy.plot.barh(x='feature', y='importance', sort_columns=True, **kwargs)
    for p in ax.patches:
      ax.annotate(f'{p.get_width():.4f}', ((p.get_width() * 1.005), p.get_y()  * 1.005))

imp = PermutationImportance(learn)

Getting base error

Calculating Permutation Importance

And it did! Is that bad? No, it's actually what we want. If they utilized the same things, we'd expect very similar results. We're bringing in other models to hope that they can provide a different outlook to how they're utilizing the features (hopefully differently)

Ensembling

And perform our ensembling! To do so we'll average our predictions to gather (take the sum and divide by 2)

avgs = (nn_preds + xgb_preds) / 2

avgs

tensor([[0.9300, 0.0700],
        [0.6735, 0.3265],
        [0.6679, 0.3321],
        ...,
        [0.4095, 0.5905],
        [0.9307, 0.0693],
        [0.9929, 0.0071]])

And now we'll take the argmax to get our predictions:

argmax = avgs.argmax(dim=1)

argmax

tensor([0, 0, 0,  ..., 1, 0, 0])

How do we know if it worked? Let's grade our predictions:

y_test

array([0, 0, 0, ..., 0, 1, 0], dtype=int8)

accuracy(tensor(nn_preds), tensor(y_test))

tensor(0.8385)

accuracy(tensor(xgb_preds), tensor(y_test))

tensor(0.8340)

accuracy(tensor(avgs), tensor(y_test))

tensor(0.8391)

As you can see we scored a bit higher!

Bringing in Random Forests

Let's also try with Random Forests

from sklearn.ensemble import RandomForestClassifier

tree = RandomForestClassifier(n_estimators=100)

Now let's fit

tree.fit(X_train, y_train);

Now, we are not going to use the default importances. Why? Read up here:

Beware Default Random Forest Importances by Terence Parr, Kerem Turgutlu, Christopher Csiszar, and Jeremy Howard

Instead, based on their recommendations we'll be utilizing their rfpimp package

from rfpimp import *

imp = importances(tree, X_test, to.valid.ys)

plot_importances(imp)

Which as we can see, was also very different.

Now we can get our raw probabilities:

forest_preds = tree.predict_proba(X_test)

forest_preds

array([[0.99, 0.01],
       [0.72, 0.28],
       [0.75, 0.25],
       ...,
       [0.42, 0.58],
       [0.72, 0.28],
       [1.  , 0.  ]])

And now we can add it to our ensemble:

avgs = (nn_preds + xgb_preds + forest_preds) / 3

accuracy(tensor(avgs), tensor(y_test))

tensor(0.8354)

As we can see, it didn't quite work how we wanted to. But that is okay, the goal was to experiment!

epoch	train_loss	valid_loss	accuracy	time
0	0.378514	0.372543	0.829545	00:04
1	0.363559	0.371811	0.827242	00:04
2	0.356018	0.362737	0.830313	00:04
3	0.359589	0.359365	0.836609	00:04
4	0.343187	0.362838	0.838452	00:04

Lesson 2 - Ensembling

Lesson Video:

XGBoost

Bring in fastai

Ensembling

Bringing in Random Forests

Bring in `fastai`