Often for tabular problems, we deal with ensembling from other models. For today, we'll look at using XGBoost (Gradient Boosting) mixed in with fastai
, and you'll notice we'll be using fastai
to prepare our data!
from fastai.tabular.all import *
Let's first build our TabularPandas
object:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
y_names = 'salary'
y_block = CategoryBlock()
splits = RandomSplitter()(range_of(df))
to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names,
y_names=y_names, y_block=y_block, splits=splits)
XGBoost
- Gradient Boosting
- Documentation
import xgboost as xgb
We'll need our x
's and our y
's
X_train, y_train = to.train.xs, to.train.ys.values.ravel()
X_test, y_test = to.valid.xs, to.valid.ys.values.ravel()
model = xgb.XGBClassifier(n_estimators = 100, max_depth=8, learning_rate=0.1, subsample=0.5)
And now we can fit our classifier:
xgb_model = model.fit(X_train, y_train)
And we'll grab the raw probabilities from our test data:
xgb_preds = xgb_model.predict_proba(X_test)
xgb_preds
And check it's accuracy
accuracy(tensor(xgb_preds), tensor(y_test))
We can even plot the importance
from xgboost import plot_importance
plot_importance(xgb_model)
dls = to.dataloaders()
learn = tabular_learner(dls, layers=[200,100], metrics=accuracy)
learn.fit(5, 1e-2)
As we can see, our neural network has 83.84%, slighlty higher than the GBT
Now we'll grab predictions
nn_preds = learn.get_preds()[0]
nn_preds
Let's check to see if our feature importance changed at all
class PermutationImportance():
"Calculate and plot the permutation importance"
def __init__(self, learn:Learner, df=None, bs=None):
"Initialize with a test dataframe, a learner, and a metric"
self.learn = learn
self.df = df if df is not None else None
bs = bs if bs is not None else learn.dls.bs
self.dl = learn.dls.test_dl(self.df, bs=bs) if self.df is not None else learn.dls[1]
self.x_names = learn.dls.x_names.filter(lambda x: '_na' not in x)
self.na = learn.dls.x_names.filter(lambda x: '_na' in x)
self.y = dls.y_names
self.results = self.calc_feat_importance()
self.plot_importance(self.ord_dic_to_df(self.results))
def measure_col(self, name:str):
"Measures change after column shuffle"
col = [name]
if f'{name}_na' in self.na: col.append(name)
orig = self.dl.items[col].values
perm = np.random.permutation(len(orig))
self.dl.items[col] = self.dl.items[col].values[perm]
metric = learn.validate(dl=self.dl)[1]
self.dl.items[col] = orig
return metric
def calc_feat_importance(self):
"Calculates permutation importance by shuffling a column on a percentage scale"
print('Getting base error')
base_error = self.learn.validate(dl=self.dl)[1]
self.importance = {}
pbar = progress_bar(self.x_names)
print('Calculating Permutation Importance')
for col in pbar:
self.importance[col] = self.measure_col(col)
for key, value in self.importance.items():
self.importance[key] = (base_error-value)/base_error #this can be adjusted
return OrderedDict(sorted(self.importance.items(), key=lambda kv: kv[1], reverse=True))
def ord_dic_to_df(self, dict:OrderedDict):
return pd.DataFrame([[k, v] for k, v in dict.items()], columns=['feature', 'importance'])
def plot_importance(self, df:pd.DataFrame, limit=20, asc=False, **kwargs):
"Plot importance with an optional limit to how many variables shown"
df_copy = df.copy()
df_copy['feature'] = df_copy['feature'].str.slice(0,25)
df_copy = df_copy.sort_values(by='importance', ascending=asc)[:limit].sort_values(by='importance', ascending=not(asc))
ax = df_copy.plot.barh(x='feature', y='importance', sort_columns=True, **kwargs)
for p in ax.patches:
ax.annotate(f'{p.get_width():.4f}', ((p.get_width() * 1.005), p.get_y() * 1.005))
imp = PermutationImportance(learn)
And it did! Is that bad? No, it's actually what we want. If they utilized the same things, we'd expect very similar results. We're bringing in other models to hope that they can provide a different outlook to how they're utilizing the features (hopefully differently)
And perform our ensembling! To do so we'll average our predictions to gather (take the sum and divide by 2)
avgs = (nn_preds + xgb_preds) / 2
avgs
And now we'll take the argmax to get our predictions:
argmax = avgs.argmax(dim=1)
argmax
How do we know if it worked? Let's grade our predictions:
y_test
accuracy(tensor(nn_preds), tensor(y_test))
accuracy(tensor(xgb_preds), tensor(y_test))
accuracy(tensor(avgs), tensor(y_test))
As you can see we scored a bit higher!
from sklearn.ensemble import RandomForestClassifier
tree = RandomForestClassifier(n_estimators=100)
Now let's fit
tree.fit(X_train, y_train);
Now, we are not going to use the default importances. Why? Read up here:
Beware Default Random Forest Importances by Terence Parr, Kerem Turgutlu, Christopher Csiszar, and Jeremy Howard
Instead, based on their recommendations we'll be utilizing their rfpimp
package
from rfpimp import *
imp = importances(tree, X_test, to.valid.ys)
plot_importances(imp)
Which as we can see, was also very different.
Now we can get our raw probabilities:
forest_preds = tree.predict_proba(X_test)
forest_preds
And now we can add it to our ensemble:
avgs = (nn_preds + xgb_preds + forest_preds) / 3
accuracy(tensor(avgs), tensor(y_test))
As we can see, it didn't quite work how we wanted to. But that is okay, the goal was to experiment!