CatBoostClassifier on Polynomial Features

Feature engineering meets gradient boosting on the Palmer Penguins dataset

Introduction

CatBoost handles categorical features natively during training, no preprocessing step needed. Pass the column indices via cat_features and it deals with encoding internally using ordered target statistics.

The ordered statistics approach works roughly like this: for each categorical value, CatBoost estimates the target mean using only the rows that appeared before the current one in a shuffled order. This prevents target leakage that you would get from naive target encoding, where computing statistics over the whole dataset lets the model see future target values during training.

Polynomial features expand the numeric input space by adding interaction terms and powers. With degree=2, a feature x becomes x and , and two features x, y also get a cross term x·y. The model then picks up on relationships it would miss if only raw columns were passed in. For the three numeric columns in the Penguins dataset (bill_length_mm, bill_depth_mm, flipper_length_mm), degree=2 expansion produces 9 features total before the bias term, nearly tripling the numeric feature count.

Feature Engineering

The pipeline keeps numeric and categorical columns on separate tracks, then recombines them before training:

  • Numeric columns: median imputation for missing values (the Penguins dataset has a handful of NaN rows), then PolynomialFeatures(degree=2, include_bias=False) to expand the feature space.
  • Categorical columns: most-frequent imputation only. CatBoost's cat_features parameter handles species and sex natively, so no one-hot encoding or label encoding is needed.
  • Both tracks are concatenated into a single matrix, with the numeric polynomial columns first and the categorical columns appended at the end.

Model evaluation uses stratified 5-fold cross-validation to preserve class distribution across folds. Stratification matters here because the three species have unequal sample counts: Adelie has 152 samples, Gentoo 124, and Chinstrap only 68. A random split could create folds that underrepresent Chinstrap and produce overly optimistic scores on the majority classes.

Results & Notebook

Cross-validation scores across 5 folds: 0.9818, 0.9455, 0.9455, 1.0, 1.0. Final held-out test accuracy: 97.1% on 69 test samples.

The two folds that scored below 0.97 likely reflect the trickier boundary cases between Adelie and Chinstrap, which overlap more in bill dimensions than either does with Gentoo. The polynomial cross terms (bill_length x bill_depth in particular) help define this boundary. Running CatBoost on the raw numeric features without polynomial expansion drops mean CV accuracy by a few points, confirming the expansion is doing real work.

97.1% on 344 samples is solid. The Penguins dataset is relatively clean and separable by bill morphology, but 344 samples is small enough that every percentage point has noise. The CV consistency (four of five folds above 0.94) is more informative than the single test number.