Decision Tree Classifier from Scratch
Building Gini-based tree splitting in pure NumPy on the Iris dataset
Introduction
A Decision Tree splits data recursively. At each node it picks the feature and threshold that best separates the classes among the current samples, then repeats on each resulting subset until a stopping condition is reached. Leaves vote by majority class.
Every prediction follows a traceable path through the tree you can read like a set of if-else rules, nothing hidden. The implementation is pure NumPy, no scikit-learn for the tree logic.
One useful property of decision trees is that the algorithm is fully deterministic given the training data. No random initialization, no gradient descent, no convergence questions. Run it twice on the same data and you get the same tree. The only design choices are the splitting criterion, depth limit, and minimum samples to split a node. Everything else follows from those.
How It Works
The core of the algorithm is the split selection criterion:
- Gini Impurity: Measures class mixing at a node. Computed as
1 − Σ(pᵢ²), where pᵢ is the proportion of class i. A pure node has Gini = 0; a maximally mixed two-class node has Gini = 0.5. Gini is cheaper to compute than entropy (which uses logarithms) and produces very similar splits in practice. - Best split search: For each feature, every unique value is considered as a candidate threshold. Each candidate is scored by computing the weighted Gini of the two resulting child nodes:
(n_left/n) * gini_left + (n_right/n) * gini_right. The feature-threshold pair with the lowest weighted Gini is selected. - Stopping conditions: Recursion halts when the maximum depth is reached, the node is already pure (Gini = 0), or the number of samples falls below
min_samples_split. - Leaf assignment: Each leaf stores the majority class among its training samples and outputs that class for any test point that reaches it.
The brute-force search over all features and thresholds is O(n * d) per node where d is the number of features and n is the samples at that node. For small datasets like Iris this is instant. For larger datasets with many unique feature values, you would typically sort each feature column once and scan thresholds in order, or use approximate histogram-based methods.
Results & Notebook
Tested on the Iris dataset (150 samples, 3 classes) with an 80/20 split and max_depth = 5. The from-scratch tree hits 100% test accuracy, matching scikit-learn's DecisionTreeClassifier on the same split.
Iris is a clean starting point: the classes are separable enough that a correct implementation should hit 100%, making it easy to catch logic errors in the Gini computation or split search. A mistake in the weighted Gini formula, for instance, typically produces splits that look reasonable but miss the true optimum, causing a few misclassifications you can directly inspect.
Setting max_depth = 5 gives more capacity than Iris actually needs. In practice the tree stops growing at depth 3 because all leaves are pure by then. Increasing max_depth further has no effect on this dataset. The interesting experiment is reducing it: at max_depth = 2 you can still achieve 96-97% accuracy, which shows how much structure Iris has along just two or three decision boundaries.