Backpropagation from Scratch

Training a neural network by hand with NumPy, no autograd, no shortcuts

Introduction

Every neural network library hides the training loop behind a .backward() call. You define a loss, call backward, and the gradients appear. This is great for building models quickly, but it makes it easy to treat backpropagation as a black box. Writing it yourself, even once, changes that permanently.

This notebook builds a fully connected neural network from scratch using only NumPy. No PyTorch, no TensorFlow, no autograd. The network trains on a binary classification task using sigmoid activations, binary cross-entropy loss, and gradient descent. Every gradient is computed by hand following the chain rule, and every weight update is written out explicitly.

The goal is not to produce a production-ready library. It is to make the math concrete enough that you can follow what a real framework does when you look under the hood. Once you have written the update rule for a single layer, the rest is just repeating the same logic backwards through the network.

The Math & Implementation

Training a neural network comes down to two passes. The forward pass computes predictions. The backward pass figures out how much each weight contributed to the error, then nudges them in the direction that reduces it.

Forward pass: For each layer, compute a linear combination of the inputs and weights: Z = W @ X + b, then apply an activation function to get the output A = activation(Z). The output of one layer feeds into the next. The final layer uses sigmoid to produce a probability between 0 and 1.

Loss: Binary cross-entropy measures how far the predicted probabilities are from the true labels: L = -[y * log(a) + (1-y) * log(1-a)]. This loss is close to zero when the prediction is confident and correct, and blows up when the prediction is confident and wrong.

Backward pass: This is where the chain rule comes in. The gradient of the loss with respect to any weight is computed by chaining together the partial derivatives through every operation that weight touched. Working backwards from the output:

Output layer gradient: dL/dZ = A - Y for sigmoid + cross-entropy. This clean form is one reason these two are almost always paired together.
Weight gradients: dL/dW = (1/m) * dZ @ A_prev.T and dL/db = (1/m) * sum(dZ), where m is the batch size and A_prev is the activation from the layer before.
Propagate backwards: Pass the gradient through the weight matrix transposed to get dA_prev = W.T @ dZ, then multiply by the activation derivative to get dZ_prev = dA_prev * activation_deriv(Z_prev).
Weight update: Subtract the scaled gradient: W = W - lr * dW, b = b - lr * db.

The implementation stores the intermediate values from the forward pass (the Z and A arrays for each layer) in a cache, because the backward pass needs them. Without the cache, you would have to recompute half the forward pass again for each gradient, which gets expensive fast.

Weight initialization matters more than it looks. Initializing all weights to zero breaks the symmetry: every neuron in a layer computes the same gradient and learns the same thing, so you effectively have a single neuron no matter how many you add. Small random initialization (scaled by sqrt(1/n_prev) for Xavier-style scaling) breaks this symmetry without sending activations into the saturated tails of sigmoid from the start.

Results & Notebook

The network is a two-hidden-layer architecture: input layer matching the feature count, two hidden layers with 8 and 4 units respectively, and a single sigmoid output unit. The training dataset is a synthetic binary classification problem generated with make_classification, 1000 samples with 10 features.

After 1000 gradient descent steps with a learning rate of 0.01, cross-entropy loss drops from around 0.69 (random chance for balanced classes) to below 0.15. Test accuracy comes in at 96%. The loss curve drops steeply in the first 200 iterations and flattens out afterward, which is the typical pattern for a well-tuned learning rate on a well-separable problem.

One thing that becomes obvious writing this from scratch: the gradient computation for a 3-layer network is not complicated, it is just tedious. The chain rule always works the same way. Each layer only needs to know three things: the gradient flowing in from the layer above, its own cached Z, and the previous layer's activation. That locality is actually what makes deep networks tractable to train, and it is also why the computational graph abstraction that frameworks like PyTorch use makes sense once you have done this by hand.

A common issue when implementing this for the first time is a vanishing gradient. If you use too many layers with sigmoid, the gradient gets multiplied by values below 1 at each step and shrinks to near-zero by the time it reaches the early layers. ReLU avoids this because its derivative is either 0 or 1, which does not compound the shrinkage. The notebook sticks with sigmoid to keep the math self-contained, but the fix for deeper networks is simply switching to ReLU in the hidden layers.

View Notebook Code

← Return Home