Optimization / Deep Learning / From Scratch

Building Adam from Scratch

From plain gradient descent to adaptive moment estimation. Each optimizer fixes one problem the previous had. The notebook benchmarks all three on XOR in plain NumPy.

The Basics: Loss, Parameters, and Gradients

Before we talk about any optimizer, we need three building blocks. If you already know these, skip ahead. If not, this section sets up the foundation.

Parameters (θ)

A neural network is, at its core, a big pile of numbers. Weights, biases, thousands or millions of them. These numbers are the parameters, and we write them collectively as θ (the Greek letter theta). At the start of training, these parameters are random guesses. The entire point of training is to adjust them until the network does something useful.

Analogy
The Mixing Board

Think of a sound mixing board with hundreds of knobs. Each knob controls some aspect of the output, volume on one channel, treble on another, reverb on a third. Right now they are all set randomly and the sound is garbage. Your job is to turn these knobs until the output sounds right. The knobs are the parameters θ.

The Loss Function - L(θ)

L(θ) is the loss function, sometimes also called the cost function or objective function. It takes the current parameter values θ, runs the network on your training data, and spits out a single number that measures how wrong the network's predictions are.

If the network predicts a cat when the image is a dog, the loss is high. If it correctly predicts dog with high confidence, the loss is low. The notation L(θ) just means "the loss is a function of the parameters." Change the parameters, and the loss changes.

A common loss function for classification is cross-entropy. For regression, it is mean squared error. But the specific formula does not matter for understanding optimizers. What matters is that L(θ) produces one number, and you want that number to go down.

Analogy
The Score Counter

Imagine a judge watching you tune that mixing board and holding up a score card after every attempt. A high score means "that sounds terrible." A low score means "getting close." The score is L(θ). You cannot see the judge's criteria directly, but you can see the score change every time you turn a knob. Your goal: minimize the score.

The Gradient - ∇L(θ)

You have hundreds of knobs and a score. How do you know which ones to turn, and in which direction?

Imagine you wiggle just one knob, knob #47, a tiny bit to the right. The score goes up (gets worse). So you know: knob #47 should go left. Now imagine you wiggle knob #112 a tiny bit to the right. The score goes down (gets better). Great, knob #112 should go right. If you did this for every single knob, you would end up with a full set of instructions: "turn knob #1 a bit left, turn knob #2 a lot right, barely touch knob #3..." and so on.

That set of instructions is the gradient. It is just a list of numbers, one per knob (parameter), where each number tells you two things: which direction to turn that knob to make the loss worse, and how sensitive the loss is to that knob. A big number means the loss reacts strongly to that knob. A tiny number means the loss barely cares.

Analogy
The Slope Under Your Feet

Picture yourself standing on a hilly landscape where height represents the loss. The gradient is the slope under your feet. It tells you which direction is uphill and how steep it is. You want to go downhill (reduce loss), so you step in the opposite direction of the slope. The steeper the slope, the more confident you are about which way to go.

Since you want to reduce the loss, you move each parameter in the opposite direction of its gradient value. If the gradient says "increasing this weight makes the loss worse," you decrease that weight. That is the core idea behind every optimizer on this page.

In mathematical notation, the gradient is written as ∇L(θ). The ∇ symbol (called "nabla" or "del") just means "the collection of all those sensitivities." Each entry, written ∂L/∂θ, is the sensitivity of the loss to one particular parameter:

∇L(θ) = [ ∂L∂θ1 , ∂L∂θ2 , ∂L∂θ3 , ... , ∂L∂θn ]

Do not be intimidated by the notation. Each ∂L/∂θi is just the answer to "if I nudge parameter i, how much does the loss change?" The formula is the formal way of writing what you already understood from the knob-wiggling example.

How the gradient is computed: backpropagation

In practice, you do not actually wiggle each knob one at a time. That would take forever with millions of parameters. Instead, neural networks use a clever algorithm called backpropagation that computes all the sensitivities at once in a single backward pass through the network. The details of how backprop works are a separate topic, but the result is what matters here: after one forward pass (make a prediction) and one backward pass (run backprop), you get the full gradient and know exactly which direction to adjust every parameter.

Those three pieces, parameters θ, loss L(θ), and gradient ∇L(θ), are all an optimizer needs. The rest is just deciding how to use them. That starts with the simplest possible approach: gradient descent.

Gradient Descent: Hiking Downhill in Fog

Now that you know what a loss function and a gradient are, the simplest optimizer is surprisingly straightforward. Imagine you are standing on that hilly loss landscape, blindfolded, and your only goal is to reach the lowest valley. You feel the slope (compute the gradient), take a step in the steepest downhill direction, then feel the slope again and repeat.

That is gradient descent. At each step, the gradient tells you which way is uphill. You go the opposite way. The size of each step is controlled by a single number called the learning rate. Too large, and you overshoot the valley, bouncing from one hillside to the other. Too small, and you inch forward so slowly that training never finishes.

Analogy
The Blindfolded Hiker

Every step is the same fixed length. The hiker cannot look ahead to see if a cliff is coming or if the valley is just two inches away. All decisions are local, based only on the slope at the current position. That simplicity is also the core problem with vanilla gradient descent.

The update rule

At every step, you update each parameter like this:

θ = θ - η · ∇L(θ)

The whole algorithm is that one line. Compute ∇L(θ), scale it by η, subtract. Each step nudges θ toward a configuration that makes predictions less wrong.

Where it breaks down

This simple rule has real problems. If the loss landscape is shaped like a long, narrow ravine, as it very often is in high-dimensional neural networks, the gradient points mostly across the ravine rather than along it. The optimizer zigzags back and forth, barely making progress toward the actual minimum. It also treats every parameter identically: parameters whose gradients are naturally large get the same learning rate as parameters whose gradients are tiny. And in noisy settings (stochastic mini-batches), each gradient estimate is imprecise, so the path jitters even more.

Momentum: The Rolling Ball

Now change the image. Instead of a cautious hiker taking one careful step at a time, picture a heavy ball rolling down the mountain. The ball does not stop and recalculate after every inch. It accumulates speed. If the slope keeps pointing in the same direction, the ball goes faster. If the slope suddenly reverses, the ball's built-up inertia dampens the reversal instead of immediately zigzagging.

Analogy
The Bowling Ball on a Hillside

Momentum smooths out the path. On a consistent downhill slope, it accelerates. In a noisy zigzag ravine, the lateral oscillations cancel out while the forward movement accumulates. The ball reaches the bottom of the valley faster and with less jittering.

The update rule

vt = β · vt-1 + η · ∇L(θ)
θ = θ - vt

The new variable v is the velocity. The β term (typically 0.9) controls how much of the previous velocity carries over. At 0.9, the optimizer "remembers" roughly the last 10 gradients. This is also called an exponential moving average, because older gradients fade away exponentially.

Momentum solves the zigzag problem, but it still uses one global learning rate for every parameter. If some parameters need large steps and others need tiny ones, you are stuck picking a compromise that is not ideal for either.

Adaptive Learning Rates: Not All Parameters Are Equal

Think about a team of runners in a relay race. Some runners are sprinters on flat ground, others are climbers on steep terrain. Giving everyone the same speed instruction makes no sense. The sprinter needs to go fast, the climber needs to go slow and steady. What if each runner could self-adjust?

Analogy
The Self-Adjusting Relay Team

Optimizers like AdaGrad and RMSProp give each parameter its own learning rate. Parameters that have seen large gradients historically get a smaller effective rate (slow down, you are oscillating). Parameters with small gradients get a larger effective rate (speed up, you are barely moving).

RMSProp's idea

vt = β2 · vt-1 + (1 - β2) · gt2

θ = θ - η · gtvt

The variable v here tracks the exponential moving average of squared gradients. It is the "second moment", a measure of how volatile a parameter's gradient has been. Dividing by its square root normalizes the update: wild parameters get tamed, sleepy parameters get amplified.

RMSProp handles the per-parameter scaling problem, but it has no momentum. The two ideas were developed independently. The obvious next step was combining them.

Adam: The Best of Both Worlds

Adam, short for Adaptive Moment Estimation (Kingma and Ba, 2015), does exactly that. It keeps two running statistics for each parameter: a momentum-like first moment (the direction smoother) and an RMSProp-like second moment (the per-parameter rate adjuster). Then it combines them into a single, elegant update rule.

Analogy
The GPS Navigator with Traffic Data

Think of Adam as a GPS with two data feeds. The first is your recent travel direction and speed (momentum, the first moment) — it knows you have been heading north and smooths out small detours. The second is per-road traffic data (adaptive rate, the second moment) — Main Street is congested (high gradient variance) so it routes you slowly; the highway is clear (low variance) so it says floor it. Both signals together produce routes that are smoother and individually tuned per road.

Adam, step by step

1

Compute the gradient

Run a forward pass, compute the loss, then backpropagate to get gradients for every parameter.

2

Update the first moment (m)

A running average of the gradient. This is the "momentum" part. It smooths out noise and remembers which direction you have been heading.

3

Update the second moment (v)

A running average of the squared gradient. This is the "adaptive rate" part. It tracks how volatile each parameter's gradients are.

4

Bias-correct both

Since m and v start at zero, the early estimates are biased too low. Dividing by (1 - βt) inflates them back to their true scale.

The complete update rule

First moment mt = β1 · mt-1 + (1 - β1) · gt
Second moment vt = β2 · vt-1 + (1 - β2) · gt2
Bias correction t = mt1 - β1 t t = vt1 - β2 t
Parameter update θ = θ - η · tt + ε

Why bias correction matters

This is a subtle but important detail that many explanations gloss over. Both m and v are initialized to zero vectors. At the very first step, m becomes (1 - β₁) · g, which with β₁ = 0.9 means m is only 10% of the actual gradient. That is a huge underestimate.

The bias correction term 1 / (1 - β₁ᵗ) fixes this. At step 1, it divides by 1 - 0.9 = 0.1, effectively multiplying by 10 and recovering the true scale. At step 2, it divides by 1 - 0.81 = 0.19. By step 10, β₁¹⁰ = 0.349, so the correction is only about 1.5x. The correction naturally fades as the running average warms up.

Analogy
The New Restaurant's Rating

Imagine a restaurant review site that shows a running average rating. A brand-new restaurant with one 5-star review shows 5.0, which seems too confident. A better system would say "we do not have enough data yet, so let us weight this carefully." That is what bias correction does: it compensates for the lack of history in the early steps, preventing the optimizer from being overconfident about near-zero moment estimates.

Default hyperparameters

Parameter Default What it controls
η (alpha) 0.001 Overall step size. The master dial.
β₁ 0.9 Decay rate for the first moment. Higher = smoother, more inertia.
β₂ 0.999 Decay rate for the second moment. Higher = longer memory of gradient magnitudes.
ε 1e-8 Tiny constant to prevent division by zero. Rarely needs tuning.

The paper's defaults hold up well on most problems. In practice, η is the only one that usually needs touching.

Interactive: Watch the Optimizers Descend

Below is a 2D loss surface (a classic bowl-shaped function with an elongated ravine). You can watch three optimizers try to reach the minimum from the same starting point. Notice how vanilla SGD zigzags, Momentum smooths the path but overshoots, and Adam converges quickly and cleanly.

SGD Momentum Adam

Convergence on XOR: Adam vs SGD vs Momentum

The notebook trains a two-layer network on XOR using all three optimizers. Below is the convergence comparison.

Loss over 3,000 epochs

Adam reaches near-zero loss much faster. By epoch 200 it is already below 0.05, while SGD and Momentum are still grinding through higher error. Adam's per-parameter rate scaling plus momentum means far fewer wasted steps on the way down.

Optimizer Learning Rate Final Loss Final Accuracy
Adam 0.01 0.000122 100%
SGD 0.10 0.225 100%
Momentum 0.01 0.00118 100%

All three hit 100% accuracy on XOR eventually, but the final loss numbers show how confident each model actually is. Adam's output probabilities land within 0.001 of the target. SGD still carries measurable residual error at epoch 3,000.

The Family Tree

Each optimizer builds directly on the one before it:

When to use Adam

Adam is the default starting point for most training runs, and it earns that. It converges fast, adjusts step sizes per parameter automatically, and rarely blows up without warning. Pure SGD with a well-tuned schedule can edge it out on some large-scale vision tasks — there are published ImageNet results where that holds — but those experiments involve weeks of tuning the schedule. For anything involving language models, generative networks, or early-stage experiments, Adam is the practical choice. Start there, tune the learning rate, and move on.

Summary
The One-Sentence Version

Adam = Momentum (smooth out the direction using a running average of gradients) + RMSProp (adapt the step size per-parameter using a running average of squared gradients) + bias correction (compensate for cold-start zero initialization).