Opening hook
Andrej Karpathy built micrograd in around 100 lines of Python. It is an autograd engine that implements backpropagation over a dynamically built computation graph. If you understand micrograd, you understand the engine that trains every modern model from a 10M-parameter character-level transformer to a frontier model with hundreds of billions of parameters. The math is the same. The scale is different.
Core teaching
The first principle: a neural network is a math expression. Specifically, it is a parameterized function that takes inputs, multiplies them by weights, adds biases, applies a nonlinearity, and produces outputs. Training is the process of finding weight values that minimize a loss function over a dataset. That sentence sounds abstract until you build it from scratch. After you build it, it sounds obvious.
Karpathy's micrograd teaches the foundation by building one piece at a time (Karpathy, Neural Networks: Zero to Hero, 2022). The first piece is the Value object. A Value wraps a single scalar number and tracks two things: the operations that produced it, and a gradient. Every arithmetic operation between Value objects produces a new Value that records its parents and the local derivative. This is the computation graph. The graph is built dynamically as you write Python.
Here is what the core of micrograd looks like, simplified to the first principle:
class Value:
def __init__(self, data, _children=(), _op=''):
self.data = data
self.grad = 0.0
self._prev = set(_children)
self._op = _op
self._backward = lambda: None
def __add__(self, other):
out = Value(self.data + other.data, (self, other), '+')
def _backward():
self.grad += out.grad
other.grad += out.grad
out._backward = _backward
return out
def __mul__(self, other):
out = Value(self.data * other.data, (self, other), '*')
def _backward():
self.grad += other.data * out.grad
other.grad += self.data * out.grad
out._backward = _backward
return out
That code is the engine of deep learning. The + operation passes gradients straight through (the local derivative of a + b with respect to either input is 1). The * operation routes the gradient by the other operand (the derivative of a * b with respect to a is b). Once you have these two rules plus a nonlinearity like tanh, you can build any feedforward neural network and train it with gradient descent. That is the whole game.
The second principle: backpropagation is the chain rule applied recursively to the computation graph. The chain rule says if y = f(g(x)), then dy/dx = (dy/dg) * (dg/dx). Backpropagation walks the graph in reverse topological order, multiplying local derivatives along the way (Goodfellow, Bengio, Courville, Deep Learning, 2016, Chapter 6). Karpathy's micrograd implements this in maybe 20 lines of Python:
def backward(self):
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(self)
self.grad = 1.0
for v in reversed(topo):
v._backward()
That is it. That is what PyTorch, JAX, and TensorFlow do under the hood, with vectorized tensors and GPU kernels and many engineering layers, but the core algorithm is the same recursive chain-rule pass.
The third principle: a neuron is just a weighted sum plus a nonlinearity. A multilayer perceptron is just stacked layers of neurons. Training is just running examples through the network, computing a loss, calling backward to populate gradients, and stepping the parameters in the direction that reduces loss. Karpathy walks through building all of this on top of his Value class. Once you have built it, you have demystified the entire field.
The fourth principle: why this matters for an AI engineer. You will spend most of your career using PyTorch or JAX, not micrograd. The point is not to write your own autograd in production. The point is that when a model fails, when a loss curve goes flat, when a gradient explodes, when a layer norm placement is wrong, you need to be able to reason about what is happening at the autograd level. Engineers who have built autograd from scratch can debug models. Engineers who have only used model.fit cannot. This is why Karpathy's pedagogy starts here, and why every serious AI engineer should walk through micrograd at least once.
The fifth principle: the bitter lesson preview. As Sutton's bitter lesson (which we cover in Lesson 1.6) makes clear, the field has consistently rewarded methods that scale with compute and data over methods that bake in human knowledge. Backpropagation through computation graphs is the scaling-friendly substrate that everything else has been built on. Understanding it from first principles is non-negotiable.
The sixth principle: build, then read. The fast.ai philosophy is to build first, theorize second (Howard, Practical Deep Learning, 2024). Walk through Karpathy's micrograd video while implementing it yourself. Do not skip the implementation. Implementation is where the understanding lives. After implementation, then read the relevant chapters of Goodfellow, Bengio, and Courville. The theory will land differently after your hands have done the work.
AI-specific application
For the AI engineer in 2026, micrograd is the first checkpoint. Frontier model training is conceptually identical to training a 100-line MLP on the moons dataset, except the model has 100 billion parameters instead of 100, the data is 15 trillion tokens instead of 200 points, the compute is thousands of H100s instead of your laptop, and the optimizer is AdamW with carefully tuned schedules instead of vanilla SGD. The math is the same. The chain rule is the same. The autograd is the same.
This matters because the AI engineering interview at Anthropic, OpenAI, DeepMind, or any AI-first company will probe whether you understand the foundation. You will not be asked to write micrograd from scratch in 45 minutes (probably), but you will be asked to reason about gradient flow in a transformer, to explain why a layer norm placement matters, to debug a training run that has stopped converging. Engineers who skipped this layer cannot answer those questions credibly. Engineers who walked through micrograd answer them naturally because they built the underlying mental model.
Practice exercises
Implement micrograd from scratch. Do not copy from the repo. Watch Karpathy's video at 1.0 speed and pause to write each piece. Implement
Valuewith+,*,tanh, andbackward. Test it on a single neuron and verify the gradient matches PyTorch's autograd to numerical precision.Train a 2-layer MLP on the moons dataset using only your micrograd. No PyTorch. Plot the loss curve. Plot the decision boundary. Confirm convergence.
Break it intentionally. Introduce a bug in your
_backwardfor*. Train and observe what fails. Restore. Introduce a different bug in the topological sort. Observe. The point is to internalize what each piece does by removing it.
Knowledge check
Question 1. What does micrograd's
Valueclass track in addition to a scalardata? a) Only the value b) The value, the gradient, the parent operations, and the backward function [correct, these four pieces are what enable autograd] c) The value and a label d) The value and a learning rateQuestion 2. When two
Valueobjects are added, what is the local gradient that propagates back to each operand? a) The product of the operands b) 1 for each operand [correct, derivative ofa + bwith respect to either input is 1] c) 0 d) The sum of the operandsQuestion 3. When two
Valueobjects are multiplied, what is the local gradient that propagates back to operanda? a) 1 b) The data of the other operandb[correct, derivative ofa * bwith respect toaisb] c) The data ofad) The output gradient times itselfQuestion 4. What does backpropagation do, in one sentence? a) It computes loss b) It applies the chain rule recursively in reverse topological order over the computation graph to populate gradients [correct, that is the entire algorithm] c) It updates weights d) It samples from the model
Question 5. Why does Karpathy teach micrograd before teaching production frameworks like PyTorch? a) Because micrograd is faster b) Because the math is the same as production frameworks, and building it from scratch creates the mental model that lets you debug production systems [correct, pedagogical foundation] c) Because PyTorch is too complex for beginners d) Because micrograd scales to large models
Question 6. What is a computation graph? a) A static neural network architecture b) A directed graph of operations and operands built dynamically as Python executes, used to enable automatic differentiation [correct] c) A loss curve d) A training schedule
Question 7. Why does backward initialize the output gradient to 1.0 before walking the graph in reverse? a) To zero the gradients b) Because the derivative of the output with respect to itself is 1, and that seeds the chain-rule walk [correct] c) Because the loss is always 1 d) Because PyTorch does it that way
Slide deck outline
- Title slide: "Lesson 1.1, Neural networks from scratch (Karpathy's micrograd)"
- Hook: 100 lines of Python is the engine of frontier models
- The first principle: a neural network is a math expression
- The
Valueobject: data, grad, parents, backward - Code walkthrough:
__add__ - Code walkthrough:
__mul__ - The chain rule visualization
- Reverse topological order on a small graph
- Code walkthrough:
backward() - From
Valueto neuron to MLP - Loss, gradient, parameter step
- The full training loop in pseudocode
- Why this matters: debugging at the autograd level
- The bitter lesson preview
- Build first, theory second (fast.ai philosophy)
- From micrograd to PyTorch: same algorithm, vectorized
- Frontier model training: same math, different scale
- The AI engineering interview implication
- Common implementation mistakes
- Citations: Karpathy, Goodfellow, fast.ai
- Practice exercise summary
- Transition to Lesson 1.2
Reference reading
- Karpathy, micrograd repo: https://github.com/karpathy/micrograd
- Karpathy, Neural Networks: Zero to Hero playlist: https://karpathy.ai/zero-to-hero.html
- Goodfellow, Bengio, Courville, Deep Learning, Chapter 6: https://www.deeplearningbook.org/
- fast.ai, Practical Deep Learning for Coders: https://course.fast.ai/
Transition
You have the autograd engine. Next, you build a generative model on top of it. In Lesson 1.2, the makemore series: a character-level language model that takes you from a bigram baseline through an MLP to a recurrent model, all on the foundation you just built.