Free sample lesson: Lesson 1.1, Neural networks from scratch (Karpathy's micrograd)

Opening hook

Andrej Karpathy built micrograd in around 100 lines of Python. It is an autograd engine that implements backpropagation over a dynamically built computation graph. If you understand micrograd, you understand the engine that trains every modern model from a 10M-parameter character-level transformer to a frontier model with hundreds of billions of parameters. The math is the same. The scale is different.

Core teaching

The first principle: a neural network is a math expression. Specifically, it is a parameterized function that takes inputs, multiplies them by weights, adds biases, applies a nonlinearity, and produces outputs. Training is the process of finding weight values that minimize a loss function over a dataset. That sentence sounds abstract until you build it from scratch. After you build it, it sounds obvious.

Karpathy's micrograd teaches the foundation by building one piece at a time (Karpathy, Neural Networks: Zero to Hero, 2022). The first piece is the Value object. A Value wraps a single scalar number and tracks two things: the operations that produced it, and a gradient. Every arithmetic operation between Value objects produces a new Value that records its parents and the local derivative. This is the computation graph. The graph is built dynamically as you write Python.

Here is what the core of micrograd looks like, simplified to the first principle:

class Value:
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0
        self._prev = set(_children)
        self._op = _op
        self._backward = lambda: None

    def __add__(self, other):
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _backward
        return out

    def __mul__(self, other):
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
        return out

That code is the engine of deep learning. The + operation passes gradients straight through (the local derivative of a + b with respect to either input is 1). The * operation routes the gradient by the other operand (the derivative of a * b with respect to a is b). Once you have these two rules plus a nonlinearity like tanh, you can build any feedforward neural network and train it with gradient descent. That is the whole game.

The second principle: backpropagation is the chain rule applied recursively to the computation graph. The chain rule says if y = f(g(x)), then dy/dx = (dy/dg) * (dg/dx). Backpropagation walks the graph in reverse topological order, multiplying local derivatives along the way (Goodfellow, Bengio, Courville, Deep Learning, 2016, Chapter 6). Karpathy's micrograd implements this in maybe 20 lines of Python:

def backward(self):
    topo = []
    visited = set()
    def build_topo(v):
        if v not in visited:
            visited.add(v)
            for child in v._prev:
                build_topo(child)
            topo.append(v)
    build_topo(self)
    self.grad = 1.0
    for v in reversed(topo):
        v._backward()

That is it. That is what PyTorch, JAX, and TensorFlow do under the hood, with vectorized tensors and GPU kernels and many engineering layers, but the core algorithm is the same recursive chain-rule pass.

The third principle: a neuron is just a weighted sum plus a nonlinearity. A multilayer perceptron is just stacked layers of neurons. Training is just running examples through the network, computing a loss, calling backward to populate gradients, and stepping the parameters in the direction that reduces loss. Karpathy walks through building all of this on top of his Value class. Once you have built it, you have demystified the entire field.

The fourth principle: why this matters for an AI engineer. You will spend most of your career using PyTorch or JAX, not micrograd. The point is not to write your own autograd in production. The point is that when a model fails, when a loss curve goes flat, when a gradient explodes, when a layer norm placement is wrong, you need to be able to reason about what is happening at the autograd level. Engineers who have built autograd from scratch can debug models. Engineers who have only used model.fit cannot. This is why Karpathy's pedagogy starts here, and why every serious AI engineer should walk through micrograd at least once.

The fifth principle: the bitter lesson preview. As Sutton's bitter lesson (which we cover in Lesson 1.6) makes clear, the field has consistently rewarded methods that scale with compute and data over methods that bake in human knowledge. Backpropagation through computation graphs is the scaling-friendly substrate that everything else has been built on. Understanding it from first principles is non-negotiable.

The sixth principle: build, then read. The fast.ai philosophy is to build first, theorize second (Howard, Practical Deep Learning, 2024). Walk through Karpathy's micrograd video while implementing it yourself. Do not skip the implementation. Implementation is where the understanding lives. After implementation, then read the relevant chapters of Goodfellow, Bengio, and Courville. The theory will land differently after your hands have done the work.

AI-specific application

For the AI engineer in 2026, micrograd is the first checkpoint. Frontier model training is conceptually identical to training a 100-line MLP on the moons dataset, except the model has 100 billion parameters instead of 100, the data is 15 trillion tokens instead of 200 points, the compute is thousands of H100s instead of your laptop, and the optimizer is AdamW with carefully tuned schedules instead of vanilla SGD. The math is the same. The chain rule is the same. The autograd is the same.

This matters because the AI engineering interview at Anthropic, OpenAI, DeepMind, or any AI-first company will probe whether you understand the foundation. You will not be asked to write micrograd from scratch in 45 minutes (probably), but you will be asked to reason about gradient flow in a transformer, to explain why a layer norm placement matters, to debug a training run that has stopped converging. Engineers who skipped this layer cannot answer those questions credibly. Engineers who walked through micrograd answer them naturally because they built the underlying mental model.

Practice exercises

Implement micrograd from scratch. Do not copy from the repo. Watch Karpathy's video at 1.0 speed and pause to write each piece. Implement Value with +, *, tanh, and backward. Test it on a single neuron and verify the gradient matches PyTorch's autograd to numerical precision.
Train a 2-layer MLP on the moons dataset using only your micrograd. No PyTorch. Plot the loss curve. Plot the decision boundary. Confirm convergence.
Break it intentionally. Introduce a bug in your _backward for *. Train and observe what fails. Restore. Introduce a different bug in the topological sort. Observe. The point is to internalize what each piece does by removing it.

Knowledge check

Question 1. What does micrograd's Value class track in addition to a scalar data? a) Only the value b) The value, the gradient, the parent operations, and the backward function [correct, these four pieces are what enable autograd] c) The value and a label d) The value and a learning rate
Question 2. When two Value objects are added, what is the local gradient that propagates back to each operand? a) The product of the operands b) 1 for each operand [correct, derivative of a + b with respect to either input is 1] c) 0 d) The sum of the operands
Question 3. When two Value objects are multiplied, what is the local gradient that propagates back to operand a? a) 1 b) The data of the other operand b [correct, derivative of a * b with respect to a is b] c) The data of a d) The output gradient times itself
Question 4. What does backpropagation do, in one sentence? a) It computes loss b) It applies the chain rule recursively in reverse topological order over the computation graph to populate gradients [correct, that is the entire algorithm] c) It updates weights d) It samples from the model
Question 5. Why does Karpathy teach micrograd before teaching production frameworks like PyTorch? a) Because micrograd is faster b) Because the math is the same as production frameworks, and building it from scratch creates the mental model that lets you debug production systems [correct, pedagogical foundation] c) Because PyTorch is too complex for beginners d) Because micrograd scales to large models
Question 6. What is a computation graph? a) A static neural network architecture b) A directed graph of operations and operands built dynamically as Python executes, used to enable automatic differentiation [correct] c) A loss curve d) A training schedule
Question 7. Why does backward initialize the output gradient to 1.0 before walking the graph in reverse? a) To zero the gradients b) Because the derivative of the output with respect to itself is 1, and that seeds the chain-rule walk [correct] c) Because the loss is always 1 d) Because PyTorch does it that way

Slide deck outline

Title slide: "Lesson 1.1, Neural networks from scratch (Karpathy's micrograd)"
Hook: 100 lines of Python is the engine of frontier models
The first principle: a neural network is a math expression
The Value object: data, grad, parents, backward
Code walkthrough: __add__
Code walkthrough: __mul__
The chain rule visualization
Reverse topological order on a small graph
Code walkthrough: backward()
From Value to neuron to MLP
Loss, gradient, parameter step
The full training loop in pseudocode
Why this matters: debugging at the autograd level
The bitter lesson preview
Build first, theory second (fast.ai philosophy)
From micrograd to PyTorch: same algorithm, vectorized
Frontier model training: same math, different scale
The AI engineering interview implication
Common implementation mistakes
Citations: Karpathy, Goodfellow, fast.ai
Practice exercise summary
Transition to Lesson 1.2

Reference reading

Karpathy, micrograd repo: https://github.com/karpathy/micrograd
Karpathy, Neural Networks: Zero to Hero playlist: https://karpathy.ai/zero-to-hero.html
Goodfellow, Bengio, Courville, Deep Learning, Chapter 6: https://www.deeplearningbook.org/
fast.ai, Practical Deep Learning for Coders: https://course.fast.ai/

Transition

You have the autograd engine. Next, you build a generative model on top of it. In Lesson 1.2, the makemore series: a character-level language model that takes you from a bigram baseline through an MLP to a recurrent model, all on the foundation you just built.

Opening hook

Core teaching

Here is what the core of micrograd looks like, simplified to the first principle:

class Value:
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0
        self._prev = set(_children)
        self._op = _op
        self._backward = lambda: None

    def __add__(self, other):
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _backward
        return out

    def __mul__(self, other):
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
        return out

def backward(self):
    topo = []
    visited = set()
    def build_topo(v):
        if v not in visited:
            visited.add(v)
            for child in v._prev:
                build_topo(child)
            topo.append(v)
    build_topo(self)
    self.grad = 1.0
    for v in reversed(topo):
        v._backward()

AI-specific application

Practice exercises

Implement micrograd from scratch. Do not copy from the repo. Watch Karpathy's video at 1.0 speed and pause to write each piece. Implement Value with +, *, tanh, and backward. Test it on a single neuron and verify the gradient matches PyTorch's autograd to numerical precision.
Train a 2-layer MLP on the moons dataset using only your micrograd. No PyTorch. Plot the loss curve. Plot the decision boundary. Confirm convergence.
Break it intentionally. Introduce a bug in your _backward for *. Train and observe what fails. Restore. Introduce a different bug in the topological sort. Observe. The point is to internalize what each piece does by removing it.

Knowledge check

Question 1. What does micrograd's Value class track in addition to a scalar data? a) Only the value b) The value, the gradient, the parent operations, and the backward function [correct, these four pieces are what enable autograd] c) The value and a label d) The value and a learning rate
Question 2. When two Value objects are added, what is the local gradient that propagates back to each operand? a) The product of the operands b) 1 for each operand [correct, derivative of a + b with respect to either input is 1] c) 0 d) The sum of the operands
Question 3. When two Value objects are multiplied, what is the local gradient that propagates back to operand a? a) 1 b) The data of the other operand b [correct, derivative of a * b with respect to a is b] c) The data of a d) The output gradient times itself
Question 4. What does backpropagation do, in one sentence? a) It computes loss b) It applies the chain rule recursively in reverse topological order over the computation graph to populate gradients [correct, that is the entire algorithm] c) It updates weights d) It samples from the model
Question 5. Why does Karpathy teach micrograd before teaching production frameworks like PyTorch? a) Because micrograd is faster b) Because the math is the same as production frameworks, and building it from scratch creates the mental model that lets you debug production systems [correct, pedagogical foundation] c) Because PyTorch is too complex for beginners d) Because micrograd scales to large models
Question 6. What is a computation graph? a) A static neural network architecture b) A directed graph of operations and operands built dynamically as Python executes, used to enable automatic differentiation [correct] c) A loss curve d) A training schedule
Question 7. Why does backward initialize the output gradient to 1.0 before walking the graph in reverse? a) To zero the gradients b) Because the derivative of the output with respect to itself is 1, and that seeds the chain-rule walk [correct] c) Because the loss is always 1 d) Because PyTorch does it that way

Slide deck outline

Title slide: "Lesson 1.1, Neural networks from scratch (Karpathy's micrograd)"
Hook: 100 lines of Python is the engine of frontier models
The first principle: a neural network is a math expression
The Value object: data, grad, parents, backward
Code walkthrough: __add__
Code walkthrough: __mul__
The chain rule visualization
Reverse topological order on a small graph
Code walkthrough: backward()
From Value to neuron to MLP
Loss, gradient, parameter step
The full training loop in pseudocode
Why this matters: debugging at the autograd level
The bitter lesson preview
Build first, theory second (fast.ai philosophy)
From micrograd to PyTorch: same algorithm, vectorized
Frontier model training: same math, different scale
The AI engineering interview implication
Common implementation mistakes
Citations: Karpathy, Goodfellow, fast.ai
Practice exercise summary
Transition to Lesson 1.2

Reference reading

Karpathy, micrograd repo: https://github.com/karpathy/micrograd
Karpathy, Neural Networks: Zero to Hero playlist: https://karpathy.ai/zero-to-hero.html
Goodfellow, Bengio, Courville, Deep Learning, Chapter 6: https://www.deeplearningbook.org/
fast.ai, Practical Deep Learning for Coders: https://course.fast.ai/

Lesson 1.1, Neural networks from scratch (Karpathy's micrograd)

Opening hook

Core teaching

AI-specific application

Practice exercises

Knowledge check

Slide deck outline

Reference reading

Transition

That was one lesson. The course has 50.

Get cybersecurity career insights delivered weekly

Lesson 1.1, Neural networks from scratch (Karpathy's micrograd)

Opening hook

Core teaching

AI-specific application

Practice exercises

Knowledge check

Slide deck outline

Reference reading

Transition

That was one lesson. The course has 50.

Get cybersecurity career insights delivered weekly