Deep Reinforcement Learning · Zero to Hero

Line by line.
By you.
From MDPs to AlphaZero and GRPO.

The most hands‑on reinforcement learning course on the internet. 18 guided notebooks, every algorithm built from scratch, with a voice‑capable AI tutor watching over your shoulder.

18 notebooks · MIT licensed · Docker-ready
Scroll
robot

RL tackles one question: how can an agent learn to make optimal decisions by interacting with an environment? No one feeds it answers. It learns by trying, failing, and being nudged by a sparse reward — much like we do.

This course is the full pipeline: you'll implement DQN, SAC, PPO, AlphaZero, Dreamer, and GRPO from scratch. You'll play Atari, land on the Moon, train robots, and align LLMs — by filling in # TODO blocks, with a solution/ folder (or the Companion) one click away.

Live · training now

An agent, learning in front of you.

A tabular Q-learning agent in a tiny gridworld. No video, no slideshow — it is actually training in your browser right now. This is the pattern you'll build, formalize, and scale across 18 notebooks.

agent goal wall Q-value
Episode0
Steps0
ε (exploration)1.00
Avg reward (100 ep)
Reward curve
New · AI Companion

A tutor that actually watches you code.

The DRL‑ZH Companion is a VS Code extension. It sees which TODO your cursor is on, notices when you're stuck, and nudges with Socratic hints — never spoilers. Prefer to talk? Flip on voice mode and have a conversation.

03_DQN.ipynb
04_PG.ipynb
05_AC.ipynb
drl-zh · VS Code
In [3]:
class DQN(nn.Module):
    def __init__(self, obs_dim, n_actions):
        super().__init__()
        # TODO: build a 2-layer MLP that maps obs → Q-values.
        # Hint: nn.Sequential with Linear and ReLU layers.
        self.net = None

    def forward(self, obs):
        return self.net(obs)
In [4]:
# Replay buffer: uniform sampling of past transitions.
buffer = ReplayBuffer(capacity=100_000)
DRL Companion
idle reading stuck drift flow
watching 03_DQN.ipynb · awareness: attentive

Awareness signals

Detects idle, stuck, reading, confusion, drift, flow — and tunes when to speak up.

Voice mode

Whisper STT + local neural TTS (Kokoro). Talk it out; it talks back. No audio leaves your machine with Kokoro.

Bring your own LLM

Groq, Gemini, OpenAI, or Anthropic. Your keys, your bill. Free tiers cover most of the course.

Curriculum

18 notebooks. Four acts. Zero to hero.

Every chapter is a notebook with # TODO blocks you fill in, and a mirrored solution/ for when you're stuck. Progress is linear, but branches are welcome.

Act I 3 notebooks

Foundations

The vocabulary of RL. Agent, environment, reward, policy, value.

00

Introduction

Why RL. How this course works. How to use the Companion.

01

Markov Decision Processes

The math behind decisions. Bellman, policies, value iteration.

02

RL Foundations

Agents, environments, states, actions, rewards — and Gym.

Act II 4 notebooks

Deep RL Core

The classics, from scratch. Value‑based, policy‑based, actor‑critic.

03

Deep Q-Learning

The value-based classic that cracked Atari. Experience replay, target nets.

04

Policy Gradient

REINFORCE and descendants. Directly optimize behavior.

05

Actor-Critic & SAC

Combine value and policy. Soft Actor-Critic, maximum-entropy RL.

06

PPO

The workhorse of modern RL. Clip, advantage, stability.

Act III 5 notebooks

Advanced

Beyond single‑agent, single‑task. Where RL gets interesting.

07

Bridge to Advanced

A map of the frontier. Where each branch leads and why.

08

Exploration & Curiosity

Sparse rewards, intrinsic motivation, RND, ICM.

09

Multi-Agent RL

Cooperation, competition, CTDE, self-play.

10

Offline & Imitation

Learning from logs. From demonstrations. BC, CQL.

11

MCTS & AlphaZero

Tree search meets deep nets. Self-play champions.

Act IV 7 notebooks

Frontier

Where research lives today. LLM alignment, world models, meta‑learning.

12

RLHF & GRPO

How modern LLMs get aligned. Reward heuristics, PPO-for-text, GRPO.

13

Decision Transformers & VLA

RL as sequence modeling. Vision-language-action for robots.

14

Productionizing RL

Logging, eval, TensorBoard, reproducibility, pitfalls.

15

Model-Based RL

Plan inside a learned world. MBPO, PETS.

16

Dreamer & World Models

Imagination-based RL. Latent dynamics, actor-critic in a dream.

17

Meta-RL

Agents that learn to learn. MAML, RL².

18

Conclusion

What you've built. Where to go next.

Trailer

Two minutes, then you're in.

Before you start

Are you ready? Self-check.

Tick the boxes you're comfortable with. You don't need all three — the more, the smoother.

Tick what applies to you.
Quickstart

Three commands. You're training.

The repo ships a Docker environment with notebooks, Python, and the Companion wired up. No local setup needed beyond Docker and git.

bash
# Clone, then cd into the repo
git clone https://github.com/alessiodm/drl-zh.git && cd drl-zh

# (Linux/macOS) pin your UID/GID for the container
printf "UID=$(id -u)\nGID=$(id -g)\n" > .env

# Build and launch. Then open http://localhost:8080
docker compose up --build -d

GPU? Append -f docker-compose.gpu.yml. Prefer a manual Python setup? See the README.

I gathered years of scattered code I'd written and shaped it into a coherent course. For this third edition, AI became a collaborator — helping polish the prose, sharpen the algorithms, and build the Companion itself.