Introduction
Why RL. How this course works. How to use the Companion.
The most hands‑on reinforcement learning course on the internet. 18 guided notebooks, every algorithm built from scratch, with a voice‑capable AI tutor watching over your shoulder.
RL tackles one question: how can an agent learn to make optimal decisions by interacting with an environment? No one feeds it answers. It learns by trying, failing, and being nudged by a sparse reward — much like we do.
This course is the full pipeline: you'll implement DQN, SAC, PPO, AlphaZero, Dreamer, and GRPO from scratch. You'll play Atari, land on the Moon, train robots, and align LLMs — by filling in # TODO blocks, with a solution/ folder (or the Companion) one click away.
A tabular Q-learning agent in a tiny gridworld. No video, no slideshow — it is actually training in your browser right now. This is the pattern you'll build, formalize, and scale across 18 notebooks.
The DRL‑ZH Companion is a VS Code extension. It sees which TODO your cursor is on, notices when you're stuck, and nudges with Socratic hints — never spoilers. Prefer to talk? Flip on voice mode and have a conversation.
class DQN(nn.Module):
def __init__(self, obs_dim, n_actions):
super().__init__()
# TODO: build a 2-layer MLP that maps obs → Q-values.
# Hint: nn.Sequential with Linear and ReLU layers.
self.net = None
def forward(self, obs):
return self.net(obs)
# Replay buffer: uniform sampling of past transitions.
buffer = ReplayBuffer(capacity=100_000)
Detects idle, stuck, reading, confusion, drift, flow — and tunes when to speak up.
Whisper STT + local neural TTS (Kokoro). Talk it out; it talks back. No audio leaves your machine with Kokoro.
Groq, Gemini, OpenAI, or Anthropic. Your keys, your bill. Free tiers cover most of the course.
Every chapter is a notebook with # TODO blocks you fill in, and a mirrored solution/ for when you're stuck. Progress is linear, but branches are welcome.
The vocabulary of RL. Agent, environment, reward, policy, value.
Why RL. How this course works. How to use the Companion.
The math behind decisions. Bellman, policies, value iteration.
Agents, environments, states, actions, rewards — and Gym.
The classics, from scratch. Value‑based, policy‑based, actor‑critic.
The value-based classic that cracked Atari. Experience replay, target nets.
REINFORCE and descendants. Directly optimize behavior.
Combine value and policy. Soft Actor-Critic, maximum-entropy RL.
The workhorse of modern RL. Clip, advantage, stability.
Beyond single‑agent, single‑task. Where RL gets interesting.
A map of the frontier. Where each branch leads and why.
Sparse rewards, intrinsic motivation, RND, ICM.
Cooperation, competition, CTDE, self-play.
Learning from logs. From demonstrations. BC, CQL.
Tree search meets deep nets. Self-play champions.
Where research lives today. LLM alignment, world models, meta‑learning.
How modern LLMs get aligned. Reward heuristics, PPO-for-text, GRPO.
RL as sequence modeling. Vision-language-action for robots.
Logging, eval, TensorBoard, reproducibility, pitfalls.
Plan inside a learned world. MBPO, PETS.
Imagination-based RL. Latent dynamics, actor-critic in a dream.
Agents that learn to learn. MAML, RL².
What you've built. Where to go next.
Tick the boxes you're comfortable with. You don't need all three — the more, the smoother.
The repo ships a Docker environment with notebooks, Python, and the Companion wired up. No local setup needed beyond Docker and git.
# Clone, then cd into the repo
git clone https://github.com/alessiodm/drl-zh.git && cd drl-zh
# (Linux/macOS) pin your UID/GID for the container
printf "UID=$(id -u)\nGID=$(id -g)\n" > .env
# Build and launch. Then open http://localhost:8080
docker compose up --build -d
GPU? Append -f docker-compose.gpu.yml. Prefer a manual Python setup? See the README.
“I gathered years of scattered code I'd written and shaped it into a coherent course. For this third edition, AI became a collaborator — helping polish the prose, sharpen the algorithms, and build the Companion itself.