A from-scratch PyTorch implementation of DreamerV3 (Hafner et al., 2023), the state-of-the-art model-based reinforcement learning algorithm. The agent learns a world model from pixel observations, then trains a policy entirely inside its learned imagination.
Observation (64x64 RGB)
→ CNN Encoder (stride-2 convolutions, SiLU, LayerNorm)
→ Embedding (1024-dim)
→ RSSM World Model
├── Deterministic path: LayerNorm GRU (512-dim hidden state)
├── Stochastic path: 32 categorical distributions × 32 classes
├── Prior: h_t, action → h_{t+1}, z_{t+1}_prior
└── Posterior: h_{t+1}, embed → z_{t+1}_posterior
→ Latent state (512 + 1024 = 1536-dim)
├── CNN Decoder → Reconstructed image
├── Reward Head (4-layer MLP, symlog) → Predicted reward
├── Continue Head (4-layer MLP, binary) → Episode termination
├── Actor (4-layer MLP, tanh-squashed Normal) → Action distribution
└── Critic (4-layer MLP, EMA target) → State value
- Categorical latents: 32 distributions with 32 classes each, sampled via straight-through Gumbel-Softmax
- Symlog transforms: Applied to observations and reward predictions for handling diverse value scales
- KL balancing: 80% forward KL (train prior to match posterior) + 20% reverse KL, with 1.0 free nats
- Straight-through actor gradients: Policy loss backpropagates through the imagined trajectory via
rsample(), with percentile-based return normalization (5th/95th quantile) - EMA target critic: Slow target network updated at rate 0.02
| Component | Parameters |
|---|---|
| CNN Encoder | 4.2M |
| RSSM (GRU + prior/posterior MLPs) | 4.8M |
| CNN Decoder | 4.3M |
| Reward Head | 4.3M |
| Continue Head | 4.3M |
| Actor | 1.3M |
| Critic (online + target) | 2.0M |
| Total | 29.5M |
# Full training (500K env steps, ~10h on RTX-class GPU)
python -m scripts.train --device cuda --steps 500000
# Resume from checkpoint
python -m scripts.train --device cuda --resume checkpoints/dreamer/dreamer_step100000.ptTraining alternates between environment interaction and world model + actor-critic gradient updates. With train_every=2, one gradient step occurs every 2 environment steps (replay ratio 0.5).
| Parameter | Value |
|---|---|
| Image size | 64x64 |
| Action repeat | 2 |
| Batch size | 16 sequences × 64 steps |
| Learning rate | 1e-4 (Adam) |
| Gradient clip | 100.0 |
| Imagination horizon | 15 steps |
| Discount | 0.997 |
| Replay capacity | 1M transitions |
| Prefill | 5000 random steps |
| Metric | Value |
|---|---|
| Return (mean over 20 episodes) | 171.3 |
| Return (std) | 39.5 |
| Return (median) | 174.7 |
| Return (min / max) | 98.1 / 260.2 |
| Episode length | 500 (max) |
| Reconstruction SSIM | 0.685 |
| Reconstruction PSNR | 20.1 dB |
python -m scripts.evaluate \
--checkpoint checkpoints/dreamer/dreamer_best.pt \
--env walker-walk --suite dmcontrol \
--episodes 20 --device cuda --save-dir results/dreamerEvaluation runs the learned policy (deterministic, using distribution mean) and reports:
- Episode returns (mean, std, median, min, max)
- Reconstruction quality (SSIM, PSNR) of the world model decoder
Walker-Walk (DeepMind Control Suite): A planar biped must learn to walk forward. Reward is based on upright posture and forward velocity.
- Observation: 64x64 RGB pixels
- Action: 6D continuous (hip, knee, ankle joints for both legs)
- Max episode length: 500 steps (with action_repeat=2 → 1000 physics steps)
- Reward range: [0, ~1000] per episode
| File | Description |
|---|---|
models/rssm.py |
RSSM world model with categorical latents and LayerNorm GRU |
models/encoder_decoder.py |
CNN encoder/decoder with symlog, MLP prediction heads |
models/actor_critic.py |
Actor (tanh-squashed Normal), Critic (EMA target), lambda returns |
models/agent.py |
Full DreamerV3 agent tying all components together |
envs/wrappers.py |
Unified wrapper for dm_control, robosuite, gymnasium |
data/replay_buffer.py |
Experience replay buffer with sequence sampling |
scripts/train.py |
Training loop with prefill, periodic eval, checkpointing |
scripts/evaluate.py |
Evaluation with reconstruction quality metrics |
configs/default.py |
Full hyperparameter configuration |
- Performance gap vs published results: Our 29.5M model achieves ~171 return on walker-walk at 500K steps, while the published DreamerV3 (200M+ parameters, heavily tuned) reports ~900+. The gap is primarily due to smaller model capacity and shorter training. Longer training (2M+ steps) and a larger model would close this gap.
- Tanh-squashed Normal is used instead of the paper's truncated Normal distribution. The tanh squashing bounds actions to [-1, 1] but differs slightly in density at the boundaries.
- No block-diagonal GRU: The paper uses a block-diagonal GRU for efficiency; our implementation uses a standard GRU with LayerNorm, which is functionally equivalent but less parameter-efficient at larger hidden sizes.
- Hafner, D. et al. "Mastering Diverse Domains through World Models" (DreamerV3, 2023)
- Hafner, D. et al. "Dream to Control: Learning Behaviors by Latent Imagination" (Dreamer, 2020)