DreamerV3 for Robotic Control

A from-scratch PyTorch implementation of DreamerV3 (Hafner et al., 2023), the state-of-the-art model-based reinforcement learning algorithm. The agent learns a world model from pixel observations, then trains a policy entirely inside its learned imagination.

Architecture

Observation (64x64 RGB)
    → CNN Encoder (stride-2 convolutions, SiLU, LayerNorm)
    → Embedding (1024-dim)
    → RSSM World Model
        ├── Deterministic path: LayerNorm GRU (512-dim hidden state)
        ├── Stochastic path: 32 categorical distributions × 32 classes
        ├── Prior:     h_t, action → h_{t+1}, z_{t+1}_prior
        └── Posterior:  h_{t+1}, embed → z_{t+1}_posterior
    → Latent state (512 + 1024 = 1536-dim)
        ├── CNN Decoder → Reconstructed image
        ├── Reward Head (4-layer MLP, symlog) → Predicted reward
        ├── Continue Head (4-layer MLP, binary) → Episode termination
        ├── Actor (4-layer MLP, tanh-squashed Normal) → Action distribution
        └── Critic (4-layer MLP, EMA target) → State value

Key Implementation Details

Categorical latents: 32 distributions with 32 classes each, sampled via straight-through Gumbel-Softmax
Symlog transforms: Applied to observations and reward predictions for handling diverse value scales
KL balancing: 80% forward KL (train prior to match posterior) + 20% reverse KL, with 1.0 free nats
Straight-through actor gradients: Policy loss backpropagates through the imagined trajectory via rsample(), with percentile-based return normalization (5th/95th quantile)
EMA target critic: Slow target network updated at rate 0.02

Parameters

Component	Parameters
CNN Encoder	4.2M
RSSM (GRU + prior/posterior MLPs)	4.8M
CNN Decoder	4.3M
Reward Head	4.3M
Continue Head	4.3M
Actor	1.3M
Critic (online + target)	2.0M
Total	29.5M

Training

# Full training (500K env steps, ~10h on RTX-class GPU)
python -m scripts.train --device cuda --steps 500000

# Resume from checkpoint
python -m scripts.train --device cuda --resume checkpoints/dreamer/dreamer_step100000.pt

Training alternates between environment interaction and world model + actor-critic gradient updates. With train_every=2, one gradient step occurs every 2 environment steps (replay ratio 0.5).

Training Configuration

Parameter	Value
Image size	64x64
Action repeat	2
Batch size	16 sequences × 64 steps
Learning rate	1e-4 (Adam)
Gradient clip	100.0
Imagination horizon	15 steps
Discount	0.997
Replay capacity	1M transitions
Prefill	5000 random steps

Results (walker-walk, 500K env steps)

Metric	Value
Return (mean over 20 episodes)	171.3
Return (std)	39.5
Return (median)	174.7
Return (min / max)	98.1 / 260.2
Episode length	500 (max)
Reconstruction SSIM	0.685
Reconstruction PSNR	20.1 dB

Evaluation

python -m scripts.evaluate \
    --checkpoint checkpoints/dreamer/dreamer_best.pt \
    --env walker-walk --suite dmcontrol \
    --episodes 20 --device cuda --save-dir results/dreamer

Evaluation runs the learned policy (deterministic, using distribution mean) and reports:

Episode returns (mean, std, median, min, max)
Reconstruction quality (SSIM, PSNR) of the world model decoder

Environment

Walker-Walk (DeepMind Control Suite): A planar biped must learn to walk forward. Reward is based on upright posture and forward velocity.

Observation: 64x64 RGB pixels
Action: 6D continuous (hip, knee, ankle joints for both legs)
Max episode length: 500 steps (with action_repeat=2 → 1000 physics steps)
Reward range: [0, ~1000] per episode

Files

File	Description
`models/rssm.py`	RSSM world model with categorical latents and LayerNorm GRU
`models/encoder_decoder.py`	CNN encoder/decoder with symlog, MLP prediction heads
`models/actor_critic.py`	Actor (tanh-squashed Normal), Critic (EMA target), lambda returns
`models/agent.py`	Full DreamerV3 agent tying all components together
`envs/wrappers.py`	Unified wrapper for dm_control, robosuite, gymnasium
`data/replay_buffer.py`	Experience replay buffer with sequence sampling
`scripts/train.py`	Training loop with prefill, periodic eval, checkpointing
`scripts/evaluate.py`	Evaluation with reconstruction quality metrics
`configs/default.py`	Full hyperparameter configuration

Known Limitations

Performance gap vs published results: Our 29.5M model achieves ~171 return on walker-walk at 500K steps, while the published DreamerV3 (200M+ parameters, heavily tuned) reports ~900+. The gap is primarily due to smaller model capacity and shorter training. Longer training (2M+ steps) and a larger model would close this gap.
Tanh-squashed Normal is used instead of the paper's truncated Normal distribution. The tanh squashing bounds actions to [-1, 1] but differs slightly in density at the boundaries.
No block-diagonal GRU: The paper uses a block-diagonal GRU for efficiency; our implementation uses a standard GRU with LayerNorm, which is functionally equivalent but less parameter-efficient at larger hidden sizes.

References

Hafner, D. et al. "Mastering Diverse Domains through World Models" (DreamerV3, 2023)
Hafner, D. et al. "Dream to Control: Learning Behaviors by Latent Imagination" (Dreamer, 2020)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
common		common
configs		configs
data		data
envs		envs
models		models
results/dreamer		results/dreamer
scripts		scripts
tests/test_dreamer		tests/test_dreamer
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DreamerV3 for Robotic Control

Architecture

Key Implementation Details

Parameters

Training

Training Configuration

Results (walker-walk, 500K env steps)

Evaluation

Environment

Files

Known Limitations

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DreamerV3 for Robotic Control

Architecture

Key Implementation Details

Parameters

Training

Training Configuration

Results (walker-walk, 500K env steps)

Evaluation

Environment

Files

Known Limitations

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages