Skip to content

A-SHOJAEI/dreamerv3-robotic-control

Repository files navigation

DreamerV3 for Robotic Control

A from-scratch PyTorch implementation of DreamerV3 (Hafner et al., 2023), the state-of-the-art model-based reinforcement learning algorithm. The agent learns a world model from pixel observations, then trains a policy entirely inside its learned imagination.

Architecture

Observation (64x64 RGB)
    → CNN Encoder (stride-2 convolutions, SiLU, LayerNorm)
    → Embedding (1024-dim)
    → RSSM World Model
        ├── Deterministic path: LayerNorm GRU (512-dim hidden state)
        ├── Stochastic path: 32 categorical distributions × 32 classes
        ├── Prior:     h_t, action → h_{t+1}, z_{t+1}_prior
        └── Posterior:  h_{t+1}, embed → z_{t+1}_posterior
    → Latent state (512 + 1024 = 1536-dim)
        ├── CNN Decoder → Reconstructed image
        ├── Reward Head (4-layer MLP, symlog) → Predicted reward
        ├── Continue Head (4-layer MLP, binary) → Episode termination
        ├── Actor (4-layer MLP, tanh-squashed Normal) → Action distribution
        └── Critic (4-layer MLP, EMA target) → State value

Key Implementation Details

  • Categorical latents: 32 distributions with 32 classes each, sampled via straight-through Gumbel-Softmax
  • Symlog transforms: Applied to observations and reward predictions for handling diverse value scales
  • KL balancing: 80% forward KL (train prior to match posterior) + 20% reverse KL, with 1.0 free nats
  • Straight-through actor gradients: Policy loss backpropagates through the imagined trajectory via rsample(), with percentile-based return normalization (5th/95th quantile)
  • EMA target critic: Slow target network updated at rate 0.02

Parameters

Component Parameters
CNN Encoder 4.2M
RSSM (GRU + prior/posterior MLPs) 4.8M
CNN Decoder 4.3M
Reward Head 4.3M
Continue Head 4.3M
Actor 1.3M
Critic (online + target) 2.0M
Total 29.5M

Training

# Full training (500K env steps, ~10h on RTX-class GPU)
python -m scripts.train --device cuda --steps 500000

# Resume from checkpoint
python -m scripts.train --device cuda --resume checkpoints/dreamer/dreamer_step100000.pt

Training alternates between environment interaction and world model + actor-critic gradient updates. With train_every=2, one gradient step occurs every 2 environment steps (replay ratio 0.5).

Training Configuration

Parameter Value
Image size 64x64
Action repeat 2
Batch size 16 sequences × 64 steps
Learning rate 1e-4 (Adam)
Gradient clip 100.0
Imagination horizon 15 steps
Discount 0.997
Replay capacity 1M transitions
Prefill 5000 random steps

Results (walker-walk, 500K env steps)

Metric Value
Return (mean over 20 episodes) 171.3
Return (std) 39.5
Return (median) 174.7
Return (min / max) 98.1 / 260.2
Episode length 500 (max)
Reconstruction SSIM 0.685
Reconstruction PSNR 20.1 dB

Evaluation

python -m scripts.evaluate \
    --checkpoint checkpoints/dreamer/dreamer_best.pt \
    --env walker-walk --suite dmcontrol \
    --episodes 20 --device cuda --save-dir results/dreamer

Evaluation runs the learned policy (deterministic, using distribution mean) and reports:

  • Episode returns (mean, std, median, min, max)
  • Reconstruction quality (SSIM, PSNR) of the world model decoder

Environment

Walker-Walk (DeepMind Control Suite): A planar biped must learn to walk forward. Reward is based on upright posture and forward velocity.

  • Observation: 64x64 RGB pixels
  • Action: 6D continuous (hip, knee, ankle joints for both legs)
  • Max episode length: 500 steps (with action_repeat=2 → 1000 physics steps)
  • Reward range: [0, ~1000] per episode

Files

File Description
models/rssm.py RSSM world model with categorical latents and LayerNorm GRU
models/encoder_decoder.py CNN encoder/decoder with symlog, MLP prediction heads
models/actor_critic.py Actor (tanh-squashed Normal), Critic (EMA target), lambda returns
models/agent.py Full DreamerV3 agent tying all components together
envs/wrappers.py Unified wrapper for dm_control, robosuite, gymnasium
data/replay_buffer.py Experience replay buffer with sequence sampling
scripts/train.py Training loop with prefill, periodic eval, checkpointing
scripts/evaluate.py Evaluation with reconstruction quality metrics
configs/default.py Full hyperparameter configuration

Known Limitations

  • Performance gap vs published results: Our 29.5M model achieves ~171 return on walker-walk at 500K steps, while the published DreamerV3 (200M+ parameters, heavily tuned) reports ~900+. The gap is primarily due to smaller model capacity and shorter training. Longer training (2M+ steps) and a larger model would close this gap.
  • Tanh-squashed Normal is used instead of the paper's truncated Normal distribution. The tanh squashing bounds actions to [-1, 1] but differs slightly in density at the boundaries.
  • No block-diagonal GRU: The paper uses a block-diagonal GRU for efficiency; our implementation uses a standard GRU with LayerNorm, which is functionally equivalent but less parameter-efficient at larger hidden sizes.

References

  • Hafner, D. et al. "Mastering Diverse Domains through World Models" (DreamerV3, 2023)
  • Hafner, D. et al. "Dream to Control: Learning Behaviors by Latent Imagination" (Dreamer, 2020)

About

DreamerV3 (29.5M params) for robotic control — RSSM world model with categorical latents, straight-through actor gradients, trained entirely in imagination on DMControl walker-walk

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages