Skip to content

Add Docker virtual environments for reproducible development#14

Open
supmo668 wants to merge 3 commits into
WooooDyy:mainfrom
supmo668:feat/docker-virtual-environments
Open

Add Docker virtual environments for reproducible development#14
supmo668 wants to merge 3 commits into
WooooDyy:mainfrom
supmo668:feat/docker-virtual-environments

Conversation

@supmo668

@supmo668 supmo668 commented Dec 2, 2025

Copy link
Copy Markdown

Overview

This PR introduces Docker-based development environments to solve the reproducibility problem in AgentGym-RL. Currently, setting up the training environment requires careful manual configuration of CUDA, PyTorch, flash-attention, and various dependencies—a process that's error-prone and time-consuming.

With this change, contributors can get a working environment with:

make docker-build-train
make docker-train-shell

What's Changed

1. Docker Infrastructure

Three purpose-built images that layer on each other:

Image Purpose Base
agentgym-rl/base CUDA 12.4 + PyTorch 2.4 + flash-attn nvidia/cuda
agentgym-rl/train Full RL training with verl base
agentgym-rl/scripts Model merging utilities base
agentgym/eval Lightweight evaluation python:3.10-slim

2. Service Orchestration

docker-compose.yml with profile-based services:

  • train: GPU-enabled training container
  • env: Environment servers (searchqa, babyai, etc.)
  • eval: Evaluation against environment servers

3. Developer Tooling

  • Makefile: Simple commands (make docker-build, make docker-train-shell)
  • .env.example: Template for API keys and configuration
  • .dockerignore: Keeps builds fast by excluding large files

4. Documentation

DOCKER.md covers quick start, common workflows, and troubleshooting.

Design Decisions

Why separate images? The base image with CUDA/PyTorch is large (~15GB). By layering, we can rebuild train/scripts quickly when only code changes.

Why profiles? Not everyone needs all services. docker compose --profile train up starts only what's needed.

Why volume mounts for models? Baking large model files into images would make them huge and slow to transfer. Mounts are more flexible.

Testing

$ make test-docker
Docker version 28.5.1
Docker Compose version v2.40.3
docker-compose.yml: OK
base.Dockerfile: OK
train.Dockerfile: OK
scripts.Dockerfile: OK
Dockerfile.eval: OK
.dockerignore: OK
.env.example: OK
All tests PASSED

$ make docker-build-eval  # Built successfully
$ docker run --rm agentgym/eval:latest python -c "import agentenv"  # Works

Commits

  1. Add Docker infrastructure - Dockerfiles and compose configuration
  2. Add developer tooling - Makefile, .env.example, .dockerignore
  3. Add documentation - DOCKER.md usage guide

Checklist

  • Dockerfiles build successfully
  • Compose configuration validates
  • Makefile commands work
  • Documentation is clear and complete
  • No breaking changes to existing workflows

@supmo668 supmo668 force-pushed the feat/docker-virtual-environments branch from 7ba91ac to 9542a57 Compare December 2, 2025 08:12
@supmo668

supmo668 commented Dec 2, 2025

Copy link
Copy Markdown
Author

Test Results

All validation tests pass. The Docker setup is safe for running alongside existing agent environments.

make test-docker Output

Docker version 28.5.1, build e180ab8
Docker Compose version v2.40.3-desktop.1
docker-compose.yml: OK
base.Dockerfile: OK
train.Dockerfile: OK  
scripts.Dockerfile: OK
Dockerfile.eval: OK
.dockerignore: OK
.env.example: OK
All tests PASSED

Eval Image Test

$ docker run --rm agentgym/eval:latest python -c "import agentenv; print('OK')"
OK

make docker-status Output

NAMES     STATUS
REPOSITORY          TAG       IMAGE ID       CREATED        SIZE
agentgym/eval       latest    1a1f846ded51   5 hours ago    12.7GB

Safety Notes

  • All make targets tested and working
  • No port conflicts with existing services
  • docker-status shows current container/image state
  • docker-down safely removes only AgentGym containers

Introduce containerized development environments that ensure
consistent setup across machines. This eliminates "works on
my machine" issues and simplifies onboarding.

What's included:
- docker/base.Dockerfile: CUDA 12.4 + PyTorch 2.4 + flash-attn
- docker/train.Dockerfile: Full RL training environment
- docker/scripts.Dockerfile: Model merging utilities
- Dockerfile.eval: Lightweight evaluation runner
- docker-compose.yml: Service orchestration with profiles

Key design decisions:
- Multi-stage builds reduce final image size
- Profile-based services (train/eval/env) for flexibility
- GPU resources allocated via nvidia runtime
- Volume mounts for models/checkpoints (not baked into image)
Provide convenient commands and configuration to streamline
the Docker-based development experience.

Makefile targets:
- docker-build-*: Build individual or all images
- docker-train-shell: Interactive training environment
- docker-status: Quick health check of containers/images
- test-docker: Validate setup without building

.env.example:
- Template for required environment variables
- Includes API keys, ports, model settings

.dockerignore:
- Excludes checkpoints, caches, and large data
- Keeps build context small for faster builds
Comprehensive guide for using the Docker infrastructure,
written for both new contributors and experienced users.

Contents:
- Quick start (build → run → develop)
- Image descriptions and when to use each
- Common workflows (training, model merging, evaluation)
- Environment variable reference
- Troubleshooting common issues
- CI/CD integration patterns
@supmo668 supmo668 force-pushed the feat/docker-virtual-environments branch from 9542a57 to 514f2bc Compare December 2, 2025 08:17
@supmo668 supmo668 changed the title feat: Docker virtual environments for reproducible training and scripts Add Docker virtual environments for reproducible development Dec 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant