This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
RubyLLM::Agents is a Rails engine gem for building, managing, and monitoring LLM-powered AI agents. It provides a DSL for agent configuration, a middleware pipeline for execution, automatic tracking with cost analytics, and a mountable dashboard UI.
Requirements: Ruby >= 3.1, Rails >= 7.0, RubyLLM >= 1.16.0
bundle exec rspec # Run full test suite (~3700+ specs)
bundle exec rspec spec/agents/routing_spec.rb # Run a single spec file
bundle exec rspec spec/agents/ -e "parses" # Run specs matching description
bundle exec standardrb # Lint (StandardRB, targets Ruby 3.1)
bundle exec standardrb --fix # Auto-fix lint issues
bundle exec rake # Run both specs and linter (default task)
RUN_INTEGRATION=1 bundle exec rspec # Include integration tests (skipped by default)Pre-commit hook: A git pre-commit hook runs standardrb --no-fix and blocks commits on lint failures. Fix with bundle exec standardrb --fix before committing.
CI: Runs lint on Ruby 3.4, tests on Ruby 3.2/3.3/3.4.
From John Ousterhout's A Philosophy of Software Design: The best modules are deep — they hide significant complexity behind a simple interface.
Module Depth = (Complexity Hidden) / (Interface Exposed)
A deep module has a small public API that hides substantial logic. A shallow module exposes an interface almost as complex as its implementation. Do not create shallow modules.
Before creating a new class, module, concern, or middleware, answer:
- What complexity does it hide? If "not much" or "it just delegates," don't create it.
- Is the interface simpler than the implementation? If the public API has as many concepts as the internals, it's shallow.
- Does it have a reason to change independently? If it always changes in lockstep with another module, merge them.
- Can I name it with a specific noun/verb? Vague names (
Manager,Handler,Processor,Utils) usually signal shallow design.
| Layer | Deep | Shallow |
|---|---|---|
| Middleware | Reliability — hides retries, exponential backoff, fallback model switching, circuit breaker state machine |
A middleware that just logs and delegates |
| DSL Module | Dsl::Reliability — simple on_failure { retries 3 } hides complex config normalization and validation |
A DSL module that wraps a single attr_accessor |
| Concern | Execution::Analytics — hides aggregate queries, trend calculations, grouping logic |
Execution::Timestamps — just reformats created_at |
| Infrastructure | BudgetTracker — facade hiding query/record/forecast/alert subsystems |
A class that just calls update_column |
| Pipeline::Context | Explicit data carrier — all middleware reads/writes named attributes | Would be shallow if it were just a hash wrapper |
| Controller | Intentionally thin — delegates to models. Controllers should be shallow in this gem | |
| Job | ExecutionLoggerJob — thin wrapper calling deep model logic (correct; jobs are infrastructure glue) |
Pass-through methods:
# BAD: Wraps a call without hiding anything
class AgentExecutor
def run(agent, params)
agent.call(**params)
end
end
# Just call agent.call(**params) directly.Premature extraction:
# BAD: Concern used by only one class, hides nothing meaningful
module Pipeline::Middleware::Logging
def log_start = Rails.logger.info("Starting")
end
# Inline it. Extract when a second middleware needs it.Needless indirection:
# BAD: Repository pattern over ActiveRecord
class ExecutionRepository
def find(id) = Execution.find(id)
def all = Execution.all
end
# ActiveRecord IS the repository. This adds a layer that hides nothing.Config objects for simple cases:
# BAD: Over-engineered wrapper
class RetryConfig
attr_reader :max, :backoff
def initialize(max: 3, backoff: :linear) = ...
end
# Use keyword arguments or constants until complexity warrants it.RubyLLM::Agents::BaseAgent # Core: middleware pipeline, DSL, execute
└── RubyLLM::Agents::Base # Adds before_call/after_call callbacks
└── ApplicationAgent # User's base class in host app
└── ConcreteAgent # User's agent
Specialized types (Embedder, Speaker, Transcriber, image agents) also inherit from BaseAgent but have their own execution logic.
Agent execution flows through a middleware stack assembled by Pipeline::Builder:
- Tenant → resolves tenant context
- Budget → checks spending limits, records costs after
- Cache → returns cached result or stores new one
- Instrumentation → logs execution to DB via
ExecutionLoggerJob - Reliability → retries, fallback models, circuit breakers
- Core executor → calls the agent's
executemethod (builds messages, calls LLM)
Pipeline::Context is the data carrier that flows through each middleware layer.
Two layers — both can be mixed:
Declarative (recommended): model, temperature, system, user/prompt (with {placeholder} auto-registering params), assistant (prefill), returns (structured output), cache for:, on_failure { retries/fallback/circuit_breaker }, tools, streaming, before/after callbacks, aliases (previous class names for rename tracking).
Method overrides: system_prompt, user_prompt, process_response(response), messages, schema, metadata.
- executions — lean analytics columns (agent_type, model, tokens, costs, timing, status, tenant_id, metadata JSON)
- execution_details — large payloads via
has_one :detail(prompts, response, tool_calls, attempts, fallback_chain) - tenants — multi-tenancy budget tracking (limits, usage counters, enforcement mode)
lib/ruby_llm/agents/
├── core/ # Configuration, version, errors, instrumentation, LLM tenant
├── dsl/ # Base, Reliability, Caching DSL modules
├── pipeline/ # Context, Builder, Executor, middleware/
├── infrastructure/ # Reliability (retry/fallback/circuit breaker), budget, cache, alerts
├── routing/ # Classification concern (ClassMethods, Result)
├── results/ # Result classes for each agent type
├── text/ # Embedder
├── audio/ # Speaker, Transcriber, pricing
├── image/ # Generator, Analyzer, Editor, Transformer, etc.
└── rails/ # Engine
app/
├── models/ # Execution, ExecutionDetail, Tenant, TenantBudget
├── controllers/ # Dashboard, Agents, Executions, Tenants, SystemConfig
├── views/ # ERB templates with Tailwind CSS + Alpine.js
├── services/ # AgentRegistry
└── helpers/
spec/
├── agents/ # Core agent/pipeline/DSL/reliability specs
├── models/ # ActiveRecord model specs
├── controllers/ # Request specs
├── views/ # View specs
├── generators/ # Generator specs
├── migrations/ # Migration upgrade path specs (type: :migration)
├── factories/ # FactoryBot definitions
├── support/ # Schema builder, mock objects, shared examples
└── dummy/ # Minimal Rails app with SQLite in-memory
Located at example/ — a full Rails app demonstrating all agent types. Uses storage/development.sqlite3 (not db/). Agents organized in app/agents/ with subdirectories: embedders/, audio/, images/, routers/, concerns/.
Depth in lib/ (gem internals): Middleware, infrastructure services, DSL modules, and the pipeline are where complexity lives. Each should hide meaningful logic behind a simple interface.
Shallow in app/ (Rails engine): Controllers, jobs, and views are intentionally thin. They delegate to deep model methods and gem internals.
Concerns earn their existence: Every concern must hide real complexity. Don't extract a concern for a single method or trivial logic — keep it on the model until a second consumer needs it or the logic is substantial.
# GOOD: Deep concern — hides aggregate queries, trend math, grouping
module Execution::Analytics
def self.cost_trends(period:, group_by:)
# Complex SQL aggregation, date bucketing, trend calculation
end
end
# BAD: Shallow concern — trivial delegation
module Execution::Status
def success? = status == "success"
end
# Just put this on the model.Each middleware must follow the pre/post pattern and hide meaningful logic:
# GOOD: Deep middleware — hides retry logic, backoff calculation,
# fallback switching, circuit breaker state
class Pipeline::Middleware::Reliability < Base
def call(context)
with_retries(context) do
with_fallbacks(context) do
with_circuit_breaker(context) do
@app.call(context)
end
end
end
end
end
# BAD: Shallow middleware — just logs and passes through
class Pipeline::Middleware::Logger < Base
def call(context)
log("starting")
@app.call(context)
end
end
# Inline this into an existing middleware.DSL modules provide the declarative interface for agent authors. Each module should feel like configuration, not programming:
# GOOD: Simple declaration hides complex configuration normalization
class MyAgent < Base
on_failure {
retries 3, backoff: :exponential
fallback "gpt-4o-mini"
circuit_breaker threshold: 5, reset_after: 60
}
end
# BAD: DSL that mirrors the implementation (no depth gained)
class MyAgent < Base
set_retry_count 3
set_backoff_type :exponential
end
# This is just attr_writers with extra steps.Configuration — use the existing singleton pattern:
RubyLLM::Agents.configure do |config|
config.default_model = "gpt-4o"
end
# Do NOT introduce alternative config mechanisms. One way to configure.Results — typed result objects per agent type:
Each agent type returns a specific Result subclass. Don't return raw hashes or untyped data. The result class hides response normalization.
Pipeline::Context — explicit data carrier: All data flows through named attributes on Context. Don't store data in instance variables on middleware or use thread-local storage. Context is the single source of truth.
Agent Registry — two-source discovery: Combines filesystem scanning with execution history. Don't simplify to one source — both are needed (filesystem for current agents, history for deleted/renamed ones).
| DO | DON'T |
|---|---|
| Follow existing patterns (middleware, DSL, Context) | Introduce new architectural patterns without justification |
| Keep diffs minimal and focused | Mix refactors with behavior changes |
| Add to existing middleware when appropriate | Create new middleware for trivial logic |
| Use concerns that hide real complexity | Extract concerns for one method or one consumer |
| Design simple public APIs on classes | Expose implementation details to callers |
| Keep controllers thin (delegate to models) | Put business logic in controllers |
| Keep jobs as thin infrastructure glue | Put business logic in jobs |
| Return typed Result objects | Return raw hashes or unstructured data |
| Use Pipeline::Context for data flow | Use instance variables or thread-local state |
| Validate at system boundaries (user input, LLM responses) | Add redundant validation for internal data |
| Type | Pattern | Example |
|---|---|---|
| Tables | ruby_llm_agents_ prefix |
ruby_llm_agents_executions |
| Models | RubyLLM::Agents:: namespace |
RubyLLM::Agents::Execution |
| Generators | RubyLlmAgents:: module |
Note different casing from runtime |
| Middleware | Pipeline::Middleware:: namespace |
Pipeline::Middleware::Budget |
| DSL Modules | Dsl:: namespace |
Dsl::Reliability |
| Results | Results:: namespace + Result suffix |
Results::EmbeddingResult |
| Agent types | app/agents/ with subdirs |
embedders/, audio/, images/, routers/ |
Naming red flags: Manager, Handler, Processor, Helper, Utils, Wrapper. If you reach for these names, reconsider whether the abstraction earns its keep.
Always prefer exercising real code paths over mocking. Mocks hide bugs and make tests brittle to refactoring. Only mock when absolutely necessary (external HTTP calls to LLM APIs). Specifically:
- Do NOT mock internal classes like
Pipeline::Executor, middleware,Result, DSL methods, or ActiveRecord models. Instantiate real objects and call real methods. - Do NOT stub methods on the class under test. If you need to control inputs, pass them as arguments or set up proper test state.
- Do NOT replace
RubyLLM::Agents.configurationwith a test double. UseRubyLLM::Agents.reset_configuration!+RubyLLM::Agents.configure { |c| ... }to set real config values. Config doubles become stale whenever a new config attribute is added, causing hard-to-diagnose CI failures. - DO mock the external LLM API boundary (
RubyLLM::Chat,RubyLLM::Embedding, etc.) to avoid real network calls. Use the existing helpers inspec/support/ruby_llm_mock.rbandspec/support/chat_mock_helpers.rb. - When testing
process_response, pass a real or simple struct response object directly — don't mock the entire call chain to get there. - When testing DSL behavior (routes, params, schema), define a real test class with the DSL and assert against its class-level state. No mocks needed.
- For database-touching specs, use FactoryBot and let the real ActiveRecord models run against SQLite in-memory.
The goal: if the real code breaks, the test should break too.
| Change Type | Required Tests |
|---|---|
| New middleware | Middleware spec exercising pre/post logic and edge cases |
| New DSL method | DSL spec with real test agent class asserting class-level state |
| New model/concern | Model spec with validations, associations, scopes |
| New controller action | Request spec |
| Bug fix | Regression test proving the fix |
| New agent type | Agent spec with mocked LLM boundary |
| New infrastructure | Unit spec for the service, integration spec for the pipeline |
- Tests use SQLite in-memory database; schema loaded from
spec/dummy/db/schema.rb DatabaseCleanerwith transaction strategy (truncation for migration tests)spec/support/ruby_llm_mock.rbmocks RubyLLM API calls to avoid real LLM requests- Migration specs (
spec/migrations/, type:migration) useSchemaBuilderto rebuild schema at historical versions - Integration tests require
RUN_INTEGRATION=1environment variable - FactoryBot factories in
spec/factories/ - Shared examples in
spec/support/shared_examples/ - Ruby 3.4+/4.0 compatibility:
require "ostruct"explicitly when using OpenStruct in specs
- Never log or store API keys, tokens, or credentials in execution records or metadata
- Filter sensitive parameters — ensure
filter_parameterscovers all secret fields - Validate at boundaries — validate user input in controllers, validate LLM responses in result classes
- Strong parameters — always use
permitin controllers, never rawparams - Dashboard auth — respect
config.dashboard_authlambda for access control - Avoid command injection — never interpolate user input into shell commands or raw SQL
The engine includes a mountable dashboard using ERB + Tailwind CSS + Alpine.js.
- Use existing Tailwind utilities — don't introduce custom CSS unless absolutely necessary
- Use Alpine.js for interactivity — don't add other JS frameworks
- Keep views thin — complex logic belongs in helpers or model methods, not ERB
- Use partials for reuse — extract repeated markup into partials
- Accessibility — include focus states, ARIA labels on interactive elements
| DO | DON'T |
|---|---|
| Use Tailwind utility classes | Write custom CSS for things Tailwind handles |
| Use Alpine.js for dynamic behavior | Add jQuery, React, or other JS frameworks |
| Extract repeated markup into partials | Copy-paste view code between templates |
| Put display logic in helpers | Put complex conditionals in ERB |
Use turbo_stream for dynamic updates |
Full page reloads for partial updates |
When making significant changes to architecture, database schema, or public API, document the decision in changelog/.
- Database schema changes (new tables, columns, migrations)
- Public API changes (new DSL methods, changed signatures, deprecations)
- Middleware pipeline changes (new middleware, reordering)
- Major refactors affecting multiple files
- New architectural patterns introduced
Create a new file: changelog/YYYY-MM-DD-short-title.md
# Title
## Context
What is the background? What problem are we solving?
## Decision
What change was made?
## Consequences
- What are the implications?
- What migrations or follow-up work is needed?- Design deep modules — every abstraction must hide meaningful complexity behind a simple interface
- Reject shallow wrappers — no pass-through methods, trivial concerns, or needless indirection
- Depth in lib/, thin in app/ — middleware, DSL, and infrastructure are deep; controllers, jobs, and views are thin
- One way to do things — follow existing patterns (middleware pipeline, DSL, Context, Result objects)
- Keep diffs minimal — only change what's necessary, don't mix refactors with features
- Test real code — mock only the LLM API boundary, exercise everything else
- Explicit data flow — use Pipeline::Context, not instance variables or thread-locals
- Typed results — return Result subclasses, not raw hashes
- Concerns earn their keep — extract only when complexity warrants it
- Security by default — validate at boundaries, filter secrets, respect auth config