Lemonade Server Configuration

Overview

Lemonade Server starts automatically with the OS after installation. Configuration is managed through a single config.json file stored in the lemonade cache directory.

config.json

If you used an installer from the Lemonade release your config.json will be at these locations depending on your OS:

Linux — apt/.deb (Debian/Ubuntu): /var/lib/lemonade/.cache/lemonade/config.json
Linux — dnf/.rpm (Fedora/Red Hat): /opt/var/lib/lemonade/.cache/lemonade/config.json

Note: For Debian/Ubuntu, upgrading the package automatically migrates data from the old /opt/var/lib/lemonade path to /var/lib/lemonade.
Windows: %USERPROFILE%\.cache\lemonade\config.json
macOS: /Library/Application Support/lemonade/.cache/config.json

If you are using a standalone lemond exectable, the default location is ~/.cache/lemonade/config.json.

Note: If config.json doesn't exist, it's created automatically with default values on first run.

Seeding defaults for packaged installs

On first run, config.json is initialized from the defaults baked into the release (resources/defaults.json). Packagers can override those defaults without editing the release, in increasing precedence:

On Linux, lemond also merges /usr/share/lemonade/defaults.json if it exists, so distro packages can ship their own defaults (e.g. backend *_bin paths pointing at system-installed binaries).
Set the LEMONADE_DEFAULTS_PATH environment variable to a defaults.json at any location to merge it on top. This is the seam for non-FHS distros (Nix, Guix) that cannot write under /usr/share.

Values set in the user's config.json always take precedence over these seeded defaults.

Example config.json

{
  "_generated": "GENERATED by docs/tools/gen_backend_boilerplate.py -- do not hand-edit per-recipe sections (they come from each backend's descriptor config_defaults()). Global keys are hand-maintained in this file. Regenerate and verify with that script; CI --check fails on drift.",
  "cloud_providers": [],
  "config_version": 2,
  "ctx_size": -1,
  "disable_model_filtering": false,
  "enable_dgpu_gtt": false,
  "extra_models_dir": "",
  "flm": {
    "args": ""
  },
  "global_timeout": 600,
  "host": "localhost",
  "kokoro": {
    "cpu_bin": "builtin"
  },
  "llamacpp": {
    "args": "",
    "backend": "auto",
    "cpu_args": "",
    "cpu_bin": "builtin",
    "cuda_bin": "builtin",
    "prefer_system": true,
    "rocm_args": "",
    "rocm_bin": "builtin",
    "vulkan_args": "",
    "vulkan_bin": "builtin"
  },
  "log_level": "info",
  "max_loaded_models": 1,
  "models_dir": "auto",
  "moonshine": {
    "args": "",
    "cpu_args": "",
    "cpu_bin": "builtin"
  },
  "no_broadcast": false,
  "no_fetch_executables": false,
  "offline": false,
  "port": 13305,
  "rocm_channel": "stable",
  "ryzenai": {
    "server_bin": "builtin"
  },
  "sdcpp": {
    "args": "",
    "backend": "auto",
    "cfg_scale": 7.0,
    "cpu_args": "",
    "cpu_bin": "builtin",
    "height": 512,
    "rocm_args": "",
    "rocm_bin": "builtin",
    "steps": 20,
    "vulkan_args": "",
    "vulkan_bin": "builtin",
    "width": 512
  },
  "telemetry": {
    "enabled": false,
    "hide_inputs": false,
    "hide_outputs": false,
    "hide_thinking": false,
    "max_queue_capacity": 1000,
    "otlp": {
      "batch_timeout_s": 1.0,
      "endpoint": "http://localhost:4318/v1/traces",
      "headers": {},
      "max_retries": 0,
      "protocol": "http/protobuf",
      "retry_backoff_base_s": 5.0,
      "semantics": [
        "openinference",
        "otel_genai"
      ],
      "send_batch_size": 100
    }
  },
  "vllm": {
    "args": "",
    "backend": "auto"
  },
  "websocket_port": "auto",
  "whispercpp": {
    "args": "",
    "backend": "auto",
    "cpu_args": "",
    "cpu_bin": "builtin",
    "npu_args": "",
    "npu_bin": "builtin"
  }
}

Settings Reference

Key	Type	Default	Description
`port`	int	13305	Port number for the HTTP server
`host`	string	"localhost"	Address to bind for connections
`log_level`	string	"info"	Logging level (trace, debug, info, warning, error, fatal, none)
`global_timeout`	int	600	Timeout in seconds for HTTP, inference, and readiness checks
`max_loaded_models`	int	1	Max models per type slot. Use -1 for unlimited
`no_broadcast`	bool	false	Disable UDP broadcasting for server discovery
`extra_models_dir`	string	""	Secondary directory to scan for GGUF model files
`models_dir`	string	"auto"	Directory for cached model files. "auto" follows HF_HUB_CACHE / HF_HOME / platform default
`ctx_size`	int	-1	Default context size for LLM models. Use `-1` for auto-resolution: the server computes the largest context that fits in available device memory using GGUF architecture metadata. Use a positive integer to set an explicit size.
`offline`	bool	false	Skip model downloads
`no_fetch_executables`	bool	false	Prevent downloading backend executable artifacts; backends must already be installed or use the system backend
`disable_model_filtering`	bool	false	Show all models regardless of hardware capabilities
`enable_dgpu_gtt`	bool	false	Include GTT for hardware-based model filtering
`rocm_channel`	string	"stable"	ROCm backend channel: "stable" (default) or "nightly". See llama.cpp Backend for details

Backend Configuration

Backend-specific settings are nested under their backend name:

llamacpp — LLM inference via llama.cpp:

Key	Default	Description
`backend`	"auto"	Backend to use: "auto" means "choose for me"
`args`	""	Custom arguments to pass to llama-server (fallback, unused when backend-specific args defined)
`*_args`	""	Backend-specific custom arguments to pass to llama-server
`device`	""	Comma-separated list of devices to use for offloading. Empty is auto.
`prefer_system`	false	Prefer system-installed llama.cpp over bundled
`*_bin`	"builtin"	Backend binary selection — see Backend binary selection

whispercpp — Audio transcription:

Key	Default	Description
`backend`	"auto"	Backend to use: "auto" means "choose for me"
`args`	""	Custom arguments to pass to whisper-server (fallback, unused when backend-specific args defined)
`*_args`	""	Backend-specific custom arguments to pass to whisper-server
`*_bin`	"builtin"	Backend binary selection — see Backend binary selection

sdcpp — Image generation:

Key	Default	Description
`backend`	"auto"	Backend to use: "auto" means "choose for me"
`args`	""	Custom arguments to pass to `sd-server` (fallback, unused when backend-specific args defined)
`*_args`	""	Backend-specific custom arguments to pass to `sd-server`
`steps`	20	Number of inference steps
`cfg_scale`	7.0	Classifier-free guidance scale
`width`	512	Image width in pixels
`height`	512	Image height in pixels
`*_bin`	"builtin"	Backend binary selection — see Backend binary selection

flm — FastFlowLM NPU inference:

Key	Default	Description
`args`	""	Custom arguments to pass to flm serve

ryzenai — RyzenAI NPU inference:

Key	Default	Description
`server_bin`	"builtin"	Backend binary selection — see Backend binary selection

kokoro — Text-to-speech:

Key	Default	Description
`cpu_bin`	"builtin"	Backend binary selection — see Backend binary selection

cloud_providers — Cloud OpenAI-compatible providers (see Cloud Offload). Array, one object per installed provider:

Key	Description
`name`	Short identifier (e.g. `fireworks`). Used as the model-name prefix.
`base_url`	OpenAI-compatible base URL ending in `/v1` (or equivalent).

API keys for these providers are not stored in config.json — they live in LEMONADE_<PROVIDER>_API_KEY env vars (persistent) or lemond process memory via POST /v1/cloud/auth (ephemeral). Manage providers with lemonade cloud install/uninstall/auth/list rather than editing this section by hand.

telemetry — Unified telemetry and tracing configurations:

Key	Type	Default	Description
`enabled`	bool	false	Enable or disable telemetry tracing.
`hide_inputs`	bool	false	Redact prompt message content from spans.
`hide_outputs`	bool	false	Redact generated assistant message content from spans.
`hide_thinking`	bool	false	Redact reasoning/thought content from spans.
`max_queue_capacity`	int	1000	The maximum capacity of the in-memory telemetry queue buffer. Oldest spans are dropped when full. Must be `> 0`.
`otlp`	object	(nested object)	Sub-block grouping OTLP transport details (see below).

telemetry.otlp — Nested OTLP settings:

Key	Type	Default	Description
`endpoint`	string	"http://localhost:4318/v1/traces"	The OTLP endpoint to send traces to.
`protocol`	string	"http/protobuf"	Supported OTLP trace protocol: `"http/protobuf"` or `"http/json"`.
`semantics`	array of strings	["openinference", "otel_genai"]	Active trace semantics. Supported values: `"openinference"` and `"otel_genai"`.
`headers`	object	{}	Map of custom HTTP headers to pass to the OTLP receiver.
`max_retries`	int	0	Maximum number of retry attempts for failed exports. Set to `0` to disable retries and discard failed spans immediately. Must be `>= 0`.
`retry_backoff_base_s`	double	5.0	Base delay in seconds for exponential backoff retries. Must be `>= 0`.
`send_batch_size`	int	100	Target maximum number of spans to group in a single batched OTLP request. Must be `>= 1`.
`batch_timeout_s`	double	1.0	Maximum time to wait in seconds before exporting a partially filled batch of spans. Must be `> 0`.

Telemetry and Tracing Details

Lemonade uses a unified telemetry subsystem to trace requests and capture critical execution spans. The following technical behaviors apply:

Multi-Standard Semantic Conventions: Supports exporting traces using two co-existing semantics:
- OpenInference: Uses Arize Phoenix-compatible properties (always prefixed with openinference.span.kind, llm.model_name, llm.token_count.*).
- OpenTelemetry GenAI: Uses standard OpenTelemetry GenAI properties (gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.input.messages, gen_ai.output.messages). When both semantics are specified in telemetry.otlp.semantics, trace spans carry attributes for both conventions in a single network payload. This allows the collector to parse either convention without duplicate network requests.
Dynamic Attribute Prefixing: Span attributes are dynamically prefixed based on the query type to simplify filtering:
- llm.* for standard chat and completion spans.
- embedding.* for text embedding generation spans.
- reranker.* for document reranking spans.
Token Tracking: Captures and reports token usage metrics using semantic attributes depending on the enabled semantics:
- For OpenInference: Token count is prefixed with llm.token_count across all span kinds (llm.token_count.prompt, llm.token_count.completion, llm.token_count.total) alongside legacy keys like llm.usage.prompt_tokens.
- For OpenTelemetry GenAI: Token count uses standard fields like gen_ai.usage.input_tokens and gen_ai.usage.output_tokens.
Calculated Performance Metrics: In streaming mode, the server automatically computes and records throughput (llm.performance.tokens_per_second / gen_ai.usage.tokens_per_second depending on semantic conventions) and prefill latency (llm.performance.time_to_first_token / gen_ai.performance.time_to_first_token) if not natively returned by the backend (e.g., for vLLM and Cloud models).
vLLM Engine Telemetry: For the vLLM backend, the server queries the local /metrics endpoint on completion to attach scheduler queue metrics (llm.vllm.num_requests_waiting, llm.vllm.num_requests_running, llm.vllm.num_requests_swapped) and KV cache utilization (llm.vllm.gpu_cache_usage_factor, llm.vllm.cpu_cache_usage_factor) directly to the trace spans.
Reasoning Model Support: For reasoning models (e.g., DeepSeek models), the server extracts and records reasoning_content from the assistant's generation. Any variant thought-termination tags (e.g., </think|>) are automatically standardized to the canonical </think> tag.
Exporter Retry Backoff: When retries are enabled (i.e., max_retries > 0), the exporter uses an exponential backoff strategy combined with randomized jitter for failed posts. The base retry interval starts at retry_backoff_base_s seconds (defaulting to 5), doubling on each subsequent failure (e.g., 5s, 10s, 20s, 40s), up to a maximum cap of 60 seconds. A randomized jitter factor between 0.5 and 1.5 is applied to each calculated delay to prevent a "thundering herd" when the collector recovers. Permanent client errors (4xx HTTP status codes, excluding 429 Too Many Requests) are classified as non-retryable and cause the batch to be dropped immediately to save resources.
OTLP Trace Batching: Spans are aggregated in an in-memory queue buffer and exported in batches to minimize network overhead and maximize compression efficiency. Batching operates on a dual-trigger system: a batch is immediately serialized and dispatched if it reaches send_batch_size (default: 100), or if batch_timeout_s (default: 1.0 second) has elapsed since the oldest span in the batch arrived. All remaining traces are flushed cleanly to the OTel collector upon server shutdown. Users can also trigger a manual flush at any time via the POST /internal/telemetry/flush endpoint.
Request Failure Tracing: Captures request failures directly on the telemetry spans. If a model fails to load, a request is rejected by the router, or a streaming connection encounters an exception or a non-200 HTTP status code from the backend, the span is ended with Error status and the specific error message is attached.
Queue Blocking & Thundering Herd Prevention: To prevent client requests from hanging and to avoid exhausting resources when the telemetry receiver endpoint is down, Lemonade employs a fail-fast mechanism. The exporter memory buffer is strictly bounded to a capacity of max_queue_capacity spans (default: 1000). When full, a head-drop (FIFO) eviction policy is applied to drop the oldest telemetry spans to make room for newer ones, prioritizing current application state. If a telemetry transmission task fails all of its retries and is dropped, the endpoint is marked as unreachable. While in this unreachable state, subsequent spans in the transmission queue are attempted only once and immediately dropped without backoff delay if they fail, preventing the telemetry queue from blocking server operations. A single successful span delivery to the endpoint automatically resets the unreachable state and restores normal retry behavior.

Backend binary selection

Every *_bin key (e.g. llamacpp.vulkan_bin, whispercpp.cpu_bin, sdcpp.rocm_bin) accepts the same set of values:

Value	Meaning
`"builtin"` (default)	Use the version of the upstream backend that lemonade pins in its release. Recommended for most users — these versions are tested with this lemonade build.
`""`	Same as `"builtin"`.
`"latest"`	Resolve to the most-recent upstream GitHub release on first install or first status query for that backend, then install on demand. The resolved tag is recorded in `<lemonade-home>/bin/<recipe>/<backend>/version.txt`.
`"b8664"` / `"v1.8.2"` / etc.	A specific upstream release tag. Lemonade downloads that exact version from GitHub.
`"/path/to/bin"`	A directory you populated yourself (e.g. a local build). Lemonade uses the executable inside this directory and never downloads. The path must exist when set.

Note: the latest setting is experimental.

Note: llamacpp.rocm_bin version tags are channel-specific. Each ROCm channel downloads from a different GitHub repository, so you must set the correct rocm_channel before pinning rocm_bin to a specific tag. See Pinning to a Specific Version Tag for details.

Examples:

# Track upstream llama.cpp Vulkan releases (auto-resolve at lemond start)
lemonade config set llamacpp.vulkan_bin=latest

# Pin to a specific llama.cpp build
lemonade config set llamacpp.vulkan_bin=b8664

# Use your own llama.cpp build
lemonade config set llamacpp.vulkan_bin=/home/me/llama.cpp/build/bin

# Revert to the version lemonade ships
lemonade config set llamacpp.vulkan_bin=builtin

Behavior when `*_bin` changes

Changing a *_bin value applies live: lemonade unloads any model currently using that backend, downloads the new binary if needed, and reloads the model on the new binary. No lemond restart is required.

`latest` re-resolution

"latest" is resolved once per lemond process. The first install or status query for a latest-pinned backend hits the GitHub API; the resolved tag is then cached in memory for the rest of the process lifetime. Subsequent installs and status queries (including manual lemonade backends install) reuse the cached tag and do not re-query GitHub. Restart lemond to pick up a newer upstream release.

Upgrade signals in `lemonade backends`

The lemonade backends listing surfaces two upgrade signals for backends pinned to "latest":

update_available — A newer upstream release exists than what's installed. The backend keeps running on the installed version; the listed action is the install command to apply the upgrade when you're ready.
update_required — The installed version is older than the version lemonade ships in this release. This forces an upgrade prompt because running below the lemonade-shipped baseline is not supported.

Backends pinned to a specific tag (e.g. b8664) do not get either signal — they're treated as an explicit user choice.

Interactions with other config

offline: true blocks the GitHub call for "latest". If a previously-installed version.txt exists in the install directory, lemonade reuses that version with a warning. Otherwise the install fails.
no_fetch_executables: true blocks all downloads, including resolving and installing "latest" and any version-tag pin. Existing installs continue to work.

Editing Configuration

lemonade config (recommended)

Use the lemonade config CLI to view and modify settings while the server is running. Changes are applied immediately and persisted to config.json.

# View all current settings
lemonade config

# Set one or more values
lemonade config set key=value [key=value ...]

Top-level settings use their JSON key name directly. Nested backend settings use dot notation (section.key=value):

# Change the server port and log level
lemonade config set port=9000 log_level=debug

# Change a backend setting
lemonade config set llamacpp.backend=rocm

# Set multiple values at once
lemonade config set port=9000 llamacpp.backend=rocm sdcpp.steps=30

lemond CLI arguments (fallback)

If the server cannot start (e.g., invalid port in config.json), lemond accepts --port and --host as CLI arguments to override config.json. These overrides are persisted so the server can start normally next time:

lemond --port 9000 --host 0.0.0.0

Edit config.json manually (last resort)

If the server won't start and CLI arguments aren't sufficient, you can edit config.json directly. Restart the server after making changes:

# Linux (Debian/Ubuntu)
sudo nano /var/lib/lemonade/.cache/lemonade/config.json

# Linux (Fedora/Red Hat)
sudo nano /opt/var/lib/lemonade/.cache/lemonade/config.json

sudo systemctl restart lemond

# Windows — edit with your preferred text editor:
# %USERPROFILE%\.cache\lemonade\config.json
# Then quit and relaunch from the Start Menu

lemond CLI

lemond [cache_dir] [--port PORT] [--host HOST]

cache_dir — Path to the lemonade cache directory containing config.json and model data. Optional; defaults to platform-specific location.
--port — Port to serve on (overrides config.json, persisted). Use as a fallback if the server cannot start.
--host — Address to bind (overrides config.json, persisted). Use as a fallback if the server cannot start.

API Key and Security

Regular API Key

The LEMONADE_API_KEY environment variable sets an API key for authentication on regular API endpoints (/api/*, /v0/*, /v1/*). On Linux with systemd, set it in the service environment (e.g., via a systemd override or drop-in file). On Windows, set it as a system environment variable.

When LEMONADE_API_KEY is set, the inference and model-management endpoints reject any request that does not present a matching Bearer token. This is the only credential that gates those endpoints, so it controls whether unauthenticated clients can reach the server at all. When it is unset, those endpoints are reachable without authentication.

Admin API Key

The LEMONADE_ADMIN_API_KEY environment variable provides elevated access to both regular API endpoints and internal endpoints (/internal/*). When set, it takes precedence over LEMONADE_API_KEY for client authentication.

LEMONADE_ADMIN_API_KEY enables privilege separation between two classes of authenticated clients. Holders of LEMONADE_API_KEY can reach the regular API endpoints, while only holders of LEMONADE_ADMIN_API_KEY can reach the internal control endpoints (/internal/*, e.g. shutdown and configuration). A client presenting only LEMONADE_API_KEY cannot reach /internal/* if LEMONADE_ADMIN_API_KEY is set to a distinct value. If LEMONADE_ADMIN_API_KEY is not set, it defaults to the value of LEMONADE_API_KEY, so the regular key then also authenticates against /internal/* and no privilege separation exists.

Authentication Hierarchy:

Scenario	`LEMONADE_API_KEY`	`LEMONADE_ADMIN_API_KEY`	Internal Endpoints	Regular API Endpoints
No keys set	(not set)	(not set)	No auth required	No auth required
Only API key	"secret"	(not set)	Requires key	Requires key
Only admin key	(not set)	"admin"	Requires admin key	No auth required
Both keys different	"regular"	"admin"	Requires admin key	Either key accepted

Client Behavior: Clients (CLI, tray app) automatically prefer LEMONADE_ADMIN_API_KEY if set, otherwise fall back to LEMONADE_API_KEY.

Remote Server Connection

To make Lemonade Server accessible from other machines on your network, set the host to 0.0.0.0:

lemonade config set host=0.0.0.0

Warning: Using host: "0.0.0.0" allows connections from any machine on the network — including to the internal control endpoints (/internal/*, e.g. shutdown and config). Only do this on trusted networks, and set an API key to manage access. LEMONADE_API_KEY secures all endpoints; LEMONADE_ADMIN_API_KEY on its own secures only /internal/* and leaves the inference and model-management endpoints (/api, /v0, /v1) open, so set LEMONADE_API_KEY to protect those too. The server logs a warning at startup when bound to a non-loopback host without the regular key.

Next Steps

The Server Specification provides more information about how to integrate Lemonade Server into an application.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lemonade Server Configuration

Overview

config.json

Seeding defaults for packaged installs

Example config.json

Settings Reference

Backend Configuration

Telemetry and Tracing Details

Backend binary selection

Behavior when `*_bin` changes

`latest` re-resolution

Upgrade signals in `lemonade backends`

Interactions with other config

Editing Configuration

lemonade config (recommended)

lemond CLI arguments (fallback)

Edit config.json manually (last resort)

lemond CLI

API Key and Security

Regular API Key

Admin API Key

Remote Server Connection

Next Steps

Uh oh!

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Lemonade Server Configuration

Overview

config.json

Seeding defaults for packaged installs

Example config.json

Settings Reference

Backend Configuration

Telemetry and Tracing Details

Backend binary selection

Behavior when *_bin changes

latest re-resolution

Upgrade signals in lemonade backends

Interactions with other config

Editing Configuration

lemonade config (recommended)

lemond CLI arguments (fallback)

Edit config.json manually (last resort)

lemond CLI

API Key and Security

Regular API Key

Admin API Key

Remote Server Connection

Next Steps

Behavior when `*_bin` changes

`latest` re-resolution

Upgrade signals in `lemonade backends`