Drop-in cost optimization for LangChain. One line of config routes your existing ChatOpenAI / ChatAnthropic / ChatMistralAI / ChatGroq / ChatCohere through the Tessera optimization proxy — auto-route to cheaper-equivalent models, exact + provider-prompt-cache hits, prompt compression with per-stack quality canary, batch arbitrage on async-tolerant calls. Free Sandbox tier: 60M tokens/month, no card. Paid tiers: flat monthly subscription by token volume, keep 100% of savings.
Companion to tessera-sdk (vanilla provider SDKs), tessera-vercel-ai (Vercel AI SDK integration), tessera-llamaindex (LlamaIndex integration), tessera-mastra (Mastra Agent framework integration), tessera-pydantic-ai (Pydantic AI integration), tessera-crewai (CrewAI multi-agent integration), and tessera-autogen (AutoGen 0.4+ multi-agent integration). Same proxy, same mechanic stack, LangChain-shaped API.
▶ 41-second walkthrough: live counter ticks · baseline $74,800 → actual $30,000 ($44,800 saved, 60% reduction) · audit-immutable savings ledger. Click to play.
Worked example, customer-support agent on gpt-4o, 5B tokens/month:
| Stage | Cost / month | Saved |
|---|---|---|
| Baseline — OpenAI direct | $24,000 | — |
| + Tessera (route, cache, prompt-cache headers, compress, output-length ceiling, batch) | $9,400 | $14,600 |
| Tessera subscription (Growth tier, flat) | $999 | — |
| You net pay | $10,399 | $13,601 / mo saved |
Verify the savings math yourself. Every billable line traces back to two immutable cost figures pinned to a multi-source pricing catalog snapshot captured at request time. Two engineers, three hours, can re-derive any month from raw inputs. Full procedure at tesseraai.io/trust.
Quality canary across the full mechanic stack: mean-score 0.96 (floor 0.95) — 0.95 SLA held all 30 days. Full walkthrough on /blog/cut-openai-bill-48-percent-without-quality-regression.
pip install tessera-langchain # Python
npm install @tessera-llm/langchain # Node / TypeScriptGet a free API key (60M tokens/mo, no card) — tesseraai.io/dev. Sign-up takes about 30 seconds and returns an instant tk_… key plus magic-link dashboard access.
from langchain_openai import ChatOpenAI
from tessera_langchain import tessera_openai_config
llm = ChatOpenAI(
model="gpt-4o",
openai_api_key="sk-...", # your OpenAI key, unchanged
**tessera_openai_config(api_key="tk_..."), # one line, routes through Tessera
)
# Existing LangChain code runs unchanged.
response = llm.invoke("Summarize the architecture of a Kubernetes operator in 3 bullets.")Same pattern for the other providers:
from langchain_anthropic import ChatAnthropic
from tessera_langchain import tessera_anthropic_config
llm = ChatAnthropic(
model="claude-sonnet-4-5-20250929",
anthropic_api_key="sk-ant-...",
**tessera_anthropic_config(api_key="tk_..."),
)Or wrap an existing ChatModel instance instead:
from langchain_openai import ChatOpenAI
from tessera_langchain import wrap_openai
base = ChatOpenAI(model="gpt-4o", openai_api_key="sk-...")
llm = wrap_openai(base, tessera_api_key="tk_...") # returns a new ChatOpenAI routed through Tesseraimport { ChatOpenAI } from "@langchain/openai";
import { tesseraOpenAIConfig } from "@tessera-llm/langchain";
const llm = new ChatOpenAI({
model: "gpt-4o",
apiKey: process.env.OPENAI_API_KEY!, // your OpenAI key, unchanged
...tesseraOpenAIConfig({ apiKey: process.env.TESSERA_API_KEY! }),
});
const response = await llm.invoke(
"Summarize the architecture of a Kubernetes operator in 3 bullets."
);Anthropic / Mistral / Groq / Cohere mirror the same shape — see examples/.
Wrap an existing instance:
import { ChatOpenAI } from "@langchain/openai";
import { wrapOpenAI } from "@tessera-llm/langchain";
const base = new ChatOpenAI({ model: "gpt-4o", apiKey: process.env.OPENAI_API_KEY! });
const llm = wrapOpenAI(base, process.env.TESSERA_API_KEY!);Same mechanic stack as the main tessera-sdk. Each mechanic is opt-in per workload, observable per request, and bypasses when its quality canary drops below the per-stack 0.95 floor.
| Mechanic | What it does | Typical savings |
|---|---|---|
| Auto-route (m1) | Route to a cheaper-equivalent model gated by a daily promptfoo canary on your eval set | 15–35% on routed calls |
| Auto-cache (m2) | sha256 cache on the canonical request body, 7-day TTL, Cloudflare edge KV | 5–40% depending on prompt repetition |
| Auto-compress (m3) | Per-role heuristic compression (system + user toggles independent). Preserves code fences and JSON shapes. | 5–15% on prompt tokens |
| Prompt cache (m6) | Inject provider-native cache headers — OpenAI cached-input (50% off), Anthropic cache_control: ephemeral (90% off cache reads) |
50–90% on cached prefixes |
| Context prune (m7) | Conservative trim on long conversations (system + last 8 turns; TF-IDF rerank on RAG attachments) | 5–25% on multi-turn workloads |
| Output-length ceiling (m9) | Daily compute fits p90 of completion length per workload, injects max_tokens = p90 × 1.3 |
5–15% on completion cost |
| Batch arbitrage (m10) | Route async-tolerant calls to provider Batch APIs (OpenAI Batch + Anthropic Message Batches both 50% off) | 50% on batch-eligible traffic |
| Per-provider circuit breaker | (Reliability primitive, above the mechanics.) Rolling 5xx-rate state machine per upstream — when a provider degrades, auto-route skips its intra-provider alternative mappings until the half-open probe succeeds. Details on /how-it-works. | n/a — keeps the savings stack honest |
| Provider | LangChain class (Py) | LangChain.js class | Tessera proxy URL |
|---|---|---|---|
| OpenAI | ChatOpenAI |
ChatOpenAI |
https://api.tesseraai.io/v1/openai |
| Anthropic | ChatAnthropic |
ChatAnthropic |
https://api.tesseraai.io/v1/anthropic |
| Mistral | ChatMistralAI |
ChatMistralAI |
https://api.tesseraai.io/v1/mistral |
| Groq | ChatGroq |
ChatGroq |
https://api.tesseraai.io/v1/groq |
| Cohere | ChatCohere |
ChatCohere |
https://api.tesseraai.io/v1/cohere |
Other LangChain provider integrations can use the OpenAI-compat base URL by configuring base_url=https://api.tesseraai.io/v1/openai and default_headers={"x-tessera-api-key": "tk_..."} directly — works for anything that speaks the OpenAI wire format.
- Free Sandbox — 60M tokens/month, 30 requests/minute, full mechanic stack active (route · cache · compress · batch). No card. Forever.
- Paid tiers — flat monthly subscription by token volume: Starter $199 (≤1B), Growth $999 (≤5B), Scale $3,999 (≤20B), Enterprise custom (20B+). You keep 100% of measured savings.
Existing early customers of tessera-sdk keep their rate_locked_pct (25% Founding Pilot) on this package too — same tk_… key, same billing record.
Same proxy. Same mechanics. Same billing. tessera-sdk patches the underlying provider clients (OpenAI, Anthropic, Mistral, Groq, Cohere) directly. tessera-langchain plugs into the LangChain ChatModel constructor — useful when you're already on LangChain and want to keep the abstraction.
If you're on LangChain, install this one. If you're using the raw provider SDK without LangChain, install tessera-sdk. Both packages are safe to install side by side.
No. Your LangChain ChatModel object behaves identically — same .invoke(), same .stream(), same .bind_tools(), same .with_structured_output(). Only the upstream HTTP endpoint changes, and the response shape is the OpenAI / Anthropic / etc. wire format unchanged.
Your application gets HTTP errors instead of LLM responses. To passthrough on error, configure your LangChain max_retries to fall back to a non-Tessera client (we'll document this pattern explicitly in a future release). On the proxy side, a per-provider circuit breaker tracks rolling 5xx rates and skips degraded providers in auto-route decisions — cross-provider failover (re-routing to a different provider entirely when an upstream is down) is on the roadmap, not shipped yet.
They pass through. Tessera does not aggregate quotas across customers. Your provider rate limits apply normally; the proxy enforces only the Tessera tier limits (30 rpm Free Sandbox, 60 rpm Production by default — higher on request).
No. We log only token counts, cost deltas, mechanics_stack, and provider response status. Prompts and completions are never persisted. Full data handling on tesseraai.io/security.
Yes — every request gets a row in /portal/audit showing the canonical mechanics_stack, the model that fired, the original vs actual cost, and the pricing_catalog snapshot id used. Export to CSV any time.
Per-workload in /portal/settings. Each mechanic has its own toggle. Per-role compression has independent toggles (compress system, compress user turns). Mechanics also auto-disable per-stack if the daily quality canary drops below 0.95 for the affected mechanic combination.
Tessera built and maintains this package. It uses public LangChain constructor APIs (no monkey-patching, no private imports). LangChain's own pluggability is what makes this clean.
See examples/:
openai-langchain.py— ChatOpenAI through Tesseraanthropic-langchain.py— ChatAnthropic through Tesseraopenai-langchain.ts— TypeScript / LangChain.js
- AI-native teams spending $5k+/month on OpenAI / Anthropic / Gemini and wanting that bill cut without re-architecting.
- LangChain users who do not want to swap to a different abstraction just to add an optimization proxy.
- Production workloads with eval sets — Tessera's mechanic stack only fires when the per-stack canary holds 0.95 quality. If you don't have an eval set yet, the Free Sandbox tier is the right place to start.
- Hobby projects under ~$500/month total bill — the Free Sandbox tier covers you with the same full mechanic stack; a paid tier isn't worth the subscription cost at that volume.
- Air-gapped / on-prem deployments — Tessera is hosted-only.
- Workloads with no repetition AND no stable prefix — exact cache and prompt-cache headers won't fire. Auto-route and batch arbitrage might still help; worth measuring on Free Sandbox first.
- High-latency-sensitivity workloads with <10ms p50 SLO — the proxy adds 15-25 ms p50 from the Cloudflare edge.
Open-source SDK ↔ closed-source proxy. This package is a thin client that adds one HTTP hop. The actual mechanic decisions (route, cache, compress, etc.) run inside the Tessera Cloudflare Worker proxy at api.tesseraai.io. The split is intentional: the wire format is open so you can audit what we send; the mechanic implementations are closed because that's the asymmetric IP. See the tessera-sdk README's "Architecture" note for the longer explanation.
Apache-2.0. See LICENSE.
We accept PRs that:
- Add support for a new LangChain provider class (paste-and-mirror the existing config function shape)
- Improve typing precision (TypeScript strict, Python
mypy --strict) - Add concrete example scripts under
examples/showing a real LangChain pipeline - Improve tests or test infrastructure
We do not accept PRs that change the proxy's HTTP contract — that lives in the closed-source worker.
See CONTRIBUTING.md (TODO — same as tessera-sdk contributing guide).
Semver. Wire format compatibility committed across minor releases; breaking changes only on major bumps. Independent versioning from tessera-sdk (per-package CHANGELOG).
See SECURITY.md. Coordinated disclosure address: security@tesseraai.io.
Tessera is the substrate layer for LLM cost optimization, also called the Optimize Layer in our product surface. A thin proxy that sits in your application's request-path, applies a conservative cascade of optimization mechanics, and measures every saved dollar against an audit-immutable baseline. We bill a flat monthly subscription by token volume (Starter $199, Growth $999, Scale $3,999, Enterprise custom); you keep 100% of measured savings. No per-token gateway fee; the category we operate in is "LLM cost optimizer," distinct from per-token AI gateways and observability dashboards.
Where observability tools tell you what you spent and AI gateways re-shape the request without measuring the cost delta, Tessera is the layer that does both, and proves the measured savings line by line. The verified-savings ledger at ledger.tesseraai.io shows every original-vs-actual cost pair, snapshot-pinned to a pricing_catalog version captured at request time. Mid-contract price changes don't retroactively alter past savings. This is the FinOps-friendly model for AI inference: every line of the bill traces to a code-enforced rule.
Operated by Fintechagency OÜ (Tallinn, Estonia, registry code 16638667).
- Developer entry: tesseraai.io/dev
- Mechanic reference: tesseraai.io/how-it-works
- Dashboard: ledger.tesseraai.io
- Engineering blog: tesseraai.io/blog
