citeformer is deliberately thin. The hard technical work — token masking, CSL rendering, PDF extraction, NLI — already lives in well-maintained dependencies. Our job is to compose them behind a single honest API.
CLI → orchestration (Citeformer) → verify → render → backends → grammar → core
Upper layers depend only on lower. A render module must never import from backends; a backend must never reach up into orchestration. Break this and the refactor radius explodes.
Before writing new code, ask: is this already done by one of these?
| We piggyback on | For |
|---|---|
| XGrammar / llguidance | Grammar-level token masking at generation time |
| transformers (HF) | Running local causal LMs |
| vLLM | High-throughput inference with --guided-decoding-backend |
llama.cpp (llama-cpp-python) |
CPU / Apple Silicon inference with GBNF grammars |
| openai / anthropic / google-genai / mistralai | API-provider generation clients (the openai SDK is also the wire client for OpenRouter) |
| lark | Authoring the citation grammar before handing off to the decoder |
| httpx + diskcache | Metadata fetchers (Crossref, arXiv) with polite caching |
| pypdf / grobid-client-python | PDF text extraction — pypdf default, GROBID opt-in for cleaner scientific-paper parsing |
| readability-lxml | URL extraction |
| DeBERTa-v3-MNLI (via transformers) | NLI entailment for verify() |
| pydantic + typer + rich | Types, CLI, pretty output |
The parts citeformer owns are the glue plus the render layer: the citation grammar shape (§10.1), the CSL-JSON source metadata contract (§10.2), the output pydantic models (§10.3), the inline-marker-to-reference coupling, the orchestration loop, and the six hand-written CSL formatters (APA 7, MLA 9, Chicago author-date, IEEE, Nature, Vancouver — see ADR-004). Everything else is a composition.
v0.1.0 shipped on 2026-04-24. Each phase was a mergeable milestone with its own exit criterion; see the frozen genesis at docs/spec/v0.md for the original plan.
| Phase | Scope | Exit criterion |
|---|---|---|
| P0 | Scaffolding: pyproject, CI, docs skeleton, .claude/ | make lint && make test && make docs-build green; v0.0.1 publishes to TestPyPI |
| P1 | Core types: Source, Citation, Reference, GenerationResult, Policy, Backend ABC |
Contracts locked; mock backend works end-to-end |
| P2 | HF backend with grammar-level logit enforcement (the flagship) | Smoke test: given N sources, model cannot emit [N+k] for any k > 0, across 100+ prompts |
| P3 | Deterministic CSL reference rendering (home-grown, see ADR-004) | APA 7, MLA 9, Chicago author-date, IEEE, Nature, Vancouver render cleanly on the fixture set |
| P4 | Metadata adapters: DOI, arXiv, PDF, URL, BibTeX, Zotero | VCR-backed CI tests plus a live smoke script |
| P5 | vLLM and llama.cpp backends | All three local backends pass the same conformance suite |
| P6 | NLI verification + hand-curated AI-papers benchmark | Coverage report shows support-rate gains; see benchmarks/README.md |
| Polish | REQUIRED progression fix (ADR-009), real CLI, examples as living reports | ADR-009 integration test passes; citeformer CLI covers generate/verify/render; examples/ has runnable scripts with findings READMEs |
| Expansion | Marker-shape enum (ADR-011), OpenAI + Anthropic + Gemini + Mistral API backends, threshold calibration, multi-prompt + ALCE benchmarks, literature-review notebook, HF Space demo, GROBID PDF extractor | Seven backends pass a shared contract; 40-run multi-prompt sweep reports 0.0 ± 0.0 fabrication; PREPRINT.md describes the v0.1 design + evaluation |
| P7 (shipped) | v0.1.0 on PyPI + GitHub Release | pip install citeformer==0.1.0 works; docs built on RTD; CI green across Python 3.11–3.14 |
Next-up (v0.2 scope TBD): full-ALCE reproducibility (ASQA / QAMPARI / ELI5), per-chunk NLI during generation, streaming refinements on API backends, and a possible citeformer-ts sibling if ecosystem demand materialises.
v0.1 framed the API/local split as "schema-tier vs logit-tier", but as of late 2025 that's no longer the honest line: every modern provider's strict structured-outputs mode is real token-level constrained sampling inside their runtime, not post-hoc validation. The current honest distinction is where the masking runs — in your process, or inside the provider:
| Backend | Where the masking runs | Mechanism | Notes |
|---|---|---|---|
HFBackend |
In-process | XGrammar LogitsProcessor |
The flagship — you own the runtime. |
VLLMBackend |
In-process | XGrammar / llguidance via GuidedDecodingParams |
Linux/CUDA only. |
LlamaCppBackend |
In-process | Native GBNF (Llama(grammar=...)) |
CPU + Metal + CUDA. |
OpenAIBackend |
Provider runtime | Strict JSON schema | Token-level constrained sampling on gpt-4o-2024-08-06+ and successors per OpenAI's Aug 2024 announcement. |
AnthropicBackend |
Provider runtime | Native Citations API + cache_control |
Provider enforces that every cite references a supplied document. Prompt-caching on by default — repeat-source RAG bills cache-read prices on subsequent calls. |
OpenRouterBackend |
Provider runtime (per upstream) | Strict JSON via OpenAI wire format | Routes to Anthropic / OpenAI / Google / Mistral / Groq / Fireworks / Together / Cohere. provider.require_parameters: true (default) refuses to land on upstreams that don't honour strict mode — preserves the guarantee end-to-end. |
FireworksBackend |
Provider runtime | Native GBNF (type: grammar) |
The cleanest "logit-tier on a hosted API" backend — citeformer's cite-id GBNF rule is dropped in unchanged via Fireworks's grammar mode. Same constraint that masks logits inside HFBackend, just running on Fireworks's GPUs. |
TogetherBackend |
Provider runtime | Strict json_schema |
Strict structured outputs on Together's open-weight upstreams (Llama / Qwen / DeepSeek / …). |
GeminiBackend |
Provider runtime | response_schema (OpenAPI subset) |
Constrained generation on Gemini 1.5+ / 2.x. |
MistralBackend |
Provider runtime | response_format strict JSON |
mistral-large-2411+. |
All eight backends produce the same GenerationResult — the orchestration, verify, and render layers are backend-agnostic. The choice between in-process and provider-runtime masking is mostly an operational question: do you want to host the model, or pay someone to do it? The structural guarantee — fabricated cite ids are token-impossible to emit — holds either way.
The bibliography pipeline is unchanged regardless: references are rendered deterministically by our home-grown formatters, never by the model.
API-backend GenerationResult carries a usage: TokenUsage | None field with input_tokens, output_tokens, optional cache_creation_input_tokens / cache_read_input_tokens (Anthropic prompt-caching), and cost_credits (OpenRouter exposes a per-call cost in OR credits — 1 credit ≈ $1 USD by default but the unit is credits, not dollars; other providers leave it None and consumers price tokens themselves). Local backends leave usage = None — token accounting is meaningless when you control the runtime.