Skip to content

GenAI semconv needs vendor-neutral reasoning parts and continuity tokens #192

Description

@mikeldking

Area(s)

area:gen-ai

What's missing?

Reasoning-capable models now return more than assistant text and tool calls.

They can return ordered reasoning or thinking parts, plus opaque continuity values that the next request must carry forward. Those values are not ordinary metadata. They are part of the provider protocol for continuing a reasoning turn, especially when reasoning happens before a tool call.

Today, the GenAI semantic conventions mostly model assistant output as text plus tool calls. That leaves no vendor-neutral place to represent:

  • a reasoning/thinking part in the assistant message
  • the original order between reasoning, text, and tool calls
  • opaque continuity tokens or signatures attached to those parts
  • signatures attached directly to a tool call
  • reasoning-specific token accounting and request-side reasoning controls

The practical problem is replay.

A trace can show that a model called a tool, and it can show the tool result, but it may not contain enough information to reconstruct the next provider request. If a tool runner rebuilds the next turn from only the tool call and tool result, it can silently drop the reasoning part that preceded the tool call.

That makes the trace incomplete in a way that matters operationally. You cannot reliably replay the conversation from telemetry, compare provider behavior, debug a failed tool loop, or prove what was sent on turn N+1 from the trace of turn N.

This is not about standardizing or exposing private chain-of-thought text. It is about preserving the provider-visible message structure and the opaque continuity values required to continue the conversation, subject to normal redaction and privacy policy.

Cross-vendor pattern

The providers use different names and wire formats, but the shape is similar.

OpenAI Responses API

  • Assistant output can include an output[] item with type: "reasoning".
  • Reasoning continuity can be carried by encrypted_content and the reasoning item identity.
  • Reasoning token usage can be reported separately, for example in output_tokens_details.reasoning_tokens.
  • Request-side controls include fields such as reasoning.effort and reasoning summary settings.

Anthropic Messages API

  • Assistant content can include thinking and redacted_thinking blocks.
  • Continuity is represented through values such as signature for visible thinking, or redacted data for redacted thinking.
  • Thinking tokens are counted in output tokens, but are not always exposed as a separate usage bucket.
  • Request-side controls include thinking type, token budget, and display settings.

Google Gemini API

  • Response parts[] can include thinking parts, for example with thought: true.
  • Continuity is represented with thoughtSignature.
  • The signature can be attached to a data-bearing part, including a functionCall, so this cannot be modeled only as metadata on a reasoning text block.
  • Usage can include thoughtsTokenCount.
  • Request-side controls include thinking budget, thinking level, and whether thoughts are included.

The common failure mode is the same: a reasoning/thinking part immediately precedes a tool call, but the next request is rebuilt from only the tool call and tool result. Depending on the provider and model, dropping that reasoning continuity can cause a hard request error or a lower-quality continuation because the model has to reason again from an incomplete context.

Describe the solution you'd like

Add a cross-vendor abstraction for reasoning content and reasoning continuity in GenAI semantic conventions.

A useful model would cover:

  1. Reasoning as an ordered message part

    Add a message part type for reasoning/thinking content that can be interleaved with text, tool calls, and tool results. The original provider order should be recoverable from telemetry.

  2. Continuity tokens and signatures

    Provide a place to record the presence and metadata for opaque continuity values such as OpenAI encrypted_content, Anthropic signature / redacted data, and Gemini thoughtSignature.

    The conventions should distinguish between values that carry hidden state and values that authenticate visible reasoning text, even if both need to be preserved by the client.

  3. Tool-call-attached continuity

    Allow continuity values to attach to a tool call part, not only to a reasoning part. Gemini's thoughtSignature on a functionCall is the motivating example.

  4. Replay guidance

    Define what it means for a trace to be replayable enough to reconstruct the next provider request. At minimum, replay needs the original part order and any provider-required continuity fields, when recording those fields is allowed by policy.

    If payload capture is disabled, the trace should still be able to say that a continuity value was present, how large it was, and that full replay is intentionally not possible from the redacted trace.

  5. Reasoning token accounting

    Standardize a reasoning/thinking token breakdown where providers expose it, while documenting that some providers include thinking in output tokens without a separate bucket.

  6. Request-side reasoning config

    Record the client's requested reasoning configuration: effort, budget, level, display/include-thoughts options, or the closest provider-specific equivalent.

  7. Privacy and redaction guidance

    Continuity values can be large, opaque, and state-bearing. The default guidance should favor recording presence, length, type, and attachment point rather than raw bytes. If an instrumentation does capture payloads for replay, that should be opt-in and governed by the same masking rules as other sensitive GenAI content.

Why this matters

This gap shows up in ordinary tool-calling systems, not only in specialized research setups.

A common production pattern is:

  1. call the model
  2. receive assistant reasoning plus a tool call
  3. execute the tool
  4. send the tool result back to the model

For reasoning models, step 4 may need more than the tool result. It may also need the reasoning or continuity part from step 2. If telemetry drops that part, the trace no longer represents what the application needed to send.

That hurts several practical workflows:

  • debugging rejected follow-up requests
  • replaying incidents from traces
  • comparing behavior across providers
  • migrating tool loops from one provider API to another
  • verifying that instrumentation preserved the full assistant turn
  • evaluating quality regressions when reasoning continuity was accidentally dropped

Related work

This overlaps with, but is broader than:

OpenInference has also prototyped this across OpenAI, Anthropic, and Gemini: capture the assistant turn into span attributes, reconstruct the next turn from those attributes, and include negative checks that remove the continuity value to prove that it is load-bearing.

I would be happy to contribute the vendor comparison and prototype findings to a GenAI SIG discussion or a spec PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:messagesShape/content of input and output messages, parts, instructions, modalities, citationsarea:token-countsToken usage: input/output counts, breakdowns (cached, reasoning), per-modality, totals

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions