Skip to content

fix: stream Qwen3 tool call string arguments#46351

Merged
chaunceyjiang merged 8 commits into
vllm-project:mainfrom
Palaiologos1453:fix-tool-call-argument-streaming-43267
Jun 23, 2026
Merged

fix: stream Qwen3 tool call string arguments#46351
chaunceyjiang merged 8 commits into
vllm-project:mainfrom
Palaiologos1453:fix-tool-call-argument-streaming-43267

Conversation

@Palaiologos1453

Copy link
Copy Markdown
Contributor

Fixes #43267.

Summary

  • Allow the parser engine to stream prefix-stable trailing string argument values instead of buffering them until tool-call end.
  • Keep the optimization schema-aware so fields that may be coerced to bool/number/null/object/array still wait until their serialized form is stable.
  • Teach the Qwen3 parser to treat </parameter> as a lexer terminal while preserving it in the arg stream, preventing partial closing-tag text from leaking into streamed arguments.
  • Align Qwen3 partial argument conversion whitespace handling with completed parameters so streamed prefixes remain monotonic.

Tests

  • tests/parser/engine/test_qwen3.py with local stubs for vllm.third_party.pynvml and uvloop
  • tests/parser/engine/test_parser_engine.py with local stubs for vllm.third_party.pynvml and uvloop
  • python -m py_compile vllm/parser/engine/parser_engine.py vllm/parser/qwen3.py tests/parser/engine/test_qwen3.py tests/parser/engine/test_parser_engine.py
  • git diff --check

Note: on this Windows checkout, plain pytest imports try to load the unbuilt CUDA extension (vllm._C_stable_libtorch), so the parser tests above were run through an in-process pynvml/uvloop stub to bypass platform detection.

@mergify mergify Bot added qwen Related to Qwen models tool-calling labels Jun 22, 2026
@Palaiologos1453 Palaiologos1453 force-pushed the fix-tool-call-argument-streaming-43267 branch 2 times, most recently from 274f002 to 8e6ad8d Compare June 22, 2026 08:30
@abinggo

abinggo commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Nice approach — promoting </parameter> to a lexer terminal so the existing prefix-buffering handles the partial closing-tag case is cleaner than the converter-side stable-prefix hook I'd sketched. Skimmed the diff: the {"string"}-only schema gating and the partial-value .strip() both look right to me, and the regression tests cover the two prefix-stability traps directly.

Since the PR builds on the root cause and the two prefix-stability issues worked out above, would you be open to adding a co-author trailer for the analysis? Something like:

Co-authored-by: abinggo <107740309+abinggo@users.noreply.github.com>

No worries either way — glad it's getting fixed properly. Thanks for picking it up and pushing it over the line.

@Palaiologos1453 Palaiologos1453 force-pushed the fix-tool-call-argument-streaming-43267 branch from 8e6ad8d to 637a483 Compare June 22, 2026 09:42
@Palaiologos1453

Copy link
Copy Markdown
Contributor Author

Thanks, added the co-author trailer in 637a483:\n\nCo-authored-by: abinggo 107740309+abinggo@users.noreply.github.com\n\nNo code diff changed in that amend.

@abinggo

abinggo commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Appreciate it, thanks! 🙏

Comment thread vllm/parser/qwen3.py
TOOL_CALL_END = "</tool_call>"
FUNC_PREFIX = "<function="
FUNC_END = "</function>"
PARAM_END = "</parameter>"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add <parameter= as a terminal so that the lexer buffers the <param prefix.

Signed-off-by: Rui Yin <2260891073@qq.com>

Co-authored-by: abinggo <107740309+abinggo@users.noreply.github.com>
@Palaiologos1453 Palaiologos1453 force-pushed the fix-tool-call-argument-streaming-43267 branch from 637a483 to 33792a2 Compare June 22, 2026 10:01
@Palaiologos1453

Palaiologos1453 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Addressed in 33792a2.

I added <parameter= as a Qwen3 lexer terminal so a split opening tag like <param is buffered instead of being emitted as part of the previous parameter value. I also added test_streaming_split_next_parameter_tag_is_buffered to cover the regression: after the partial <param chunk, the streamed args still only contain the stable query prefix, and the final args parse as both query and limit.

Local checks:

  • python -m ruff check vllm/parser/qwen3.py tests/parser/engine/test_qwen3.py
  • python -m ruff format --check vllm/parser/qwen3.py tests/parser/engine/test_qwen3.py
  • python -m py_compile vllm/parser/qwen3.py tests/parser/engine/test_qwen3.py
  • tests/parser/engine/test_qwen3.py: 51 passed with local Windows native-extension stubs.

Comment thread vllm/parser/qwen3.py
(ParserState.TOOL_ARGS, "PARAM_END"): Transition(
ParserState.TOOL_ARGS,
(EventType.ARG_VALUE_CHUNK,),
),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
),
),
(ParserState.TOOL_ARGS, "PARAM_START"): Transition(
ParserState.TOOL_ARGS,
(EventType.ARG_VALUE_CHUNK,),
),

@Palaiologos1453

Copy link
Copy Markdown
Contributor Author

Addressed in 93d7de1. I renamed the terminal to PARAM_START and added the explicit (ParserState.TOOL_ARGS, PARAM_START) -> ARG_VALUE_CHUNK transition so the opening parameter tag is still forwarded into the arg stream after lexer buffering.

Local checks:

  • tests/parser/engine/test_qwen3.py: 51 passed with local Windows native-extension stubs
  • python -m ruff check vllm/parser/qwen3.py tests/parser/engine/test_qwen3.py
  • python -m ruff format --check vllm/parser/qwen3.py tests/parser/engine/test_qwen3.py
  • python -m py_compile vllm/parser/qwen3.py tests/parser/engine/test_qwen3.py

@Palaiologos1453

Copy link
Copy Markdown
Contributor Author

@chaunceyjiang I believe the latest review comments are addressed in 93d7de1ca.

The current diff includes:

  • <parameter= registered as the Qwen3 PARAM_START terminal, so the lexer buffers split prefixes like <param.
  • an explicit (ParserState.TOOL_ARGS, PARAM_START) -> ARG_VALUE_CHUNK transition, so the opening tag is still forwarded into the raw arg stream.
  • test_streaming_split_next_parameter_tag_is_buffered covering the split opening tag regression.

Local checks already run:

  • tests/parser/engine/test_qwen3.py: 51 passed with local Windows native-extension stubs
  • python -m ruff check vllm/parser/qwen3.py tests/parser/engine/test_qwen3.py
  • python -m ruff format --check vllm/parser/qwen3.py tests/parser/engine/test_qwen3.py
  • python -m py_compile vllm/parser/qwen3.py tests/parser/engine/test_qwen3.py

Could you please re-review when you have a chance?

@chaunceyjiang chaunceyjiang added the verified Run pre-commit for new contributors without triggering other tests label Jun 22, 2026

@chaunceyjiang chaunceyjiang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM cc @bbrowning

@bbrowning bbrowning left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to give this a quick test on a live server, but the logic looks sound on the surface for being able to stream back these large string deltas.

One comment for future work (smarter whitespace stripping), and one efficiency comment (recomputing safe string keys on every delta) that may be worth tackling now if it's quick, but neither impact correctness.

Comment thread vllm/parser/qwen3.py
value = m.group(2)
if name:
params[name] = value
params[name] = value.strip()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ok for now to align the partial and complete paths.

But, we should separately track and fix this to be less aggressive as this will strip things like indentation from edit parameter values. I say separately because it's orthogonal to the scope of this PR, which is streaming large string deltas back.

Comment thread vllm/parser/engine/parser_engine.py Outdated

string_keys: set[str] | None = None
if slot.name:
string_keys = self._streamable_string_keys(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder should we set this in _emit_name_delta when we setup the ToolSlot so we only do this once? It could have a string_keys property we set and then the safe string keys get computed in one place and we just pass slot.string_keys into the _safe_arg_prefix below instead of recomputing this list of safe string keys every iteration.

@Palaiologos1453

Copy link
Copy Markdown
Contributor Author

@bbrowning thanks for the review. I addressed the efficiency comment in c47e34631 by caching the streamable string key set on ToolCallSlot when the tool name is emitted, then reusing slot.string_keys in _safe_arg_prefix() instead of recomputing it for every arg delta. I also added test_streamable_string_keys_cached_after_name_delta to cover that behavior.

I agree the whitespace stripping should be tracked separately since it is orthogonal to this PR's large-string streaming path.

Local checks:

  • tests/parser/engine/test_parser_engine.py::TestArgDeltaWithConverter + tests/parser/engine/test_qwen3.py: 54 passed with local Windows native-extension stubs
  • python -m ruff check vllm/parser/engine/parser_engine.py tests/parser/engine/test_parser_engine.py tests/parser/engine/test_qwen3.py
  • python -m ruff format --check vllm/parser/engine/parser_engine.py tests/parser/engine/test_parser_engine.py tests/parser/engine/test_qwen3.py
  • python -m py_compile vllm/parser/engine/parser_engine.py tests/parser/engine/test_parser_engine.py tests/parser/engine/test_qwen3.py
  • git diff --check

@bbrowning

Copy link
Copy Markdown
Collaborator

Note that this actually turns streaming of string arguments on for any tool parser using the new parser engine that sets stream_arg_deltas=True, so I'm also running a spot-check against a live Gemma 4 model and a GLM 4.7.

@bbrowning bbrowning left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran live tests with this against Qwen/Qwen3.6-27B, google/gemma-4-31B-it, and zai-org/GLM-4.7-Flash to cover a few popular model families that use the new parsers. Before this fix, all of those would single large string arg values back as a single delta. After the fix, all of them stream those back incrementally.

I used a script at https://gist.github.com/bbrowning/465ff39855189bbd6d7b68b7eec0e377 to help me test this against live servers.

Before this change, I got 7 deltas back from the Qwen 3.6 test. After this change, 244 deltas. If I enable spec decoding or stream_interval > 1, I get fewer than 244 deltas (as expected), but still more than the 7.

The before/after with Gemma4 is not quite as dramatic - 7 deltas before this change, 20 after. This will vary for each model, and Gemma4 has some additional logic that holds back arguments in partial parsing for correctness sake. But, still a nice improvement there.

For GLM 4.7 Flash, 7 deltas before and 191 deltas afterwards.

In all cases, the streamed deltas were able to accumulate and parse into valid JSON with the appropriate coerced types and no visible leakage of partial tags into the content.

@bbrowning

Copy link
Copy Markdown
Collaborator

@chaunceyjiang This looks good to me in manual testing. MiniMax M2 will need some adjustments to start streaming partial tool call args, because it doesn't look at the partial flag, doesn't have a PARTIAL_PARAM_RE, and would need things like the </parameter> registration so the lexer holds partial variants of that back.

But, it looks safe as-is with that model because its regex only matches on complete parameter values today.

@chaunceyjiang

Copy link
Copy Markdown
Collaborator

@chaunceyjiang This looks good to me in manual testing. MiniMax M2 will need some adjustments to start streaming partial tool call args, because it doesn't look at the partial flag, doesn't have a PARTIAL_PARAM_RE, and would need things like the </parameter> registration so the lexer holds partial variants of that back.

Yes, I also noticed that when testing this PR locally. I'll submit a PR about minmax after this one is merged.

@chaunceyjiang chaunceyjiang added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 22, 2026
slot.name = name
slot.name_sent = True
slot.string_keys = self._streamable_string_keys(
find_tool_properties(self._tools, name)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One non-blocking comment: find_tool_properties should be cached here.

@Palaiologos1453

Copy link
Copy Markdown
Contributor Author

Current Buildkite failure looks unrelated to this PR's tool-call parser changes. The failed Buildkite build is #73461, job 019eeffe-0bd5-4861-84f6-769b95e4fdce:

  • Step: AMD: Entrypoints Integration (API Server OpenAI - Part 2) (mi325_1)
  • Command: pytest -v -s entrypoints/openai/completion --ignore=entrypoints/openai/completion/test_tensorizer_entrypoint.py
  • Failing test: entrypoints/openai/completion/test_shutdown.py::test_shutdown_on_engine_failure
  • Failure: Server failed to start in 120 seconds
  • Summary: 1 failed, 104 passed, 21 warnings

The log also shows repeated HuggingFace HTTP Error 429 / rate-limit waits during this job, including around the shutdown tests. This PR only touches parser engine / Qwen3 parser files and parser tests:

  • vllm/parser/engine/parser_engine.py
  • vllm/parser/qwen3.py
  • tests/parser/engine/test_parser_engine.py
  • tests/parser/engine/test_qwen3.py

Could someone with Buildkite permissions retry the failed AMD job?

@chaunceyjiang chaunceyjiang merged commit 8db1216 into vllm-project:main Jun 23, 2026
52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed tool-calling verified Run pre-commit for new contributors without triggering other tests

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Feature]: Support streaming output for tool_calls arguments

4 participants