fix: stream Qwen3 tool call string arguments by Palaiologos1453 · Pull Request #46351 · vllm-project/vllm

Palaiologos1453 · 2026-06-22T08:27:18Z

Summary

Allow the parser engine to stream prefix-stable trailing string argument values instead of buffering them until tool-call end.
Keep the optimization schema-aware so fields that may be coerced to bool/number/null/object/array still wait until their serialized form is stable.
Teach the Qwen3 parser to treat </parameter> as a lexer terminal while preserving it in the arg stream, preventing partial closing-tag text from leaking into streamed arguments.
Align Qwen3 partial argument conversion whitespace handling with completed parameters so streamed prefixes remain monotonic.

Tests

tests/parser/engine/test_qwen3.py with local stubs for vllm.third_party.pynvml and uvloop
tests/parser/engine/test_parser_engine.py with local stubs for vllm.third_party.pynvml and uvloop
python -m py_compile vllm/parser/engine/parser_engine.py vllm/parser/qwen3.py tests/parser/engine/test_qwen3.py tests/parser/engine/test_parser_engine.py
git diff --check

Note: on this Windows checkout, plain pytest imports try to load the unbuilt CUDA extension (vllm._C_stable_libtorch), so the parser tests above were run through an in-process pynvml/uvloop stub to bypass platform detection.

abinggo · 2026-06-22T09:29:51Z

Nice approach — promoting </parameter> to a lexer terminal so the existing prefix-buffering handles the partial closing-tag case is cleaner than the converter-side stable-prefix hook I'd sketched. Skimmed the diff: the {"string"}-only schema gating and the partial-value .strip() both look right to me, and the regression tests cover the two prefix-stability traps directly.

Since the PR builds on the root cause and the two prefix-stability issues worked out above, would you be open to adding a co-author trailer for the analysis? Something like:

Co-authored-by: abinggo <107740309+abinggo@users.noreply.github.com>

No worries either way — glad it's getting fixed properly. Thanks for picking it up and pushing it over the line.

Palaiologos1453 · 2026-06-22T09:42:46Z

Thanks, added the co-author trailer in 637a483:\n\nCo-authored-by: abinggo 107740309+abinggo@users.noreply.github.com\n\nNo code diff changed in that amend.

abinggo · 2026-06-22T09:47:59Z

Appreciate it, thanks! 🙏

chaunceyjiang · 2026-06-22T09:51:45Z

 TOOL_CALL_END = "</tool_call>"
 FUNC_PREFIX = "<function="
 FUNC_END = "</function>"
+PARAM_END = "</parameter>"


Add <parameter= as a terminal so that the lexer buffers the <param prefix.

Signed-off-by: Rui Yin <2260891073@qq.com> Co-authored-by: abinggo <107740309+abinggo@users.noreply.github.com>

Palaiologos1453 · 2026-06-22T10:02:08Z

Addressed in 33792a2.

I added <parameter= as a Qwen3 lexer terminal so a split opening tag like <param is buffered instead of being emitted as part of the previous parameter value. I also added test_streaming_split_next_parameter_tag_is_buffered to cover the regression: after the partial <param chunk, the streamed args still only contain the stable query prefix, and the final args parse as both query and limit.

Local checks:

python -m ruff check vllm/parser/qwen3.py tests/parser/engine/test_qwen3.py
python -m ruff format --check vllm/parser/qwen3.py tests/parser/engine/test_qwen3.py
python -m py_compile vllm/parser/qwen3.py tests/parser/engine/test_qwen3.py
tests/parser/engine/test_qwen3.py: 51 passed with local Windows native-extension stubs.

chaunceyjiang · 2026-06-22T10:04:30Z

+            (ParserState.TOOL_ARGS, "PARAM_END"): Transition(
+                ParserState.TOOL_ARGS,
+                (EventType.ARG_VALUE_CHUNK,),
+            ),


Suggested change

),

),

(ParserState.TOOL_ARGS, "PARAM_START"): Transition(

ParserState.TOOL_ARGS,

(EventType.ARG_VALUE_CHUNK,),

),

Signed-off-by: Rui Yin <2260891073@qq.com>

Palaiologos1453 · 2026-06-22T11:59:21Z

Addressed in 93d7de1. I renamed the terminal to PARAM_START and added the explicit (ParserState.TOOL_ARGS, PARAM_START) -> ARG_VALUE_CHUNK transition so the opening parameter tag is still forwarded into the arg stream after lexer buffering.

Local checks:

tests/parser/engine/test_qwen3.py: 51 passed with local Windows native-extension stubs
python -m ruff check vllm/parser/qwen3.py tests/parser/engine/test_qwen3.py
python -m ruff format --check vllm/parser/qwen3.py tests/parser/engine/test_qwen3.py
python -m py_compile vllm/parser/qwen3.py tests/parser/engine/test_qwen3.py

Palaiologos1453 · 2026-06-22T12:15:05Z

@chaunceyjiang I believe the latest review comments are addressed in 93d7de1ca.

The current diff includes:

<parameter= registered as the Qwen3 PARAM_START terminal, so the lexer buffers split prefixes like <param.
an explicit (ParserState.TOOL_ARGS, PARAM_START) -> ARG_VALUE_CHUNK transition, so the opening tag is still forwarded into the raw arg stream.
test_streaming_split_next_parameter_tag_is_buffered covering the split opening tag regression.

Local checks already run:

tests/parser/engine/test_qwen3.py: 51 passed with local Windows native-extension stubs
python -m ruff check vllm/parser/qwen3.py tests/parser/engine/test_qwen3.py
python -m ruff format --check vllm/parser/qwen3.py tests/parser/engine/test_qwen3.py
python -m py_compile vllm/parser/qwen3.py tests/parser/engine/test_qwen3.py

Could you please re-review when you have a chance?

chaunceyjiang

LGTM cc @bbrowning

bbrowning

I want to give this a quick test on a live server, but the logic looks sound on the surface for being able to stream back these large string deltas.

One comment for future work (smarter whitespace stripping), and one efficiency comment (recomputing safe string keys on every delta) that may be worth tackling now if it's quick, but neither impact correctness.

bbrowning · 2026-06-22T12:30:37Z

            value = m.group(2)
            if name:
-                params[name] = value
+                params[name] = value.strip()


This is ok for now to align the partial and complete paths.

But, we should separately track and fix this to be less aggressive as this will strip things like indentation from edit parameter values. I say separately because it's orthogonal to the scope of this PR, which is streaming large string deltas back.

bbrowning · 2026-06-22T12:39:07Z


+        string_keys: set[str] | None = None
        if slot.name:
+            string_keys = self._streamable_string_keys(


I wonder should we set this in _emit_name_delta when we setup the ToolSlot so we only do this once? It could have a string_keys property we set and then the safe string keys get computed in one place and we just pass slot.string_keys into the _safe_arg_prefix below instead of recomputing this list of safe string keys every iteration.

Signed-off-by: Rui Yin <2260891073@qq.com>

Palaiologos1453 · 2026-06-22T13:19:37Z

@bbrowning thanks for the review. I addressed the efficiency comment in c47e34631 by caching the streamable string key set on ToolCallSlot when the tool name is emitted, then reusing slot.string_keys in _safe_arg_prefix() instead of recomputing it for every arg delta. I also added test_streamable_string_keys_cached_after_name_delta to cover that behavior.

I agree the whitespace stripping should be tracked separately since it is orthogonal to this PR's large-string streaming path.

Local checks:

tests/parser/engine/test_parser_engine.py::TestArgDeltaWithConverter + tests/parser/engine/test_qwen3.py: 54 passed with local Windows native-extension stubs
python -m ruff check vllm/parser/engine/parser_engine.py tests/parser/engine/test_parser_engine.py tests/parser/engine/test_qwen3.py
python -m ruff format --check vllm/parser/engine/parser_engine.py tests/parser/engine/test_parser_engine.py tests/parser/engine/test_qwen3.py
python -m py_compile vllm/parser/engine/parser_engine.py tests/parser/engine/test_parser_engine.py tests/parser/engine/test_qwen3.py
git diff --check

bbrowning · 2026-06-22T14:34:38Z

Note that this actually turns streaming of string arguments on for any tool parser using the new parser engine that sets stream_arg_deltas=True, so I'm also running a spot-check against a live Gemma 4 model and a GLM 4.7.

bbrowning

I ran live tests with this against Qwen/Qwen3.6-27B, google/gemma-4-31B-it, and zai-org/GLM-4.7-Flash to cover a few popular model families that use the new parsers. Before this fix, all of those would single large string arg values back as a single delta. After the fix, all of them stream those back incrementally.

I used a script at https://gist.github.com/bbrowning/465ff39855189bbd6d7b68b7eec0e377 to help me test this against live servers.

Before this change, I got 7 deltas back from the Qwen 3.6 test. After this change, 244 deltas. If I enable spec decoding or stream_interval > 1, I get fewer than 244 deltas (as expected), but still more than the 7.

The before/after with Gemma4 is not quite as dramatic - 7 deltas before this change, 20 after. This will vary for each model, and Gemma4 has some additional logic that holds back arguments in partial parsing for correctness sake. But, still a nice improvement there.

For GLM 4.7 Flash, 7 deltas before and 191 deltas afterwards.

In all cases, the streamed deltas were able to accumulate and parse into valid JSON with the appropriate coerced types and no visible leakage of partial tags into the content.

bbrowning · 2026-06-22T15:22:44Z

@chaunceyjiang This looks good to me in manual testing. MiniMax M2 will need some adjustments to start streaming partial tool call args, because it doesn't look at the partial flag, doesn't have a PARTIAL_PARAM_RE, and would need things like the </parameter> registration so the lexer holds partial variants of that back.

But, it looks safe as-is with that model because its regex only matches on complete parameter values today.

chaunceyjiang · 2026-06-22T15:30:02Z

@chaunceyjiang This looks good to me in manual testing. MiniMax M2 will need some adjustments to start streaming partial tool call args, because it doesn't look at the partial flag, doesn't have a PARTIAL_PARAM_RE, and would need things like the </parameter> registration so the lexer holds partial variants of that back.

Yes, I also noticed that when testing this PR locally. I'll submit a PR about minmax after this one is merged.

chaunceyjiang · 2026-06-22T15:41:41Z

        slot.name = name
        slot.name_sent = True
+        slot.string_keys = self._streamable_string_keys(
+            find_tool_properties(self._tools, name)


One non-blocking comment: find_tool_properties should be cached here.

Palaiologos1453 · 2026-06-22T17:06:52Z

Current Buildkite failure looks unrelated to this PR's tool-call parser changes. The failed Buildkite build is #73461, job 019eeffe-0bd5-4861-84f6-769b95e4fdce:

Step: AMD: Entrypoints Integration (API Server OpenAI - Part 2) (mi325_1)
Command: pytest -v -s entrypoints/openai/completion --ignore=entrypoints/openai/completion/test_tensorizer_entrypoint.py
Failing test: entrypoints/openai/completion/test_shutdown.py::test_shutdown_on_engine_failure
Failure: Server failed to start in 120 seconds
Summary: 1 failed, 104 passed, 21 warnings

The log also shows repeated HuggingFace HTTP Error 429 / rate-limit waits during this job, including around the shutdown tests. This PR only touches parser engine / Qwen3 parser files and parser tests:

vllm/parser/engine/parser_engine.py
vllm/parser/qwen3.py
tests/parser/engine/test_parser_engine.py
tests/parser/engine/test_qwen3.py

Could someone with Buildkite permissions retry the failed AMD job?

Palaiologos1453 requested review from aarnphm, bbrowning, chaunceyjiang and sfeng33 as code owners June 22, 2026 08:27

mergify Bot added qwen Related to Qwen models tool-calling labels Jun 22, 2026

github-project-automation Bot added this to Tool Calling Jun 22, 2026

Palaiologos1453 force-pushed the fix-tool-call-argument-streaming-43267 branch 2 times, most recently from 274f002 to 8e6ad8d Compare June 22, 2026 08:30

Palaiologos1453 mentioned this pull request Jun 22, 2026

[Feature]: Support streaming output for tool_calls arguments #43267

Closed

1 task

Palaiologos1453 force-pushed the fix-tool-call-argument-streaming-43267 branch from 8e6ad8d to 637a483 Compare June 22, 2026 09:42

chaunceyjiang reviewed Jun 22, 2026

View reviewed changes

fix: stream qwen3 tool call string arguments

33792a2

Signed-off-by: Rui Yin <2260891073@qq.com> Co-authored-by: abinggo <107740309+abinggo@users.noreply.github.com>

Palaiologos1453 force-pushed the fix-tool-call-argument-streaming-43267 branch from 637a483 to 33792a2 Compare June 22, 2026 10:01

chaunceyjiang reviewed Jun 22, 2026

View reviewed changes

Palaiologos1453 added 2 commits June 22, 2026 19:52

Merge branch 'main' into fix-tool-call-argument-streaming-43267

ad8964c

fix: add qwen3 parameter start transition

93d7de1

Signed-off-by: Rui Yin <2260891073@qq.com>

chaunceyjiang added the verified Run pre-commit for new contributors without triggering other tests label Jun 22, 2026

chaunceyjiang reviewed Jun 22, 2026

View reviewed changes

bbrowning reviewed Jun 22, 2026

View reviewed changes

Palaiologos1453 added 2 commits June 22, 2026 21:08

Merge branch 'main' into fix-tool-call-argument-streaming-43267

761a25b

perf: cache streamable parser arg keys

c47e346

Signed-off-by: Rui Yin <2260891073@qq.com>

Merge branch 'main' into fix-tool-call-argument-streaming-43267

3ed2c65

Merge branch 'main' into fix-tool-call-argument-streaming-43267

db5e7dd

bbrowning approved these changes Jun 22, 2026

View reviewed changes

Merge branch 'main' into fix-tool-call-argument-streaming-43267

fca0d6d

chaunceyjiang mentioned this pull request Jun 22, 2026

[Bugfix] fix: stream Mimimax m2 tool call string arguments #46382

Open

4 tasks

chaunceyjiang added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 22, 2026

chaunceyjiang reviewed Jun 22, 2026

View reviewed changes

jesco-absolut mentioned this pull request Jun 22, 2026

[rust][tool-parser] Add streaming invariant tests #46416

Open

chaunceyjiang merged commit 8db1216 into vllm-project:main Jun 23, 2026
52 checks passed

github-project-automation Bot moved this to Done in Tool Calling Jun 23, 2026

Uh oh!

Conversation

Palaiologos1453 commented Jun 22, 2026

Summary

Tests

Uh oh!

abinggo commented Jun 22, 2026

Uh oh!

Palaiologos1453 commented Jun 22, 2026

Uh oh!

abinggo commented Jun 22, 2026

Uh oh!

chaunceyjiang Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Palaiologos1453 commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chaunceyjiang Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Palaiologos1453 commented Jun 22, 2026

Uh oh!

Palaiologos1453 commented Jun 22, 2026

Uh oh!

chaunceyjiang left a comment

Choose a reason for hiding this comment

Uh oh!

bbrowning left a comment

Choose a reason for hiding this comment

Uh oh!

bbrowning Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

bbrowning Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Palaiologos1453 commented Jun 22, 2026

Uh oh!

bbrowning commented Jun 22, 2026

Uh oh!

bbrowning left a comment

Choose a reason for hiding this comment

Uh oh!

bbrowning commented Jun 22, 2026

Uh oh!

chaunceyjiang commented Jun 22, 2026

Uh oh!

chaunceyjiang Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Palaiologos1453 commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Palaiologos1453 commented Jun 22, 2026 •

edited

Loading