fix: stream Qwen3 tool call string arguments#46351
Conversation
274f002 to
8e6ad8d
Compare
|
Nice approach — promoting Since the PR builds on the root cause and the two prefix-stability issues worked out above, would you be open to adding a co-author trailer for the analysis? Something like: No worries either way — glad it's getting fixed properly. Thanks for picking it up and pushing it over the line. |
8e6ad8d to
637a483
Compare
|
Thanks, added the co-author trailer in 637a483:\n\nCo-authored-by: abinggo 107740309+abinggo@users.noreply.github.com\n\nNo code diff changed in that amend. |
|
Appreciate it, thanks! 🙏 |
| TOOL_CALL_END = "</tool_call>" | ||
| FUNC_PREFIX = "<function=" | ||
| FUNC_END = "</function>" | ||
| PARAM_END = "</parameter>" |
There was a problem hiding this comment.
Add <parameter= as a terminal so that the lexer buffers the <param prefix.
Signed-off-by: Rui Yin <2260891073@qq.com> Co-authored-by: abinggo <107740309+abinggo@users.noreply.github.com>
637a483 to
33792a2
Compare
|
Addressed in 33792a2. I added Local checks:
|
| (ParserState.TOOL_ARGS, "PARAM_END"): Transition( | ||
| ParserState.TOOL_ARGS, | ||
| (EventType.ARG_VALUE_CHUNK,), | ||
| ), |
There was a problem hiding this comment.
| ), | |
| ), | |
| (ParserState.TOOL_ARGS, "PARAM_START"): Transition( | |
| ParserState.TOOL_ARGS, | |
| (EventType.ARG_VALUE_CHUNK,), | |
| ), |
Signed-off-by: Rui Yin <2260891073@qq.com>
|
Addressed in 93d7de1. I renamed the terminal to Local checks:
|
|
@chaunceyjiang I believe the latest review comments are addressed in The current diff includes:
Local checks already run:
Could you please re-review when you have a chance? |
chaunceyjiang
left a comment
There was a problem hiding this comment.
LGTM cc @bbrowning
bbrowning
left a comment
There was a problem hiding this comment.
I want to give this a quick test on a live server, but the logic looks sound on the surface for being able to stream back these large string deltas.
One comment for future work (smarter whitespace stripping), and one efficiency comment (recomputing safe string keys on every delta) that may be worth tackling now if it's quick, but neither impact correctness.
| value = m.group(2) | ||
| if name: | ||
| params[name] = value | ||
| params[name] = value.strip() |
There was a problem hiding this comment.
This is ok for now to align the partial and complete paths.
But, we should separately track and fix this to be less aggressive as this will strip things like indentation from edit parameter values. I say separately because it's orthogonal to the scope of this PR, which is streaming large string deltas back.
|
|
||
| string_keys: set[str] | None = None | ||
| if slot.name: | ||
| string_keys = self._streamable_string_keys( |
There was a problem hiding this comment.
I wonder should we set this in _emit_name_delta when we setup the ToolSlot so we only do this once? It could have a string_keys property we set and then the safe string keys get computed in one place and we just pass slot.string_keys into the _safe_arg_prefix below instead of recomputing this list of safe string keys every iteration.
Signed-off-by: Rui Yin <2260891073@qq.com>
|
@bbrowning thanks for the review. I addressed the efficiency comment in I agree the whitespace stripping should be tracked separately since it is orthogonal to this PR's large-string streaming path. Local checks:
|
|
Note that this actually turns streaming of string arguments on for any tool parser using the new parser engine that sets |
bbrowning
left a comment
There was a problem hiding this comment.
I ran live tests with this against Qwen/Qwen3.6-27B, google/gemma-4-31B-it, and zai-org/GLM-4.7-Flash to cover a few popular model families that use the new parsers. Before this fix, all of those would single large string arg values back as a single delta. After the fix, all of them stream those back incrementally.
I used a script at https://gist.github.com/bbrowning/465ff39855189bbd6d7b68b7eec0e377 to help me test this against live servers.
Before this change, I got 7 deltas back from the Qwen 3.6 test. After this change, 244 deltas. If I enable spec decoding or stream_interval > 1, I get fewer than 244 deltas (as expected), but still more than the 7.
The before/after with Gemma4 is not quite as dramatic - 7 deltas before this change, 20 after. This will vary for each model, and Gemma4 has some additional logic that holds back arguments in partial parsing for correctness sake. But, still a nice improvement there.
For GLM 4.7 Flash, 7 deltas before and 191 deltas afterwards.
In all cases, the streamed deltas were able to accumulate and parse into valid JSON with the appropriate coerced types and no visible leakage of partial tags into the content.
|
@chaunceyjiang This looks good to me in manual testing. MiniMax M2 will need some adjustments to start streaming partial tool call args, because it doesn't look at the But, it looks safe as-is with that model because its regex only matches on complete parameter values today. |
Yes, I also noticed that when testing this PR locally. I'll submit a PR about minmax after this one is merged. |
| slot.name = name | ||
| slot.name_sent = True | ||
| slot.string_keys = self._streamable_string_keys( | ||
| find_tool_properties(self._tools, name) |
There was a problem hiding this comment.
One non-blocking comment: find_tool_properties should be cached here.
|
Current Buildkite failure looks unrelated to this PR's tool-call parser changes. The failed Buildkite build is #73461, job
The log also shows repeated HuggingFace
Could someone with Buildkite permissions retry the failed AMD job? |
Fixes #43267.
Summary
</parameter>as a lexer terminal while preserving it in the arg stream, preventing partial closing-tag text from leaking into streamed arguments.Tests
tests/parser/engine/test_qwen3.pywith local stubs forvllm.third_party.pynvmlanduvlooptests/parser/engine/test_parser_engine.pywith local stubs forvllm.third_party.pynvmlanduvlooppython -m py_compile vllm/parser/engine/parser_engine.py vllm/parser/qwen3.py tests/parser/engine/test_qwen3.py tests/parser/engine/test_parser_engine.pygit diff --checkNote: on this Windows checkout, plain pytest imports try to load the unbuilt CUDA extension (
vllm._C_stable_libtorch), so the parser tests above were run through an in-processpynvml/uvloopstub to bypass platform detection.