feat(evaluate): add Pass@N retry scheduler with VLM Judge#61
Open
Daiyimo wants to merge 1 commit into
Open
Conversation
新增评测模块 evaluate/,为任务执行引入 Pass@N 重试和 VLM Judge 自动判定能力: - evaluate/runner.py:Pass@N 调度器,每次 attempt 后立即运行 Judge, Judge=pass 提前退出,Judge=fail 触发重试;自动检测阅读全文场景并注入 滑动约束,Judge model 名统一从 model_config.yaml 读取,代码零硬编码 - evaluate/judge.py:VLM Judge 核心,分级判定(ABORT/关键词→0次LLM, 有截图→全量VLM),截图与返回值交叉核验捕捉读图幻觉,LLM 调用失败 自动重试最多3次 - pu_client.py:截图分叉,每步截图同时缓存给 Judge(滚动保留最后5张), return_log 新增 last_step_screenshot / history_actions / return_val, Judge 使用执行期真实截图而非事后延迟补截 - examples/run_single_task.py:接入新入口,终端报告独立分区展示 Agent 输出和 Judge 判定结果 - tools/ask_llm_v2.py:print 加 utf-8 容错,避免 Windows GBK 终端异常 - 极简运行指南_CN.md:补充第6节 Pass@N 评测入口使用说明 Co-Authored-By: Claude <noreply@anthropic.com>
0b00225 to
8acd2c4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
新增评测模块 evaluate/,为任务执行引入 Pass@N 重试和 VLM Judge 自动判定能力:
evaluate/runner.py:Pass@N 调度器,每次 attempt 后立即运行 Judge, Judge=pass 提前退出,Judge=fail 触发重试;自动检测阅读全文场景并注入 滑动约束,Judge model 名统一从 model_config.yaml 读取,代码零硬编码
evaluate/judge.py:VLM Judge 核心,分级判定(ABORT/关键词→0次LLM, 有截图→全量VLM),截图与返回值交叉核验捕捉读图幻觉,LLM 调用失败 自动重试最多3次
pu_client.py:截图分叉,每步截图同时缓存给 Judge(滚动保留最后5张), return_log 新增 last_step_screenshot / history_actions / return_val, Judge 使用执行期真实截图而非事后延迟补截
examples/run_single_task.py:接入新入口,终端报告独立分区展示 Agent 输出和 Judge 判定结果
tools/ask_llm_v2.py:print 加 utf-8 容错,避免 Windows GBK 终端异常
极简运行指南_CN.md:补充第6节 Pass@N 评测入口使用说明