Skip to content

feat(evaluate): add Pass@N retry scheduler with VLM Judge#61

Open
Daiyimo wants to merge 1 commit into
stepfun-ai:mainfrom
Daiyimo:feat/evaluate-pass-at-n-judge
Open

feat(evaluate): add Pass@N retry scheduler with VLM Judge#61
Daiyimo wants to merge 1 commit into
stepfun-ai:mainfrom
Daiyimo:feat/evaluate-pass-at-n-judge

Conversation

@Daiyimo

@Daiyimo Daiyimo commented May 13, 2026

Copy link
Copy Markdown
Contributor

新增评测模块 evaluate/,为任务执行引入 Pass@N 重试和 VLM Judge 自动判定能力:

  • evaluate/runner.py:Pass@N 调度器,每次 attempt 后立即运行 Judge, Judge=pass 提前退出,Judge=fail 触发重试;自动检测阅读全文场景并注入 滑动约束,Judge model 名统一从 model_config.yaml 读取,代码零硬编码

  • evaluate/judge.py:VLM Judge 核心,分级判定(ABORT/关键词→0次LLM, 有截图→全量VLM),截图与返回值交叉核验捕捉读图幻觉,LLM 调用失败 自动重试最多3次

  • pu_client.py:截图分叉,每步截图同时缓存给 Judge(滚动保留最后5张), return_log 新增 last_step_screenshot / history_actions / return_val, Judge 使用执行期真实截图而非事后延迟补截

  • examples/run_single_task.py:接入新入口,终端报告独立分区展示 Agent 输出和 Judge 判定结果

  • tools/ask_llm_v2.py:print 加 utf-8 容错,避免 Windows GBK 终端异常

  • 极简运行指南_CN.md:补充第6节 Pass@N 评测入口使用说明

新增评测模块 evaluate/,为任务执行引入 Pass@N 重试和 VLM Judge 自动判定能力:

- evaluate/runner.py:Pass@N 调度器,每次 attempt 后立即运行 Judge,
  Judge=pass 提前退出,Judge=fail 触发重试;自动检测阅读全文场景并注入
  滑动约束,Judge model 名统一从 model_config.yaml 读取,代码零硬编码

- evaluate/judge.py:VLM Judge 核心,分级判定(ABORT/关键词→0次LLM,
  有截图→全量VLM),截图与返回值交叉核验捕捉读图幻觉,LLM 调用失败
  自动重试最多3次

- pu_client.py:截图分叉,每步截图同时缓存给 Judge(滚动保留最后5张),
  return_log 新增 last_step_screenshot / history_actions / return_val,
  Judge 使用执行期真实截图而非事后延迟补截

- examples/run_single_task.py:接入新入口,终端报告独立分区展示
  Agent 输出和 Judge 判定结果

- tools/ask_llm_v2.py:print 加 utf-8 容错,避免 Windows GBK 终端异常

- 极简运行指南_CN.md:补充第6节 Pass@N 评测入口使用说明

Co-Authored-By: Claude <noreply@anthropic.com>
@Daiyimo Daiyimo force-pushed the feat/evaluate-pass-at-n-judge branch from 0b00225 to 8acd2c4 Compare May 13, 2026 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant