feat(evaluate): add Pass@N retry scheduler with VLM Judge by Daiyimo · Pull Request #61 · stepfun-ai/gelab-zero

Daiyimo · 2026-05-13T09:40:49Z

新增评测模块 evaluate/，为任务执行引入 Pass@N 重试和 VLM Judge 自动判定能力：

evaluate/runner.py：Pass@N 调度器，每次 attempt 后立即运行 Judge， Judge=pass 提前退出，Judge=fail 触发重试；自动检测阅读全文场景并注入滑动约束，Judge model 名统一从 model_config.yaml 读取，代码零硬编码
evaluate/judge.py：VLM Judge 核心，分级判定（ABORT/关键词→0次LLM，有截图→全量VLM），截图与返回值交叉核验捕捉读图幻觉，LLM 调用失败自动重试最多3次
pu_client.py：截图分叉，每步截图同时缓存给 Judge（滚动保留最后5张）， return_log 新增 last_step_screenshot / history_actions / return_val， Judge 使用执行期真实截图而非事后延迟补截
examples/run_single_task.py：接入新入口，终端报告独立分区展示 Agent 输出和 Judge 判定结果
tools/ask_llm_v2.py：print 加 utf-8 容错，避免 Windows GBK 终端异常
极简运行指南_CN.md：补充第6节 Pass@N 评测入口使用说明

新增评测模块 evaluate/，为任务执行引入 Pass@N 重试和 VLM Judge 自动判定能力： - evaluate/runner.py：Pass@N 调度器，每次 attempt 后立即运行 Judge， Judge=pass 提前退出，Judge=fail 触发重试；自动检测阅读全文场景并注入滑动约束，Judge model 名统一从 model_config.yaml 读取，代码零硬编码 - evaluate/judge.py：VLM Judge 核心，分级判定（ABORT/关键词→0次LLM，有截图→全量VLM），截图与返回值交叉核验捕捉读图幻觉，LLM 调用失败自动重试最多3次 - pu_client.py：截图分叉，每步截图同时缓存给 Judge（滚动保留最后5张）， return_log 新增 last_step_screenshot / history_actions / return_val， Judge 使用执行期真实截图而非事后延迟补截 - examples/run_single_task.py：接入新入口，终端报告独立分区展示 Agent 输出和 Judge 判定结果 - tools/ask_llm_v2.py：print 加 utf-8 容错，避免 Windows GBK 终端异常 - 极简运行指南_CN.md：补充第6节 Pass@N 评测入口使用说明 Co-Authored-By: Claude <noreply@anthropic.com>

Daiyimo force-pushed the feat/evaluate-pass-at-n-judge branch from 0b00225 to 8acd2c4 Compare May 13, 2026 09:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(evaluate): add Pass@N retry scheduler with VLM Judge#61

feat(evaluate): add Pass@N retry scheduler with VLM Judge#61
Daiyimo wants to merge 1 commit into
stepfun-ai:mainfrom
Daiyimo:feat/evaluate-pass-at-n-judge

Daiyimo commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Daiyimo commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant