Exercise 07: Run a Benchmark Review¶

Objective¶

Delegate to jg-benchmarker to collect benchmark data, evaluate current model assignments for all sandbox agents, and produce an advisory recommendation report. This is the operational skill that keeps a multi-agent system current as models evolve.

Required Reading

.cursor-expert/agents/jg-benchmarker.md -- Full agent definition, per-agent benchmark focus, execution style
.cursor-expert/skills/jg-benchmark-ops/SKILL.md -- Collection/evaluation workflow, verdict definitions
.cursor-expert/AGENTS.md -- Agent registry with current model assignments
Custom Agents | Cursor Docs -- How the model field in agent frontmatter controls model assignment
Agent Skills | Cursor Docs -- How skills like jg-benchmark-ops activate on-demand
Agent Skills Guide -- Additional guide on skill activation and format

CursorClaude Code

The benchmarker workflow uses Cursor's Task tool to dispatch jg-benchmarker. The snapshot format, verdict definitions, and recommendation structure are identical across environments.

The benchmarker workflow is IDE-agnostic. In Claude Code, you would invoke the same collection and evaluation logic via sequential prompting with model selection. The snapshot format, verdict definitions, and recommendation structure are identical. Only the invocation mechanism differs (Task tool in Cursor vs. direct prompting in Claude Code).

Context¶

Your sandbox has 9 agents (8 core pipeline + team-linter). Each has a model: field in its frontmatter. Over time, new models release, prices change, and capabilities shift. The benchmarker agent automates the research and evaluation so you can make data-driven decisions about model assignments.

This exercise teaches the Expert-level workflow: collect -> evaluate -> recommend -> (optionally) apply.

Tasks¶

Step 1: Collect Benchmark Data¶

Delegate to jg-benchmarker with this prompt:

Collect the latest benchmark data for models currently used by our agents. Focus on models referenced in sandbox/.cursor/agents/. Store the snapshot at docs/expert/tutorials/outputs/07-benchmark-snapshot.json. Include scores for: reasoning, coding, instruction-following, speed, and pricing. Record the source URL and retrieval date for every score. Include "produced_by": "jg-benchmarker" in the snapshot JSON. Do not overwrite existing snapshots.

The snapshot should be structured JSON with entries per model, each containing benchmark scores and metadata.

Step 2: Evaluate Current Assignments¶

Delegate to jg-benchmarker with this prompt:

Using the snapshot at docs/expert/tutorials/outputs/07-benchmark-snapshot.json, evaluate the current model assignments for all agents listed in sandbox/.cursor/agents/. For each agent, determine which benchmarks are primary (use the per-agent benchmark focus table in your agent definition), compare the current model to alternatives, and assign a verdict (Excellent/Correct/Monitor/Tune/Upgrade). Write the evaluation report to docs/expert/tutorials/outputs/07-benchmark-report.md. Include a Produced by: jg-benchmarker line in the report header.

Step 3: Review the Report¶

The report should include:

Agent Evaluation Table: Agent | Current Model | Verdict | Key Metrics (primary benchmark scores)
Recommendations: For any agent with verdict Monitor/Tune/Upgrade, suggest a specific model change with before/after metrics and cost impact
Cost Impact Summary: Total estimated monthly cost change if all recommendations were applied
Overall Assessment: Whether the current agent configuration is healthy or needs attention

Step 4: Do NOT Apply Changes¶

The benchmarker is advisory by default. Review the recommendations but do not modify any agent files. In a real workflow, an Expert would review the report, discuss with the team, and then explicitly instruct the benchmarker to apply changes.

Output¶

Two files in docs/expert/tutorials/outputs/:

07-benchmark-snapshot.json -- Structured benchmark data with model scores, sources, and dates
07-benchmark-report.md -- Evaluation report with agent table, recommendations, and cost impact

Validation

python3 docs/expert/tutorials/verify.py --exercise 07

Checks: snapshot JSON exists and is valid with produced_by: "jg-benchmarker", report markdown exists with required sections (agent evaluation table, recommendations, cost impact) and Produced by: jg-benchmarker line, report mentions 5+ agent names, report references all 5 verdict terms, report has sufficient depth.

Reflection

Which agents had the most room for improvement? Why?
Did any agent get a "Tune" verdict where a cheaper model outperforms? What's the cost savings?
How often should you rerun this review in a production environment?
What would change if a new model releases tomorrow that's 50% cheaper with similar performance?
How would you automate this review on a quarterly schedule?

Answer

Snapshot (07-benchmark-snapshot.json): Must include per-model entries with benchmark scores (reasoning, coding, instruction_following, speed), pricing, source URLs, and produced_by: "jg-benchmarker".

Report (07-benchmark-report.md): Must include agent evaluation table, recommendations for Monitor/Tune/Upgrade verdicts, cost impact summary, and overall assessment.

Typical verdicts: Subplanner and debugger score Excellent (best-in-class for their roles). Worker and tester score Correct (good cost-performance ratio). Benchmarker itself often scores Tune (reasoning too low for model comparison research at flash tier).

See docs/expert/tutorials/solutions/07-benchmark-guide.md in the source repo for a complete exemplar.