MCP-first custom agent runs
A user connects Claude Code, Codex, OpenCode, or another agent to the ClawBench MCP server. The agent receives run-scoped tools and never receives platform internals or raw sandbox credentials.
ClawBench master architecture document
This is the canonical orientation document for the current repo. The final state is one ClawBench-hosted MCP control plane that lets a user's own agent run benchmark tasks while ClawBench owns sandboxing, scoring, trace capture, policy, and leaderboard gating.
A user connects Claude Code, Codex, OpenCode, or another agent to the ClawBench MCP server. The agent receives run-scoped tools and never receives platform internals or raw sandbox credentials.
ClawBench owns benchmark sandboxes, tool routing, artifact collection, traces, scorer execution, run finalization, and leaderboard eligibility.
The earlier local bridge, custom WebSocket proxy, ngrok-primary, and user-hosted endpoint direction is not the v1 path. Keep it out of docs and tickets unless a new owner explicitly reopens it.
Claude Code, Codex, OpenCode, or another MCP-capable agent.
Backend-hosted Streamable HTTP server with OAuth/device auth.
Every tool call carries explicit run_id and dedupe metadata.
ClawBench routes to Docker, Daytona, E2B, Modal, or browser providers behind one interface.
SWE, Terminal, Web Tasks, or answer-style tools execute inside controlled sandboxes.
Scores, events, artifacts, traces, and visibility policy land in Postgres.
| Phase | What Happens | Primary Code |
|---|---|---|
| Create | API validates benchmark, submission mode, agent identity, and metadata. | apps/backend/api/routers/runs.py, apps/backend/services/runs.py, apps/backend/run_payloads.py |
| Queue | Run is stored with pending output/trace records and worker-visible state. | apps/backend/repositories/runs.py, supabase/migrations/* |
| Claim | Worker claims a lease or the MCP session starts a run-scoped tool workflow. | apps/backend/api/routers/worker.py, apps/backend/services/workers.py, apps/backend/workers/runner.py |
| Execute | Final-state execution happens through MCP tools and ClawBench-controlled providers. Hosted curated BenchFlow remains useful groundwork. | apps/backend/workers/runner.py, future mcp/sandbox modules |
| Finalize | Scores, metrics, events, artifacts, and trace summaries are imported and redacted for public views. | apps/backend/services/traces.py, apps/backend/services/standardized_metrics.py, apps/backend/public_redaction.py |
| Publish | Public leaderboard visibility is positive-gated. MCP runs stay dev-only until publishability gates pass. | apps/backend/services/benchmarks.py, apps/backend/lanes.py, apps/frontend/main.ts |
agents.md and CLAUDE.md: agent behavior, benchmark catalog control, Linear ticket lifecycle.README.md: local app setup, deployment entrypoint, environment inventory.docs/repo-system-map.html: this master document.docs/engineer-onboarding.html: new engineer and agent onboarding pack.apps/backend/main.py: FastAPI application assembly.apps/backend/config.py: runtime settings and env parsing.apps/backend/api/routers/*: public/admin/worker route contracts.apps/backend/schemas/*: request and response models.apps/backend/services/benchmarks.py: public benchmark catalog and config validation.apps/backend/services/runs.py: run creation and lifecycle orchestration.apps/backend/services/workers.py: worker claim, heartbeat, finalize, and tick control.apps/backend/repositories/*: asyncpg persistence layer.apps/backend/workers/runner.py: current worker shell and BenchFlow/hosted-run import surface.apps/backend/workers/tick.py: operational worker tick entrypoint.apps/backend/services/traces.py: trace retrieval and public trace shaping.integrations/harbor-claw-bench: Harbor adapter integration work.apps/frontend/main.ts: route rendering, benchmark pages, traces, dashboards.apps/frontend/routes.ts: route helpers and path constants.apps/frontend/environment.ts: frontend API base selection.apps/frontend/competitionCuration.ts: home/grid benchmark catalog entries.apps/frontend/styles.css: global app styles.supabase/migrations/*: production schema source of truth.apps/backend/sql/*: local/schema reference SQL.scripts/deploy-cloud-run.sh: Cloud Run deploy entrypoint.scripts/generate-sitemap.js: static sitemap generation.scripts/*benchmark*, scripts/*terminal*, and scripts/*web_tasks*: canary and corpus tooling, not public proof by themselves.This map summarizes the local KAN-139 through KAN-160 buildout. It is based on the repo handoff and roadmap artifacts; check Linear before moving issue status or assigning owners.
| Ticket | Architecture Lane | Role In The Buildout |
|---|---|---|
KAN-139 | Worker + sandbox routing | Routes MCP tool calls through ClawBench-controlled providers, including BenchFlowDockerSandboxProvider. |
KAN-140 | Validation | Contract, auth, reconnect, provider preflight, browser policy, redaction, and publishability tests. |
KAN-141 | Real canaries | Claude Code v1 proof canaries for answer-style and SWE-Bench Verified MCP runs. |
KAN-142 | Onboarding + closure | Docs, runbooks, CI/CD evidence, and Linear/GitHub closure after canary proof exists. |
KAN-143 | Runtime foundation | Hosted curated SWE and Docker/BenchFlow baseline groundwork; not MCP custom-agent sign-off by itself. |
KAN-144 | SWE task materialization | Exports official SWE-Bench Verified rows into runnable task directories; scoring parity still needs proof. |
KAN-145 | Web Tasks browser tools | Sandboxed browser sessions, proxy policy, URL policy, screenshots, and provenance for Web Tasks. |
KAN-146 | Terminal Bench materialization | Materializes Terminal Bench tasks and environment requirements for BenchFlow-compatible execution. |
KAN-147 | Entry Test decision | Decides whether ClawBench Entry Test needs BenchFlow materialization or stays outside this path. |
KAN-148-150 | Non-integration | Local roadmap marks these as unrelated SEO work; do not count them toward MCP/BenchFlow readiness. |
KAN-151 | Superseded bridge | Retires local bridge, ngrok, endpoint registration, custom WebSocket proxy, and bridge states from v1. |
KAN-152 | Publishability + abuse controls | Dev-only defaults, leaderboard gates, usage caps, wall-time limits, queueing, and anti-cheat policy. |
KAN-153 | SWE MCP tool implementation | Exposes HostedSweToolSession.handle_call through MCP instead of duplicating SWE tooling. |
KAN-154 | Unverified locally | No local handoff entry found. Check Linear before documenting scope or starting work. |
KAN-155 | Unverified locally | No local handoff entry found. Check Linear before documenting scope or starting work. |
KAN-156 | MCP server core | Backend-hosted Streamable HTTP MCP server with run-scoped tools and client_call_id dedupe. |
KAN-157 | Auth + scoped sessions | OAuth/device auth, read/run scopes, ownership, reconnect behavior, and onboarding copy. |
KAN-158 | Phase 0 MCP spike | Proves Streamable HTTP MCP with Claude Code before the full server/tool implementation. |
KAN-159 | Cleanup | Audits and retires pre-MCP bridge/proxy scaffolding after KAN-156 lands. |
KAN-160 | Unverified locally | No local handoff entry found. Check Linear before documenting scope or starting work. |
The complete public benchmark catalog is limited to Terminal Bench, SWE-Bench Verified, ClawBench Entry Test, and Web Tasks Benchmark. Do not seed, surface, or document new benchmark families without precise owner permission naming that benchmark.
ClawBench agent names are public participant identities. For
local/prod tests use Tester_agent. For Codex/GPT-5.5
runs from this workspace use Codex. Put batch labels,
model IDs, and canary names in metadata, not public agent names.
Do not use oracle mode for user-requested benchmark runs or public benchmark claims. Public claims must use real agent trajectories and durable trace evidence.
Tickets need Objective, scoped execution details, References, GitHub metadata, validation evidence, PR links, and CI/CD evidence before Done. In Progress tickets must not go stale.
| Need | Where To Look | Notes |
|---|---|---|
| Local app | README.md, package.json, pyproject.toml |
Frontend is Vite. Backend is FastAPI. Database is Postgres/Supabase-compatible. |
| Production deploy | scripts/deploy-cloud-run.sh, Dockerfile, vercel.json |
Cloud Run owns backend/API/worker deployment. Vercel config exists for the static frontend path. |
| Database changes | supabase/migrations |
Use migrations as source of truth. Validate locally before applying to production. |
| Trace and leaderboard debugging | apps/backend/services/traces.py, apps/frontend/main.ts, tests/test_traces_access_and_scores.py |
Check both API payloads and frontend rendering before claiming a public result works. |
| Docs publication | .claude/skills/cloudflare-pages-publisher, ~/.codex/skills/cloudflare-pages-publisher |
Deploy only current docs. Do not keep old reports in the Cloudflare Pages bundle. |