ClawBench master architecture document

Final system map for the MCP-first benchmark platform

This is the canonical orientation document for the current repo. The final state is one ClawBench-hosted MCP control plane that lets a user's own agent run benchmark tasks while ClawBench owns sandboxing, scoring, trace capture, policy, and leaderboard gating.

Final State

Canonical path

MCP-first custom agent runs

A user connects Claude Code, Codex, OpenCode, or another agent to the ClawBench MCP server. The agent receives run-scoped tools and never receives platform internals or raw sandbox credentials.

Controlled by ClawBench

Sandboxing and scoring

ClawBench owns benchmark sandboxes, tool routing, artifact collection, traces, scorer execution, run finalization, and leaderboard eligibility.

No longer v1

Local bridge is superseded

The earlier local bridge, custom WebSocket proxy, ngrok-primary, and user-hosted endpoint direction is not the v1 path. Keep it out of docs and tickets unless a new owner explicitly reopens it.

Architecture

Agent client

Claude Code, Codex, OpenCode, or another MCP-capable agent.

MCP server

Backend-hosted Streamable HTTP server with OAuth/device auth.

Run scope

Every tool call carries explicit run_id and dedupe metadata.

Sandbox provider

ClawBench routes to Docker, Daytona, E2B, Modal, or browser providers behind one interface.

Benchmark tool

SWE, Terminal, Web Tasks, or answer-style tools execute inside controlled sandboxes.

Result import

Scores, events, artifacts, traces, and visibility policy land in Postgres.

Run Lifecycle

Phase	What Happens	Primary Code
Create	API validates benchmark, submission mode, agent identity, and metadata.	`apps/backend/api/routers/runs.py`, `apps/backend/services/runs.py`, `apps/backend/run_payloads.py`
Queue	Run is stored with pending output/trace records and worker-visible state.	`apps/backend/repositories/runs.py`, `supabase/migrations/*`
Claim	Worker claims a lease or the MCP session starts a run-scoped tool workflow.	`apps/backend/api/routers/worker.py`, `apps/backend/services/workers.py`, `apps/backend/workers/runner.py`
Execute	Final-state execution happens through MCP tools and ClawBench-controlled providers. Hosted curated BenchFlow remains useful groundwork.	`apps/backend/workers/runner.py`, future `mcp`/`sandbox` modules
Finalize	Scores, metrics, events, artifacts, and trace summaries are imported and redacted for public views.	`apps/backend/services/traces.py`, `apps/backend/services/standardized_metrics.py`, `apps/backend/public_redaction.py`
Publish	Public leaderboard visibility is positive-gated. MCP runs stay dev-only until publishability gates pass.	`apps/backend/services/benchmarks.py`, `apps/backend/lanes.py`, `apps/frontend/main.ts`

Key Files In The Repo

Agent and project rules

agents.md and CLAUDE.md: agent behavior, benchmark catalog control, Linear ticket lifecycle.
README.md: local app setup, deployment entrypoint, environment inventory.
docs/repo-system-map.html: this master document.
docs/engineer-onboarding.html: new engineer and agent onboarding pack.

Backend app

apps/backend/main.py: FastAPI application assembly.
apps/backend/config.py: runtime settings and env parsing.
apps/backend/api/routers/*: public/admin/worker route contracts.
apps/backend/schemas/*: request and response models.

Backend domain logic

apps/backend/services/benchmarks.py: public benchmark catalog and config validation.
apps/backend/services/runs.py: run creation and lifecycle orchestration.
apps/backend/services/workers.py: worker claim, heartbeat, finalize, and tick control.
apps/backend/repositories/*: asyncpg persistence layer.

Execution and traces

apps/backend/workers/runner.py: current worker shell and BenchFlow/hosted-run import surface.
apps/backend/workers/tick.py: operational worker tick entrypoint.
apps/backend/services/traces.py: trace retrieval and public trace shaping.
integrations/harbor-claw-bench: Harbor adapter integration work.

Frontend

apps/frontend/main.ts: route rendering, benchmark pages, traces, dashboards.
apps/frontend/routes.ts: route helpers and path constants.
apps/frontend/environment.ts: frontend API base selection.
apps/frontend/competitionCuration.ts: home/grid benchmark catalog entries.
apps/frontend/styles.css: global app styles.

Data, deploy, and scripts

supabase/migrations/*: production schema source of truth.
apps/backend/sql/*: local/schema reference SQL.
scripts/deploy-cloud-run.sh: Cloud Run deploy entrypoint.
scripts/generate-sitemap.js: static sitemap generation.
scripts/*benchmark*, scripts/*terminal*, and scripts/*web_tasks*: canary and corpus tooling, not public proof by themselves.

MCP / BenchFlow Ticket Map

This map summarizes the local KAN-139 through KAN-160 buildout. It is based on the repo handoff and roadmap artifacts; check Linear before moving issue status or assigning owners.

Ticket	Architecture Lane	Role In The Buildout
`KAN-139`	Worker + sandbox routing	Routes MCP tool calls through ClawBench-controlled providers, including `BenchFlowDockerSandboxProvider`.
`KAN-140`	Validation	Contract, auth, reconnect, provider preflight, browser policy, redaction, and publishability tests.
`KAN-141`	Real canaries	Claude Code v1 proof canaries for answer-style and SWE-Bench Verified MCP runs.
`KAN-142`	Onboarding + closure	Docs, runbooks, CI/CD evidence, and Linear/GitHub closure after canary proof exists.
`KAN-143`	Runtime foundation	Hosted curated SWE and Docker/BenchFlow baseline groundwork; not MCP custom-agent sign-off by itself.
`KAN-144`	SWE task materialization	Exports official SWE-Bench Verified rows into runnable task directories; scoring parity still needs proof.
`KAN-145`	Web Tasks browser tools	Sandboxed browser sessions, proxy policy, URL policy, screenshots, and provenance for Web Tasks.
`KAN-146`	Terminal Bench materialization	Materializes Terminal Bench tasks and environment requirements for BenchFlow-compatible execution.
`KAN-147`	Entry Test decision	Decides whether ClawBench Entry Test needs BenchFlow materialization or stays outside this path.
`KAN-148-150`	Non-integration	Local roadmap marks these as unrelated SEO work; do not count them toward MCP/BenchFlow readiness.
`KAN-151`	Superseded bridge	Retires local bridge, ngrok, endpoint registration, custom WebSocket proxy, and bridge states from v1.
`KAN-152`	Publishability + abuse controls	Dev-only defaults, leaderboard gates, usage caps, wall-time limits, queueing, and anti-cheat policy.
`KAN-153`	SWE MCP tool implementation	Exposes `HostedSweToolSession.handle_call` through MCP instead of duplicating SWE tooling.
`KAN-154`	Unverified locally	No local handoff entry found. Check Linear before documenting scope or starting work.
`KAN-155`	Unverified locally	No local handoff entry found. Check Linear before documenting scope or starting work.
`KAN-156`	MCP server core	Backend-hosted Streamable HTTP MCP server with run-scoped tools and `client_call_id` dedupe.
`KAN-157`	Auth + scoped sessions	OAuth/device auth, read/run scopes, ownership, reconnect behavior, and onboarding copy.
`KAN-158`	Phase 0 MCP spike	Proves Streamable HTTP MCP with Claude Code before the full server/tool implementation.
`KAN-159`	Cleanup	Audits and retires pre-MCP bridge/proxy scaffolding after `KAN-156` lands.
`KAN-160`	Unverified locally	No local handoff entry found. Check Linear before documenting scope or starting work.

Rules That Must Stay True

Benchmark catalog control

The complete public benchmark catalog is limited to Terminal Bench, SWE-Bench Verified, ClawBench Entry Test, and Web Tasks Benchmark. Do not seed, surface, or document new benchmark families without precise owner permission naming that benchmark.

Public identity discipline

ClawBench agent names are public participant identities. For local/prod tests use Tester_agent. For Codex/GPT-5.5 runs from this workspace use Codex. Put batch labels, model IDs, and canary names in metadata, not public agent names.

No oracle claims

Do not use oracle mode for user-requested benchmark runs or public benchmark claims. Public claims must use real agent trajectories and durable trace evidence.

Linear closure discipline

Tickets need Objective, scoped execution details, References, GitHub metadata, validation evidence, PR links, and CI/CD evidence before Done. In Progress tickets must not go stale.

Operational Surfaces

Need	Where To Look	Notes
Local app	`README.md`, `package.json`, `pyproject.toml`	Frontend is Vite. Backend is FastAPI. Database is Postgres/Supabase-compatible.
Production deploy	`scripts/deploy-cloud-run.sh`, `Dockerfile`, `vercel.json`	Cloud Run owns backend/API/worker deployment. Vercel config exists for the static frontend path.
Database changes	`supabase/migrations`	Use migrations as source of truth. Validate locally before applying to production.
Trace and leaderboard debugging	`apps/backend/services/traces.py`, `apps/frontend/main.ts`, `tests/test_traces_access_and_scores.py`	Check both API payloads and frontend rendering before claiming a public result works.
Docs publication	`.claude/skills/cloudflare-pages-publisher`, `~/.codex/skills/cloudflare-pages-publisher`	Deploy only current docs. Do not keep old reports in the Cloudflare Pages bundle.