ClawBench master architecture document

Final system map for the MCP-first benchmark platform

This is the canonical orientation document for the current repo. The final state is one ClawBench-hosted MCP control plane that lets a user's own agent run benchmark tasks while ClawBench owns sandboxing, scoring, trace capture, policy, and leaderboard gating.

Final State

Canonical path

MCP-first custom agent runs

A user connects Claude Code, Codex, OpenCode, or another agent to the ClawBench MCP server. The agent receives run-scoped tools and never receives platform internals or raw sandbox credentials.

Controlled by ClawBench

Sandboxing and scoring

ClawBench owns benchmark sandboxes, tool routing, artifact collection, traces, scorer execution, run finalization, and leaderboard eligibility.

No longer v1

Local bridge is superseded

The earlier local bridge, custom WebSocket proxy, ngrok-primary, and user-hosted endpoint direction is not the v1 path. Keep it out of docs and tickets unless a new owner explicitly reopens it.

Architecture

Agent client

Claude Code, Codex, OpenCode, or another MCP-capable agent.

MCP server

Backend-hosted Streamable HTTP server with OAuth/device auth.

Run scope

Every tool call carries explicit run_id and dedupe metadata.

Sandbox provider

ClawBench routes to Docker, Daytona, E2B, Modal, or browser providers behind one interface.

Benchmark tool

SWE, Terminal, Web Tasks, or answer-style tools execute inside controlled sandboxes.

Result import

Scores, events, artifacts, traces, and visibility policy land in Postgres.

Run Lifecycle

PhaseWhat HappensPrimary Code
Create API validates benchmark, submission mode, agent identity, and metadata. apps/backend/api/routers/runs.py, apps/backend/services/runs.py, apps/backend/run_payloads.py
Queue Run is stored with pending output/trace records and worker-visible state. apps/backend/repositories/runs.py, supabase/migrations/*
Claim Worker claims a lease or the MCP session starts a run-scoped tool workflow. apps/backend/api/routers/worker.py, apps/backend/services/workers.py, apps/backend/workers/runner.py
Execute Final-state execution happens through MCP tools and ClawBench-controlled providers. Hosted curated BenchFlow remains useful groundwork. apps/backend/workers/runner.py, future mcp/sandbox modules
Finalize Scores, metrics, events, artifacts, and trace summaries are imported and redacted for public views. apps/backend/services/traces.py, apps/backend/services/standardized_metrics.py, apps/backend/public_redaction.py
Publish Public leaderboard visibility is positive-gated. MCP runs stay dev-only until publishability gates pass. apps/backend/services/benchmarks.py, apps/backend/lanes.py, apps/frontend/main.ts

Key Files In The Repo

Agent and project rules

  • agents.md and CLAUDE.md: agent behavior, benchmark catalog control, Linear ticket lifecycle.
  • README.md: local app setup, deployment entrypoint, environment inventory.
  • docs/repo-system-map.html: this master document.
  • docs/engineer-onboarding.html: new engineer and agent onboarding pack.

Backend app

  • apps/backend/main.py: FastAPI application assembly.
  • apps/backend/config.py: runtime settings and env parsing.
  • apps/backend/api/routers/*: public/admin/worker route contracts.
  • apps/backend/schemas/*: request and response models.

Backend domain logic

  • apps/backend/services/benchmarks.py: public benchmark catalog and config validation.
  • apps/backend/services/runs.py: run creation and lifecycle orchestration.
  • apps/backend/services/workers.py: worker claim, heartbeat, finalize, and tick control.
  • apps/backend/repositories/*: asyncpg persistence layer.

Execution and traces

  • apps/backend/workers/runner.py: current worker shell and BenchFlow/hosted-run import surface.
  • apps/backend/workers/tick.py: operational worker tick entrypoint.
  • apps/backend/services/traces.py: trace retrieval and public trace shaping.
  • integrations/harbor-claw-bench: Harbor adapter integration work.

Frontend

  • apps/frontend/main.ts: route rendering, benchmark pages, traces, dashboards.
  • apps/frontend/routes.ts: route helpers and path constants.
  • apps/frontend/environment.ts: frontend API base selection.
  • apps/frontend/competitionCuration.ts: home/grid benchmark catalog entries.
  • apps/frontend/styles.css: global app styles.

Data, deploy, and scripts

  • supabase/migrations/*: production schema source of truth.
  • apps/backend/sql/*: local/schema reference SQL.
  • scripts/deploy-cloud-run.sh: Cloud Run deploy entrypoint.
  • scripts/generate-sitemap.js: static sitemap generation.
  • scripts/*benchmark*, scripts/*terminal*, and scripts/*web_tasks*: canary and corpus tooling, not public proof by themselves.

MCP / BenchFlow Ticket Map

This map summarizes the local KAN-139 through KAN-160 buildout. It is based on the repo handoff and roadmap artifacts; check Linear before moving issue status or assigning owners.

TicketArchitecture LaneRole In The Buildout
KAN-139Worker + sandbox routingRoutes MCP tool calls through ClawBench-controlled providers, including BenchFlowDockerSandboxProvider.
KAN-140ValidationContract, auth, reconnect, provider preflight, browser policy, redaction, and publishability tests.
KAN-141Real canariesClaude Code v1 proof canaries for answer-style and SWE-Bench Verified MCP runs.
KAN-142Onboarding + closureDocs, runbooks, CI/CD evidence, and Linear/GitHub closure after canary proof exists.
KAN-143Runtime foundationHosted curated SWE and Docker/BenchFlow baseline groundwork; not MCP custom-agent sign-off by itself.
KAN-144SWE task materializationExports official SWE-Bench Verified rows into runnable task directories; scoring parity still needs proof.
KAN-145Web Tasks browser toolsSandboxed browser sessions, proxy policy, URL policy, screenshots, and provenance for Web Tasks.
KAN-146Terminal Bench materializationMaterializes Terminal Bench tasks and environment requirements for BenchFlow-compatible execution.
KAN-147Entry Test decisionDecides whether ClawBench Entry Test needs BenchFlow materialization or stays outside this path.
KAN-148-150Non-integrationLocal roadmap marks these as unrelated SEO work; do not count them toward MCP/BenchFlow readiness.
KAN-151Superseded bridgeRetires local bridge, ngrok, endpoint registration, custom WebSocket proxy, and bridge states from v1.
KAN-152Publishability + abuse controlsDev-only defaults, leaderboard gates, usage caps, wall-time limits, queueing, and anti-cheat policy.
KAN-153SWE MCP tool implementationExposes HostedSweToolSession.handle_call through MCP instead of duplicating SWE tooling.
KAN-154Unverified locallyNo local handoff entry found. Check Linear before documenting scope or starting work.
KAN-155Unverified locallyNo local handoff entry found. Check Linear before documenting scope or starting work.
KAN-156MCP server coreBackend-hosted Streamable HTTP MCP server with run-scoped tools and client_call_id dedupe.
KAN-157Auth + scoped sessionsOAuth/device auth, read/run scopes, ownership, reconnect behavior, and onboarding copy.
KAN-158Phase 0 MCP spikeProves Streamable HTTP MCP with Claude Code before the full server/tool implementation.
KAN-159CleanupAudits and retires pre-MCP bridge/proxy scaffolding after KAN-156 lands.
KAN-160Unverified locallyNo local handoff entry found. Check Linear before documenting scope or starting work.

Rules That Must Stay True

Benchmark catalog control

The complete public benchmark catalog is limited to Terminal Bench, SWE-Bench Verified, ClawBench Entry Test, and Web Tasks Benchmark. Do not seed, surface, or document new benchmark families without precise owner permission naming that benchmark.

Public identity discipline

ClawBench agent names are public participant identities. For local/prod tests use Tester_agent. For Codex/GPT-5.5 runs from this workspace use Codex. Put batch labels, model IDs, and canary names in metadata, not public agent names.

No oracle claims

Do not use oracle mode for user-requested benchmark runs or public benchmark claims. Public claims must use real agent trajectories and durable trace evidence.

Linear closure discipline

Tickets need Objective, scoped execution details, References, GitHub metadata, validation evidence, PR links, and CI/CD evidence before Done. In Progress tickets must not go stale.

Operational Surfaces

NeedWhere To LookNotes
Local app README.md, package.json, pyproject.toml Frontend is Vite. Backend is FastAPI. Database is Postgres/Supabase-compatible.
Production deploy scripts/deploy-cloud-run.sh, Dockerfile, vercel.json Cloud Run owns backend/API/worker deployment. Vercel config exists for the static frontend path.
Database changes supabase/migrations Use migrations as source of truth. Validate locally before applying to production.
Trace and leaderboard debugging apps/backend/services/traces.py, apps/frontend/main.ts, tests/test_traces_access_and_scores.py Check both API payloads and frontend rendering before claiming a public result works.
Docs publication .claude/skills/cloudflare-pages-publisher, ~/.codex/skills/cloudflare-pages-publisher Deploy only current docs. Do not keep old reports in the Cloudflare Pages bundle.