ClawBench engineering onboarding

How new engineers and agents work in this repo

This is the single starting point for goals, architecture, tickets, PRs, verification, test coverage, and the agent rules that keep ClawBench implementation disciplined.

Project Goals

MVP 1

Benchmark local agents live

Users run their own agents from local machines against ClawBench tasks. ClawBench scores the runs, publishes traces and leaderboards, and turns that evidence into shareable competitive content.

Roadmap

Improve agents on tasks

The roadmap tab should show what exists competitively now and what comes next: agent harness engineering, custom task onboarding, and task-specific improvement loops.

TBA

AI consulting surface

Consulting is a future track. Do not present it as shipped or guaranteed. It can be referenced as to be announced when owners decide the positioning.

Custom tasks and custom benchmarks are roadmap work. Examples such as browsing LinkedIn or X/Twitter through OpenClaw should stay internal or clearly experimental until the owner approves a named public benchmark family.

Product SurfacePurposeEngineering Implication
MVP 1 competition Benchmark user-run local agents, score them live, make results competitive, and generate public evidence/content. Prioritize stable registration, run submission, scoring, traces, leaderboards, sharing, and result integrity.
Roadmap tab Show what exists today, where agents fail, and which improvements are planned next. Separate current competitive proof from future harness engineering. Do not blur shipped and planned work.
Phase 1.5 harness work Onboard custom tasks, create owner-approved custom benchmarks, and support requested workflows such as browser tasks through OpenClaw. Add task ingestion, harness contracts, verifier evidence, sandbox routing, and canary coverage before public exposure.
AI consulting Future commercial surface tied to agent evaluation and improvement. Keep as TBA until product owners define offer, scope, and claims.

Architecture Diagram

Engineer Linear, PRs, reviews Agent Owner Local agent runner GitHub Branch, PR, CI Frontend Vite public app FastAPI Backend Routers, services, repos MCP server: KAN-156/157 Supabase Postgres, migrations Worker Layer Claim, run, finalize BenchFlow ACP, sandbox, verifier Leaderboards Scores, lanes, Elo Traces Evidence, artifacts Roadmap Tab Gaps, priorities Harness R&D Custom tasks, MCP Improvement Loop Better agents KAN-139-160 buildout on the architecture 139/153 routing + SWE tools 140/141 validation + canaries 145 browser tools 152 publishability 156/157 MCP + auth 158 spike 142/159 docs + cleanup 143/144/146/147 materialization 151 bridge retired 148-150 non-integration 154/155/160 check Linear

Current runtime is FastAPI, Postgres/Supabase, Vite, GitHub Actions, Cloud Run, and Vercel. The final benchmark control plane is MCP-first, with ClawBench owning sandboxing, scoring, trace capture, and public visibility gates.

How BenchFlow Works Here

Boundary

BenchFlow is the ACP benchmark runner. It creates task environments, launches the selected agent adapter, passes prompts, captures trajectories, runs verifiers, and writes artifacts.

ClawBench stays the product surface: public agents, run IDs, benchmark IDs, leaderboards, traces, redaction, visibility, and ticket/PR discipline.

Run contract

Use execution_mode: benchflow_acp for BenchFlow-backed runs. Keep public ClawBench agent names separate from BenchFlow adapter names such as claude-agent-acp, codex-acp, opencode, and openclaw.

Import result.json rewards, verifier errors, timing, and trajectory/acp_trajectory.jsonl into ClawBench outputs, events, traces, and score fields.

StepBenchFlow OwnsClawBench Owns
Task materialization task.toml, instruction.md, environment/, tests/, optional solution. Allowed benchmark catalog, task provenance, public benchmark identity, and owner approval.
Agent execution ACP adapter startup, sandbox backend, prompt delivery, verifier execution, result directory. Run queue, public agent identity, metadata, quotas, and no-oracle public-claim policy.
Artifact import result.json, rewards, errors, logs, timing, ACP trajectory, verifier output. Redaction, trace rows, score normalization, leaderboard eligibility, and public evidence.
execution_mode: benchflow_acp
benchflow_agent: claude-agent-acp
benchflow_model: claude-haiku-4-5-20251001
benchflow_environment: docker
benchflow_jobs_dir: /tmp/clawbench-benchflow/<run_id>

First provider name in the MCP-first plan is BenchFlowDockerSandboxProvider. Daytona, E2B, and Modal are future providers until preflight and provider contract tests pass.

Get The Repo Started

Install and run locally

npm install
uv sync

docker run --name clawbench-pg \
  -e POSTGRES_USER=postgres \
  -e POSTGRES_PASSWORD=postgres \
  -e POSTGRES_DB=postgres \
  -p 54322:5432 -d postgres:16

cat apps/backend/sql/001_initial_schema.sql \
  | docker exec -i clawbench-pg psql -U postgres -d postgres

Run app surfaces

DATABASE_URL=postgresql://postgres:postgres@127.0.0.1:54322/postgres \
FASTAPI_BASE_URL=http://127.0.0.1:8080 \
CLAWBENCH_PUBLIC_APP_ORIGIN=http://localhost:3000 \
CLAWBENCH_ADMIN_TOKEN=replace-me \
uv run uvicorn backend.main:app --host 0.0.0.0 --port 8080 --reload

VITE_API_BASE_URL=http://127.0.0.1:8080/api/v1 npm run dev

Read README.md before changing setup or deployment. Read apps/backend/ARCHITECTURE.md before changing backend boundaries.

Linear To PR Workflow

StepWhat Engineers DoVerification
Create ticket Add Objective, scoped details, references, GitHub section, definition of done, and expected validation. Ticket is clear enough for another engineer or agent to execute without private context.
Start work Move Linear to In Progress only when implementation starts. Create kan-<id>-<slug> branch. Progress comment links branch and restates the first milestone.
Implement Change only files required by the ticket. Keep architecture boundaries intact. Update tests with the behavior. Targeted tests pass locally, and the diff maps directly to ticket scope.
Open PR Link PR in Linear, move ticket to In Review, include validation evidence and screenshots where relevant. GitHub Actions starts and PR description explains risk and coverage.
Merge and close After merge, verify production CI/CD or deploy checks, then move Linear to Done with closure links. Final comment includes merged PR, CI/CD result, validation evidence, and follow-up tickets if any.

KAN-139 To KAN-160 Build Map

Linear MCP was unavailable during this refresh, so this table is based on the local coordinator handoff, roadmap artifact, and worktree evidence. Check Linear before moving status or assigning ownership.

TicketArchitecture LaneRole In The Buildout
KAN-139Worker + sandbox routingRoutes MCP tool calls through ClawBench-controlled providers, including BenchFlowDockerSandboxProvider.
KAN-140ValidationContract, auth, reconnect, provider preflight, browser policy, redaction, and publishability tests.
KAN-141Real canariesClaude Code v1 proof canaries for answer-style and SWE-Bench Verified MCP runs.
KAN-142Onboarding + closureDocs, runbooks, CI/CD evidence, and Linear/GitHub closure after canary proof exists.
KAN-143Runtime foundationHosted curated SWE and Docker/BenchFlow baseline groundwork; not MCP custom-agent sign-off by itself.
KAN-144SWE task materializationExports official SWE-Bench Verified rows into runnable task directories; scoring parity still needs proof.
KAN-145Web Tasks browser toolsSandboxed browser sessions, proxy policy, URL policy, screenshots, and provenance for Web Tasks.
KAN-146Terminal Bench materializationMaterializes Terminal Bench tasks and environment requirements for BenchFlow-compatible execution.
KAN-147Entry Test decisionDecides whether ClawBench Entry Test needs BenchFlow materialization or stays outside this path.
KAN-148-150Non-integrationLocal roadmap marks these as unrelated SEO work; do not count them toward MCP/BenchFlow readiness.
KAN-151Superseded bridgeRetires local bridge, ngrok, endpoint registration, custom WebSocket proxy, and bridge states from v1.
KAN-152Publishability + abuse controlsDev-only defaults, leaderboard gates, usage caps, wall-time limits, queueing, and anti-cheat policy.
KAN-153SWE MCP tool implementationExposes HostedSweToolSession.handle_call through MCP instead of duplicating SWE tooling.
KAN-154Unverified locallyNo local handoff entry found. Check Linear before documenting scope or starting work.
KAN-155Unverified locallyNo local handoff entry found. Check Linear before documenting scope or starting work.
KAN-156MCP server coreBackend-hosted Streamable HTTP MCP server with run-scoped tools and client_call_id dedupe.
KAN-157Auth + scoped sessionsOAuth/device auth, read/run scopes, ownership, reconnect behavior, and onboarding copy.
KAN-158Phase 0 MCP spikeProves Streamable HTTP MCP with Claude Code before the full server/tool implementation.
KAN-159CleanupAudits and retires pre-MCP bridge/proxy scaffolding after KAN-156 lands.
KAN-160Unverified locallyNo local handoff entry found. Check Linear before documenting scope or starting work.

Useful Agent Rules

agents.md

Shared working agreement

  • Start from Linear tickets and use PRs for implementation.
  • Respect benchmark catalog control and public run identity.
  • Keep architecture changes inside router, service, repository, worker, and migration boundaries.
  • Verify before claiming completion.
CLAUDE.md

Behavioral guardrails

  • Think before coding and surface uncertainty.
  • Prefer minimum code and no speculative abstractions.
  • Make surgical changes and leave unrelated code alone.
  • Transform work into verifiable goals and checks.
Design decisions

How to adhere

Use docs/repo-system-map.html for final-state architecture, apps/backend/ARCHITECTURE.md for backend boundaries, and the current Linear ticket for local scope. If they conflict, stop and ask the owner to resolve it.

Infrastructure

How to adhere

Use Supabase migrations for database changes, GitHub Actions for CI/CD, Cloud Run for backend/API deployment, Vercel for frontend deployment, and Cloudflare Pages only for static docs/reports.

Test Suite And Coverage

Commands

  • npm test: Vitest frontend and TypeScript contract tests.
  • npm run build: sitemap generation plus Vite production build.
  • npx tsc --noEmit: TypeScript type check when TS surfaces change.
  • uv run pytest: Python service/API/repository tests.
  • python -m py_compile $(find apps/backend scripts -name '*.py' -print): Python syntax check.

CI coverage

.github/workflows/deploy.yml runs npm test, npm run build, selected Python onboarding/agent tests, Python compile checks, Supabase migration validation, and production deploy jobs on main.

AreaRepresentative TestsWhat They Protect
Frontend routes and rendering routes.test.ts, challenge_page_rendering.test.ts, live_arena_index.test.ts SPA routes, public competition pages, live arena copy, and removed legacy route behavior.
Benchmark catalog and lanes competition_curation.test.ts, test_benchmark_discovery_filtering.py, test_competitions_benchmark_lanes.py Approved public benchmark formats, lane grouping, hidden benchmark filtering, and fallback catalog control.
Agent onboarding and accounts home_onboarding.test.ts, test_onboarding_contract.py, test_agent_registration_service.py, test_account_route.py Agent registration, claim/enrollment flows, human session access, account dashboard behavior, and skill docs.
Runs, workers, and traces test_runs_contract.py, test_run_reconciliation.py, test_worker_claim_controls.py, test_traces_access_and_scores.py Benchmark-only run submission, fail-closed runner behavior, worker safety controls, public trace access, and score enrichment.
Security and public data test_public_response_redaction.py, test_public_schema_rls.py, test_backend_cors_origins.py Secret redaction, public schema RLS expectations, and CORS origin handling.
Leaderboards and ratings test_elo_ratings.py, test_global_leaderboard_lanes.py, leaderboard_rendering.test.ts Score-first leaderboard output, Elo ordering, confidence intervals, and frontend threshold behavior.
Observability and publishing posthog_analytics.test.ts, test_posthog_error_tracking.py, sentry_frontend.test.ts, seo_crawlability.test.ts Analytics init, exception capture, Sentry setup, robots/sitemap coverage, and crawlable public pages.

Known coverage gap: full public benchmark execution is intentionally fail-closed until the runner/MCP path is wired and verified with real canaries. Do not claim production benchmark execution from this repo without new end-to-end evidence.

Find Work And History

Find what needs doing

  • Start in Linear Backlog and In Progress queues.
  • Use parent/child relationships around program tickets such as MCP custom-agent work.
  • Read the Roadmap tab for product direction and visible gaps.
  • Read docs/repo-system-map.html before changing architecture.
  • Search local plans with rg --files docs/superpowers/plans.

Find what already happened

  • Check Linear Done tickets for closure comments and validation evidence.
  • Search merged PRs and branches by KAN ID.
  • Use git log --oneline --grep KAN- when commit messages include ticket IDs.
  • Check GitHub Actions for post-merge deploy and migration results.
  • Use trace and leaderboard tests to understand what behavior is already protected.

Create And Publish HTML Docs

StepRuleCurrent Command Or File
Author Put canonical docs under docs/. Avoid stale report pages and outdated benchmark families. docs/repo-system-map.html, docs/engineer-onboarding.html
Stage Copy only current docs into a temporary Cloudflare Pages bundle. Do not deploy the whole repo. /tmp/clawbench-api-reports-pages
Protect routes Use the deployed Pages worker to return current docs and hard-404 outdated report/API paths. _worker.js in the Pages bundle
Deploy Use the Cloudflare Pages publisher skill with credentials from environment variables, never CLI args. python3 .claude/skills/cloudflare-pages-publisher/scripts/publish_pages.py --project-name clawbench-api-reports --source-dir /tmp/clawbench-api-reports-pages --branch main
Verify Check new docs return 200 and removed paths return 404 no-store. curl -sSI https://clawbench-api-reports.pages.dev/docs/engineer-onboarding