ClawBench engineering onboarding

How new engineers and agents work in this repo

This is the single starting point for goals, architecture, tickets, PRs, verification, test coverage, and the agent rules that keep ClawBench implementation disciplined.

Project Goals

MVP 1

Benchmark local agents live

Users run their own agents from local machines against ClawBench tasks. ClawBench scores the runs, publishes traces and leaderboards, and turns that evidence into shareable competitive content.

Roadmap

Improve agents on tasks

The roadmap tab should show what exists competitively now and what comes next: agent harness engineering, custom task onboarding, and task-specific improvement loops.

TBA

AI consulting surface

Consulting is a future track. Do not present it as shipped or guaranteed. It can be referenced as to be announced when owners decide the positioning.

Custom tasks and custom benchmarks are roadmap work. Examples such as browsing LinkedIn or X/Twitter through OpenClaw should stay internal or clearly experimental until the owner approves a named public benchmark family.

Product Surface	Purpose	Engineering Implication
MVP 1 competition	Benchmark user-run local agents, score them live, make results competitive, and generate public evidence/content.	Prioritize stable registration, run submission, scoring, traces, leaderboards, sharing, and result integrity.
Roadmap tab	Show what exists today, where agents fail, and which improvements are planned next.	Separate current competitive proof from future harness engineering. Do not blur shipped and planned work.
Phase 1.5 harness work	Onboard custom tasks, create owner-approved custom benchmarks, and support requested workflows such as browser tasks through OpenClaw.	Add task ingestion, harness contracts, verifier evidence, sandbox routing, and canary coverage before public exposure.
AI consulting	Future commercial surface tied to agent evaluation and improvement.	Keep as TBA until product owners define offer, scope, and claims.

Architecture Diagram

Current runtime is FastAPI, Postgres/Supabase, Vite, GitHub Actions, Cloud Run, and Vercel. The final benchmark control plane is MCP-first, with ClawBench owning sandboxing, scoring, trace capture, and public visibility gates.

How BenchFlow Works Here

Boundary

BenchFlow is the ACP benchmark runner. It creates task environments, launches the selected agent adapter, passes prompts, captures trajectories, runs verifiers, and writes artifacts.

ClawBench stays the product surface: public agents, run IDs, benchmark IDs, leaderboards, traces, redaction, visibility, and ticket/PR discipline.

Run contract

Use execution_mode: benchflow_acp for BenchFlow-backed runs. Keep public ClawBench agent names separate from BenchFlow adapter names such as claude-agent-acp, codex-acp, opencode, and openclaw.

Import result.json rewards, verifier errors, timing, and trajectory/acp_trajectory.jsonl into ClawBench outputs, events, traces, and score fields.

Step	BenchFlow Owns	ClawBench Owns
Task materialization	`task.toml`, `instruction.md`, `environment/`, `tests/`, optional solution.	Allowed benchmark catalog, task provenance, public benchmark identity, and owner approval.
Agent execution	ACP adapter startup, sandbox backend, prompt delivery, verifier execution, result directory.	Run queue, public agent identity, metadata, quotas, and no-oracle public-claim policy.
Artifact import	`result.json`, rewards, errors, logs, timing, ACP trajectory, verifier output.	Redaction, trace rows, score normalization, leaderboard eligibility, and public evidence.

execution_mode: benchflow_acp
benchflow_agent: claude-agent-acp
benchflow_model: claude-haiku-4-5-20251001
benchflow_environment: docker
benchflow_jobs_dir: /tmp/clawbench-benchflow/<run_id>

First provider name in the MCP-first plan is BenchFlowDockerSandboxProvider. Daytona, E2B, and Modal are future providers until preflight and provider contract tests pass.

Get The Repo Started

Install and run locally

npm install
uv sync

docker run --name clawbench-pg \
  -e POSTGRES_USER=postgres \
  -e POSTGRES_PASSWORD=postgres \
  -e POSTGRES_DB=postgres \
  -p 54322:5432 -d postgres:16

cat apps/backend/sql/001_initial_schema.sql \
  | docker exec -i clawbench-pg psql -U postgres -d postgres

Run app surfaces

DATABASE_URL=postgresql://postgres:postgres@127.0.0.1:54322/postgres \
FASTAPI_BASE_URL=http://127.0.0.1:8080 \
CLAWBENCH_PUBLIC_APP_ORIGIN=http://localhost:3000 \
CLAWBENCH_ADMIN_TOKEN=replace-me \
uv run uvicorn backend.main:app --host 0.0.0.0 --port 8080 --reload

VITE_API_BASE_URL=http://127.0.0.1:8080/api/v1 npm run dev

Read README.md before changing setup or deployment. Read apps/backend/ARCHITECTURE.md before changing backend boundaries.

Linear To PR Workflow

Step	What Engineers Do	Verification
Create ticket	Add Objective, scoped details, references, GitHub section, definition of done, and expected validation.	Ticket is clear enough for another engineer or agent to execute without private context.
Start work	Move Linear to In Progress only when implementation starts. Create `kan-<id>-<slug>` branch.	Progress comment links branch and restates the first milestone.
Implement	Change only files required by the ticket. Keep architecture boundaries intact. Update tests with the behavior.	Targeted tests pass locally, and the diff maps directly to ticket scope.
Open PR	Link PR in Linear, move ticket to In Review, include validation evidence and screenshots where relevant.	GitHub Actions starts and PR description explains risk and coverage.
Merge and close	After merge, verify production CI/CD or deploy checks, then move Linear to Done with closure links.	Final comment includes merged PR, CI/CD result, validation evidence, and follow-up tickets if any.

KAN-139 To KAN-160 Build Map

Linear MCP was unavailable during this refresh, so this table is based on the local coordinator handoff, roadmap artifact, and worktree evidence. Check Linear before moving status or assigning ownership.

Ticket	Architecture Lane	Role In The Buildout
`KAN-139`	Worker + sandbox routing	Routes MCP tool calls through ClawBench-controlled providers, including `BenchFlowDockerSandboxProvider`.
`KAN-140`	Validation	Contract, auth, reconnect, provider preflight, browser policy, redaction, and publishability tests.
`KAN-141`	Real canaries	Claude Code v1 proof canaries for answer-style and SWE-Bench Verified MCP runs.
`KAN-142`	Onboarding + closure	Docs, runbooks, CI/CD evidence, and Linear/GitHub closure after canary proof exists.
`KAN-143`	Runtime foundation	Hosted curated SWE and Docker/BenchFlow baseline groundwork; not MCP custom-agent sign-off by itself.
`KAN-144`	SWE task materialization	Exports official SWE-Bench Verified rows into runnable task directories; scoring parity still needs proof.
`KAN-145`	Web Tasks browser tools	Sandboxed browser sessions, proxy policy, URL policy, screenshots, and provenance for Web Tasks.
`KAN-146`	Terminal Bench materialization	Materializes Terminal Bench tasks and environment requirements for BenchFlow-compatible execution.
`KAN-147`	Entry Test decision	Decides whether ClawBench Entry Test needs BenchFlow materialization or stays outside this path.
`KAN-148-150`	Non-integration	Local roadmap marks these as unrelated SEO work; do not count them toward MCP/BenchFlow readiness.
`KAN-151`	Superseded bridge	Retires local bridge, ngrok, endpoint registration, custom WebSocket proxy, and bridge states from v1.
`KAN-152`	Publishability + abuse controls	Dev-only defaults, leaderboard gates, usage caps, wall-time limits, queueing, and anti-cheat policy.
`KAN-153`	SWE MCP tool implementation	Exposes `HostedSweToolSession.handle_call` through MCP instead of duplicating SWE tooling.
`KAN-154`	Unverified locally	No local handoff entry found. Check Linear before documenting scope or starting work.
`KAN-155`	Unverified locally	No local handoff entry found. Check Linear before documenting scope or starting work.
`KAN-156`	MCP server core	Backend-hosted Streamable HTTP MCP server with run-scoped tools and `client_call_id` dedupe.
`KAN-157`	Auth + scoped sessions	OAuth/device auth, read/run scopes, ownership, reconnect behavior, and onboarding copy.
`KAN-158`	Phase 0 MCP spike	Proves Streamable HTTP MCP with Claude Code before the full server/tool implementation.
`KAN-159`	Cleanup	Audits and retires pre-MCP bridge/proxy scaffolding after `KAN-156` lands.
`KAN-160`	Unverified locally	No local handoff entry found. Check Linear before documenting scope or starting work.

Useful Agent Rules

agents.md

Shared working agreement

Start from Linear tickets and use PRs for implementation.
Respect benchmark catalog control and public run identity.
Keep architecture changes inside router, service, repository, worker, and migration boundaries.
Verify before claiming completion.

CLAUDE.md

Behavioral guardrails

Think before coding and surface uncertainty.
Prefer minimum code and no speculative abstractions.
Make surgical changes and leave unrelated code alone.
Transform work into verifiable goals and checks.

Design decisions

How to adhere

Use docs/repo-system-map.html for final-state architecture, apps/backend/ARCHITECTURE.md for backend boundaries, and the current Linear ticket for local scope. If they conflict, stop and ask the owner to resolve it.

Infrastructure

How to adhere

Use Supabase migrations for database changes, GitHub Actions for CI/CD, Cloud Run for backend/API deployment, Vercel for frontend deployment, and Cloudflare Pages only for static docs/reports.

Test Suite And Coverage

Commands

npm test: Vitest frontend and TypeScript contract tests.
npm run build: sitemap generation plus Vite production build.
npx tsc --noEmit: TypeScript type check when TS surfaces change.
uv run pytest: Python service/API/repository tests.
python -m py_compile $(find apps/backend scripts -name '*.py' -print): Python syntax check.

CI coverage

.github/workflows/deploy.yml runs npm test, npm run build, selected Python onboarding/agent tests, Python compile checks, Supabase migration validation, and production deploy jobs on main.

Area	Representative Tests	What They Protect
Frontend routes and rendering	`routes.test.ts`, `challenge_page_rendering.test.ts`, `live_arena_index.test.ts`	SPA routes, public competition pages, live arena copy, and removed legacy route behavior.
Benchmark catalog and lanes	`competition_curation.test.ts`, `test_benchmark_discovery_filtering.py`, `test_competitions_benchmark_lanes.py`	Approved public benchmark formats, lane grouping, hidden benchmark filtering, and fallback catalog control.
Agent onboarding and accounts	`home_onboarding.test.ts`, `test_onboarding_contract.py`, `test_agent_registration_service.py`, `test_account_route.py`	Agent registration, claim/enrollment flows, human session access, account dashboard behavior, and skill docs.
Runs, workers, and traces	`test_runs_contract.py`, `test_run_reconciliation.py`, `test_worker_claim_controls.py`, `test_traces_access_and_scores.py`	Benchmark-only run submission, fail-closed runner behavior, worker safety controls, public trace access, and score enrichment.
Security and public data	`test_public_response_redaction.py`, `test_public_schema_rls.py`, `test_backend_cors_origins.py`	Secret redaction, public schema RLS expectations, and CORS origin handling.
Leaderboards and ratings	`test_elo_ratings.py`, `test_global_leaderboard_lanes.py`, `leaderboard_rendering.test.ts`	Score-first leaderboard output, Elo ordering, confidence intervals, and frontend threshold behavior.
Observability and publishing	`posthog_analytics.test.ts`, `test_posthog_error_tracking.py`, `sentry_frontend.test.ts`, `seo_crawlability.test.ts`	Analytics init, exception capture, Sentry setup, robots/sitemap coverage, and crawlable public pages.

Known coverage gap: full public benchmark execution is intentionally fail-closed until the runner/MCP path is wired and verified with real canaries. Do not claim production benchmark execution from this repo without new end-to-end evidence.

Find Work And History

Find what needs doing

Start in Linear Backlog and In Progress queues.
Use parent/child relationships around program tickets such as MCP custom-agent work.
Read the Roadmap tab for product direction and visible gaps.
Read docs/repo-system-map.html before changing architecture.
Search local plans with rg --files docs/superpowers/plans.

Find what already happened

Check Linear Done tickets for closure comments and validation evidence.
Search merged PRs and branches by KAN ID.
Use git log --oneline --grep KAN- when commit messages include ticket IDs.
Check GitHub Actions for post-merge deploy and migration results.
Use trace and leaderboard tests to understand what behavior is already protected.

Create And Publish HTML Docs

Step	Rule	Current Command Or File
Author	Put canonical docs under `docs/`. Avoid stale report pages and outdated benchmark families.	`docs/repo-system-map.html`, `docs/engineer-onboarding.html`
Stage	Copy only current docs into a temporary Cloudflare Pages bundle. Do not deploy the whole repo.	`/tmp/clawbench-api-reports-pages`
Protect routes	Use the deployed Pages worker to return current docs and hard-404 outdated report/API paths.	`_worker.js` in the Pages bundle
Deploy	Use the Cloudflare Pages publisher skill with credentials from environment variables, never CLI args.	`python3 .claude/skills/cloudflare-pages-publisher/scripts/publish_pages.py --project-name clawbench-api-reports --source-dir /tmp/clawbench-api-reports-pages --branch main`
Verify	Check new docs return `200` and removed paths return `404 no-store`.	`curl -sSI https://clawbench-api-reports.pages.dev/docs/engineer-onboarding`