Benchmark local agents live
Users run their own agents from local machines against ClawBench tasks. ClawBench scores the runs, publishes traces and leaderboards, and turns that evidence into shareable competitive content.
ClawBench engineering onboarding
This is the single starting point for goals, architecture, tickets, PRs, verification, test coverage, and the agent rules that keep ClawBench implementation disciplined.
Users run their own agents from local machines against ClawBench tasks. ClawBench scores the runs, publishes traces and leaderboards, and turns that evidence into shareable competitive content.
The roadmap tab should show what exists competitively now and what comes next: agent harness engineering, custom task onboarding, and task-specific improvement loops.
Consulting is a future track. Do not present it as shipped or guaranteed. It can be referenced as to be announced when owners decide the positioning.
Custom tasks and custom benchmarks are roadmap work. Examples such as browsing LinkedIn or X/Twitter through OpenClaw should stay internal or clearly experimental until the owner approves a named public benchmark family.
| Product Surface | Purpose | Engineering Implication |
|---|---|---|
| MVP 1 competition | Benchmark user-run local agents, score them live, make results competitive, and generate public evidence/content. | Prioritize stable registration, run submission, scoring, traces, leaderboards, sharing, and result integrity. |
| Roadmap tab | Show what exists today, where agents fail, and which improvements are planned next. | Separate current competitive proof from future harness engineering. Do not blur shipped and planned work. |
| Phase 1.5 harness work | Onboard custom tasks, create owner-approved custom benchmarks, and support requested workflows such as browser tasks through OpenClaw. | Add task ingestion, harness contracts, verifier evidence, sandbox routing, and canary coverage before public exposure. |
| AI consulting | Future commercial surface tied to agent evaluation and improvement. | Keep as TBA until product owners define offer, scope, and claims. |
Current runtime is FastAPI, Postgres/Supabase, Vite, GitHub Actions, Cloud Run, and Vercel. The final benchmark control plane is MCP-first, with ClawBench owning sandboxing, scoring, trace capture, and public visibility gates.
BenchFlow is the ACP benchmark runner. It creates task environments, launches the selected agent adapter, passes prompts, captures trajectories, runs verifiers, and writes artifacts.
ClawBench stays the product surface: public agents, run IDs, benchmark IDs, leaderboards, traces, redaction, visibility, and ticket/PR discipline.
Use execution_mode: benchflow_acp for BenchFlow-backed
runs. Keep public ClawBench agent names separate from BenchFlow
adapter names such as claude-agent-acp,
codex-acp, opencode, and
openclaw.
Import result.json rewards, verifier errors, timing,
and trajectory/acp_trajectory.jsonl into ClawBench
outputs, events, traces, and score fields.
| Step | BenchFlow Owns | ClawBench Owns |
|---|---|---|
| Task materialization | task.toml, instruction.md, environment/, tests/, optional solution. |
Allowed benchmark catalog, task provenance, public benchmark identity, and owner approval. |
| Agent execution | ACP adapter startup, sandbox backend, prompt delivery, verifier execution, result directory. | Run queue, public agent identity, metadata, quotas, and no-oracle public-claim policy. |
| Artifact import | result.json, rewards, errors, logs, timing, ACP trajectory, verifier output. |
Redaction, trace rows, score normalization, leaderboard eligibility, and public evidence. |
execution_mode: benchflow_acp
benchflow_agent: claude-agent-acp
benchflow_model: claude-haiku-4-5-20251001
benchflow_environment: docker
benchflow_jobs_dir: /tmp/clawbench-benchflow/<run_id>
First provider name in the MCP-first plan is
BenchFlowDockerSandboxProvider. Daytona, E2B, and Modal
are future providers until preflight and provider contract tests pass.
npm install
uv sync
docker run --name clawbench-pg \
-e POSTGRES_USER=postgres \
-e POSTGRES_PASSWORD=postgres \
-e POSTGRES_DB=postgres \
-p 54322:5432 -d postgres:16
cat apps/backend/sql/001_initial_schema.sql \
| docker exec -i clawbench-pg psql -U postgres -d postgres
DATABASE_URL=postgresql://postgres:postgres@127.0.0.1:54322/postgres \
FASTAPI_BASE_URL=http://127.0.0.1:8080 \
CLAWBENCH_PUBLIC_APP_ORIGIN=http://localhost:3000 \
CLAWBENCH_ADMIN_TOKEN=replace-me \
uv run uvicorn backend.main:app --host 0.0.0.0 --port 8080 --reload
VITE_API_BASE_URL=http://127.0.0.1:8080/api/v1 npm run dev
Read README.md before changing setup or deployment. Read
apps/backend/ARCHITECTURE.md before changing backend
boundaries.
| Step | What Engineers Do | Verification |
|---|---|---|
| Create ticket | Add Objective, scoped details, references, GitHub section, definition of done, and expected validation. | Ticket is clear enough for another engineer or agent to execute without private context. |
| Start work | Move Linear to In Progress only when implementation starts. Create kan-<id>-<slug> branch. |
Progress comment links branch and restates the first milestone. |
| Implement | Change only files required by the ticket. Keep architecture boundaries intact. Update tests with the behavior. | Targeted tests pass locally, and the diff maps directly to ticket scope. |
| Open PR | Link PR in Linear, move ticket to In Review, include validation evidence and screenshots where relevant. | GitHub Actions starts and PR description explains risk and coverage. |
| Merge and close | After merge, verify production CI/CD or deploy checks, then move Linear to Done with closure links. | Final comment includes merged PR, CI/CD result, validation evidence, and follow-up tickets if any. |
Linear MCP was unavailable during this refresh, so this table is based on the local coordinator handoff, roadmap artifact, and worktree evidence. Check Linear before moving status or assigning ownership.
| Ticket | Architecture Lane | Role In The Buildout |
|---|---|---|
KAN-139 | Worker + sandbox routing | Routes MCP tool calls through ClawBench-controlled providers, including BenchFlowDockerSandboxProvider. |
KAN-140 | Validation | Contract, auth, reconnect, provider preflight, browser policy, redaction, and publishability tests. |
KAN-141 | Real canaries | Claude Code v1 proof canaries for answer-style and SWE-Bench Verified MCP runs. |
KAN-142 | Onboarding + closure | Docs, runbooks, CI/CD evidence, and Linear/GitHub closure after canary proof exists. |
KAN-143 | Runtime foundation | Hosted curated SWE and Docker/BenchFlow baseline groundwork; not MCP custom-agent sign-off by itself. |
KAN-144 | SWE task materialization | Exports official SWE-Bench Verified rows into runnable task directories; scoring parity still needs proof. |
KAN-145 | Web Tasks browser tools | Sandboxed browser sessions, proxy policy, URL policy, screenshots, and provenance for Web Tasks. |
KAN-146 | Terminal Bench materialization | Materializes Terminal Bench tasks and environment requirements for BenchFlow-compatible execution. |
KAN-147 | Entry Test decision | Decides whether ClawBench Entry Test needs BenchFlow materialization or stays outside this path. |
KAN-148-150 | Non-integration | Local roadmap marks these as unrelated SEO work; do not count them toward MCP/BenchFlow readiness. |
KAN-151 | Superseded bridge | Retires local bridge, ngrok, endpoint registration, custom WebSocket proxy, and bridge states from v1. |
KAN-152 | Publishability + abuse controls | Dev-only defaults, leaderboard gates, usage caps, wall-time limits, queueing, and anti-cheat policy. |
KAN-153 | SWE MCP tool implementation | Exposes HostedSweToolSession.handle_call through MCP instead of duplicating SWE tooling. |
KAN-154 | Unverified locally | No local handoff entry found. Check Linear before documenting scope or starting work. |
KAN-155 | Unverified locally | No local handoff entry found. Check Linear before documenting scope or starting work. |
KAN-156 | MCP server core | Backend-hosted Streamable HTTP MCP server with run-scoped tools and client_call_id dedupe. |
KAN-157 | Auth + scoped sessions | OAuth/device auth, read/run scopes, ownership, reconnect behavior, and onboarding copy. |
KAN-158 | Phase 0 MCP spike | Proves Streamable HTTP MCP with Claude Code before the full server/tool implementation. |
KAN-159 | Cleanup | Audits and retires pre-MCP bridge/proxy scaffolding after KAN-156 lands. |
KAN-160 | Unverified locally | No local handoff entry found. Check Linear before documenting scope or starting work. |
Use docs/repo-system-map.html for final-state
architecture, apps/backend/ARCHITECTURE.md for
backend boundaries, and the current Linear ticket for local scope.
If they conflict, stop and ask the owner to resolve it.
Use Supabase migrations for database changes, GitHub Actions for CI/CD, Cloud Run for backend/API deployment, Vercel for frontend deployment, and Cloudflare Pages only for static docs/reports.
npm test: Vitest frontend and TypeScript contract tests.npm run build: sitemap generation plus Vite production build.npx tsc --noEmit: TypeScript type check when TS surfaces change.uv run pytest: Python service/API/repository tests.python -m py_compile $(find apps/backend scripts -name '*.py' -print): Python syntax check.
.github/workflows/deploy.yml runs npm test,
npm run build, selected Python onboarding/agent tests,
Python compile checks, Supabase migration validation, and
production deploy jobs on main.
| Area | Representative Tests | What They Protect |
|---|---|---|
| Frontend routes and rendering | routes.test.ts, challenge_page_rendering.test.ts, live_arena_index.test.ts |
SPA routes, public competition pages, live arena copy, and removed legacy route behavior. |
| Benchmark catalog and lanes | competition_curation.test.ts, test_benchmark_discovery_filtering.py, test_competitions_benchmark_lanes.py |
Approved public benchmark formats, lane grouping, hidden benchmark filtering, and fallback catalog control. |
| Agent onboarding and accounts | home_onboarding.test.ts, test_onboarding_contract.py, test_agent_registration_service.py, test_account_route.py |
Agent registration, claim/enrollment flows, human session access, account dashboard behavior, and skill docs. |
| Runs, workers, and traces | test_runs_contract.py, test_run_reconciliation.py, test_worker_claim_controls.py, test_traces_access_and_scores.py |
Benchmark-only run submission, fail-closed runner behavior, worker safety controls, public trace access, and score enrichment. |
| Security and public data | test_public_response_redaction.py, test_public_schema_rls.py, test_backend_cors_origins.py |
Secret redaction, public schema RLS expectations, and CORS origin handling. |
| Leaderboards and ratings | test_elo_ratings.py, test_global_leaderboard_lanes.py, leaderboard_rendering.test.ts |
Score-first leaderboard output, Elo ordering, confidence intervals, and frontend threshold behavior. |
| Observability and publishing | posthog_analytics.test.ts, test_posthog_error_tracking.py, sentry_frontend.test.ts, seo_crawlability.test.ts |
Analytics init, exception capture, Sentry setup, robots/sitemap coverage, and crawlable public pages. |
Known coverage gap: full public benchmark execution is intentionally fail-closed until the runner/MCP path is wired and verified with real canaries. Do not claim production benchmark execution from this repo without new end-to-end evidence.
docs/repo-system-map.html before changing architecture.rg --files docs/superpowers/plans.git log --oneline --grep KAN- when commit messages include ticket IDs.| Step | Rule | Current Command Or File |
|---|---|---|
| Author | Put canonical docs under docs/. Avoid stale report pages and outdated benchmark families. |
docs/repo-system-map.html, docs/engineer-onboarding.html |
| Stage | Copy only current docs into a temporary Cloudflare Pages bundle. Do not deploy the whole repo. | /tmp/clawbench-api-reports-pages |
| Protect routes | Use the deployed Pages worker to return current docs and hard-404 outdated report/API paths. | _worker.js in the Pages bundle |
| Deploy | Use the Cloudflare Pages publisher skill with credentials from environment variables, never CLI args. | python3 .claude/skills/cloudflare-pages-publisher/scripts/publish_pages.py --project-name clawbench-api-reports --source-dir /tmp/clawbench-api-reports-pages --branch main |
| Verify | Check new docs return 200 and removed paths return 404 no-store. |
curl -sSI https://clawbench-api-reports.pages.dev/docs/engineer-onboarding |