ZeroNoise Logo zeronoise
Post
Multi-agent reality check: worktree-based parallelism, new Claude Code skills, and Codex 5.3 low-level wins
Feb 28
6 min read
162 docs
Today’s highest-signal theme: multi-agent setups break down on research rigor, even as raw coding capabilities keep climbing. You’ll get concrete tool updates (Claude Code /batch + /simplify, Remote Control rollout), replicable workflows (spec→async agent run→deploy, worktree-based parallelism), and two watchable clips on long-horizon loops and evaluation scaffolding.

🔥 TOP SIGNAL

Multi-agent coding looks very different when the task isn’t “implement this,” but “do research.” Andrej Karpathy tried running 8 agents (4 Claude + 4 Codex) in parallel on nanochat experiments (1 GPU each) and found the system “doesn’t work” largely because agents’ idea generation and experimental rigor are weak—they skip solid baselines/ablations and run nonsensical variations, even if they can implement well-scoped instructions quickly . His framing: the real target is “programming an organization”—prompts, skills, tools, and rituals (even “daily standup”) become the “org code,” and the eval is how fast that org makes progress on arbitrary tasks .

🛠️ TOOLS & MODELS

  • Claude Code (next version): new Skills /simplify + /batch

    • /simplify: run parallel agents to improve code quality, tune efficiency, and ensure CLAUDE.md compliance.
    • /batch: interactively plan migrations, then execute with dozens of isolated agents using git worktrees; each agent tests before opening a PR .
    • Intended use: automate much of the work to shepherd PRs to production and to do straightforward, parallelizable migrations.
  • Claude Code Remote Control: rolling out to Pro users

    • Rollout: 10% and ramping; Team/Enterprise “coming soon” .
    • Enablement checklist: update to claude v2.1.58+, log out/in, then run /remote-control.
  • GPT-5.3-Codex: “default choice” signals for automation

    • OpenAI’s Tibo Sottiaux: since release in the API, he’s “consistently hearing” at meetups that GPT-5.3-Codex is the model to use to “get actual work done,” and a “clear winner” for background agents / automation at scale.
    • Also notes it’s breaking through on raw coding ability and that “the secret is out” on best results per $.
    • Docs: https://developers.openai.com/api/docs/models/gpt-5.3-codex.
  • Codex 5.3-high: one-shot, low-level infra surgery

    • Reported “one-shotted” task: bypassed HuggingFace KV cache abstraction, monkey-patched attention at module level, handled M-RoPE, coordinated prompt-memory state with KV cache state, and performed granular eviction with span tracking.
    • Greg Brockman points to Codex 5.3 for “complicated software engineering” .
  • Cursor adoption lens (workflow evolution)

    • Karpathy’s sketch of the “optimal setup” evolution as capabilities improve: None → Tab → Agent → Parallel agents → Agent Teams (?) → ???.
    • His process heuristic: 80% of time on what reliably works, 20% exploring the next step up—even if it’s messy .

💡 WORKFLOWS & TRICKS

  • Parallel agents with real isolation: git worktrees are emerging as the default primitive

    • Karpathy’s research-org simulation: each “research program” as a git branch, each scientist forks a feature branch, and git worktrees provide isolation; “simple files” handle comms .
    • Claude Code’s /batch mirrors this: each migration agent runs in full isolation via git worktrees, tests, then opens a PR .
  • “Research org” orchestration pattern (Karpathy): tmux as your control plane

    • One setup: a tmux window grid of interactive agent sessions so you can watch work, and “take over” when needed .
    • His finding: agents are strong at implementation, weak at experiment design (baselines, ablations, runtime/FLOPs controls), so expect humans to still provide taste + rigor .
  • Fast app-to-prod loop with the Codex app (from a live demo)

    • Romain Huet highlights a <30 min workflow: scaffold the app, use docs + Playwright MCP, add features with plan mode, then use skills for OpenAI image generation and Vercel deploy.
    • Demo link: https://x.com/kagigz/status/2027444590895063313.
  • Spec-first → async agent run against a real repo (Simon Willison)

  • Context-window hygiene via “stop-and-reset” loops (Ringo/OpenClaw example)

    • Ringo’s “RALPH loop” executes a task markdown file one step at a time, then stops so the next step starts with a fresh context window.
    • Practical takeaway: if your runs degrade over time, consider deliberately chunking work into restartable steps instead of trying to one-shot long horizons .
  • Safety guardrails for agentic tools with destructive capabilities (OpenClaw talk)

    • Patterns called out: mandatory confirmations for destructive actions, sandboxing/read-only modes, and using a separate phone number/SIM for the bot .
    • Failure mode to design around: rules stored only in the model’s working memory can be lost after context compaction—leading to destructive behavior .
  • Eval realism check: scaffolding juice is real, but overfit risk is too

    • METR’s Joel Becker describes harness/scaffold tuning for high performance on dev tasks while trying to avoid overfitting; they invest heavily in scaffolds to upper bound model capabilities for safety analysis .
    • He also notes how measuring productivity got harder: developers may refuse “AI-disallowed” randomization, and today’s concurrent workflows (multiple issues in parallel) don’t fit old study designs .

👤 PEOPLE TO WATCH

  • Andrej Karpathy — concrete, instrumented look at why “agent research orgs” are still messy: implementation is easy; ideas + rigor are the bottleneck.
  • Boris Cherny (Claude Code) — shipping practical agent “skills” that encode repeatable team workflows: /simplify + /batch, plus Remote Control rollout details .
  • Romain Huet (OpenAI/Codex) — curating high-signal Codex workflows and capability examples (rapid app shipping; low-level infra tasks) .
  • Max Woolf — detailed “skeptic tries agent coding” writeup; notable claim that Opus 4.6/Codex 5.3 feel “an order of magnitude better” for complex tasks than models from months earlier .
  • Simon Willison — repeatable “spec → async agent run → deploy” patterns with publicly inspectable artifacts .

🎬 WATCH & LISTEN

1) OpenClaw Manila — Ringo’s “idea → live prototype” loop (≈24:15–27:55)

How it works under the hood: a ReAct-style loop that writes a task file, executes one task per fresh context window, and uses infra integrations (GitHub/Cloudflare/etc.) to ship prototypes fast .

2) METR (Joel Becker) — harness/scaffold tuning and the overfit trap (≈56:25–57:35)

A grounded explanation of why different harnesses can swing results—and why METR invests in scaffolds to estimate “best possible” model capability without fooling themselves via overfitting .

📊 PROJECTS & REPOS


Editorial take: Raw coding is getting solved; the leverage is moving to orchestration + isolation + guardrails—and the hardest remaining gap is still tasteful, rigorous idea generation, not implementation .