Activity for Multi-agent reality check: worktree-based parallelism, new Claude Code skills, and Codex 5.3 low-level wins

Multi-agent reality check: worktree-based parallelism, new Claude Code skills, and Codex 5.3 low-level wins

Today’s highest-signal theme: multi-agent setups break down on research rigor, even as raw coding capabilities keep climbing. You’ll get concrete tool updates (Claude Code /batch + /simplify, Remote Control rollout), replicable workflows (spec→async agent run→deploy, worktree-based parallelism), and two watchable clips on long-horizon loops and evaluation scaffolding.

Greg Brockman

eigenron

Peter Steinberger

+12

🔥 TOP SIGNAL

Multi-agent coding looks very different when the task isn’t “implement this,” but “do research.” Andrej Karpathy tried running 8 agents (4 Claude + 4 Codex) in parallel on nanochat experiments (1 GPU each) and found the system “doesn’t work” largely because agents’ idea generation and experimental rigor are weak—they skip solid baselines/ablations and run nonsensical variations, even if they can implement well-scoped instructions quickly . His framing: the real target is “programming an organization”—prompts, skills, tools, and rituals (even “daily standup”) become the “org code,” and the eval is how fast that org makes progress on arbitrary tasks .

🛠️ TOOLS & MODELS

Claude Code (next version): new Skills /simplify + /batch
- /simplify: run parallel agents to improve code quality, tune efficiency, and ensure CLAUDE.md compliance.
- /batch: interactively plan migrations, then execute with dozens of isolated agents using git worktrees; each agent tests before opening a PR .
- Intended use: automate much of the work to shepherd PRs to production and to do straightforward, parallelizable migrations.
Claude Code Remote Control: rolling out to Pro users
- Rollout: 10% and ramping; Team/Enterprise “coming soon” .
- Enablement checklist: update to claude v2.1.58+, log out/in, then run /remote-control.
GPT-5.3-Codex: “default choice” signals for automation
- OpenAI’s Tibo Sottiaux: since release in the API, he’s “consistently hearing” at meetups that GPT-5.3-Codex is the model to use to “get actual work done,” and a “clear winner” for background agents / automation at scale.
- Also notes it’s breaking through on raw coding ability and that “the secret is out” on best results per $.
- Docs: https://developers.openai.com/api/docs/models/gpt-5.3-codex.
Codex 5.3-high: one-shot, low-level infra surgery
- Reported “one-shotted” task: bypassed HuggingFace KV cache abstraction, monkey-patched attention at module level, handled M-RoPE, coordinated prompt-memory state with KV cache state, and performed granular eviction with span tracking.
- Greg Brockman points to Codex 5.3 for “complicated software engineering” .
Cursor adoption lens (workflow evolution)
- Karpathy’s sketch of the “optimal setup” evolution as capabilities improve: None → Tab → Agent → Parallel agents → Agent Teams (?) → ???.
- His process heuristic: 80% of time on what reliably works, 20% exploring the next step up—even if it’s messy .

💡 WORKFLOWS & TRICKS

Parallel agents with real isolation: git worktrees are emerging as the default primitive
- Karpathy’s research-org simulation: each “research program” as a git branch, each scientist forks a feature branch, and git worktrees provide isolation; “simple files” handle comms .
- Claude Code’s /batch mirrors this: each migration agent runs in full isolation via git worktrees, tests, then opens a PR .
“Research org” orchestration pattern (Karpathy): tmux as your control plane
- One setup: a tmux window grid of interactive agent sessions so you can watch work, and “take over” when needed .
- His finding: agents are strong at implementation, weak at experiment design (baselines, ablations, runtime/FLOPs controls), so expect humans to still provide taste + rigor .
Fast app-to-prod loop with the Codex app (from a live demo)
- Romain Huet highlights a <30 min workflow: scaffold the app, use docs + Playwright MCP, add features with plan mode, then use skills for OpenAI image generation and Vercel deploy.
- Demo link: https://x.com/kagigz/status/2027444590895063313.
Spec-first → async agent run against a real repo (Simon Willison)
- Willison’s loop: brainstorm the use case with Claude, have Claude write a spec, then kick off an asynchronous Claude Code “for web” research project against his simonw/research repo to turn the spec into working code .
- Shipped artifacts:
  - Code/report: https://github.com/simonw/research/tree/main/unicode-explorer-binary-search#readme
  - Deployed demo: https://tools.simonwillison.net/unicode-binary-search
Context-window hygiene via “stop-and-reset” loops (Ringo/OpenClaw example)
- Ringo’s “RALPH loop” executes a task markdown file one step at a time, then stops so the next step starts with a fresh context window.
- Practical takeaway: if your runs degrade over time, consider deliberately chunking work into restartable steps instead of trying to one-shot long horizons .
Safety guardrails for agentic tools with destructive capabilities (OpenClaw talk)
- Patterns called out: mandatory confirmations for destructive actions, sandboxing/read-only modes, and using a separate phone number/SIM for the bot .
- Failure mode to design around: rules stored only in the model’s working memory can be lost after context compaction—leading to destructive behavior .
Eval realism check: scaffolding juice is real, but overfit risk is too
- METR’s Joel Becker describes harness/scaffold tuning for high performance on dev tasks while trying to avoid overfitting; they invest heavily in scaffolds to upper bound model capabilities for safety analysis .
- He also notes how measuring productivity got harder: developers may refuse “AI-disallowed” randomization, and today’s concurrent workflows (multiple issues in parallel) don’t fit old study designs .

👤 PEOPLE TO WATCH

Andrej Karpathy — concrete, instrumented look at why “agent research orgs” are still messy: implementation is easy; ideas + rigor are the bottleneck.
Boris Cherny (Claude Code) — shipping practical agent “skills” that encode repeatable team workflows: /simplify + /batch, plus Remote Control rollout details .
Romain Huet (OpenAI/Codex) — curating high-signal Codex workflows and capability examples (rapid app shipping; low-level infra tasks) .
Max Woolf — detailed “skeptic tries agent coding” writeup; notable claim that Opus 4.6/Codex 5.3 feel “an order of magnitude better” for complex tasks than models from months earlier .
Simon Willison — repeatable “spec → async agent run → deploy” patterns with publicly inspectable artifacts .

🎬 WATCH & LISTEN

1) OpenClaw Manila — Ringo’s “idea → live prototype” loop (≈24:15–27:55)

How it works under the hood: a ReAct-style loop that writes a task file, executes one task per fresh context window, and uses infra integrations (GitHub/Cloudflare/etc.) to ship prototypes fast .

2) METR (Joel Becker) — harness/scaffold tuning and the overfit trap (≈56:25–57:35)

A grounded explanation of why different harnesses can swing results—and why METR invests in scaffolds to estimate “best possible” model capability without fooling themselves via overfitting .

📊 PROJECTS & REPOS

DeerFlow 2.0 (ByteDance) — long-horizon agent architecture
- Rebuilt on LangGraph 1.0 with planning, long-term memory, file system, and skills .
- Repo: https://github.com/bytedance/deer-flow
- Prior version: 20k+ GitHub stars.
Unicode Explorer (Simon Willison) — binary search over HTTP range requests
- Live demo: https://tools.simonwillison.net/unicode-binary-search
- Code/report: https://github.com/simonw/research/tree/main/unicode-explorer-binary-search#readme.
Rust wordcloud CLI (Claude Code-built) — small, shippable agent output
- Link: https://github.com/simonw/research/tree/main/rust-wordcloud#readme.
Decompile-driven porting example (Huntley link roundup)
- ls → Rust port via objdump: https://github.com/DanielJoyce/ls-rs.
Ben Tossell’s “files interface” (open-source, looking for testers)
- Described as an API + frontend that looks IDE-like, designed so agents can extend it.

Editorial take: Raw coding is getting solved; the leverage is moving to orchestration + isolation + guardrails—and the hardest remaining gap is still tasteful, rigorous idea generation, not implementation .

Multi-agent reality check: worktree-based parallelism, new Claude Code skills, and Codex 5.3 low-level wins

Peter Steinberger

Profile 1 doc

Jesse Palaniban's Ringo OpenClaw agent for prototyping (engineering leader at irudify fintech startup): Builds full prototypes from idea to live URL/repo/screenshot announcement on Slack in 30 minutes or less. Uses ReAct loop (RALPH) with o1-preview model, PM2 for process, Caddy server; accesses GitHub/Cloudflare/Notion/GDrive; runs on Azure VM via Mosh/SSH, company chats via Slack . Workflow: Chat idea referencing existing codebases (e.g., "build on-brand landing page at quiz.ringo.irudify.com"); agent creates task MD, iterates tasks with fresh context windows, pushes code, deploys . Tips: Build on prior successes/existing repos; copy code locally (remote slow); orchestrate/delegate coding to open code/cloud code to avoid memory overload . Firsthand production use for fast ideation/validation .

SIV agency Aria OpenClaw (Raven, CTO): Multiplayer Discord orchestration shipped 169 products (39 live) via ReAct loops + Claude/agent SDK (80-90% auto, humans refine/deploy); market research/competitor analysis/coding/deploy per standards; human-AI collab (e.g., tag engineer for SynthPay context) . Dashboards/products auto-generated/editable via chat (files in workspace); ~30k USD/mo on frontier models . Pattern: Shared institutional knowledge, dept-specific flows, build-first-optimize-later .

Jeff J. Hunter's AI Persona OS skill (early OpenClaw contributor, VA owner): ClawHub.ai skill (3000+ downloads) with Soulmaker (prompt interview builds soul.md), Never Forget memory protocol (context checkpoints at 75%), security hardening, 8-step workflows, knowledge base; VPS remote desktop script (~$40/mo for 2 agents, Hostinger/Kimmy K API) . Timeless: Hardcode rules in system files (not context) to survive compaction .

Sai's dev setup: soul.md personality; GitHub PR/issue review; cron for research/QA (unit/E2E tests, human eval) . Safety patterns: Mandatory confirmations/destructive actions, sandbox/read-only, separate phone/SIM, DM scopes .

Nicholas Reyes custom outreach skill: Prompt OpenClaw to build pipeline (Slack->AgentMail API/webhook Flask server on GCP VM) safer than downloading .

Simon Willison's Weblog

simonwillison 1 doc

Max Woolf (@minimaxir), a skeptic with coding pedigree, details using coding agents for projects escalating from YouTube metadata scrapers to porting scikit-learn to Rust, creating rustlearn crate with fast implementations of logistic regression and k-means clustering that beat scikit-learn benchmarks via a three-step agent pipeline.

Opus 4.6 and Codex 5.3 are 'an order of magnitude better' than prior coding LLMs, correctly handling complex tasks that would take months manually—challenging hype perceptions .

Inspired, Simon Willison used Claude Code to build a Rust word cloud CLI tool .

Firsthand account from Max Woolf on serious side projects.

Latent Space

youtube 1 doc

METR Developer Productivity RCT (March 2025, Joel Becker reporting): AI slowed developers on real issues vs no-AI baseline . Redo challenging: devs refuse no-AI randomization (selection bias); concurrent multi-issue workflows unmeasured .

Firsthand Workflow (Swyx): Shifted from Cursor paired with Claude code to async Claude code sessions + review/iterate.

o1 (Claude 3.5 Sonnet?) Impact (Joel Becker observation): Top engineers went from avoiding AI coding to writing almost no code manually.

Contrarian Takes on Gains: Long agent runs (e.g., Claude code 5-30 hours) anecdotal/cherry-picked, output quality dubious . Speedup claims (e.g., 10x) overstated; new AI-enabled tasks often low-value; organizations can't absorb 10x output .

Coding Benchmarks (METR): HCAS tasks (bug fixes, up to 30 human-hours); SWA atomic (e.g., ID password file from files) . Harness variations cause ~10pp score diffs; METR optimizes for max performance .

Simon Willison's Weblog

simonwillison 1 doc

Simon Willison shares a firsthand workflow for rapid prototyping using Claude and Claude Code:

Brainstormed use case (binary search on Unicode data) with Claude .
Had Claude generate a spec .
Launched an asynchronous research project with Claude Code against his simonw/research repo to produce working code .

Resulting code and report: github.com/simonw/research/tree/main/unicode-explorer-binary-search#readme. Deployed demo: tools.simonwillison.net/unicode-binary-search.

Built as an experiment from his phone to explore HTTP range requests .

Geoffrey Huntley

ghuntley 1 doc

Geoffrey Huntley (prolific writer, international keynotes, latentpatterns.com creator, discovered Ralph ~1 year ago) shares firsthand coding agent experiences .

Ralph (Claude loops agent): Viral; used for game theory analysis; enabled PE firm to profit shorting Atlassian . Resource: https://www.theregister.com/2026/01/27/ralph_wiggum_claude_loops/.
Ralph Loop workflow: Automated full database migration Cloudflare D1 → PlanetScale Postgres; succeeded unattended .
Cursor: Enables non-devs to build software; observed at meetup with attendees showcasing creations .
Claude Code technique: Rebuild SaaS features from screenshots; cloned Posthog/Jira/Pipedrive/Calendly for hyper-personalized latentpatterns.com.

Quantitative: Software dev costs $10.42/hour < minimum wage .

Contrarian: Dev skills commoditized; AI erases specialist identities (backend/frontend/etc.) .

Related: z80 LLM porting example.

Andrej Karpathy

x 2 docs

Andrej Karpathy (@karpathy, ex-Director of AI @ Tesla, OpenAI founding team, Stanford PhD) shares a firsthand multi-agent setup in nanochat for NanoGPT speedrun experiments (e.g., deleting logit softcap without regression) .

Tools/Agents: 4 Claude + 4 Codex agents, 1 GPU each .

Workflow Setups:

8 independent solo researchers
1 chief scientist directing 8 juniors

Infrastructure:

Git branches per program; feature forks & worktrees for isolation
Simple files for communication
Tmux window grids for interactive sessions (no Docker/VMs; instructions prevent interference)
Human takeover possible (no -p flag)

Limitations: Agents excel at implementing well-scoped ideas but generate poor ones—weak on experiment design, baselines, ablation, controls (e.g., spurious hidden size discovery) .

Philosophy: Treat as "programming an organization" via prompts, skills, tools, processes (e.g., daily standups); nanochat optimization as task/eval for arbitrary progress speed .

Visual: Tmux grid video .

Context: Inspired by @Thom_Wolf questioning NanoGPT speedrun automation .

swyx

x 5 docs

@swyx (@cognition) states Harbor framework (@harborframework) has dominated RL infra and evals; his team at Cog prioritizing migration of all evals to Harbor—launched ~3 months ago for TerminalBench 2.

Harbor standardizes benchmarks via one interface: repeatable runs, standardized traces, production-grade practice—born from TerminalBench . Launch video: https://youtu.be/5wo0mLlG0fk?si=IYMJ4gRtBffq3G7d. Expects mini-industry of Harbor-based evals/benchmarks .

Related: Packed Modal SF event running complex RL environments, scaling sandboxes, LLM-as-judge evals.

Firsthand production migration insight from Cog engineer.

Andrej Karpathy

x 2 docs

Andrej Karpathy (ex-Director of AI @ Tesla, OpenAI founding team, Stanford PhD) on Cursor agent adoption from tab/agent request ratios:

Evolution of optimal setups: None → Tab → Agent → Parallel agents → Agent Teams (?) → ???

Too conservative leaves leverage on table; too aggressive creates chaos .

80/20 process: 80% time in comfortable setup, 20% exploring next steps .

Chart source: https://x.com/mntruell/status/2026736314272591924

Ben Tossell

x 2 docs

@bentossell, dev tool investor and Makerpad founder (acq. by Zapier), is building an open-source API + frontend interface for building/working with files—resembling an IDE—that agents can infinitely extend.

Looking for testers .

Firsthand project announcement from practitioner.

Latent.Space

latent 1 doc

Claude Code adoption jumped from 2% to 4% of GitHub commits in one month, indicating AI escape velocity and coding agents as the first real trillion-dollar unlock; total AI-generated code likely ~10% or more (incl. Copilot, Cognition/Devon) .

Dylan Patel (SemiAnalysis Founder/CEO, advising AI labs/hyperscalers) reports ~1/3 of his 60-person team (engineers + hedge fund analysts) now uses Claude Code for data scraping and pro forma financial modeling—firsthand production usage .

Recent coding agent developments: Claude Code/Bot, Multi-book, Kimi 2.5 agent swarms, Codex 5.3 .

Resources:

Jason Zhou

x 2 docs

DeerFlow 2.0 is a new open-source general agent architecture rebuilt from scratch on LangGraph 1.0, designed for long-horizon tasks with planning, long-term memory, file system, and skills.

Previous version has 20k+ GitHub stars.

Repo: https://github.com/bytedance/deer-flow.

Endorsement by @jasonzhou1993 (AI builder at @SuperDesignDev): "awesome" for long-running complex tasks .

Firsthand announcement from ByteDance engineer @henry19840301 .

Boris Cherny

x 2 docs

Claude Code Remote Control feature rolling out to Pro users (10% rollout ramping; Team/Enterprise soon), allowing use away from desk .

Enablement steps:

Update to claude v2.1.58+
Log out/log in for fresh flag values
Run /remote-control

Includes demo video .

Firsthand rollout announcement by Anthropic engineer @noahzweben, shared by @bcherny .

Romain Huet

x 2 docs

Romain Huet (Head of Developer Experience @OpenAI, working on Codex) shares @kagigz's firsthand workflow to build and deploy apps in <30 minutes using the Codex app.

Demo: Photobooth app from scratch :

Scaffold the app
Use docs and Playwright MCP
Add features with plan mode
Use skills for OpenAI image generation and Vercel deployment

Video: https://x.com/kagigz/status/2027444590895063313.

Boris Cherny

x 4 docs

Claude Code next version introduces two new skills: /simplify and /batch, used daily by author Boris Cherny (@bcherny) to automate shepherding PRs to production and parallelizable code migrations .

/simplify: Parallel agents improve code quality, tune efficiency, ensure CLAUDE.md compliance. Usage: “hey claude make this code change then run /simplify” .

/batch: Interactively plan code migrations, execute in parallel with dozens of agents using git worktrees for isolation, testing before PR. Usage: “/batch migrate src/ from Solid to React” .

Romain Huet

x 2 docs

@eigenron reports codex-5.3-high (part of GPT-5.3-Codex) one-shotted a complex task: bypassing HuggingFace’s entire KV cache abstraction, monkey-patching attention at the module level, dealing with M-RoPE, coordinating prompt-level memory state with KV cache state, and granular surgical eviction with span tracking .

@romainhuet (Head of Developer Experience @OpenAI, working on Codex) highlights this as epic, with GPT-5.3-Codex "keeps raising the bar" .

Greg Brockman

x 2 docs

Codex 5.3-high one-shotted a complex low-level engineering task: bypassing HuggingFace’s entire KV cache abstraction, monkey-patching attention at the module level, dealing with M-RoPE, coordinating prompt-level memory state with KV cache state, and granular surgical eviction with span tracking .

@eigenron (firsthand): "my jaw is on the floor" .

@gdb (OpenAI President & Co-Founder) highlights Codex 5.3 for complicated software engineering .

Theo - t3.gg

x 1 doc

@jullerino, contributor to T3 Code, has nearly completed Claude Code integration .

He shared: down to 30% remaining in Codex usage, so using Claude Code subscription for easier tasks .

Firsthand account relayed by Theo (@theo, CEO @t3.gg) .

Tibo

x 2 docs

GPT-5.3-Codex, released in the OpenAI API by @thsottiaux's team, is reported as the top model for actual work, background agents, and automation at scale based on feedback from meetups . It excels in raw coding ability and delivers the best results per dollar spent.

Docs: https://developers.openai.com/api/docs/models/gpt-5.3-codex.

Author context: OpenAI Codex team member sharing firsthand release insights and community feedback .