Error!
Unable to generate download right now.
We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
🔥 TOP SIGNAL
Multi-agent coding looks very different when the task isn’t “implement this,” but “do research.” Andrej Karpathy tried running 8 agents (4 Claude + 4 Codex) in parallel on nanochat experiments (1 GPU each) and found the system “doesn’t work” largely because agents’ idea generation and experimental rigor are weak—they skip solid baselines/ablations and run nonsensical variations, even if they can implement well-scoped instructions quickly . His framing: the real target is “programming an organization”—prompts, skills, tools, and rituals (even “daily standup”) become the “org code,” and the eval is how fast that org makes progress on arbitrary tasks .
🛠️ TOOLS & MODELS
Claude Code (next version): new Skills
/simplify+/batch/simplify: run parallel agents to improve code quality, tune efficiency, and ensure CLAUDE.md compliance./batch: interactively plan migrations, then execute with dozens of isolated agents using git worktrees; each agent tests before opening a PR .- Intended use: automate much of the work to shepherd PRs to production and to do straightforward, parallelizable migrations.
Claude Code Remote Control: rolling out to Pro users
- Rollout: 10% and ramping; Team/Enterprise “coming soon” .
-
Enablement checklist: update to claude v2.1.58+, log out/in, then run
/remote-control.
GPT-5.3-Codex: “default choice” signals for automation
- OpenAI’s Tibo Sottiaux: since release in the API, he’s “consistently hearing” at meetups that GPT-5.3-Codex is the model to use to “get actual work done,” and a “clear winner” for background agents / automation at scale.
- Also notes it’s breaking through on raw coding ability and that “the secret is out” on best results per $.
- Docs: https://developers.openai.com/api/docs/models/gpt-5.3-codex.
Codex 5.3-high: one-shot, low-level infra surgery
- Reported “one-shotted” task: bypassed HuggingFace KV cache abstraction, monkey-patched attention at module level, handled M-RoPE, coordinated prompt-memory state with KV cache state, and performed granular eviction with span tracking.
- Greg Brockman points to Codex 5.3 for “complicated software engineering” .
Cursor adoption lens (workflow evolution)
- Karpathy’s sketch of the “optimal setup” evolution as capabilities improve: None → Tab → Agent → Parallel agents → Agent Teams (?) → ???.
- His process heuristic: 80% of time on what reliably works, 20% exploring the next step up—even if it’s messy .
💡 WORKFLOWS & TRICKS
Parallel agents with real isolation: git worktrees are emerging as the default primitive
- Karpathy’s research-org simulation: each “research program” as a git branch, each scientist forks a feature branch, and git worktrees provide isolation; “simple files” handle comms .
-
Claude Code’s
/batchmirrors this: each migration agent runs in full isolation via git worktrees, tests, then opens a PR .
“Research org” orchestration pattern (Karpathy): tmux as your control plane
- One setup: a tmux window grid of interactive agent sessions so you can watch work, and “take over” when needed .
- His finding: agents are strong at implementation, weak at experiment design (baselines, ablations, runtime/FLOPs controls), so expect humans to still provide taste + rigor .
Fast app-to-prod loop with the Codex app (from a live demo)
- Romain Huet highlights a <30 min workflow: scaffold the app, use docs + Playwright MCP, add features with plan mode, then use skills for OpenAI image generation and Vercel deploy.
- Demo link: https://x.com/kagigz/status/2027444590895063313.
Spec-first → async agent run against a real repo (Simon Willison)
-
Willison’s loop: brainstorm the use case with Claude, have Claude write a spec, then kick off an asynchronous Claude Code “for web” research project against his
simonw/researchrepo to turn the spec into working code . - Shipped artifacts:
-
Willison’s loop: brainstorm the use case with Claude, have Claude write a spec, then kick off an asynchronous Claude Code “for web” research project against his
Context-window hygiene via “stop-and-reset” loops (Ringo/OpenClaw example)
- Ringo’s “RALPH loop” executes a task markdown file one step at a time, then stops so the next step starts with a fresh context window.
- Practical takeaway: if your runs degrade over time, consider deliberately chunking work into restartable steps instead of trying to one-shot long horizons .
Safety guardrails for agentic tools with destructive capabilities (OpenClaw talk)
- Patterns called out: mandatory confirmations for destructive actions, sandboxing/read-only modes, and using a separate phone number/SIM for the bot .
- Failure mode to design around: rules stored only in the model’s working memory can be lost after context compaction—leading to destructive behavior .
Eval realism check: scaffolding juice is real, but overfit risk is too
- METR’s Joel Becker describes harness/scaffold tuning for high performance on dev tasks while trying to avoid overfitting; they invest heavily in scaffolds to upper bound model capabilities for safety analysis .
- He also notes how measuring productivity got harder: developers may refuse “AI-disallowed” randomization, and today’s concurrent workflows (multiple issues in parallel) don’t fit old study designs .
👤 PEOPLE TO WATCH
- Andrej Karpathy — concrete, instrumented look at why “agent research orgs” are still messy: implementation is easy; ideas + rigor are the bottleneck.
- Boris Cherny (Claude Code) — shipping practical agent “skills” that encode repeatable team workflows:
/simplify+/batch, plus Remote Control rollout details . - Romain Huet (OpenAI/Codex) — curating high-signal Codex workflows and capability examples (rapid app shipping; low-level infra tasks) .
- Max Woolf — detailed “skeptic tries agent coding” writeup; notable claim that Opus 4.6/Codex 5.3 feel “an order of magnitude better” for complex tasks than models from months earlier .
- Simon Willison — repeatable “spec → async agent run → deploy” patterns with publicly inspectable artifacts .
🎬 WATCH & LISTEN
1) OpenClaw Manila — Ringo’s “idea → live prototype” loop (≈24:15–27:55)
How it works under the hood: a ReAct-style loop that writes a task file, executes one task per fresh context window, and uses infra integrations (GitHub/Cloudflare/etc.) to ship prototypes fast .
2) METR (Joel Becker) — harness/scaffold tuning and the overfit trap (≈56:25–57:35)
A grounded explanation of why different harnesses can swing results—and why METR invests in scaffolds to estimate “best possible” model capability without fooling themselves via overfitting .
📊 PROJECTS & REPOS
DeerFlow 2.0 (ByteDance) — long-horizon agent architecture
- Rebuilt on LangGraph 1.0 with planning, long-term memory, file system, and skills .
- Repo: https://github.com/bytedance/deer-flow
- Prior version: 20k+ GitHub stars.
Unicode Explorer (Simon Willison) — binary search over HTTP range requests
Rust wordcloud CLI (Claude Code-built) — small, shippable agent output
Decompile-driven porting example (Huntley link roundup)
ls→ Rust port via objdump: https://github.com/DanielJoyce/ls-rs.
Ben Tossell’s “files interface” (open-source, looking for testers)
- Described as an API + frontend that looks IDE-like, designed so agents can extend it.
Editorial take: Raw coding is getting solved; the leverage is moving to orchestration + isolation + guardrails—and the hardest remaining gap is still tasteful, rigorous idea generation, not implementation .