Orchestration-first agent coding: Codex CLI v0.105, spec-driven loops, and eval infra wars

Today’s theme: orchestration beats raw generation. You’ll get concrete spec-first workflows from practitioners, major Codex CLI upgrades (v0.105), new PR autofix automation, and why shared eval infrastructure is suddenly a battleground.

adi

swyx

Addy Osmani

+21

🔥 TOP SIGNAL

Orchestration is becoming the core dev skill: Addy Osmani argues the enterprise frontier is orchestrating a modest set of agents with control/traceability, not running huge swarms . In practice, that shows up as spec-first work: Brendan Long’s vibe-coding loop starts by writing a detailed GitHub issue ("90% of the work"), optionally having an agent plan, then having another agent implement .

🛠️ TOOLS & MODELS

Codex CLI v0.105 (major QoL upgrade)
- New: syntax highlighting, dictate prompts by holding spacebar, better multi-agent workflows, improved approval controls, plus other QoL changes .
- Install/upgrade: $ npm i -g @openai/codex@latest.
- Practitioner reaction: “diffs are beautiful” and it’s “very, very fast now” .
Codex app (Windows) — first waitlist batch invited
- Team says they’ll “expand from there” as they iterate through feedback .
Model preference + benchmarkging signals (Codex 5.3)
- Mitchell Hashimoto: Codex 5.3 felt “much more effective” than Opus 4.6; after switching back-and-forth, he hasn’t touched Opus for a week .
- Romain Huet: GPT-5.3-Codex hit 90% on IBench at xhigh reasoning; says with speed gains, “xhigh doesn’t feel like a tradeoff anymore” .
- Related run: “decided to run 5.3 codex on xhigh as well, its 90%… rip IBench, survived 3 months” .
Cursor — Bugbot Autofix (PR issues → auto-fixes)
- Announcement: Bugbot can now automatically fix issues it finds in PRs .
- Details: http://cursor.com/blog/bugbot-autofix.
Devin AI (real production debugging)
- swyx reports Devin investigated a production bug (Vercel org migration + forgotten key), asked for exactly what it needed, and verified the fix .
FactoryAI Droids — “Missions” + terminal “Mission Control”
- “Missions”: multi-day autonomous goals where you describe what you want, approve a plan, and come back to finished work .
- Mission Control: a terminal view of which feature is being built, which Droid is on it, tools used, and progress .
- Examples FactoryAI says enterprises are running: modernize a 40-year-old COBOL module; migrate >1k microservices across regions; recalc 10 years of pricing; refactor a monolith handling 20M daily API calls with no downtime .
OpenClaw — new beta bits
- Adds: external secrets management (openclaw secrets) , CP thread-bound agents , WebSocket support for Codex , and Codex/Claude Code as first-class subagents via ACP .
Omarchy 3.4 — agent features shipped
- Release highlights include “new agent features (claude by default + tmux swarm!)” and a tailored tmux setup .
Harbor framework — shared agent eval infra momentum
- Laude Institute frames Harbor as shared infrastructure to standardize benchmarks via one interface (repeatable runs, standardized traces, production-grade practice) .
- swyx says his team is prioritizing migrating evals to Harbor and calls it dominant in RL infra/evals for terminal agents .

💡 WORKFLOWS & TRICKS

Spec-driven agent work (make the spec the artifact)
- Brendan Long’s repeatable loop for large vibe-coded apps:
  1. Write a GitHub issue
  2. If it’s complex, have an agent produce a plan and update the issue
  3. Have another agent read the issue and implement it
- He claims a detailed enough issue is “90% of the work” and rewriting it is often what fixes problems .
Enterprise-grade orchestration guidance (modest fleets, strong controls)
- Addy Osmani’s concrete advice: spend 30–40% of task time writing the spec—constraints, success criteria, stack/architecture—and gather context in a resources directory; otherwise you “waste tokens” and LLMs default to “lowest common denominator” patterns .
- For teams: codify best practices in context (e.g., MCP-callable systems or even markdown files) to raise the odds the output is shippable .
- He also flags the real bottleneck: “Not generation, but coordination” .
Close the loop: isolate the runtime so agents can run it
- Kent C. Dodds: “get your app running in an isolated environment to close the agent loop” .
- He points to his Epic Stack guiding principles—“Minimize Setup Friction” and “Offline Development”—as a practical way to make this easier .
Hoard working examples, then recombine (prompt with concrete known-good snippets)
- Simon Willison’s pattern: keep a personal library of solved examples across blogs/TIL, many repos, and small “HTML tools” pages, because agents can recombine them quickly .
- His OCR tool story: he combined snippets for PDF rendering and OCR into a single HTML page via prompt, iterated a few times, and ended with a tool he still uses .
- Agent tip: when asking Claude Code to reuse an existing tool, he sometimes specifies curl explicitly to fetch raw HTML instead of a summarizing fetch tool .
Tests aren’t a moat anymore (agents can recreate them fast)
- tldraw moved tests to a closed-source repo to prevent “Slop Fork” forks .
- Armin Ronacher’s counterpoint: agents can generate language/implementation-agnostic test suites quickly if there’s a reference implementation .
Security footnote from a vibe-coded app
- In Simon Willison’s Present.app walkthrough, the remote-control web server used GET requests for state changes (e.g., /next, /prev), which he notes opens up CSRF vulnerabilities—he didn’t care for that application .

👤 PEOPLE TO WATCH

Addy Osmani (Google Cloud AI) — clearest “enterprise reality check”: quality bars, traceability, and spec/context discipline, plus a strong stance that orchestration is the thing to learn .
Simon Willison — consistently turns agent usage into transferable patterns (Agentic Engineering Patterns + “hoard examples” + codebase walkthrough prompts) .
Brendan Long — practical decomposition: write issues like a system design interview, then let agents execute .
Nicholas Moy (DeepMind) — framing: “10x engineer” becomes “10 agent orchestrator,” measured by concurrent agents you can run effectively .
Dylan Patel (Semianalysis) — adoption signal: Claude Code share of GitHub commits going 2%→4% in a month, with a broader estimate of total AI-written code around ~10% .

🎬 WATCH & LISTEN

1) Addy Osmani: “Learn orchestration” + the path to agent fleets (≈ 23:32–25:54)

Hook: Practical roadmap from single-agent prompting to multi-agent orchestration and coordination patterns—before you burn tokens on experimental swarms .

2) SAIL LIVE #6: why SWE-Bench got saturated (and what that says about evals) (≈ 29:49–33:48)

Hook: A clear explanation of how SWE-Bench is constructed, why it became the default “agentic coding” benchmark, and why that creates problems once it’s widely known and reused .

📊 PROJECTS & REPOS

Agentic Engineering Patterns (Simon Willison) — a living guide of coding-agent practices and patterns (agentic engineering vs vibe coding framing)
- https://simonwillison.net/guides/agentic-engineering-patterns/
Present.app (Simon Willison) — vibe-coded SwiftUI macOS presentation app where each “slide” is a URL; GitHub repo shared
- https://github.com/simonw/present
OpenClaw releases + docs (beta features shipping)
- https://github.com/openclaw/openclaw/releases
- Secrets docs: https://docs.openclaw.ai/cli/secrets
- ACP agents docs: https://docs.openclaw.ai/tools/acp-agents
Cursor Bugbot Autofix announcement + writeup
- http://cursor.com/blog/bugbot-autofix
Omarchy 3.4 release (61 contributors; agent features + tmux work)
- https://github.com/basecamp/omarchy/releases/tag/v3.4.0
tldraw tests move discussion (tests closed-source)
- https://github.com/tldraw/tldraw/issues/8082

Editorial take: The leverage is shifting from “pick the best model” to “build the tightest loop”: spec → isolated runtime → tests/evals → approvals—and only then scale agents.