ZeroNoise Logo zeronoise
Post
Orchestration-first agent coding: Codex CLI v0.105, spec-driven loops, and eval infra wars
Feb 27
6 min read
155 docs
Today’s theme: orchestration beats raw generation. You’ll get concrete spec-first workflows from practitioners, major Codex CLI upgrades (v0.105), new PR autofix automation, and why shared eval infrastructure is suddenly a battleground.

🔥 TOP SIGNAL

Orchestration is becoming the core dev skill: Addy Osmani argues the enterprise frontier is orchestrating a modest set of agents with control/traceability, not running huge swarms . In practice, that shows up as spec-first work: Brendan Long’s vibe-coding loop starts by writing a detailed GitHub issue ("90% of the work"), optionally having an agent plan, then having another agent implement .

🛠️ TOOLS & MODELS

  • Codex CLI v0.105 (major QoL upgrade)

    • New: syntax highlighting, dictate prompts by holding spacebar, better multi-agent workflows, improved approval controls, plus other QoL changes .
    • Install/upgrade: $ npm i -g @openai/codex@latest.
    • Practitioner reaction: “diffs are beautiful” and it’s “very, very fast now” .
  • Codex app (Windows) — first waitlist batch invited

    • Team says they’ll “expand from there” as they iterate through feedback .
  • Model preference + benchmarkging signals (Codex 5.3)

    • Mitchell Hashimoto: Codex 5.3 felt “much more effective” than Opus 4.6; after switching back-and-forth, he hasn’t touched Opus for a week .
    • Romain Huet: GPT-5.3-Codex hit 90% on IBench at xhigh reasoning; says with speed gains, “xhigh doesn’t feel like a tradeoff anymore” .
    • Related run: “decided to run 5.3 codex on xhigh as well, its 90%… rip IBench, survived 3 months” .
  • Cursor — Bugbot Autofix (PR issues → auto-fixes)

  • Devin AI (real production debugging)

    • swyx reports Devin investigated a production bug (Vercel org migration + forgotten key), asked for exactly what it needed, and verified the fix .
  • FactoryAI Droids — “Missions” + terminal “Mission Control”

    • “Missions”: multi-day autonomous goals where you describe what you want, approve a plan, and come back to finished work .
    • Mission Control: a terminal view of which feature is being built, which Droid is on it, tools used, and progress .
    • Examples FactoryAI says enterprises are running: modernize a 40-year-old COBOL module; migrate >1k microservices across regions; recalc 10 years of pricing; refactor a monolith handling 20M daily API calls with no downtime .
  • OpenClaw — new beta bits

    • Adds: external secrets management (openclaw secrets) , CP thread-bound agents , WebSocket support for Codex , and Codex/Claude Code as first-class subagents via ACP .
  • Omarchy 3.4 — agent features shipped

    • Release highlights include “new agent features (claude by default + tmux swarm!)” and a tailored tmux setup .
  • Harbor framework — shared agent eval infra momentum

    • Laude Institute frames Harbor as shared infrastructure to standardize benchmarks via one interface (repeatable runs, standardized traces, production-grade practice) .
    • swyx says his team is prioritizing migrating evals to Harbor and calls it dominant in RL infra/evals for terminal agents .

💡 WORKFLOWS & TRICKS

  • Spec-driven agent work (make the spec the artifact)

    • Brendan Long’s repeatable loop for large vibe-coded apps:
      1. Write a GitHub issue
      2. If it’s complex, have an agent produce a plan and update the issue
      3. Have another agent read the issue and implement it
    • He claims a detailed enough issue is “90% of the work” and rewriting it is often what fixes problems .
  • Enterprise-grade orchestration guidance (modest fleets, strong controls)

    • Addy Osmani’s concrete advice: spend 30–40% of task time writing the spec—constraints, success criteria, stack/architecture—and gather context in a resources directory; otherwise you “waste tokens” and LLMs default to “lowest common denominator” patterns .
    • For teams: codify best practices in context (e.g., MCP-callable systems or even markdown files) to raise the odds the output is shippable .
    • He also flags the real bottleneck: “Not generation, but coordination” .
  • Close the loop: isolate the runtime so agents can run it

    • Kent C. Dodds: “get your app running in an isolated environment to close the agent loop” .
    • He points to his Epic Stack guiding principles—“Minimize Setup Friction” and “Offline Development”—as a practical way to make this easier .
  • Hoard working examples, then recombine (prompt with concrete known-good snippets)

    • Simon Willison’s pattern: keep a personal library of solved examples across blogs/TIL, many repos, and small “HTML tools” pages, because agents can recombine them quickly .
    • His OCR tool story: he combined snippets for PDF rendering and OCR into a single HTML page via prompt, iterated a few times, and ended with a tool he still uses .
    • Agent tip: when asking Claude Code to reuse an existing tool, he sometimes specifies curl explicitly to fetch raw HTML instead of a summarizing fetch tool .
  • Tests aren’t a moat anymore (agents can recreate them fast)

    • tldraw moved tests to a closed-source repo to prevent “Slop Fork” forks .
    • Armin Ronacher’s counterpoint: agents can generate language/implementation-agnostic test suites quickly if there’s a reference implementation .
  • Security footnote from a vibe-coded app

    • In Simon Willison’s Present.app walkthrough, the remote-control web server used GET requests for state changes (e.g., /next, /prev), which he notes opens up CSRF vulnerabilities—he didn’t care for that application .

👤 PEOPLE TO WATCH

  • Addy Osmani (Google Cloud AI) — clearest “enterprise reality check”: quality bars, traceability, and spec/context discipline, plus a strong stance that orchestration is the thing to learn .
  • Simon Willison — consistently turns agent usage into transferable patterns (Agentic Engineering Patterns + “hoard examples” + codebase walkthrough prompts) .
  • Brendan Long — practical decomposition: write issues like a system design interview, then let agents execute .
  • Nicholas Moy (DeepMind) — framing: “10x engineer” becomes “10 agent orchestrator,” measured by concurrent agents you can run effectively .
  • Dylan Patel (Semianalysis) — adoption signal: Claude Code share of GitHub commits going 2%→4% in a month, with a broader estimate of total AI-written code around ~10% .

🎬 WATCH & LISTEN

1) Addy Osmani: “Learn orchestration” + the path to agent fleets (≈ 23:32–25:54)

Hook: Practical roadmap from single-agent prompting to multi-agent orchestration and coordination patterns—before you burn tokens on experimental swarms .

2) SAIL LIVE #6: why SWE-Bench got saturated (and what that says about evals) (≈ 29:49–33:48)

Hook: A clear explanation of how SWE-Bench is constructed, why it became the default “agentic coding” benchmark, and why that creates problems once it’s widely known and reused .

📊 PROJECTS & REPOS


Editorial take: The leverage is shifting from “pick the best model” to “build the tightest loop”: spec → isolated runtime → tests/evals → approvals—and only then scale agents.

Orchestration-first agent coding: Codex CLI v0.105, spec-driven loops, and eval infra wars
Summary
Coverage start
Feb 26 at 7:00 AM
Coverage end
Feb 27 at 7:00 AM
Frequency
Daily
Published
Feb 27 at 8:09 AM
Reading time
6 min
Research time
4 hrs 11 min
Documents scanned
155
Documents used
29
Citations
55
Sources monitored
110 / 110
Insights
Skipped contexts
Source details
Source Docs Insights Status
Lukas Möller 0 0
Jediah Katz 0 0
Aman Karmani 0 0
Jacob Jackson 0 0
Cursor Blog | RSS Feed 0 0
Nicholas Moy 3 2
Mike Krieger 0 0
Sualeh Asif 0 0
Michael Truell 0 0
Google Antigravity 3 0
Aman Sanger 0 0
cat 0 0
Mark Chen 0 0
Greg Brockman 4 0
Tongzhou Wang 0 0
fouad 0 0
Calvin French-Owen 0 0
Hanson Wang 0 0
Ed Bayes 4 0
Alexander Embiricos 4 1
Tibo 3 2
Romain Huet 4 1
DHH 12 1
Jane Street Blog 0 0
Miguel Grinberg's Blog: AI 0 0
xxchan's Blog 0 0
<antirez> 0 0
Brendan Long 1 1
The Pragmatic Engineer 0 0
David Heinemeier Hansson 0 0
Armin Ronacher ⇌ 6 1
Mitchell Hashimoto 0 0
Armin Ronacher's Thoughts and Writings 0 0
Peter Steinberger 0 0
Theo - t3.gg 3 0
Sourcegraph 0 0
Anthropic 1 0
Cursor 0 0
LangChain 0 0
Anthropic 0 0
LangChain Blog 0 0
LangChain 3 0
Cursor 2 1
Riley Brown 0 0
Riley Brown 3 0
Jason Zhou 0 0
Boris Cherny 4 0
Mckay Wrigley 4 0
geoff 17 1
Peter Steinberger 🦞 6 1
AI Jason 0 0
Alex Albert 0 0
Latent.Space 1 0
Logan Kilpatrick 2 0
Fireship 0 0
Fireship 1 0
Kent C. Dodds ⚡ 7 2
Practical AI 0 0
Practical AI Clips 0 0
Stories by Steve Yegge on Medium 0 0
Kent C. Dodds Blog 0 0
ThePrimeTime 0 0
Theo - t3․gg 1 0
ThePrimeagen 11 0
Ben Tossell 5 1
swyx 23 4
AI For Developers 0 0
Geoffrey Huntley 0 0
Addy Osmani 3 0
Andrej Karpathy 0 0
Simon Willison 6 1
Matthew Berman 0 0
Changelog 0 0
Simon Willison’s Newsletter 1 1
Agentic Coding Newsletter 0 0
Latent Space 1 1
Simon Willison's Weblog 2 2
Elevate 0 0
Lukas Möller 0 0
Jediah Katz 0 0
Sualeh Asif 0 0
Mike Krieger 0 0
Michael Truell 0 0
Cat Wu 0 0
Kevin Hou 0 0
Aman Sanger 0 0
Nicholas Moy 0 0
Andrey Mishchenko 0 0
Jerry Tworek 0 0
Romain Huet 0 0
Thibault Sottiaux 0 0
Alexander Embiricos 0 0
xxchan 0 0
Salvatore Sanfilippo 1 0
Armin Ronacher 0 0
David Heinemeier Hansson (DHH) 0 0
Alex Albert 0 0
Logan Kilpatrick 0 0
Shawn "swyx" Wang 2 1
Jason Zhou 0 0
Riley Brown 0 0
McKay Wrigley 0 0
Boris Cherny 0 0
Ben Tossell 0 0
Geoffrey Huntley 0 0
Peter Steinberger 0 0
Addy Osmani 1 1
Simon Willison 0 0
Andrej Karpathy 0 0
Harrison Chase 0 0