ZeroNoise Logo zeronoise
Post
Orchestration-first agent coding: Codex CLI v0.105, spec-driven loops, and eval infra wars
Feb 27
6 min read
155 docs
Today’s theme: orchestration beats raw generation. You’ll get concrete spec-first workflows from practitioners, major Codex CLI upgrades (v0.105), new PR autofix automation, and why shared eval infrastructure is suddenly a battleground.

🔥 TOP SIGNAL

Orchestration is becoming the core dev skill: Addy Osmani argues the enterprise frontier is orchestrating a modest set of agents with control/traceability, not running huge swarms . In practice, that shows up as spec-first work: Brendan Long’s vibe-coding loop starts by writing a detailed GitHub issue ("90% of the work"), optionally having an agent plan, then having another agent implement .

🛠️ TOOLS & MODELS

  • Codex CLI v0.105 (major QoL upgrade)

    • New: syntax highlighting, dictate prompts by holding spacebar, better multi-agent workflows, improved approval controls, plus other QoL changes .
    • Install/upgrade: $ npm i -g @openai/codex@latest.
    • Practitioner reaction: “diffs are beautiful” and it’s “very, very fast now” .
  • Codex app (Windows) — first waitlist batch invited

    • Team says they’ll “expand from there” as they iterate through feedback .
  • Model preference + benchmarkging signals (Codex 5.3)

    • Mitchell Hashimoto: Codex 5.3 felt “much more effective” than Opus 4.6; after switching back-and-forth, he hasn’t touched Opus for a week .
    • Romain Huet: GPT-5.3-Codex hit 90% on IBench at xhigh reasoning; says with speed gains, “xhigh doesn’t feel like a tradeoff anymore” .
    • Related run: “decided to run 5.3 codex on xhigh as well, its 90%… rip IBench, survived 3 months” .
  • Cursor — Bugbot Autofix (PR issues → auto-fixes)

  • Devin AI (real production debugging)

    • swyx reports Devin investigated a production bug (Vercel org migration + forgotten key), asked for exactly what it needed, and verified the fix .
  • FactoryAI Droids — “Missions” + terminal “Mission Control”

    • “Missions”: multi-day autonomous goals where you describe what you want, approve a plan, and come back to finished work .
    • Mission Control: a terminal view of which feature is being built, which Droid is on it, tools used, and progress .
    • Examples FactoryAI says enterprises are running: modernize a 40-year-old COBOL module; migrate >1k microservices across regions; recalc 10 years of pricing; refactor a monolith handling 20M daily API calls with no downtime .
  • OpenClaw — new beta bits

    • Adds: external secrets management (openclaw secrets) , CP thread-bound agents , WebSocket support for Codex , and Codex/Claude Code as first-class subagents via ACP .
  • Omarchy 3.4 — agent features shipped

    • Release highlights include “new agent features (claude by default + tmux swarm!)” and a tailored tmux setup .
  • Harbor framework — shared agent eval infra momentum

    • Laude Institute frames Harbor as shared infrastructure to standardize benchmarks via one interface (repeatable runs, standardized traces, production-grade practice) .
    • swyx says his team is prioritizing migrating evals to Harbor and calls it dominant in RL infra/evals for terminal agents .

💡 WORKFLOWS & TRICKS

  • Spec-driven agent work (make the spec the artifact)

    • Brendan Long’s repeatable loop for large vibe-coded apps:
      1. Write a GitHub issue
      2. If it’s complex, have an agent produce a plan and update the issue
      3. Have another agent read the issue and implement it
    • He claims a detailed enough issue is “90% of the work” and rewriting it is often what fixes problems .
  • Enterprise-grade orchestration guidance (modest fleets, strong controls)

    • Addy Osmani’s concrete advice: spend 30–40% of task time writing the spec—constraints, success criteria, stack/architecture—and gather context in a resources directory; otherwise you “waste tokens” and LLMs default to “lowest common denominator” patterns .
    • For teams: codify best practices in context (e.g., MCP-callable systems or even markdown files) to raise the odds the output is shippable .
    • He also flags the real bottleneck: “Not generation, but coordination” .
  • Close the loop: isolate the runtime so agents can run it

    • Kent C. Dodds: “get your app running in an isolated environment to close the agent loop” .
    • He points to his Epic Stack guiding principles—“Minimize Setup Friction” and “Offline Development”—as a practical way to make this easier .
  • Hoard working examples, then recombine (prompt with concrete known-good snippets)

    • Simon Willison’s pattern: keep a personal library of solved examples across blogs/TIL, many repos, and small “HTML tools” pages, because agents can recombine them quickly .
    • His OCR tool story: he combined snippets for PDF rendering and OCR into a single HTML page via prompt, iterated a few times, and ended with a tool he still uses .
    • Agent tip: when asking Claude Code to reuse an existing tool, he sometimes specifies curl explicitly to fetch raw HTML instead of a summarizing fetch tool .
  • Tests aren’t a moat anymore (agents can recreate them fast)

    • tldraw moved tests to a closed-source repo to prevent “Slop Fork” forks .
    • Armin Ronacher’s counterpoint: agents can generate language/implementation-agnostic test suites quickly if there’s a reference implementation .
  • Security footnote from a vibe-coded app

    • In Simon Willison’s Present.app walkthrough, the remote-control web server used GET requests for state changes (e.g., /next, /prev), which he notes opens up CSRF vulnerabilities—he didn’t care for that application .

👤 PEOPLE TO WATCH

  • Addy Osmani (Google Cloud AI) — clearest “enterprise reality check”: quality bars, traceability, and spec/context discipline, plus a strong stance that orchestration is the thing to learn .
  • Simon Willison — consistently turns agent usage into transferable patterns (Agentic Engineering Patterns + “hoard examples” + codebase walkthrough prompts) .
  • Brendan Long — practical decomposition: write issues like a system design interview, then let agents execute .
  • Nicholas Moy (DeepMind) — framing: “10x engineer” becomes “10 agent orchestrator,” measured by concurrent agents you can run effectively .
  • Dylan Patel (Semianalysis) — adoption signal: Claude Code share of GitHub commits going 2%→4% in a month, with a broader estimate of total AI-written code around ~10% .

🎬 WATCH & LISTEN

1) Addy Osmani: “Learn orchestration” + the path to agent fleets (≈ 23:32–25:54)

Hook: Practical roadmap from single-agent prompting to multi-agent orchestration and coordination patterns—before you burn tokens on experimental swarms .

2) SAIL LIVE #6: why SWE-Bench got saturated (and what that says about evals) (≈ 29:49–33:48)

Hook: A clear explanation of how SWE-Bench is constructed, why it became the default “agentic coding” benchmark, and why that creates problems once it’s widely known and reused .

📊 PROJECTS & REPOS


Editorial take: The leverage is shifting from “pick the best model” to “build the tightest loop”: spec → isolated runtime → tests/evals → approvals—and only then scale agents.