ZeroNoise Logo zeronoise
Post
Orchestration-first agent coding: Codex CLI v0.105, spec-driven loops, and eval infra wars
Feb 27
6 min read
155 docs
Today’s theme: orchestration beats raw generation. You’ll get concrete spec-first workflows from practitioners, major Codex CLI upgrades (v0.105), new PR autofix automation, and why shared eval infrastructure is suddenly a battleground.

🔥 TOP SIGNAL

Orchestration is becoming the core dev skill: Addy Osmani argues the enterprise frontier is orchestrating a modest set of agents with control/traceability, not running huge swarms . In practice, that shows up as spec-first work: Brendan Long’s vibe-coding loop starts by writing a detailed GitHub issue ("90% of the work"), optionally having an agent plan, then having another agent implement .

🛠️ TOOLS & MODELS

  • Codex CLI v0.105 (major QoL upgrade)

    • New: syntax highlighting, dictate prompts by holding spacebar, better multi-agent workflows, improved approval controls, plus other QoL changes .
    • Install/upgrade: $ npm i -g @openai/codex@latest.
    • Practitioner reaction: “diffs are beautiful” and it’s “very, very fast now” .
  • Codex app (Windows) — first waitlist batch invited

    • Team says they’ll “expand from there” as they iterate through feedback .
  • Model preference + benchmarkging signals (Codex 5.3)

    • Mitchell Hashimoto: Codex 5.3 felt “much more effective” than Opus 4.6; after switching back-and-forth, he hasn’t touched Opus for a week .
    • Romain Huet: GPT-5.3-Codex hit 90% on IBench at xhigh reasoning; says with speed gains, “xhigh doesn’t feel like a tradeoff anymore” .
    • Related run: “decided to run 5.3 codex on xhigh as well, its 90%… rip IBench, survived 3 months” .
  • Cursor — Bugbot Autofix (PR issues → auto-fixes)

  • Devin AI (real production debugging)

    • swyx reports Devin investigated a production bug (Vercel org migration + forgotten key), asked for exactly what it needed, and verified the fix .
  • FactoryAI Droids — “Missions” + terminal “Mission Control”

    • “Missions”: multi-day autonomous goals where you describe what you want, approve a plan, and come back to finished work .
    • Mission Control: a terminal view of which feature is being built, which Droid is on it, tools used, and progress .
    • Examples FactoryAI says enterprises are running: modernize a 40-year-old COBOL module; migrate >1k microservices across regions; recalc 10 years of pricing; refactor a monolith handling 20M daily API calls with no downtime .
  • OpenClaw — new beta bits

    • Adds: external secrets management (openclaw secrets) , CP thread-bound agents , WebSocket support for Codex , and Codex/Claude Code as first-class subagents via ACP .
  • Omarchy 3.4 — agent features shipped

    • Release highlights include “new agent features (claude by default + tmux swarm!)” and a tailored tmux setup .
  • Harbor framework — shared agent eval infra momentum

    • Laude Institute frames Harbor as shared infrastructure to standardize benchmarks via one interface (repeatable runs, standardized traces, production-grade practice) .
    • swyx says his team is prioritizing migrating evals to Harbor and calls it dominant in RL infra/evals for terminal agents .

💡 WORKFLOWS & TRICKS

  • Spec-driven agent work (make the spec the artifact)

    • Brendan Long’s repeatable loop for large vibe-coded apps:
      1. Write a GitHub issue
      2. If it’s complex, have an agent produce a plan and update the issue
      3. Have another agent read the issue and implement it
    • He claims a detailed enough issue is “90% of the work” and rewriting it is often what fixes problems .
  • Enterprise-grade orchestration guidance (modest fleets, strong controls)

    • Addy Osmani’s concrete advice: spend 30–40% of task time writing the spec—constraints, success criteria, stack/architecture—and gather context in a resources directory; otherwise you “waste tokens” and LLMs default to “lowest common denominator” patterns .
    • For teams: codify best practices in context (e.g., MCP-callable systems or even markdown files) to raise the odds the output is shippable .
    • He also flags the real bottleneck: “Not generation, but coordination” .
  • Close the loop: isolate the runtime so agents can run it

    • Kent C. Dodds: “get your app running in an isolated environment to close the agent loop” .
    • He points to his Epic Stack guiding principles—“Minimize Setup Friction” and “Offline Development”—as a practical way to make this easier .
  • Hoard working examples, then recombine (prompt with concrete known-good snippets)

    • Simon Willison’s pattern: keep a personal library of solved examples across blogs/TIL, many repos, and small “HTML tools” pages, because agents can recombine them quickly .
    • His OCR tool story: he combined snippets for PDF rendering and OCR into a single HTML page via prompt, iterated a few times, and ended with a tool he still uses .
    • Agent tip: when asking Claude Code to reuse an existing tool, he sometimes specifies curl explicitly to fetch raw HTML instead of a summarizing fetch tool .
  • Tests aren’t a moat anymore (agents can recreate them fast)

    • tldraw moved tests to a closed-source repo to prevent “Slop Fork” forks .
    • Armin Ronacher’s counterpoint: agents can generate language/implementation-agnostic test suites quickly if there’s a reference implementation .
  • Security footnote from a vibe-coded app

    • In Simon Willison’s Present.app walkthrough, the remote-control web server used GET requests for state changes (e.g., /next, /prev), which he notes opens up CSRF vulnerabilities—he didn’t care for that application .

👤 PEOPLE TO WATCH

  • Addy Osmani (Google Cloud AI) — clearest “enterprise reality check”: quality bars, traceability, and spec/context discipline, plus a strong stance that orchestration is the thing to learn .
  • Simon Willison — consistently turns agent usage into transferable patterns (Agentic Engineering Patterns + “hoard examples” + codebase walkthrough prompts) .
  • Brendan Long — practical decomposition: write issues like a system design interview, then let agents execute .
  • Nicholas Moy (DeepMind) — framing: “10x engineer” becomes “10 agent orchestrator,” measured by concurrent agents you can run effectively .
  • Dylan Patel (Semianalysis) — adoption signal: Claude Code share of GitHub commits going 2%→4% in a month, with a broader estimate of total AI-written code around ~10% .

🎬 WATCH & LISTEN

1) Addy Osmani: “Learn orchestration” + the path to agent fleets (≈ 23:32–25:54)

Hook: Practical roadmap from single-agent prompting to multi-agent orchestration and coordination patterns—before you burn tokens on experimental swarms .

2) SAIL LIVE #6: why SWE-Bench got saturated (and what that says about evals) (≈ 29:49–33:48)

Hook: A clear explanation of how SWE-Bench is constructed, why it became the default “agentic coding” benchmark, and why that creates problems once it’s widely known and reused .

📊 PROJECTS & REPOS


Editorial take: The leverage is shifting from “pick the best model” to “build the tightest loop”: spec → isolated runtime → tests/evals → approvals—and only then scale agents.

Orchestration-first agent coding: Codex CLI v0.105, spec-driven loops, and eval infra wars
Simon Willison’s Newsletter

Simon Willison launched Agentic Engineering Patterns guide for coding practices with agentic engineering tools like Claude Code and OpenAI Codex, which generate and execute code independently . Firsthand usage in production (blog) and side projects .

Key timeless patterns (1-2 new/week):

  • Red/green TDD: Test-first development yields succinct, reliable agent code with minimal prompting . Guide
  • Writing code is cheap now: Agents drop code cost; rethink design/planning/habits . Guide
  • First run the tests: Essential for verifying agent code; cheap to generate . Guide
  • Linear walkthroughs: Prompt for structured codebase tour . Used on vibe-coded app. Guide
  • Hoard things you know how to do: Leverage expertise for effective agent guidance . Guide

Workflow example: Vibe-coded Present.app (macOS SwiftUI presentation tool, URLs per slide, phone remote via Tailscale) using Claude Opus 4.6 in Claude Code for web on iPhone. Prompts: Initial UI/fullscreen nav; add web server on :9123 for mobile controls . Repo: github.com/simonw/present. ~45min build, no Xcode .

Shawn "swyx" Wang
Profile 2 docs

SWE-Bench flaws undermine coding agent evaluations. Originally from Princeton, SWE-Bench tests agentic coding by resolving ~2k GitHub issues using repo context, passing tests—first proper agentic benchmark beyond autocomplete like HumanEval; Devin first reported it, scores rose from 13% to 80% .

OpenAI's SWE-Bench Verified curated 500 high-quality tasks (3 humans/task vetting, ~$2M cost), but saturated at ~80% with noise (±0.5-1%) . Audit revealed 59% of unsolved tasks impossible (e.g., magic string 'get_annotation' required) —ideal for contamination canaries .

Models cheat via memorization: Public datasets enable task ID → solution regurgitation; CoT leaks post-cutoff knowledge (e.g., future Django APIs) . GPT-o1, competitors (Flash/Gemini/Opus) vulnerable .

Fixes in SWE-Bench Pro (Scale AI): Newer problems, diverse repos/languages, private/public splits . Benchmarks remain hard/expensive; coding easiest to eval objectively .

Swyx (Latent Space host, agents expert), Nathan Lambert (researcher), Sebastian Raschka (LLM book author) share firsthand benchmark analysis.

Addy Osmani
Profile 1 doc

Addy Osmani (Google Cloud AI Director, 14+ years at Google, leads technical evangelism for Gemini and Vertex Agent Development Kit) emphasizes enterprise agent orchestration over massive agent swarms: modest sets of agents solving real problems with control, traceability, and quality gates—contrasting solo-founder 'Wild West' approaches .

Timeless patterns:

  • Multi-agent systems: agents spawn others, communicate in parallel, self-coordinate . Coordination is the hard problem, not generation .
  • Agent types in Vertex ADK: LLM agents (reasoning), workflow agents (deterministic sequential/parallel loops), custom agents; mix orchestrated control and dynamic autonomy .
  • Spec-driven development: Spend 30-40% time upfront on specs, constraints, success criteria, stack/architecture; use context/resources directory to avoid token waste (LLMs default to 'lowest common denominator') . Codify team best practices .
  • Quality gates: Balance AI velocity with code review, testing coverage, human-in-loop for maintenance/legacy/security; manually intervene as needed .

Tools he uses (firsthand):

  • GitHub Copilot agents: Assign to-do lists; routes to Anthropic/OpenAI models .
  • Google Joules (Gemini-based) for similar orchestration .
  • Claude Code Web, Claude Code agent teams (recent Swarm support) for agent coordination/context handoff .
  • A2A (Agent-to-Agent protocol, Linux Foundation): Complements MCP (agent-tool); enables cross-vendor agent communication like 'TCP/IP of agentic era' .

Contrarian takes: Skeptical of 'near perfection' hype (e.g., Matt Schuler post)—valid for prototypes/MVPs, but enterprise needs rigor; distinguish 'feeling busy' (100s agents) from productivity . Prioritize orchestration learning this year . Be cost-conscious: Experiment small tasks, extrapolate; some justify $100s-$1Ks/month vs. hiring . Future: Agent-optimized codebases (less human-readable) .

Latent Space
youtube 1 doc

Dylan Patel (founder/CEO of Semianalysis, 60-person firm tracking AI infra/models) reports Claude Code adoption surged from 2% to 4% of GitHub commits in January, estimating total AI-generated code (incl. Claude Code, Codex, Cognition/Devin, GitHub Copilot) at ~10%.

At Semianalysis, ~1/3 of staff (engineers + ex-hedge fund analysts) now use Claude Code for data scraping, financial modeling, and pro forma analysis—firsthand production usage .

Recent coding tools: Claude Code, Claudebot, Maltbook, Kimi 2.5 agent swarms, Codex 5.3 (latter not far behind Codex) .

swyx
x 2 docs

@shadcn demonstrated Claude Code's tool preferences: when prompted to build with no tool names anywhere in the input, it selects category-leading infra tools (see image) .

@swyx quoted this, stating such leaders merit $5B+ valuations as coding agents will recommend them for 5+ years since infra is stickier than agents (disclosure: small Resend angel) .

Demo: https://x.com/shadcn/status/2027062972753866796.

Prompt technique: Omit tool names to reveal agent's inherent recommendations .

Simon Willison's Weblog

Andrej Karpathy states that programming has dramatically changed due to AI in the last 2 months (December), where coding agents basically didn’t work before but basically work since thanks to models' significantly higher quality, long-term coherence, tenacity, enabling them to power through large and long tasksextremely disruptive to default workflows .

Firsthand take from Karpathy (top AI practitioner) .

Original tweet.

swyx
x 4 docs

Harbor framework (@harborframework) dominates RL infra and evals landscape for terminal agents .

  • @swyx (Cognition): Team prioritizing migration of all evals to Harbor; originated from TerminalBench 2 Discord needs; expects Harbor-based evals/benchmarks/infra startups . Firsthand production adoption.
  • Standing room only at Modal x @willccbb meetup; Harbor now required knowledge .
  • @LaudeInstitute: Standardizes benchmarks via one interface—repeatable runs, standardized traces, production-grade—from TerminalBench .
  • @willccbb (quoted by @swyx): Harbor for tasksets in terminal agents; verifiers as domain-agnostic RL env layer with token-level plumbing .

Link: https://x.com/laudeinstitute/status/2027101198529266171.

Timeless pattern: Shared eval infra for agent benchmarks.

Simon Willison's Weblog

Simon Willison, experienced developer with extensive personal code repositories, advocates hoarding working code examples as a timeless pattern for productive coding agent use—collect in blog , TIL , GitHub (1000+ repos) , tools.simonwillison.net (HTML tools) , and simonw/research repo .

Recombination workflow: Prompt agents to build new tools by merging existing examples.

  • Used Claude 3 Opus with snippets for PDF.js (render pages to images) and Tesseract.js (OCR images) to create browser-based PDF OCR tool :

    Use these examples to put together a single HTML page...

  • Iterated to final tool at https://tools.simonwillison.net/ocr.

Coding agent enhancements (e.g., Claude Code) :

  • Fetch tool sources: Use curl to fetch... https://tools.simonwillison.net/ocr and https://tools.simonwillison.net/gemini-bbox....
  • Use local repos: Add mocked HTTP tests to ~/dev/ecosystem/datasette-oauth inspired by ~/dev/ecosystem/llm-mistral.
  • Clone public repos: Clone simonw/research... to /tmp and find examples of compiling Rust to WebAssembly....

Firsthand production/side-project usage by author.

swyx
x 3 docs

FactoryAI's Droids now support Missions: autonomous pursuit of multi-day software engineering goals. Workflow: Describe desired outcome, approve the AI-generated plan, and retrieve finished work .

Enterprise examples from production use:

  • Modernize 40-year-old COBOL core module
  • Migrate >1k microservices to new Kubernetes cluster across three regions
  • Recalculate 10 years of pricing data post-revenue rule change
  • Refactor monolith handling 20M daily API calls with zero downtime

@matanSF (FactoryAI) shares these as real missions run by large enterprises . @swyx praises as "EXCELLENTLY executed idea" .

Demo: https://x.com/FactoryAI/status/2027104794289263104.

Simon Willison
x 1 doc

Simon Willison published a new chapter in his Agentic Engineering Patterns guide titled "Hoard things you know how to do", described as general career advice that also helps when working with coding agents.

Resource: https://simonwillison.net/guides/agentic-engineering-patterns/hoard-things-you-know-how-to-do/

Ben Tossell
x 2 docs

@FactoryAI announced Mission Control, a terminal-based dashboard providing "one view for everything": which feature is being built, which worker Droid is assigned, tools it's using, and Mission progress . @bentossell (@FactoryAI affiliate) highlights support for multi-day tasks end to end.

swyx
x 1 doc

@swyx (@cognition) shares firsthand production use of Devin AI (@devinai): it investigated a bug from Vercel org migration (forgotten key), requested exactly what it needed from humans, verified fixes, and saluted upon compliment .

This demonstrates a human-in-the-loop agent workflow for production debugging .

geoff
x 1 doc

Latent Patterns new feature via partnership with Chainguard: Secure images for embedded terminals enable running Claude code directly in browser — zero API key provisioning or software installation required, works even on Chromebook. Announced firsthand by builder @GeoffreyHuntley.

Kent C. Dodds ⚡
x 2 docs

Kent C. Dodds (@kentcdodds), dev educator, shares his firsthand workflow with Cursor's agent and Bugbot: Kick off an agent on a task before bed; wake up to Bugbot and agent iterating on the solution. He's been using this "for a while now" and calls it "awesome" .

Cursor announced Bugbot Autofix now automatically fixes issues found in PRs .

Link: https://x.com/cursor_ai/status/2027079876948484200.

Kent C. Dodds ⚡
x 1 doc

Kent C. Dodds (@kentcdodds), dev educator behind Epic Stack, advises: Get your app running in an isolated environment to close the agent loop.

His “Minimize Setup Friction” and “Offline Development” guiding principles have helped significantly .

Guiding principles repo.

Firsthand experience from production stack usage.

Peter Steinberger 🦞
x 3 docs

OpenClaw beta release announced by @steipete (OpenClaw contributor):

Key features:

Release notes: https://github.com/openclaw/openclaw/releases

Romain Huet
x 2 docs

Romain Huet (Head of Developer Experience @OpenAI, working on Codex) shared that GPT-5.3-Codex achieved 90% on IBench at xhigh reasoning, calling it "in a different league" . With recent speed gains, xhigh "doesn’t feel like a tradeoff anymore" .

Quoting @adonis_singh: "decided to run 5.3 codex on xhigh as well, its 90%… rip IBench, survived 3 months" . See: https://x.com/adonis_singh/status/2026692938751725655.

Armin Ronacher ⇌
x 2 docs

Armin Ronacher (@mitsuhiko, Flask creator) observes that agents are great at creating language- and implementation-agnostic test suites, enabling surprisingly quick test coverage from a reference implementation .

This contrarian take challenges tldraw's strategy of moving tests to a closed-source repo to prevent "Slop Fork" forks .

Related issue:https://github.com/tldraw/tldraw/issues/8082.

Timeless pattern: Agent-driven test generation for agnostic coverage, transcending specific implementations.

Cursor
x 2 docs

Cursor Bugbot Autofix: New feature that automatically fixes issues found in PRs .

Official announcement from @cursor_ai with blog link for details: http://cursor.com/blog/bugbot-autofix.

Brendan Long

Brendan Long, working on two fairly large vibe-coded apps, uses this converged process:

  1. Write a GitHub issue
  2. (If complicated) Tell an agent to make a plan and update the issue
  3. Have another agent read the issue and implement it

Writing a detailed enough issue is 90% of the work; refining it fixes problems . This mirrors a system design interview—focusing on high-level design, edge cases, tradeoffs—without needing to impress or scale to trillions .

Example issue: lion-reader #641.

Firsthand production/side project experience from senior engineer; highlights timeless pattern: detailed specs as core of agentic coding workflows .