Brief details for Orchestration-first agent coding: Codex CLI v0.105, spec-driven loops, and eval infra wars

Orchestration-first agent coding: Codex CLI v0.105, spec-driven loops, and eval infra wars

Today’s theme: orchestration beats raw generation. You’ll get concrete spec-first workflows from practitioners, major Codex CLI upgrades (v0.105), new PR autofix automation, and why shared eval infrastructure is suddenly a battleground.

adi

swyx

Addy Osmani

+21

🔥 TOP SIGNAL

Orchestration is becoming the core dev skill: Addy Osmani argues the enterprise frontier is orchestrating a modest set of agents with control/traceability, not running huge swarms . In practice, that shows up as spec-first work: Brendan Long’s vibe-coding loop starts by writing a detailed GitHub issue ("90% of the work"), optionally having an agent plan, then having another agent implement .

🛠️ TOOLS & MODELS

Codex CLI v0.105 (major QoL upgrade)
- New: syntax highlighting, dictate prompts by holding spacebar, better multi-agent workflows, improved approval controls, plus other QoL changes .
- Install/upgrade: $ npm i -g @openai/codex@latest.
- Practitioner reaction: “diffs are beautiful” and it’s “very, very fast now” .
Codex app (Windows) — first waitlist batch invited
- Team says they’ll “expand from there” as they iterate through feedback .
Model preference + benchmarkging signals (Codex 5.3)
- Mitchell Hashimoto: Codex 5.3 felt “much more effective” than Opus 4.6; after switching back-and-forth, he hasn’t touched Opus for a week .
- Romain Huet: GPT-5.3-Codex hit 90% on IBench at xhigh reasoning; says with speed gains, “xhigh doesn’t feel like a tradeoff anymore” .
- Related run: “decided to run 5.3 codex on xhigh as well, its 90%… rip IBench, survived 3 months” .
Cursor — Bugbot Autofix (PR issues → auto-fixes)
- Announcement: Bugbot can now automatically fix issues it finds in PRs .
- Details: http://cursor.com/blog/bugbot-autofix.
Devin AI (real production debugging)
- swyx reports Devin investigated a production bug (Vercel org migration + forgotten key), asked for exactly what it needed, and verified the fix .
FactoryAI Droids — “Missions” + terminal “Mission Control”
- “Missions”: multi-day autonomous goals where you describe what you want, approve a plan, and come back to finished work .
- Mission Control: a terminal view of which feature is being built, which Droid is on it, tools used, and progress .
- Examples FactoryAI says enterprises are running: modernize a 40-year-old COBOL module; migrate >1k microservices across regions; recalc 10 years of pricing; refactor a monolith handling 20M daily API calls with no downtime .
OpenClaw — new beta bits
- Adds: external secrets management (openclaw secrets) , CP thread-bound agents , WebSocket support for Codex , and Codex/Claude Code as first-class subagents via ACP .
Omarchy 3.4 — agent features shipped
- Release highlights include “new agent features (claude by default + tmux swarm!)” and a tailored tmux setup .
Harbor framework — shared agent eval infra momentum
- Laude Institute frames Harbor as shared infrastructure to standardize benchmarks via one interface (repeatable runs, standardized traces, production-grade practice) .
- swyx says his team is prioritizing migrating evals to Harbor and calls it dominant in RL infra/evals for terminal agents .

💡 WORKFLOWS & TRICKS

Spec-driven agent work (make the spec the artifact)
- Brendan Long’s repeatable loop for large vibe-coded apps:
  1. Write a GitHub issue
  2. If it’s complex, have an agent produce a plan and update the issue
  3. Have another agent read the issue and implement it
- He claims a detailed enough issue is “90% of the work” and rewriting it is often what fixes problems .
Enterprise-grade orchestration guidance (modest fleets, strong controls)
- Addy Osmani’s concrete advice: spend 30–40% of task time writing the spec—constraints, success criteria, stack/architecture—and gather context in a resources directory; otherwise you “waste tokens” and LLMs default to “lowest common denominator” patterns .
- For teams: codify best practices in context (e.g., MCP-callable systems or even markdown files) to raise the odds the output is shippable .
- He also flags the real bottleneck: “Not generation, but coordination” .
Close the loop: isolate the runtime so agents can run it
- Kent C. Dodds: “get your app running in an isolated environment to close the agent loop” .
- He points to his Epic Stack guiding principles—“Minimize Setup Friction” and “Offline Development”—as a practical way to make this easier .
Hoard working examples, then recombine (prompt with concrete known-good snippets)
- Simon Willison’s pattern: keep a personal library of solved examples across blogs/TIL, many repos, and small “HTML tools” pages, because agents can recombine them quickly .
- His OCR tool story: he combined snippets for PDF rendering and OCR into a single HTML page via prompt, iterated a few times, and ended with a tool he still uses .
- Agent tip: when asking Claude Code to reuse an existing tool, he sometimes specifies curl explicitly to fetch raw HTML instead of a summarizing fetch tool .
Tests aren’t a moat anymore (agents can recreate them fast)
- tldraw moved tests to a closed-source repo to prevent “Slop Fork” forks .
- Armin Ronacher’s counterpoint: agents can generate language/implementation-agnostic test suites quickly if there’s a reference implementation .
Security footnote from a vibe-coded app
- In Simon Willison’s Present.app walkthrough, the remote-control web server used GET requests for state changes (e.g., /next, /prev), which he notes opens up CSRF vulnerabilities—he didn’t care for that application .

👤 PEOPLE TO WATCH

Addy Osmani (Google Cloud AI) — clearest “enterprise reality check”: quality bars, traceability, and spec/context discipline, plus a strong stance that orchestration is the thing to learn .
Simon Willison — consistently turns agent usage into transferable patterns (Agentic Engineering Patterns + “hoard examples” + codebase walkthrough prompts) .
Brendan Long — practical decomposition: write issues like a system design interview, then let agents execute .
Nicholas Moy (DeepMind) — framing: “10x engineer” becomes “10 agent orchestrator,” measured by concurrent agents you can run effectively .
Dylan Patel (Semianalysis) — adoption signal: Claude Code share of GitHub commits going 2%→4% in a month, with a broader estimate of total AI-written code around ~10% .

🎬 WATCH & LISTEN

1) Addy Osmani: “Learn orchestration” + the path to agent fleets (≈ 23:32–25:54)

Hook: Practical roadmap from single-agent prompting to multi-agent orchestration and coordination patterns—before you burn tokens on experimental swarms .

2) SAIL LIVE #6: why SWE-Bench got saturated (and what that says about evals) (≈ 29:49–33:48)

Hook: A clear explanation of how SWE-Bench is constructed, why it became the default “agentic coding” benchmark, and why that creates problems once it’s widely known and reused .

📊 PROJECTS & REPOS

Agentic Engineering Patterns (Simon Willison) — a living guide of coding-agent practices and patterns (agentic engineering vs vibe coding framing)
- https://simonwillison.net/guides/agentic-engineering-patterns/
Present.app (Simon Willison) — vibe-coded SwiftUI macOS presentation app where each “slide” is a URL; GitHub repo shared
- https://github.com/simonw/present
OpenClaw releases + docs (beta features shipping)
- https://github.com/openclaw/openclaw/releases
- Secrets docs: https://docs.openclaw.ai/cli/secrets
- ACP agents docs: https://docs.openclaw.ai/tools/acp-agents
Cursor Bugbot Autofix announcement + writeup
- http://cursor.com/blog/bugbot-autofix
Omarchy 3.4 release (61 contributors; agent features + tmux work)
- https://github.com/basecamp/omarchy/releases/tag/v3.4.0
tldraw tests move discussion (tests closed-source)
- https://github.com/tldraw/tldraw/issues/8082

Editorial take: The leverage is shifting from “pick the best model” to “build the tightest loop”: spec → isolated runtime → tests/evals → approvals—and only then scale agents.

Orchestration-first agent coding: Codex CLI v0.105, spec-driven loops, and eval infra wars

Summary

Coverage start

Feb 26 at 7:00 AM

Coverage end

Feb 27 at 7:00 AM

Frequency

Daily

Published

Feb 27 at 8:09 AM

Reading time

6 min

Research time

4 hrs 11 min

Documents scanned

155

Documents used

Citations

Sources monitored

110 / 110

Insights

View

Skipped contexts

View

Source details

Source	Docs	Insights
Lukas Möller	0	0
Jediah Katz	0	0
Aman Karmani	0	0
Jacob Jackson	0	0
Cursor Blog \| RSS Feed	0	0
Nicholas Moy	3	2
Mike Krieger	0	0
Sualeh Asif	0	0
Michael Truell	0	0
Google Antigravity	3	0
Aman Sanger	0	0
cat	0	0
Mark Chen	0	0
Greg Brockman	4	0
Tongzhou Wang	0	0
fouad	0	0
Calvin French-Owen	0	0
Hanson Wang	0	0
Ed Bayes	4	0
Alexander Embiricos	4	1
Tibo	3	2
Romain Huet	4	1
DHH	12	1
Jane Street Blog	0	0
Miguel Grinberg's Blog: AI	0	0
xxchan's Blog	0	0
<antirez>	0	0
Brendan Long	1	1
The Pragmatic Engineer	0	0
David Heinemeier Hansson	0	0
Armin Ronacher ⇌	6	1
Mitchell Hashimoto	0	0
Armin Ronacher's Thoughts and Writings	0	0
Peter Steinberger	0	0
Theo - t3.gg	3	0
Sourcegraph	0	0
Anthropic	1	0
Cursor	0	0
LangChain	0	0
Anthropic	0	0
LangChain Blog	0	0
LangChain	3	0
Cursor	2	1
Riley Brown	0	0
Riley Brown	3	0
Jason Zhou	0	0
Boris Cherny	4	0
Mckay Wrigley	4	0
geoff	17	1
Peter Steinberger 🦞	6	1
AI Jason	0	0
Alex Albert	0	0
Latent.Space	1	0
Logan Kilpatrick	2	0
Fireship	0	0
Fireship	1	0
Kent C. Dodds ⚡	7	2
Practical AI	0	0
Practical AI Clips	0	0
Stories by Steve Yegge on Medium	0	0
Kent C. Dodds Blog	0	0
ThePrimeTime	0	0
Theo - t3․gg	1	0
ThePrimeagen	11	0
Ben Tossell	5	1
swyx	23	4
AI For Developers	0	0
Geoffrey Huntley	0	0
Addy Osmani	3	0
Andrej Karpathy	0	0
Simon Willison	6	1
Matthew Berman	0	0
Changelog	0	0
Simon Willison’s Newsletter	1	1
Agentic Coding Newsletter	0	0
Latent Space	1	1
Simon Willison's Weblog	2	2
Elevate	0	0
Lukas Möller	0	0
Jediah Katz	0	0
Sualeh Asif	0	0
Mike Krieger	0	0
Michael Truell	0	0
Cat Wu	0	0
Kevin Hou	0	0
Aman Sanger	0	0
Nicholas Moy	0	0
Andrey Mishchenko	0	0
Jerry Tworek	0	0
Romain Huet	0	0
Thibault Sottiaux	0	0
Alexander Embiricos	0	0
xxchan	0	0
Salvatore Sanfilippo	1	0
Armin Ronacher	0	0
David Heinemeier Hansson (DHH)	0	0
Alex Albert	0	0
Logan Kilpatrick	0	0
Shawn "swyx" Wang	2	1
Jason Zhou	0	0
Riley Brown	0	0
McKay Wrigley	0	0
Boris Cherny	0	0
Ben Tossell	0	0
Geoffrey Huntley	0	0
Peter Steinberger	0	0
Addy Osmani	1	1
Simon Willison	0	0
Andrej Karpathy	0	0
Harrison Chase	0	0