# Orchestration-first agent coding: Codex CLI v0.105, spec-driven loops, and eval infra wars

*By Coding Agents Alpha Tracker • February 27, 2026*

Today’s theme: orchestration beats raw generation. You’ll get concrete spec-first workflows from practitioners, major Codex CLI upgrades (v0.105), new PR autofix automation, and why shared eval infrastructure is suddenly a battleground.

## 🔥 TOP SIGNAL
Orchestration is becoming the core dev skill: Addy Osmani argues the enterprise frontier is **orchestrating a modest set of agents with control/traceability**, not running huge swarms [^1]. In practice, that shows up as *spec-first work*: Brendan Long’s vibe-coding loop starts by writing a detailed GitHub issue ("90% of the work"), optionally having an agent plan, then having another agent implement [^2].

## 🛠️ TOOLS & MODELS
- **Codex CLI v0.105 (major QoL upgrade)** [^3]
  - New: **syntax highlighting**, dictate prompts by holding **spacebar**, better **multi-agent workflows**, improved **approval controls**, plus other QoL changes [^3].
  - Install/upgrade: `$ npm i -g @openai/codex@latest` [^3].
  - Practitioner reaction: “diffs are beautiful” and it’s “very, very fast now” [^4].

- **Codex app (Windows) — first waitlist batch invited** [^5]
  - Team says they’ll “expand from there” as they iterate through feedback [^5].

- **Model preference + benchmarkging signals (Codex 5.3)**
  - Mitchell Hashimoto: **Codex 5.3** felt “much more effective” than **Opus 4.6**; after switching back-and-forth, he hasn’t touched Opus for a week [^6].
  - Romain Huet: **GPT-5.3-Codex** hit **90% on IBench** at **xhigh reasoning**; says with speed gains, “xhigh doesn’t feel like a tradeoff anymore” [^7].
  - Related run: “decided to run 5.3 codex on xhigh as well, its 90%… rip IBench, survived 3 months” [^8].

- **Cursor — Bugbot Autofix (PR issues → auto-fixes)** [^9]
  - Announcement: Bugbot can now automatically fix issues it finds in PRs [^9].
  - Details: http://cursor.com/blog/bugbot-autofix [^10].

- **Devin AI (real production debugging)**
  - swyx reports Devin investigated a production bug (Vercel org migration + forgotten key), asked for exactly what it needed, and verified the fix [^11].

- **FactoryAI Droids — “Missions” + terminal “Mission Control”**
  - “Missions”: multi-day autonomous goals where you describe what you want, approve a plan, and come back to finished work [^12][^13].
  - Mission Control: a terminal view of which feature is being built, which Droid is on it, tools used, and progress [^14].
  - Examples FactoryAI says enterprises are running: modernize a 40-year-old COBOL module; migrate >1k microservices across regions; recalc 10 years of pricing; refactor a monolith handling 20M daily API calls with no downtime [^13].

- **OpenClaw — new beta bits** [^15]
  - Adds: external secrets management (`openclaw secrets`) [^15][^16], CP thread-bound agents [^15], WebSocket support for Codex [^15], and Codex/Claude Code as first-class subagents via ACP [^17].

- **Omarchy 3.4 — agent features shipped**
  - Release highlights include “new agent features (claude by default + tmux swarm!)” and a tailored tmux setup [^18][^19].

- **Harbor framework — shared agent eval infra momentum**
  - Laude Institute frames Harbor as shared infrastructure to standardize benchmarks via one interface (repeatable runs, standardized traces, production-grade practice) [^20].
  - swyx says his team is prioritizing migrating evals to Harbor and calls it dominant in RL infra/evals for terminal agents [^21].

## 💡 WORKFLOWS & TRICKS
- **Spec-driven agent work (make the spec the artifact)**
  - Brendan Long’s repeatable loop for large vibe-coded apps:
    1) Write a GitHub issue [^2]
    2) If it’s complex, have an agent produce a plan and update the issue [^2]
    3) Have another agent read the issue and implement it [^2]
  - He claims a detailed enough issue is “**90% of the work**” and rewriting it is often what fixes problems [^2].

- **Enterprise-grade orchestration guidance (modest fleets, strong controls)**
  - Addy Osmani’s concrete advice: spend **30–40%** of task time writing the spec—constraints, success criteria, stack/architecture—and gather context in a resources directory; otherwise you “waste tokens” and LLMs default to “lowest common denominator” patterns [^1].
  - For teams: codify best practices in context (e.g., MCP-callable systems or even markdown files) to raise the odds the output is shippable [^1].
  - He also flags the real bottleneck: “Not generation, but coordination” [^1].

- **Close the loop: isolate the runtime so agents can run it**
  - Kent C. Dodds: “**get your app running in an isolated environment** to close the agent loop” [^22].
  - He points to his Epic Stack guiding principles—“Minimize Setup Friction” and “Offline Development”—as a practical way to make this easier [^22].

- **Hoard working examples, then recombine (prompt with concrete known-good snippets)**
  - Simon Willison’s pattern: keep a personal library of solved examples across blogs/TIL, many repos, and small “HTML tools” pages, because agents can recombine them quickly [^23].
  - His OCR tool story: he combined snippets for PDF rendering and OCR into a single HTML page via prompt, iterated a few times, and ended with a tool he still uses [^23].
  - Agent tip: when asking Claude Code to reuse an existing tool, he sometimes specifies `curl` explicitly to fetch **raw HTML** instead of a summarizing fetch tool [^23].

- **Tests aren’t a moat anymore (agents can recreate them fast)**
  - tldraw moved tests to a closed-source repo to prevent “Slop Fork” forks [^24].
  - Armin Ronacher’s counterpoint: agents can generate language/implementation-agnostic test suites quickly if there’s a reference implementation [^25].

- **Security footnote from a vibe-coded app**
  - In Simon Willison’s Present.app walkthrough, the remote-control web server used GET requests for state changes (e.g., `/next`, `/prev`), which he notes opens up CSRF vulnerabilities—he didn’t care for that application [^26].

## 👤 PEOPLE TO WATCH
- **Addy Osmani (Google Cloud AI)** — clearest “enterprise reality check”: quality bars, traceability, and spec/context discipline, plus a strong stance that orchestration is the thing to learn [^1].
- **Simon Willison** — consistently turns agent usage into transferable patterns (Agentic Engineering Patterns + “hoard examples” + codebase walkthrough prompts) [^26][^23][^26].
- **Brendan Long** — practical decomposition: write issues like a system design interview, then let agents execute [^2].
- **Nicholas Moy (DeepMind)** — framing: “10x engineer” becomes “10 agent orchestrator,” measured by concurrent agents you can run effectively [^27].
- **Dylan Patel (Semianalysis)** — adoption signal: Claude Code share of GitHub commits going 2%→4% in a month, with a broader estimate of total AI-written code around ~10% [^28].

## 🎬 WATCH & LISTEN
### 1) Addy Osmani: “Learn orchestration” + the path to agent fleets (≈ 23:32–25:54)
Hook: Practical roadmap from single-agent prompting to multi-agent orchestration and coordination patterns—before you burn tokens on experimental swarms [^1].


[![Live with Tim O’Reilly: A Conversation with Google Cloud AI Director Addy Osmani](https://img.youtube.com/vi/CI-8mQKuKbQ/hqdefault.jpg)](https://youtube.com/watch?v=CI-8mQKuKbQ&t=1411)
*Live with Tim O’Reilly: A Conversation with Google Cloud AI Director Addy Osmani (23:31)*


### 2) SAIL LIVE #6: why SWE-Bench got saturated (and what that says about evals) (≈ 29:49–33:48)
Hook: A clear explanation of how SWE-Bench is constructed, why it became the default “agentic coding” benchmark, and why that creates problems once it’s widely known and reused [^29].


[![Distillation & How Models Cheat | SAIL LIVE #6](https://img.youtube.com/vi/5VsoNE3iyhs/hqdefault.jpg)](https://youtube.com/watch?v=5VsoNE3iyhs&t=1789)
*Distillation & How Models Cheat | SAIL LIVE #6 (29:49)*


## 📊 PROJECTS & REPOS
- **Agentic Engineering Patterns (Simon Willison)** — a living guide of coding-agent practices and patterns (agentic engineering vs vibe coding framing) [^26]
  - https://simonwillison.net/guides/agentic-engineering-patterns/ [^26]

- **Present.app (Simon Willison)** — vibe-coded SwiftUI macOS presentation app where each “slide” is a URL; GitHub repo shared [^26]
  - https://github.com/simonw/present [^26]

- **OpenClaw releases + docs (beta features shipping)** [^15]
  - https://github.com/openclaw/openclaw/releases [^15]
  - Secrets docs: https://docs.openclaw.ai/cli/secrets [^16]
  - ACP agents docs: https://docs.openclaw.ai/tools/acp-agents [^17]

- **Cursor Bugbot Autofix announcement + writeup** [^10]
  - http://cursor.com/blog/bugbot-autofix [^10]

- **Omarchy 3.4 release** (61 contributors; agent features + tmux work) [^18]
  - https://github.com/basecamp/omarchy/releases/tag/v3.4.0 [^18]

- **tldraw tests move discussion** (tests closed-source) [^24]
  - https://github.com/tldraw/tldraw/issues/8082 [^24]

---
**Editorial take:** The leverage is shifting from “pick the best model” to “build the tightest loop”: spec → isolated runtime → tests/evals → approvals—and only then scale agents.

---

### Sources

[^1]: [Live with Tim O’Reilly: A Conversation with Google Cloud AI Director Addy Osmani](https://www.youtube.com/watch?v=CI-8mQKuKbQ)
[^2]: [Vibe Coding is a System Design Interview](https://www.brendanlong.com/vibe-coding-is-a-system-design-interview.html)
[^3]: [𝕏 post by @thsottiaux](https://x.com/thsottiaux/status/2027094489265807429)
[^4]: [𝕏 post by @iannuttall](https://x.com/iannuttall/status/2027063989750972826)
[^5]: [𝕏 post by @thsottiaux](https://x.com/thsottiaux/status/2027137924748259673)
[^6]: [𝕏 post by @mitchellh](https://x.com/mitchellh/status/2026669906893369760)
[^7]: [𝕏 post by @romainhuet](https://x.com/romainhuet/status/2027054225507705275)
[^8]: [𝕏 post by @adonis_singh](https://x.com/adonis_singh/status/2026692938751725655)
[^9]: [𝕏 post by @cursor_ai](https://x.com/cursor_ai/status/2027079876948484200)
[^10]: [𝕏 post by @cursor_ai](https://x.com/cursor_ai/status/2027079878584279379)
[^11]: [𝕏 post by @swyx](https://x.com/swyx/status/2027156931157368971)
[^12]: [𝕏 post by @FactoryAI](https://x.com/FactoryAI/status/2027104794289263104)
[^13]: [𝕏 post by @matanSF](https://x.com/matanSF/status/2027105643627454484)
[^14]: [𝕏 post by @FactoryAI](https://x.com/FactoryAI/status/2027104816405811694)
[^15]: [𝕏 post by @steipete](https://x.com/steipete/status/2027152375648035101)
[^16]: [𝕏 post by @steipete](https://x.com/steipete/status/2027161567519777227)
[^17]: [𝕏 post by @steipete](https://x.com/steipete/status/2027161793353683171)
[^18]: [𝕏 post by @dhh](https://x.com/dhh/status/2027095082919145488)
[^19]: [𝕏 post by @dhh](https://x.com/dhh/status/2027134422001070214)
[^20]: [𝕏 post by @LaudeInstitute](https://x.com/LaudeInstitute/status/2027101198529266171)
[^21]: [𝕏 post by @swyx](https://x.com/swyx/status/2027213347570188635)
[^22]: [𝕏 post by @kentcdodds](https://x.com/kentcdodds/status/2027180226615357620)
[^23]: [Hoard things you know how to do](https://simonwillison.net/guides/agentic-engineering-patterns/hoard-things-you-know-how-to-do)
[^24]: [𝕏 post by @cramforce](https://x.com/cramforce/status/2026782878609322317)
[^25]: [𝕏 post by @mitsuhiko](https://x.com/mitsuhiko/status/2027044543653093873)
[^26]: [Agentic Engineering Patterns](https://simonw.substack.com/p/agentic-engineering-patterns)
[^27]: [𝕏 post by @thenickmoy](https://x.com/thenickmoy/status/2027214852658495947)
[^28]: [Dylan Patel Explains the AI War While Cooking | In-Context Cooking](https://www.youtube.com/watch?v=UwnqWAYOjPU)
[^29]: [Distillation & How Models Cheat | SAIL LIVE #6](https://www.youtube.com/watch?v=5VsoNE3iyhs)