Activity for Orchestration-first agent coding: Codex CLI v0.105, spec-driven loops, and eval infra wars

Orchestration-first agent coding: Codex CLI v0.105, spec-driven loops, and eval infra wars

Today’s theme: orchestration beats raw generation. You’ll get concrete spec-first workflows from practitioners, major Codex CLI upgrades (v0.105), new PR autofix automation, and why shared eval infrastructure is suddenly a battleground.

adi

swyx

Addy Osmani

+21

🔥 TOP SIGNAL

Orchestration is becoming the core dev skill: Addy Osmani argues the enterprise frontier is orchestrating a modest set of agents with control/traceability, not running huge swarms . In practice, that shows up as spec-first work: Brendan Long’s vibe-coding loop starts by writing a detailed GitHub issue ("90% of the work"), optionally having an agent plan, then having another agent implement .

🛠️ TOOLS & MODELS

Codex CLI v0.105 (major QoL upgrade)
- New: syntax highlighting, dictate prompts by holding spacebar, better multi-agent workflows, improved approval controls, plus other QoL changes .
- Install/upgrade: $ npm i -g @openai/codex@latest.
- Practitioner reaction: “diffs are beautiful” and it’s “very, very fast now” .
Codex app (Windows) — first waitlist batch invited
- Team says they’ll “expand from there” as they iterate through feedback .
Model preference + benchmarkging signals (Codex 5.3)
- Mitchell Hashimoto: Codex 5.3 felt “much more effective” than Opus 4.6; after switching back-and-forth, he hasn’t touched Opus for a week .
- Romain Huet: GPT-5.3-Codex hit 90% on IBench at xhigh reasoning; says with speed gains, “xhigh doesn’t feel like a tradeoff anymore” .
- Related run: “decided to run 5.3 codex on xhigh as well, its 90%… rip IBench, survived 3 months” .
Cursor — Bugbot Autofix (PR issues → auto-fixes)
- Announcement: Bugbot can now automatically fix issues it finds in PRs .
- Details: http://cursor.com/blog/bugbot-autofix.
Devin AI (real production debugging)
- swyx reports Devin investigated a production bug (Vercel org migration + forgotten key), asked for exactly what it needed, and verified the fix .
FactoryAI Droids — “Missions” + terminal “Mission Control”
- “Missions”: multi-day autonomous goals where you describe what you want, approve a plan, and come back to finished work .
- Mission Control: a terminal view of which feature is being built, which Droid is on it, tools used, and progress .
- Examples FactoryAI says enterprises are running: modernize a 40-year-old COBOL module; migrate >1k microservices across regions; recalc 10 years of pricing; refactor a monolith handling 20M daily API calls with no downtime .
OpenClaw — new beta bits
- Adds: external secrets management (openclaw secrets) , CP thread-bound agents , WebSocket support for Codex , and Codex/Claude Code as first-class subagents via ACP .
Omarchy 3.4 — agent features shipped
- Release highlights include “new agent features (claude by default + tmux swarm!)” and a tailored tmux setup .
Harbor framework — shared agent eval infra momentum
- Laude Institute frames Harbor as shared infrastructure to standardize benchmarks via one interface (repeatable runs, standardized traces, production-grade practice) .
- swyx says his team is prioritizing migrating evals to Harbor and calls it dominant in RL infra/evals for terminal agents .

💡 WORKFLOWS & TRICKS

Spec-driven agent work (make the spec the artifact)
- Brendan Long’s repeatable loop for large vibe-coded apps:
  1. Write a GitHub issue
  2. If it’s complex, have an agent produce a plan and update the issue
  3. Have another agent read the issue and implement it
- He claims a detailed enough issue is “90% of the work” and rewriting it is often what fixes problems .
Enterprise-grade orchestration guidance (modest fleets, strong controls)
- Addy Osmani’s concrete advice: spend 30–40% of task time writing the spec—constraints, success criteria, stack/architecture—and gather context in a resources directory; otherwise you “waste tokens” and LLMs default to “lowest common denominator” patterns .
- For teams: codify best practices in context (e.g., MCP-callable systems or even markdown files) to raise the odds the output is shippable .
- He also flags the real bottleneck: “Not generation, but coordination” .
Close the loop: isolate the runtime so agents can run it
- Kent C. Dodds: “get your app running in an isolated environment to close the agent loop” .
- He points to his Epic Stack guiding principles—“Minimize Setup Friction” and “Offline Development”—as a practical way to make this easier .
Hoard working examples, then recombine (prompt with concrete known-good snippets)
- Simon Willison’s pattern: keep a personal library of solved examples across blogs/TIL, many repos, and small “HTML tools” pages, because agents can recombine them quickly .
- His OCR tool story: he combined snippets for PDF rendering and OCR into a single HTML page via prompt, iterated a few times, and ended with a tool he still uses .
- Agent tip: when asking Claude Code to reuse an existing tool, he sometimes specifies curl explicitly to fetch raw HTML instead of a summarizing fetch tool .
Tests aren’t a moat anymore (agents can recreate them fast)
- tldraw moved tests to a closed-source repo to prevent “Slop Fork” forks .
- Armin Ronacher’s counterpoint: agents can generate language/implementation-agnostic test suites quickly if there’s a reference implementation .
Security footnote from a vibe-coded app
- In Simon Willison’s Present.app walkthrough, the remote-control web server used GET requests for state changes (e.g., /next, /prev), which he notes opens up CSRF vulnerabilities—he didn’t care for that application .

👤 PEOPLE TO WATCH

Addy Osmani (Google Cloud AI) — clearest “enterprise reality check”: quality bars, traceability, and spec/context discipline, plus a strong stance that orchestration is the thing to learn .
Simon Willison — consistently turns agent usage into transferable patterns (Agentic Engineering Patterns + “hoard examples” + codebase walkthrough prompts) .
Brendan Long — practical decomposition: write issues like a system design interview, then let agents execute .
Nicholas Moy (DeepMind) — framing: “10x engineer” becomes “10 agent orchestrator,” measured by concurrent agents you can run effectively .
Dylan Patel (Semianalysis) — adoption signal: Claude Code share of GitHub commits going 2%→4% in a month, with a broader estimate of total AI-written code around ~10% .

🎬 WATCH & LISTEN

1) Addy Osmani: “Learn orchestration” + the path to agent fleets (≈ 23:32–25:54)

Hook: Practical roadmap from single-agent prompting to multi-agent orchestration and coordination patterns—before you burn tokens on experimental swarms .

2) SAIL LIVE #6: why SWE-Bench got saturated (and what that says about evals) (≈ 29:49–33:48)

Hook: A clear explanation of how SWE-Bench is constructed, why it became the default “agentic coding” benchmark, and why that creates problems once it’s widely known and reused .

📊 PROJECTS & REPOS

Agentic Engineering Patterns (Simon Willison) — a living guide of coding-agent practices and patterns (agentic engineering vs vibe coding framing)
- https://simonwillison.net/guides/agentic-engineering-patterns/
Present.app (Simon Willison) — vibe-coded SwiftUI macOS presentation app where each “slide” is a URL; GitHub repo shared
- https://github.com/simonw/present
OpenClaw releases + docs (beta features shipping)
- https://github.com/openclaw/openclaw/releases
- Secrets docs: https://docs.openclaw.ai/cli/secrets
- ACP agents docs: https://docs.openclaw.ai/tools/acp-agents
Cursor Bugbot Autofix announcement + writeup
- http://cursor.com/blog/bugbot-autofix
Omarchy 3.4 release (61 contributors; agent features + tmux work)
- https://github.com/basecamp/omarchy/releases/tag/v3.4.0
tldraw tests move discussion (tests closed-source)
- https://github.com/tldraw/tldraw/issues/8082

Editorial take: The leverage is shifting from “pick the best model” to “build the tightest loop”: spec → isolated runtime → tests/evals → approvals—and only then scale agents.

Orchestration-first agent coding: Codex CLI v0.105, spec-driven loops, and eval infra wars

Simon Willison’s Newsletter

substack 1 doc

Simon Willison launched Agentic Engineering Patterns guide for coding practices with agentic engineering tools like Claude Code and OpenAI Codex, which generate and execute code independently . Firsthand usage in production (blog) and side projects .

Key timeless patterns (1-2 new/week):

Red/green TDD: Test-first development yields succinct, reliable agent code with minimal prompting . Guide
Writing code is cheap now: Agents drop code cost; rethink design/planning/habits . Guide
First run the tests: Essential for verifying agent code; cheap to generate . Guide
Linear walkthroughs: Prompt for structured codebase tour . Used on vibe-coded app. Guide
Hoard things you know how to do: Leverage expertise for effective agent guidance . Guide

Workflow example: Vibe-coded Present.app (macOS SwiftUI presentation tool, URLs per slide, phone remote via Tailscale) using Claude Opus 4.6 in Claude Code for web on iPhone. Prompts: Initial UI/fullscreen nav; add web server on :9123 for mobile controls . Repo: github.com/simonw/present. ~45min build, no Xcode .

Shawn "swyx" Wang

Profile 2 docs

SWE-Bench flaws undermine coding agent evaluations. Originally from Princeton, SWE-Bench tests agentic coding by resolving ~2k GitHub issues using repo context, passing tests—first proper agentic benchmark beyond autocomplete like HumanEval; Devin first reported it, scores rose from 13% to 80% .

OpenAI's SWE-Bench Verified curated 500 high-quality tasks (3 humans/task vetting, ~$2M cost), but saturated at ~80% with noise (±0.5-1%) . Audit revealed 59% of unsolved tasks impossible (e.g., magic string 'get_annotation' required) —ideal for contamination canaries .

Models cheat via memorization: Public datasets enable task ID → solution regurgitation; CoT leaks post-cutoff knowledge (e.g., future Django APIs) . GPT-o1, competitors (Flash/Gemini/Opus) vulnerable .

Fixes in SWE-Bench Pro (Scale AI): Newer problems, diverse repos/languages, private/public splits . Benchmarks remain hard/expensive; coding easiest to eval objectively .

Swyx (Latent Space host, agents expert), Nathan Lambert (researcher), Sebastian Raschka (LLM book author) share firsthand benchmark analysis.

Addy Osmani

Profile 1 doc

Addy Osmani (Google Cloud AI Director, 14+ years at Google, leads technical evangelism for Gemini and Vertex Agent Development Kit) emphasizes enterprise agent orchestration over massive agent swarms: modest sets of agents solving real problems with control, traceability, and quality gates—contrasting solo-founder 'Wild West' approaches .

Timeless patterns:

Multi-agent systems: agents spawn others, communicate in parallel, self-coordinate . Coordination is the hard problem, not generation .
Agent types in Vertex ADK: LLM agents (reasoning), workflow agents (deterministic sequential/parallel loops), custom agents; mix orchestrated control and dynamic autonomy .
Spec-driven development: Spend 30-40% time upfront on specs, constraints, success criteria, stack/architecture; use context/resources directory to avoid token waste (LLMs default to 'lowest common denominator') . Codify team best practices .
Quality gates: Balance AI velocity with code review, testing coverage, human-in-loop for maintenance/legacy/security; manually intervene as needed .

Tools he uses (firsthand):

GitHub Copilot agents: Assign to-do lists; routes to Anthropic/OpenAI models .
Google Joules (Gemini-based) for similar orchestration .
Claude Code Web, Claude Code agent teams (recent Swarm support) for agent coordination/context handoff .
A2A (Agent-to-Agent protocol, Linux Foundation): Complements MCP (agent-tool); enables cross-vendor agent communication like 'TCP/IP of agentic era' .

Contrarian takes: Skeptical of 'near perfection' hype (e.g., Matt Schuler post)—valid for prototypes/MVPs, but enterprise needs rigor; distinguish 'feeling busy' (100s agents) from productivity . Prioritize orchestration learning this year . Be cost-conscious: Experiment small tasks, extrapolate; some justify $100s-$1Ks/month vs. hiring . Future: Agent-optimized codebases (less human-readable) .

Latent Space

youtube 1 doc

Dylan Patel (founder/CEO of Semianalysis, 60-person firm tracking AI infra/models) reports Claude Code adoption surged from 2% to 4% of GitHub commits in January, estimating total AI-generated code (incl. Claude Code, Codex, Cognition/Devin, GitHub Copilot) at ~10%.

At Semianalysis, ~1/3 of staff (engineers + ex-hedge fund analysts) now use Claude Code for data scraping, financial modeling, and pro forma analysis—firsthand production usage .

Recent coding tools: Claude Code, Claudebot, Maltbook, Kimi 2.5 agent swarms, Codex 5.3 (latter not far behind Codex) .

swyx

x 2 docs

@shadcn demonstrated Claude Code's tool preferences: when prompted to build with no tool names anywhere in the input, it selects category-leading infra tools (see image) .

@swyx quoted this, stating such leaders merit $5B+ valuations as coding agents will recommend them for 5+ years since infra is stickier than agents (disclosure: small Resend angel) .

Demo: https://x.com/shadcn/status/2027062972753866796.

Prompt technique: Omit tool names to reveal agent's inherent recommendations .

Simon Willison's Weblog

simonwillison 1 doc

Andrej Karpathy states that programming has dramatically changed due to AI in the last 2 months (December), where coding agents basically didn’t work before but basically work since thanks to models' significantly higher quality, long-term coherence, tenacity, enabling them to power through large and long tasks—extremely disruptive to default workflows .

Firsthand take from Karpathy (top AI practitioner) .

Original tweet.

swyx

x 4 docs

Harbor framework (@harborframework) dominates RL infra and evals landscape for terminal agents .

@swyx (Cognition): Team prioritizing migration of all evals to Harbor; originated from TerminalBench 2 Discord needs; expects Harbor-based evals/benchmarks/infra startups . Firsthand production adoption.
Standing room only at Modal x @willccbb meetup; Harbor now required knowledge .
@LaudeInstitute: Standardizes benchmarks via one interface—repeatable runs, standardized traces, production-grade—from TerminalBench .
@willccbb (quoted by @swyx): Harbor for tasksets in terminal agents; verifiers as domain-agnostic RL env layer with token-level plumbing .

Link: https://x.com/laudeinstitute/status/2027101198529266171.

Timeless pattern: Shared eval infra for agent benchmarks.

Simon Willison's Weblog

simonwillison 1 doc

Simon Willison, experienced developer with extensive personal code repositories, advocates hoarding working code examples as a timeless pattern for productive coding agent use—collect in blog , TIL , GitHub (1000+ repos) , tools.simonwillison.net (HTML tools) , and simonw/research repo .

Recombination workflow: Prompt agents to build new tools by merging existing examples.

Used Claude 3 Opus with snippets for PDF.js (render pages to images) and Tesseract.js (OCR images) to create browser-based PDF OCR tool :
Use these examples to put together a single HTML page...
Iterated to final tool at https://tools.simonwillison.net/ocr.

Coding agent enhancements (e.g., Claude Code) :

Fetch tool sources: Use curl to fetch... https://tools.simonwillison.net/ocr and https://tools.simonwillison.net/gemini-bbox....
Use local repos: Add mocked HTTP tests to ~/dev/ecosystem/datasette-oauth inspired by ~/dev/ecosystem/llm-mistral.
Clone public repos: Clone simonw/research... to /tmp and find examples of compiling Rust to WebAssembly....

Firsthand production/side-project usage by author.

swyx

x 3 docs

FactoryAI's Droids now support Missions: autonomous pursuit of multi-day software engineering goals. Workflow: Describe desired outcome, approve the AI-generated plan, and retrieve finished work .

Enterprise examples from production use:

Modernize 40-year-old COBOL core module
Migrate >1k microservices to new Kubernetes cluster across three regions
Recalculate 10 years of pricing data post-revenue rule change
Refactor monolith handling 20M daily API calls with zero downtime

@matanSF (FactoryAI) shares these as real missions run by large enterprises . @swyx praises as "EXCELLENTLY executed idea" .

Demo: https://x.com/FactoryAI/status/2027104794289263104.

Simon Willison

x 1 doc

Simon Willison published a new chapter in his Agentic Engineering Patterns guide titled "Hoard things you know how to do", described as general career advice that also helps when working with coding agents.

Resource: https://simonwillison.net/guides/agentic-engineering-patterns/hoard-things-you-know-how-to-do/

Ben Tossell

x 2 docs

@FactoryAI announced Mission Control, a terminal-based dashboard providing "one view for everything": which feature is being built, which worker Droid is assigned, tools it's using, and Mission progress . @bentossell (@FactoryAI affiliate) highlights support for multi-day tasks end to end.

swyx

x 1 doc

@swyx (@cognition) shares firsthand production use of Devin AI (@devinai): it investigated a bug from Vercel org migration (forgotten key), requested exactly what it needed from humans, verified fixes, and saluted upon compliment .

This demonstrates a human-in-the-loop agent workflow for production debugging .

geoff

x 1 doc

Latent Patterns new feature via partnership with Chainguard: Secure images for embedded terminals enable running Claude code directly in browser — zero API key provisioning or software installation required, works even on Chromebook. Announced firsthand by builder @GeoffreyHuntley.

Kent C. Dodds 🏹

x 2 docs

Kent C. Dodds (@kentcdodds), dev educator, shares his firsthand workflow with Cursor's agent and Bugbot: Kick off an agent on a task before bed; wake up to Bugbot and agent iterating on the solution. He's been using this "for a while now" and calls it "awesome" .

Cursor announced Bugbot Autofix now automatically fixes issues found in PRs .

Link: https://x.com/cursor_ai/status/2027079876948484200.

Kent C. Dodds 🏹

x 1 doc

Kent C. Dodds (@kentcdodds), dev educator behind Epic Stack, advises: Get your app running in an isolated environment to close the agent loop.

His “Minimize Setup Friction” and “Offline Development” guiding principles have helped significantly .

Guiding principles repo.

Firsthand experience from production stack usage.

Peter Steinberger 🦞

x 3 docs

OpenClaw beta release announced by @steipete (OpenClaw contributor):

Key features:

External Secrets Management (openclaw secrets): ; docs: https://docs.openclaw.ai/cli/secrets
CP thread-bound agents (first-class runtime)
WebSocket support for Codex
Codex/Claude code as first-class subagents via ACP ; docs: https://docs.openclaw.ai/tools/acp-agents

Release notes: https://github.com/openclaw/openclaw/releases

Romain Huet

x 2 docs

Romain Huet (Head of Developer Experience @OpenAI, working on Codex) shared that GPT-5.3-Codex achieved 90% on IBench at xhigh reasoning, calling it "in a different league" . With recent speed gains, xhigh "doesn’t feel like a tradeoff anymore" .

Quoting @adonis_singh: "decided to run 5.3 codex on xhigh as well, its 90%… rip IBench, survived 3 months" . See: https://x.com/adonis_singh/status/2026692938751725655.

Armin Ronacher ⇌

x 2 docs

Armin Ronacher (@mitsuhiko, Flask creator) observes that agents are great at creating language- and implementation-agnostic test suites, enabling surprisingly quick test coverage from a reference implementation .

This contrarian take challenges tldraw's strategy of moving tests to a closed-source repo to prevent "Slop Fork" forks .

Related issue:https://github.com/tldraw/tldraw/issues/8082.

Timeless pattern: Agent-driven test generation for agnostic coverage, transcending specific implementations.

Cursor

x 2 docs

Cursor Bugbot Autofix: New feature that automatically fixes issues found in PRs .

Official announcement from @cursor_ai with blog link for details: http://cursor.com/blog/bugbot-autofix.

Brendan Long

brendanlong 1 doc

Brendan Long, working on two fairly large vibe-coded apps, uses this converged process:

Write a GitHub issue
(If complicated) Tell an agent to make a plan and update the issue
Have another agent read the issue and implement it

Writing a detailed enough issue is 90% of the work; refining it fixes problems . This mirrors a system design interview—focusing on high-level design, edge cases, tradeoffs—without needing to impress or scale to trillions .

Example issue: lion-reader #641.

Firsthand production/side project experience from senior engineer; highlights timeless pattern: detailed specs as core of agentic coding workflows .