Delegation-first agents: plan/review loops, harness engineering gains, and benchmark vs reality gaps

A clear signal that coding agents are moving from IDE pairing to full delegation loops: plan/spec, execute, then automated review. Plus: harness engineering wins (Top 30→Top 5 on Terminal Bench), trace-driven eval tactics, and sharp practitioner comparisons of Gemini’s benchmark strength vs harness reliability.

Greg Brockman

Armin Ronacher

Alexander Embiricos

🔥 TOP SIGNAL

OpenAI’s Codex product lead Alexander Embiricos says the meaningful workflow jump isn’t “better autocomplete,” it’s the shift from pairing to delegating: agree on a plan/spec, then let the agent run end-to-end (“let it cook”), with many engineers “basically not opening editors anymore.” He frames the next bottleneck as trust + quality control (code review and beyond), aiming for agents that can own a whole internal tool and close the full loop without human review.

🛠️ TOOLS & MODELS

OpenAI — Codex app (released last week)
- Built to be ergonomic for delegating to multiple agents at once (explicitly not a text editor): it’s centered on delegation, review, and “skills” (open standard) for non-coding work like task triage or deploy monitoring.
- Standards push: Agents.md as a vendor-neutral instruction file; OpenAI also pushed for a neutral Agents/ folder for skills/scripts (not “codex/”).
- Sandboxing: Embiricos describes “the most conservative sandboxing approach,” with sandboxing as OS-level controls over what an agent can do.
OpenAI — Codex performance (GPT-5.3 Codex)
- Embiricos says GPT-5.3 Codex is “significantly more efficient,” and OpenAI shipped serving speedups: API ~40% faster and Codex ~25% faster.
- He also teases news soon about an inference partnership (mentioned: Cerebras).
Codex integrations (practitioner hacks)
- Codex exposes an API via codex app-server.
- @SIGKITTEN says they built a native Codex iPhone app that can spawn/talk to Codex instances on their network—and even run locally on the iPhone.
- Andrew Mayne reports Codex app can control an iPhone simulator to test an app, grab screenshots, and make adjustments—making automated tests easier to add.
LangChain — “harness engineering” (agent gains without model changes)
- LangChain says their coding agent jumped from Top 30 → Top 5 on Terminal Bench 2.0 by only changing the harness.
- Their definition: harness engineering is systems work to “mold” model behavior for goals like task performance, token efficiency, latency, via design choices like system prompt, tool choice, execution flow.
- They tease self-verification and tracing with LangSmith as high leverage.
- Read: https://blog.langchain.com/improving-deep-agents-with-harness-engineering/
Gemini 3.1 Pro Preview — “benchmarks vs harness reality” (Theo’s take)
- Theo claims Gemini is hitting top benchmark numbers (e.g., “consistently hits 100%” on one benchmark), but in agent harnesses he sees tool-call instability and long-run confusion—especially in the Gemini CLI (loops, buggy behavior, supervision required).
- He contrasts this with harness-friendly tool calling in other models (e.g., “never see Haiku screw up the shape of a tool call”).
Google Antigravity — Gemini long-horizon demo
- Google Antigravity shared a demo: Gemini 3.1 Pro ingests a detailed paper and builds a functional local-first CRDT simulation with real-time sync visualization and connection toggling in one long-horizon task.
- Paper link they used: https://www.inkandswitch.com/essay/local-first/local-first.pdf

💡 WORKFLOWS & TRICKS

Delegation loop that matches how teams already work (plan → execute → review)
1. Start with “plan mode”: agent proposes a detailed plan and asks questions/requests approval (framed like a new-hire RFC before starting work).
2. Delegate execution once the plan/spec is agreed, then let the agent run without hands-on keyboard time.
3. Add an explicit review pass: Codex reviewing its own PR/change is described as a common practice, and Embiricos says nearly all code at OpenAI is auto-reviewed by Codex on push.
Treat code review + quality as the real bottleneck (and invest there)
- Embiricos argues codegen is becoming “trivial,” and the underinvested bottleneck is: how you know code quality is good / you’re doing the right thing—his north star is agents you trust to own full systems without human review.
“Make your repo easier for humans” often makes it easier for agents
- Example: test runners that dump everything are bad for humans and agents; filtering to only emit failed tests helps both.
Harness engineering (practical knobs to turn)
- If agent performance is spiky, treat the harness as the product: change system prompt, tooling, and execution flow to optimize for latency/token efficiency/performance—not just the underlying model.
- Add self-verification and instrument with tracing (LangChain calls out LangSmith as impactful here).
Agent observability → evaluations that actually regress-proof you (LangChain’s recipe)
- Instrument your agent in three primitives: runs (single LLM call), traces (full execution), threads (multi-turn sessions).
- When production breaks, turn traces into tests:
  1. User reports incorrect behavior
  2. Find the production trace
  3. Extract state at failure point
  4. Create a test case from that exact state
  5. Fix and validate
- Heuristic: start with trace-level evals (inputs are easy), add run-level evals when architecture stabilizes, and expect thread-level evals to be hardest/least common.
- Read: https://blog.langchain.com/agent-observability-powers-agent-evaluation
Minimal “agentic while-loop” harness pattern (Pi)
- Mario Zechner describes Pi as a minimal layer implementing the agent loop: send user input to an LLM, interpret whether to run a tool (he says ~4 core tools) or return a final answer; it’s extensible via plugins (even self-extensible).
Non-programmers “programming” via natural language + spreadsheets (two concrete cases)
- Armin Ronacher recounts a lawyer paying for ChatGPT Pro because they “win more cases,” then using it to upload spreadsheets and output rows that violate rules—his takeaway: non-programmers are starting to “indirectly program.”
- Mario Zechner helped his linguist wife use a terminal chat interface to ingest Excel/transcripts, transform data, run stats, and generate charts—turning “two months” of manual work into “two nights,” plus a deterministic pipeline.

👤 PEOPLE TO WATCH

Alexander Embiricos (OpenAI Codex) — clearest articulation today of the shift to delegation + the coming bottleneck being review/trust, not codegen.
LangChain team — practical, systems-first framing (“harness engineering”) + concrete eval/observability guidance that maps directly to real agent failures.
Theo (t3.gg) — sharp, experience-based pressure test of Gemini-in-harnesses vs benchmark performance.
Mario Zechner + Armin Ronacher — strong on-the-ground examples of non-programmers getting leverage (and the technical-debt caveat).
Peter Steinberger (@steipete) — good reality check: agents accelerate work, but expectations rise too.

🎬 WATCH & LISTEN

1) OpenAI Codex lead — the “delegate, don’t pair” inflection (~17:18–19:17)

Hook: Embiricos describes the step-function shift from IDE-driven coding to plan/spec + delegation (“let it cook”), and claims most engineers he knows aren’t opening editors.

2) Mario Zechner — “manual coding is dead” (and what we lose) (~37:32–40:05)

Hook: A blunt take: the craft of writing code by hand is ending, but the scary part is whether new engineers develop the systems thinking needed to avoid runaway technical debt in large codebases.

📊 PROJECTS & REPOS

Pi (minimal coding-agent harness) — a concrete “agentic while-loop” architecture you can replicate when building your own agent runner.
LangChain’s harness engineering write-up — a playbook for getting large benchmark jumps by changing prompts/tools/flow, not models. https://blog.langchain.com/improving-deep-agents-with-harness-engineering/
LangChain’s agent observability → eval pipeline — turning production traces into regression tests. https://blog.langchain.com/agent-observability-powers-agent-evaluation
Google Antigravity CRDT simulation demo (Gemini 3.1 Pro) + the referenced paper: https://www.inkandswitch.com/essay/local-first/local-first.pdf

Editorial take: The advantage is shifting from “can your model write code?” to can your system reliably delegate + verify—plan-first loops, automated review, and trace-driven evals are quickly becoming the real moat.