ZeroNoise Logo zeronoise
Post
Delegation-first agents: plan/review loops, harness engineering gains, and benchmark vs reality gaps
Feb 22
6 min read
64 docs
A clear signal that coding agents are moving from IDE pairing to full delegation loops: plan/spec, execute, then automated review. Plus: harness engineering wins (Top 30→Top 5 on Terminal Bench), trace-driven eval tactics, and sharp practitioner comparisons of Gemini’s benchmark strength vs harness reliability.

🔥 TOP SIGNAL

OpenAI’s Codex product lead Alexander Embiricos says the meaningful workflow jump isn’t “better autocomplete,” it’s the shift from pairing to delegating: agree on a plan/spec, then let the agent run end-to-end (“let it cook”), with many engineers “basically not opening editors anymore.” He frames the next bottleneck as trust + quality control (code review and beyond), aiming for agents that can own a whole internal tool and close the full loop without human review.

🛠️ TOOLS & MODELS

  • OpenAI — Codex app (released last week)

    • Built to be ergonomic for delegating to multiple agents at once (explicitly not a text editor): it’s centered on delegation, review, and “skills” (open standard) for non-coding work like task triage or deploy monitoring.
    • Standards push: Agents.md as a vendor-neutral instruction file; OpenAI also pushed for a neutral Agents/ folder for skills/scripts (not “codex/”).
    • Sandboxing: Embiricos describes “the most conservative sandboxing approach,” with sandboxing as OS-level controls over what an agent can do.
  • OpenAI — Codex performance (GPT-5.3 Codex)

    • Embiricos says GPT-5.3 Codex is “significantly more efficient,” and OpenAI shipped serving speedups: API ~40% faster and Codex ~25% faster.
    • He also teases news soon about an inference partnership (mentioned: Cerebras).
  • Codex integrations (practitioner hacks)

    • Codex exposes an API via codex app-server.
    • @SIGKITTEN says they built a native Codex iPhone app that can spawn/talk to Codex instances on their network—and even run locally on the iPhone.
    • Andrew Mayne reports Codex app can control an iPhone simulator to test an app, grab screenshots, and make adjustments—making automated tests easier to add.
  • LangChain — “harness engineering” (agent gains without model changes)

    • LangChain says their coding agent jumped from Top 30 → Top 5 on Terminal Bench 2.0 by only changing the harness.
    • Their definition: harness engineering is systems work to “mold” model behavior for goals like task performance, token efficiency, latency, via design choices like system prompt, tool choice, execution flow.
    • They tease self-verification and tracing with LangSmith as high leverage.
    • Read: https://blog.langchain.com/improving-deep-agents-with-harness-engineering/
  • Gemini 3.1 Pro Preview — “benchmarks vs harness reality” (Theo’s take)

    • Theo claims Gemini is hitting top benchmark numbers (e.g., “consistently hits 100%” on one benchmark), but in agent harnesses he sees tool-call instability and long-run confusion—especially in the Gemini CLI (loops, buggy behavior, supervision required).
    • He contrasts this with harness-friendly tool calling in other models (e.g., “never see Haiku screw up the shape of a tool call”).
  • Google Antigravity — Gemini long-horizon demo

    • Google Antigravity shared a demo: Gemini 3.1 Pro ingests a detailed paper and builds a functional local-first CRDT simulation with real-time sync visualization and connection toggling in one long-horizon task.
    • Paper link they used: https://www.inkandswitch.com/essay/local-first/local-first.pdf

💡 WORKFLOWS & TRICKS

  • Delegation loop that matches how teams already work (plan → execute → review)

    1. Start with “plan mode”: agent proposes a detailed plan and asks questions/requests approval (framed like a new-hire RFC before starting work).
    2. Delegate execution once the plan/spec is agreed, then let the agent run without hands-on keyboard time.
    3. Add an explicit review pass: Codex reviewing its own PR/change is described as a common practice, and Embiricos says nearly all code at OpenAI is auto-reviewed by Codex on push.
  • Treat code review + quality as the real bottleneck (and invest there)

    • Embiricos argues codegen is becoming “trivial,” and the underinvested bottleneck is: how you know code quality is good / you’re doing the right thing—his north star is agents you trust to own full systems without human review.
  • “Make your repo easier for humans” often makes it easier for agents

    • Example: test runners that dump everything are bad for humans and agents; filtering to only emit failed tests helps both.
  • Harness engineering (practical knobs to turn)

    • If agent performance is spiky, treat the harness as the product: change system prompt, tooling, and execution flow to optimize for latency/token efficiency/performance—not just the underlying model.
    • Add self-verification and instrument with tracing (LangChain calls out LangSmith as impactful here).
  • Agent observability → evaluations that actually regress-proof you (LangChain’s recipe)

    • Instrument your agent in three primitives: runs (single LLM call), traces (full execution), threads (multi-turn sessions).
    • When production breaks, turn traces into tests:
      1. User reports incorrect behavior
      2. Find the production trace
      3. Extract state at failure point
      4. Create a test case from that exact state
      5. Fix and validate
    • Heuristic: start with trace-level evals (inputs are easy), add run-level evals when architecture stabilizes, and expect thread-level evals to be hardest/least common.
    • Read: https://blog.langchain.com/agent-observability-powers-agent-evaluation
  • Minimal “agentic while-loop” harness pattern (Pi)

    • Mario Zechner describes Pi as a minimal layer implementing the agent loop: send user input to an LLM, interpret whether to run a tool (he says ~4 core tools) or return a final answer; it’s extensible via plugins (even self-extensible).
  • Non-programmers “programming” via natural language + spreadsheets (two concrete cases)

    • Armin Ronacher recounts a lawyer paying for ChatGPT Pro because they “win more cases,” then using it to upload spreadsheets and output rows that violate rules—his takeaway: non-programmers are starting to “indirectly program.”
    • Mario Zechner helped his linguist wife use a terminal chat interface to ingest Excel/transcripts, transform data, run stats, and generate charts—turning “two months” of manual work into “two nights,” plus a deterministic pipeline.

👤 PEOPLE TO WATCH

  • Alexander Embiricos (OpenAI Codex) — clearest articulation today of the shift to delegation + the coming bottleneck being review/trust, not codegen.
  • LangChain team — practical, systems-first framing (“harness engineering”) + concrete eval/observability guidance that maps directly to real agent failures.
  • Theo (t3.gg) — sharp, experience-based pressure test of Gemini-in-harnesses vs benchmark performance.
  • Mario Zechner + Armin Ronacher — strong on-the-ground examples of non-programmers getting leverage (and the technical-debt caveat).
  • Peter Steinberger (@steipete) — good reality check: agents accelerate work, but expectations rise too.

🎬 WATCH & LISTEN

1) OpenAI Codex lead — the “delegate, don’t pair” inflection (~17:18–19:17)

Hook: Embiricos describes the step-function shift from IDE-driven coding to plan/spec + delegation (“let it cook”), and claims most engineers he knows aren’t opening editors.

2) Mario Zechner — “manual coding is dead” (and what we lose) (~37:32–40:05)

Hook: A blunt take: the craft of writing code by hand is ending, but the scary part is whether new engineers develop the systems thinking needed to avoid runaway technical debt in large codebases.

📊 PROJECTS & REPOS


Editorial take: The advantage is shifting from “can your model write code?” to can your system reliably delegate + verify—plan-first loops, automated review, and trace-driven evals are quickly becoming the real moat.

Delegation-first agents: plan/review loops, harness engineering gains, and benchmark vs reality gaps