Activity for Delegation-first agents: plan/review loops, harness engineering gains, and benchmark vs reality gaps

Delegation-first agents: plan/review loops, harness engineering gains, and benchmark vs reality gaps

A clear signal that coding agents are moving from IDE pairing to full delegation loops: plan/spec, execute, then automated review. Plus: harness engineering wins (Top 30→Top 5 on Terminal Bench), trace-driven eval tactics, and sharp practitioner comparisons of Gemini’s benchmark strength vs harness reliability.

Greg Brockman

Armin Ronacher

Alexander Embiricos

🔥 TOP SIGNAL

OpenAI’s Codex product lead Alexander Embiricos says the meaningful workflow jump isn’t “better autocomplete,” it’s the shift from pairing to delegating: agree on a plan/spec, then let the agent run end-to-end (“let it cook”), with many engineers “basically not opening editors anymore.” He frames the next bottleneck as trust + quality control (code review and beyond), aiming for agents that can own a whole internal tool and close the full loop without human review.

🛠️ TOOLS & MODELS

OpenAI — Codex app (released last week)
- Built to be ergonomic for delegating to multiple agents at once (explicitly not a text editor): it’s centered on delegation, review, and “skills” (open standard) for non-coding work like task triage or deploy monitoring.
- Standards push: Agents.md as a vendor-neutral instruction file; OpenAI also pushed for a neutral Agents/ folder for skills/scripts (not “codex/”).
- Sandboxing: Embiricos describes “the most conservative sandboxing approach,” with sandboxing as OS-level controls over what an agent can do.
OpenAI — Codex performance (GPT-5.3 Codex)
- Embiricos says GPT-5.3 Codex is “significantly more efficient,” and OpenAI shipped serving speedups: API ~40% faster and Codex ~25% faster.
- He also teases news soon about an inference partnership (mentioned: Cerebras).
Codex integrations (practitioner hacks)
- Codex exposes an API via codex app-server.
- @SIGKITTEN says they built a native Codex iPhone app that can spawn/talk to Codex instances on their network—and even run locally on the iPhone.
- Andrew Mayne reports Codex app can control an iPhone simulator to test an app, grab screenshots, and make adjustments—making automated tests easier to add.
LangChain — “harness engineering” (agent gains without model changes)
- LangChain says their coding agent jumped from Top 30 → Top 5 on Terminal Bench 2.0 by only changing the harness.
- Their definition: harness engineering is systems work to “mold” model behavior for goals like task performance, token efficiency, latency, via design choices like system prompt, tool choice, execution flow.
- They tease self-verification and tracing with LangSmith as high leverage.
- Read: https://blog.langchain.com/improving-deep-agents-with-harness-engineering/
Gemini 3.1 Pro Preview — “benchmarks vs harness reality” (Theo’s take)
- Theo claims Gemini is hitting top benchmark numbers (e.g., “consistently hits 100%” on one benchmark), but in agent harnesses he sees tool-call instability and long-run confusion—especially in the Gemini CLI (loops, buggy behavior, supervision required).
- He contrasts this with harness-friendly tool calling in other models (e.g., “never see Haiku screw up the shape of a tool call”).
Google Antigravity — Gemini long-horizon demo
- Google Antigravity shared a demo: Gemini 3.1 Pro ingests a detailed paper and builds a functional local-first CRDT simulation with real-time sync visualization and connection toggling in one long-horizon task.
- Paper link they used: https://www.inkandswitch.com/essay/local-first/local-first.pdf

💡 WORKFLOWS & TRICKS

Delegation loop that matches how teams already work (plan → execute → review)
1. Start with “plan mode”: agent proposes a detailed plan and asks questions/requests approval (framed like a new-hire RFC before starting work).
2. Delegate execution once the plan/spec is agreed, then let the agent run without hands-on keyboard time.
3. Add an explicit review pass: Codex reviewing its own PR/change is described as a common practice, and Embiricos says nearly all code at OpenAI is auto-reviewed by Codex on push.
Treat code review + quality as the real bottleneck (and invest there)
- Embiricos argues codegen is becoming “trivial,” and the underinvested bottleneck is: how you know code quality is good / you’re doing the right thing—his north star is agents you trust to own full systems without human review.
“Make your repo easier for humans” often makes it easier for agents
- Example: test runners that dump everything are bad for humans and agents; filtering to only emit failed tests helps both.
Harness engineering (practical knobs to turn)
- If agent performance is spiky, treat the harness as the product: change system prompt, tooling, and execution flow to optimize for latency/token efficiency/performance—not just the underlying model.
- Add self-verification and instrument with tracing (LangChain calls out LangSmith as impactful here).
Agent observability → evaluations that actually regress-proof you (LangChain’s recipe)
- Instrument your agent in three primitives: runs (single LLM call), traces (full execution), threads (multi-turn sessions).
- When production breaks, turn traces into tests:
  1. User reports incorrect behavior
  2. Find the production trace
  3. Extract state at failure point
  4. Create a test case from that exact state
  5. Fix and validate
- Heuristic: start with trace-level evals (inputs are easy), add run-level evals when architecture stabilizes, and expect thread-level evals to be hardest/least common.
- Read: https://blog.langchain.com/agent-observability-powers-agent-evaluation
Minimal “agentic while-loop” harness pattern (Pi)
- Mario Zechner describes Pi as a minimal layer implementing the agent loop: send user input to an LLM, interpret whether to run a tool (he says ~4 core tools) or return a final answer; it’s extensible via plugins (even self-extensible).
Non-programmers “programming” via natural language + spreadsheets (two concrete cases)
- Armin Ronacher recounts a lawyer paying for ChatGPT Pro because they “win more cases,” then using it to upload spreadsheets and output rows that violate rules—his takeaway: non-programmers are starting to “indirectly program.”
- Mario Zechner helped his linguist wife use a terminal chat interface to ingest Excel/transcripts, transform data, run stats, and generate charts—turning “two months” of manual work into “two nights,” plus a deterministic pipeline.

👤 PEOPLE TO WATCH

Alexander Embiricos (OpenAI Codex) — clearest articulation today of the shift to delegation + the coming bottleneck being review/trust, not codegen.
LangChain team — practical, systems-first framing (“harness engineering”) + concrete eval/observability guidance that maps directly to real agent failures.
Theo (t3.gg) — sharp, experience-based pressure test of Gemini-in-harnesses vs benchmark performance.
Mario Zechner + Armin Ronacher — strong on-the-ground examples of non-programmers getting leverage (and the technical-debt caveat).
Peter Steinberger (@steipete) — good reality check: agents accelerate work, but expectations rise too.

🎬 WATCH & LISTEN

1) OpenAI Codex lead — the “delegate, don’t pair” inflection (~17:18–19:17)

Hook: Embiricos describes the step-function shift from IDE-driven coding to plan/spec + delegation (“let it cook”), and claims most engineers he knows aren’t opening editors.

2) Mario Zechner — “manual coding is dead” (and what we lose) (~37:32–40:05)

Hook: A blunt take: the craft of writing code by hand is ending, but the scary part is whether new engineers develop the systems thinking needed to avoid runaway technical debt in large codebases.

📊 PROJECTS & REPOS

Pi (minimal coding-agent harness) — a concrete “agentic while-loop” architecture you can replicate when building your own agent runner.
LangChain’s harness engineering write-up — a playbook for getting large benchmark jumps by changing prompts/tools/flow, not models. https://blog.langchain.com/improving-deep-agents-with-harness-engineering/
LangChain’s agent observability → eval pipeline — turning production traces into regression tests. https://blog.langchain.com/agent-observability-powers-agent-evaluation
Google Antigravity CRDT simulation demo (Gemini 3.1 Pro) + the referenced paper: https://www.inkandswitch.com/essay/local-first/local-first.pdf

Editorial take: The advantage is shifting from “can your model write code?” to can your system reliably delegate + verify—plan-first loops, automated review, and trace-driven evals are quickly becoming the real moat.

Delegation-first agents: plan/review loops, harness engineering gains, and benchmark vs reality gaps

Armin Ronacher

Profile 1 doc

Pi (by Mario Zechner): Minimal coding agent harness implementing agentic while loop—user input to LLM, interpret for tool execution (4 core tools) or final answer; extensible via self-written plugins by LLM. Standalone alternative to Claude Code, Cursor; foundation for OpenClaw .

Firsthand workflows (Mario Zechner, experienced ML/engineer using in side projects):

Cloud Code (early agent by Peter): Navigates filesystem autonomously vs Cursor's single-file limit; built multi-platform Whisper voice transcription app (CMake build system, grunt work) overnight vs weeks manually. Prompt: 'do it on machine and good luck, make no mistakes' .
Non-programmer use: Wife (linguist) used terminal chat (natural language): load Excel transcripts → dissect/transform → run stats/charts. 2 months manual work → 2 nights; deterministic pipeline .

Contrarian take (Mario): Manual code writing 'dead'; agents enable 'throwaway code without writing it,' but large codebases need systemic thinking honed by manual experience to avoid technical debt .

Armin Ronacher (Flask creator, ex-Sentry; firsthand experimenter): Non-programmers (e.g., air traffic controller) use agents for data analysis/programming via natural language (e.g., ChatGPT Pro on spreadsheets) . Model preference: Claude 3.5 Sonnet ('Codex 5.3') for cheap tokens .

Theo - t3․gg

youtube 1 doc

Theo (t3.gg, full stack TypeScript dev) shares firsthand experiences using Gemini 3.1 Pro Preview on side projects like porting old apps , migrating SQLite to Postgres , and UI redesigns .

Workflows & Issues:

Runs in Gemini CLI (buggy: model switching, no 3.1 day one, tool call failures, 100-line file read limit, loops requiring supervision) vs. Cursor CLI (decent but slow reconnects, hides thinking traces; tuned prompts emphasize specialized tools over CAT) .
Struggles with long agentic tasks (poor on meter evals for >1hr human-equivalent) unlike Opus 4.6/GPT 5.2; needs constant unblocking .

Comparisons: Gemini tops benchmarks (skatebench 100%, Convex 95% w/ guidelines) but inconsistent tool calls (over/underuses, bad syntax); worse than Claude Haiku/Sonnet for reliability . Contrarian: "Insanely smart but sucks in harnesses—benchmaxing without RL for agentic use" .

Browser Agent Demo (Claude 4.6 + Kernel): Cloud browser (<30ms spin-up), navigate site, screenshot, summarize blog via code .

Alexander Embiricos

Profile 1 doc

Alexander Embiricos, Codex product lead at OpenAI, shares firsthand production usage insights.

Workflow shift at OpenAI: From pair programming/tab completion to full delegation—collaborate on plan/spec, then "let it cook" without hands-on keyboard; vast majority of code now AI-written, most engineers avoid opening IDEs. Inflection with GPT-5.2 Codex for end-to-end tasks/context management. Engineers run Codex constantly (never close laptops), even in meetings.

Plan mode: Agent proposes detailed plan, seeks approval/questions (like new hire RFC).

Code review: Codex auto-reviews own PRs/changes (trained for high-signal, low false-positive feedback); standard at OpenAI on push to Git repo.

Codex app (released last week): Ergonomic for delegating multiple agents; no text editing (focus delegation/review); skills (open standard) for non-coding like task triage/deploy monitoring; prominent in .agents/ folder (neutral, adopted widely except Claude). Agents.md: Vendor-neutral instructions file (convention OpenAI pioneered, widely adopted).

Performance: GPT-5.3 Codex more efficient; API 40% faster, Codex 25% faster; Cerebras inference partnership incoming. Growth: 20x since Aug (GPT-5), doubled Dec-now.

Comparisons:

Claude Code: Genius easy terminal for experimentation/tinkering. Mistake: Over-indexed CLI, limits delegation beyond power users.
Cursor: Meets devs in IDE (e.g., Cmd+K inline edits); low-friction VS Code switch, ladders to agentic.

Patterns: Build fluency/tools for individuals first (bottom-up), then enterprise/productize (top-down risks under-leverage); sandboxing (OS-level controls) critical for enterprise stickiness; agent-human interfaces overlap (e.g., concise test outputs). Future: Cloud agents tightly integrated w/local; full iterative loops (codegen/review/deploy/monitor) w/o human review.

Jason Zhou

x 2 docs

@jasonzhou1993 praises @damian_lisz's use of pinned context and flow iteration in SuperDesignDev.

@damian_lisz: "The prompt is the key, do not get discouraged with bad results at first, just refine your prompts" (with demo video) .

Demo: https://x.com/damian_lisz/status/2024449738494394840.

LangChain Blog

langchain 1 doc

LangChain team emphasizes agent observability primitives for debugging and evaluating coding agents: runs (single LLM steps with prompts/tools/context) , traces (full executions with tool calls/nested structure) , threads (multi-turn sessions preserving state evolution) .

Evaluation granularities with coding examples:

Single-step (run-level): Validate tool choice/args, e.g., from production failures; ~50% of agent test suites use these .
Full-turn (trace-level): Check trajectory (e.g., coding bug fix: read_file → edit_file → run_tests) , final response, state changes (verify written files contain correct code) .
Multi-turn (thread-level): Test context persistence, e.g., user prefers Python over JavaScript across turns (share pref → example → script) .

Production debugging insights (firsthand LangChain experience):

Coding agent errors after 10 good turns due to incorrect memory update in turn 6 compounding over time .
Inefficiency: repeated read_file calls on same file; fix via prompt to store in context .

Heuristics: Start with full-turn evals (easy inputs), add single-step for stable architectures, multi-turn least common . Production traces auto-build eval datasets .

Tool: LangSmith for traces/evals (https://smith.langchain.com/).

Peter Steinberger 🦞

x 1 doc

@steipete (experienced AI practitioner) offers a contrarian take: using agents to build software accelerates the process but "removes all the work"; it also "upped expectations" .

Greg Brockman

x 2 docs

Codex provides a nice API via codex app-server.

@SIGKITTEN's firsthand account (serious project): Investigated for a project, built native Codex iPhone app—spawn/talk to Codex instances anywhere on network, linked Codex to run locally on iPhone. Challenges competitors: "gl doing that, cc" .

Endorsed by Greg Brockman (OpenAI President & Co-Founder) .

Google Antigravity

x 2 docs

Gemini 3.1 Pro demonstrates building a functional Local-First CRDT simulation by ingesting a detailed research paper, including real-time data sync visualization and connection toggling, resulting in a debugged interactive app via one long-horizon task — shared as a firsthand demo by @antigravity (Google Antigravity platform).

Paper: https://www.inkandswitch.com/essay/local-first/local-first.pdf

LangChain

x 2 docs

LangChain's coding agent improved from Top 30 to Top 5 on Terminal Bench 2.0 by only changing the harness.

Harness engineering molds a model's spiky intelligence for specific tasks by building tooling around it to optimize task performance, token efficiency, latency; key design decisions include system prompt, tool choice, execution flow.

Teaser for improvements: self-verification and tracing with LangSmith help a lot .

From LangChain team (makers of LangSmith/LangChain); firsthand on their production coding agent .

Greg Brockman

x 2 docs

OpenAI Codex app controls iPhone simulator for end-to-end app testing by grabbing screenshots and making adjustments . This simplifies adding automated tests .

@gdb highlights it for end-to-end dev workflows.

Firsthand from @AndrewMayne ; link: https://x.com/andrewmayne/status/2025025783115514147.