ZeroNoise Logo zeronoise
Post
SWE-bench Verified is deprecated; WebSockets land in Responses API; AgentMD skepticism goes mainstream
Feb 24
6 min read
185 docs
SWE-bench Verified is being retired as a frontier coding eval: OpenAI says it’s saturated, contaminated, and riddled with test-design issues—SWE-bench Pro is the new recommendation. Also: practical agent workflows (red/green TDD, conformance-test-driven ports), new tool updates (Responses API WebSockets, Codex CLI multi-agent), and a hard look at when AGENTS.md helps vs just adds cost.

🔥 TOP SIGNAL

OpenAI is stopping SWE-bench Verified reporting and recommending SWE-bench Pro, citing benchmark saturation, contamination (frontier models can regurgitate solutions/problem statements from the Task ID), and test-design issues that make a large chunk of remaining tasks effectively unsound to chase . If you’re using SWE-bench numbers to pick models or to market agent gains, this is a hard reset on what “good” looks like in coding evals .

🛠️ TOOLS & MODELS

  • OpenAI Responses API — WebSockets mode

    • New WebSockets support aimed at low-latency, long-running agents with heavy tool calls (explicitly positioned as good for coding agents) .
    • Docs: http://developers.openai.com/api/docs/guides/websocket-mode.
    • Huet notes it was built to “keep up” with GPT-5.3-Codex-Spark.
  • Codex CLI — multi-agent mode

    • Enable multiple specialized agents in one session (each with its own role/model/behavior) .
    • Setup:
      1. Open ~/codex/config.toml
      2. Add [features] multi_agent = true
      3. Run /experimental → “Multi-agent mode is now on”
    • Comes with explorer / worker / general helper agents out of the box .
  • Agentic “full stack orchestration” demo — Antigravity

    • “Add GPay to your website” via one prompt: detects Angular, installs deps, edits frontend+backend, then verifies via an automated browser run .
  • OpenClaw — new beta

  • Practitioner model notes (Codex vs Claude, cost/latency)

    • Multiple practitioners are calling GPT-5.3-Codex + Codex app the best option “for getting software dev work done,” with strong instruction-following (trade-off: more “machine-like” personality) . Brockman attributes this to heavy investment + model/harness co-design + rapid post-training iterations .
    • QuinnyPig reports Codex made Claude Code feel dramatically weaker after testing (starting from skepticism) .
    • Claude Code pain points surfaced today:
      • “Opus 4.6 is thinking WAY TOO long” (annoying, not delivering value) .
      • Primeagen tried “Claude fast 4.6” for high-stakes work and spent $100s in ~1 hour (but said it was fast) .

💡 WORKFLOWS & TRICKS

  • New eval reality: stop optimizing for brittle tests

    • OpenAI’s critique: SWE-bench Verified became less meaningful at high scores—narrow tests can devolve into “guessing” exact names/implementation details rather than measuring coding ability .
    • What they say they want next: longer-term tasks, open-ended design decisions, code quality/maintainability, real-world product building, and human-intensive rubric evaluation.
  • Red/green TDD as an agent control surface (Willison)

    • Prompt pattern: write tests first → confirm they fail (“red”) → implement until they pass (“green”).
    • Why it works with agents: reduces the odds of shipping code that doesn’t work or that’s unnecessary, and leaves you with a regression suite .
    • Copy/paste starter prompt:
      • Build a Python function to extract headers from a markdown string. Use red/green TDD.
  • “Conformance suite + reference implementation” makes big agentic ports safer (Ladybird)

    • Andreas Kling ported LibJS to Rust using Claude Code and Codex, but emphasizes it was human-directed (he chose what to port, in what order, and how the Rust should look) .
    • Guardrails that mattered:
      • Started with components that had strong test262 coverage .
      • Required byte-for-byte identical output vs the C++ pipeline; verified identical ASTs and bytecode; reported zero regressions.
    • Result: ~25,000 lines of Rust in ~two weeks (vs “multiple months” manually) .
  • Context files (AGENTS.md / CLAUDE.md): when they help vs when they’re just tax

    • Theo cites a study on “context files” for GitHub issue resolution:
      • Dev-written context files: only +4% success vs omitting .
      • LLM-generated context files: -3% success .
      • More exploration/testing/reasoning → >20% higher costs.
      • Recommendation: omit LLM-generated context files; keep only minimal non-discoverable requirements like specific tooling .
    • Addy Osmani’s rule of thumb: auto-generated AGENTS(.md) duplicates what agents can discover and inflates cost; human-written files help mainly for non-discoverable gotchas/conventions/landmines. He suggests treating AGENTS(.md) as a living list of codebase smells (not permanent config) .
    • Theo’s practical heuristics:
      • Don’t distract the model with irrelevant background—keep it focused on “the thing” .
      • If the info is in the codebase, it often doesn’t belong in AgentMD; models can usually find what they need (e.g., via package.json + repo search) .
      • If you’re investing time, prioritize unit/integration tests, type checks, and feedback systems you can expose to the model over growing AgentMD files .
  • Agentic quality loops you can steal

    • Automated “review → fix → review” loop (Armin Ronacher): his /review extension for ralph loops between “review on an empty branch” and “go back and fix your shit” until P0/P1/P2 are resolved .
    • Unblock multi-step tasks (Theo): if step 2 keeps failing, ask the agent for step 3—he claims it often back-solves step 2 to get there .
    • Infra upgrade prompt that actually worked (Ronacher): upgrade me to postgres 18. don’t make any mistakes—shared as a successful approach for painful major version upgrades .

👤 PEOPLE TO WATCH

  • Simon Willison — launched Agentic Engineering Patterns (written by him, not an LLM) and is turning scattered best practices into an evergreen “guide” format . First chapters: “writing code is cheap now” and “red/green TDD” .
  • Theo (t3.gg) — consistently practical on agent context management; argues many AGENTS.md/CLAUDE.md setups are counterproductive and measured as a cost/latency hit .
  • Addy Osmani — sharp framing: AGENTS.md should be about non-discoverable landmines, and a single root file won’t scale for complex repos (he argues for a hierarchy of scoped files) .
  • Kent C. Dodds — evolving his reviews of agent code toward “is it actually wrong or just different,” focusing on principles over personal style; also calls out UI “taste” as a remaining bottleneck (CSS + knowing when UI looks bad) .
  • Armin Ronacher — hands-on, blunt tool feedback: calls MCP architecture token-inefficient/resource-intensive and says it underperforms “skills” in his testing .

🎬 WATCH & LISTEN

1) Prompt/context hierarchy explained (and why “extra context” sneaks into every request) — Theo (≈ 7:10–10:28)

Hook: A concrete mental model for why AgentMD/ClaudeMD “rules” are sticky: provider/system/developer/user layers, and everything above gets sent each turn—so context decisions directly impact cost and behavior .

2) What a “better coding benchmark” should measure — Latent Space + OpenAI Frontier Evals (≈ 14:04–15:51)

Hook: The team argues we’re moving beyond “solve a small GitHub issue” toward longer-running tasks and harder-to-measure signals like design taste, code quality, and maintainability .

📊 PROJECTS & REPOS


Editorial take: “Writing code is cheap now,” but proving it’s good (tests, evals, reviews, and anti-contamination discipline) is where serious teams will win .

SWE-bench Verified is deprecated; WebSockets land in Responses API; AgentMD skepticism goes mainstream
Latent.Space
latent 1 doc

OpenAI Frontier Evals team discontinues SWE-Bench Verified due to saturation (~80% frontier model scores), contamination (all frontier models regurgitate gold patches/problem statements from Task ID alone) , and >60% unsolvable tasks (49 narrow tests rejecting correct solutions, 26 requiring unspecified features) .

Endorses SWE-Bench Pro (Scale AI): harder (1-4+ hour expert tasks, diverse repos/languages), less contaminated . OpenAI not SOTA (Gemini 3 outperforms GPT-5.x) .

Ideal coding evals (per Mia Glaese, VP Research/Codex/Human Data/Alignment; Olivia Watkins, Frontier Evals Researcher, Verified coauthors): longer-term tasks (hours/days), open-ended design decisions, code quality/maintainability, real-world product building, human-intensive rubric eval, real-world usage tracking . GPT-5.2 solved 31 contamination-hard problems .

Resources: OpenAI post, SWE-Bench Pro leaderboard.

swyx
x 6 docs

OpenAI deprecates SWE-Bench Verified as coding eval standard due to saturation from contamination (all frontier models, including OpenAI's, regurgitate solutions verbatim from Task ID alone) and test-design issues (≥16.4% problems unsolvable per audits; ≥60% of remaining unsolved tasks unsolvable) .

Recommends SWE-Bench Pro for frontier coding evals .

OpenAI audits: 3 independent reviews/problem in 2024; 6 software engineer reviews + verification in 2026 .

swyx (Latent Space podcaster) notes eval fragility, questions other benchmarks . Theoretical ceiling: 87.5-95%.

Resources:

Simon Willison's Weblog

Red/green TDD—write tests first, confirm they fail (red), then implement to pass (green)—is a succinct, effective pattern for coding agents to avoid non-working or unnecessary code while building a robust test suite .

Simon Willison (practitioner using coding agents) confirms good models understand the shorthand, protecting against regressions as projects grow .

Example prompt: Build a Python function to extract headers from a markdown string. Use red/green TDD.

Demonstrations (firsthand):

  • Claude
  • ChatGPT (append "Use your code environment")

Normally uses Claude Code or OpenAI Codex.

Timeless pattern: Test-driven agent workflows.

Latent Space
youtube 1 doc

OpenAI's Frontier Evals team (Olivia) and VP of Research Mia (overseeing Codex, Simulator, Alignment teams) deprecate SWE-Bench Verified—a curated 500-task benchmark from real GitHub issues, graded by test passage—due to saturation (frontier models ~80-90%) and contamination (models regurgitating solutions or repo-specific details like unspec'd arguments) .

Key issues (>50% of unsolved problems): overly narrow tests expecting unspecified details (e.g., exact function/arg names) or unmentioned features . Original creation involved ~100 engineers with 3x reviews per task .

Recommend SWE-Bench Pro (Scale AI): harder (1-4+ hr expert tasks), diverse repos/languages, minimal contamination per auditor agent .

Future evals needed: open-ended design (e.g., optimize code perf), code quality/maintainability, long-term tasks (hours/days), end-to-end products; proxies like time/complexity/dollars for capacity .

Contrarian: Benchmarks evolve/saturate quickly; test passage ≠ good code if tests unfair .

Simon Willison's Weblog

Simon Willison highlights that coding agents dramatically reduce code writing costs, upending traditional planning, estimation, and micro-decision habits predicated on code's expense .

One engineer can now use parallel agents for simultaneous implementation, refactoring, testing, and documentation .

Good code remains expensive, requiring it to: work reliably; be verified; solve the right problem; handle errors gracefully; stay simple/minimal; include tests; have accurate docs; balance YAGNI with changeability; meet relevant ilities .

Agents assist but developers must ensure quality .

Actionable habit: Second-guess "not worth the time" instincts by firing off asynchronous agent prompts—worst case is wasted tokens .

Willison, actively developing these practices firsthand, notes the industry is still figuring out agentic engineering best practices .

Simon Willison
x 3 docs

Simon Willison (@simonw, Django co-creator) published initial chapters of Agentic Engineering Patterns guide—coding practices for optimal results with agents like Claude Code and OpenAI Codex.

Chapter 1: “Writing code is cheap now” explores agentic engineering challenge: near-zero cost to generate working code reshapes individual/team workflows .

Chapter 2: “Red/green TDD”—encourage agents to use test-first development: observe failing tests before implementing passing code—for much better results across most coding agents .

Resources:

Simon Willison's Weblog

Simon Willison, experienced software engineer with 345 posts on AI-assisted programming , launched Agentic Engineering Patterns guide—a collection of coding practices for building software with coding agents that generate and execute code, such as Claude Code and OpenAI Codex.

Agentic Engineering amplifies professional expertise, contrasting vibe coding.

First chapters:

  • Writing code is cheap now: Addresses challenges from near-zero cost of initial code generation
  • Red/green TDD: Test-first development enables agents to produce succinct, reliable code with minimal prompting

Plans 1-2 chapters weekly in evergreen guide format .

Resources:

Firsthand project by Willison (not LLM-generated) .

Addy Osmani
x 2 docs

@addyosmani (@GoogleCloud AI Director) advises caution with /init in coding agents using AGENTS.md (e.g., Claude): treat as living list of codebase smells rather than permanent config .

Auto-generated files hurt agent performance and inflate costs by duplicating discoverable info; human-written ones help only for non-discoverable details like tooling gotchas, conventions, landmines .

Contrarian take: Single root AGENTS.md insufficient for complex codebases; need hierarchy of AGENTS.md files at directory/module levels, automatically maintained for precisely scoped context.

References @theo's study proving to delete CLAUDE.md/AGENTS.md (https://x.com/theo/status/2025900730847232409).

Simon Willison's Weblog

Andreas Kling (Ladybird browser engine lead) ported LibJS JavaScript engine from C++ to Rust using Claude Code and Codex, in a human-directed process—not autonomous—with hundreds of small prompts specifying port order, targets, and Rust code structure .

Workflow details:

  • Targeted self-contained lexer/parser/AST/bytecode generator with test262 coverage
  • Ensured byte-for-byte identical output vs original C++

Quantitative results: ~25,000 lines of Rust in two weeks (vs multiple months manually); zero regressions verified against test262 .

Timeless insight: High-quality conformance tests (e.g., test262) and reference implementation comparisons make agentic engineering reliable for large projects .

Firsthand account from Kling via Ladybird post; Simon Willison highlights as advanced coding agent use for critical code .

ThePrimeagen
x 1 doc

@ThePrimeagen, a prominent developer, tested Claude fast 4.6 for high-stakes programming, spending hundreds of dollars in ~1 hour but found it very fast.

Theo - t3․gg
youtube 1 doc

Theo, a full-stack TypeScript developer (t3.gg), shares firsthand experience using Claude Code and Cursor for repo-specific coding agents, emphasizing context management as a timeless pattern .

Study findings on Agent.md/Claude.md files (developer-provided context files):

  • Marginal improvement (+4% avg. task completion) vs. omitting .
  • LLM-generated: -3% performance drop .
  • Increased exploration/testing/costs by >20% across models (e.g., Sonnet 3.5, o1-mini, Qwen) . Recommendation: Omit LLM-generated files; minimal dev-provided (e.g., specific tools) .

Context hierarchy (developer message like Agent.md sits between system prompt and user):

  • Provider > System > Developer (Agent.md) > User/history .

Actionable workflows/tips (from Theo's T3 Chat/Lawn repos):

  • Minimize Agent.md: Only add for consistent errors (e.g., 'always type-check changes') after codebase fixes fail. Delete on new models; test without .
  • Avoid /init: Generated files redundant (models read package.json, rg files autonomously) and outdated fast .
  • Hack for diagnosis: Instruct agents to 'alert confusions in Agent.md' to surface issues, then fix codebase (e.g., structure, tests) .
  • Strategic 'lies': 'No users/data yet' or 'greenfield, change schema' to avoid over-caution .
  • Multi-step unblock: Ask for step 3 to force self-fix on step 2 .
  • Monitor speed: Slow/simple tasks → reduce context; fast = good setup .
  • Test: Without Agent.md: 1:11s; with fresh init: 1:29s (+~25%) on video pipeline query .

Contrarian take: Agent.md is band-aid; prioritize agent-friendly architecture (tests, structure, feedback loops) over docs . Help, don't distract—models excel at codebase navigation .

Kent C. Dodds ⚡
x 1 doc

Kent C. Dodds (@kentcdodds), a prominent dev educator, shares a firsthand bottleneck in using AI agents for coding: they are only marginally better at CSS than he is (which he rates as bad) and lack “taste” to recognize when a UI looks bad .

Peter Steinberger 🦞
x 1 doc

Peter Steinberger (@steipete), OpenClaw maintainer, announced a new OpenClaw beta focusing on security, bugfixes, and new features including Kilo provider and Kimi vision + video support.

Release notes: https://github.com/openclaw/openclaw/releases.

Kent C. Dodds ⚡
x 2 docs

Kent C. Dodds (@kentcdodds), a dev educator and MVP, shares his evolving code review workflow for agent-generated code from firsthand experience:

  • Shifting from directing the agent to match his style to assessing if the implementation is wrong or just different.
  • Focusing on principles over style/implementation.
  • Treats AI as a coworker (like human contributors) rather than a peer-programmer .
Theo - t3.gg
x 1 doc

Contrarian take from Theo (@theo): Delete your CLAUDE.md/AGENTS.md file, as "I have a study to prove it" .

Author context: Full-time CEO @t3dotchat, part-time YouTuber, investor, and developer (firsthand account from practitioner) .

Riley Brown
x 3 docs

Riley Brown (cofounder of @vibecodeapp) shares a workflow to vibecode viral videos using OpenClaw with the Remotion-Video-Skill (works with Claude Code) .

Demo workflow steps (from video timestamps):

  • 00:55: 1st prompt
  • 02:11: Video generated
  • 03:29: Edit video with OpenClaw
  • 07:28: Add music with ElevenLabs skill
  • 09:22: Final video ✅

Skill location: http://superskills.vibecode.run

Key feature: Fetches brand data with Firecrawl .

Firsthand demo by platform cofounder.

Jason Zhou
x 2 docs

Codex CLI now supports multi-agent mode, enabling multiple specialized agents in one session, each with its own role, model, and behavior .

Setup steps (from @jasonzhou1993, learned from @hqmank):

  • Open ~/codex/config.toml
  • Add [features] multi_agent = true
  • Run /experimental in CLI → "Multi-agent mode is now on"

Out-of-box agents: explorer, worker, general helper.

Source post: https://x.com/hqmank/status/2024114550170136828.

geoff
x 1 doc

@GeoffreyHuntley shares firsthand experience with Copilot for code review: it's "growing on me" and "has caught a couple of things," despite clunky GitHub UX .

Jason Zhou
x 1 doc

@jasonzhou1993, who builds with AI at @aibuilderclub_, reports that Claude code opus 4.6 is thinking WAY TOO long—to the point of being annoying and not delivering value—based on firsthand experience .

Theo - t3.gg
x 2 docs

After experimentation prompted by @theo, @QuinnyPig confirms OpenAI’s Codex vastly outperforms Claude Code, making it look "like rumpled laundry by comparison"—a firsthand account from initial skepticism .

@theo shares this endorsement .