ZeroNoise Logo zeronoise
Post
SWE-bench Verified is deprecated; WebSockets land in Responses API; AgentMD skepticism goes mainstream
Feb 24
6 min read
185 docs
SWE-bench Verified is being retired as a frontier coding eval: OpenAI says it’s saturated, contaminated, and riddled with test-design issues—SWE-bench Pro is the new recommendation. Also: practical agent workflows (red/green TDD, conformance-test-driven ports), new tool updates (Responses API WebSockets, Codex CLI multi-agent), and a hard look at when AGENTS.md helps vs just adds cost.

🔥 TOP SIGNAL

OpenAI is stopping SWE-bench Verified reporting and recommending SWE-bench Pro, citing benchmark saturation, contamination (frontier models can regurgitate solutions/problem statements from the Task ID), and test-design issues that make a large chunk of remaining tasks effectively unsound to chase . If you’re using SWE-bench numbers to pick models or to market agent gains, this is a hard reset on what “good” looks like in coding evals .

🛠️ TOOLS & MODELS

  • OpenAI Responses API — WebSockets mode

    • New WebSockets support aimed at low-latency, long-running agents with heavy tool calls (explicitly positioned as good for coding agents) .
    • Docs: http://developers.openai.com/api/docs/guides/websocket-mode.
    • Huet notes it was built to “keep up” with GPT-5.3-Codex-Spark.
  • Codex CLI — multi-agent mode

    • Enable multiple specialized agents in one session (each with its own role/model/behavior) .
    • Setup:
      1. Open ~/codex/config.toml
      2. Add [features] multi_agent = true
      3. Run /experimental → “Multi-agent mode is now on”
    • Comes with explorer / worker / general helper agents out of the box .
  • Agentic “full stack orchestration” demo — Antigravity

    • “Add GPay to your website” via one prompt: detects Angular, installs deps, edits frontend+backend, then verifies via an automated browser run .
  • OpenClaw — new beta

  • Practitioner model notes (Codex vs Claude, cost/latency)

    • Multiple practitioners are calling GPT-5.3-Codex + Codex app the best option “for getting software dev work done,” with strong instruction-following (trade-off: more “machine-like” personality) . Brockman attributes this to heavy investment + model/harness co-design + rapid post-training iterations .
    • QuinnyPig reports Codex made Claude Code feel dramatically weaker after testing (starting from skepticism) .
    • Claude Code pain points surfaced today:
      • “Opus 4.6 is thinking WAY TOO long” (annoying, not delivering value) .
      • Primeagen tried “Claude fast 4.6” for high-stakes work and spent $100s in ~1 hour (but said it was fast) .

💡 WORKFLOWS & TRICKS

  • New eval reality: stop optimizing for brittle tests

    • OpenAI’s critique: SWE-bench Verified became less meaningful at high scores—narrow tests can devolve into “guessing” exact names/implementation details rather than measuring coding ability .
    • What they say they want next: longer-term tasks, open-ended design decisions, code quality/maintainability, real-world product building, and human-intensive rubric evaluation.
  • Red/green TDD as an agent control surface (Willison)

    • Prompt pattern: write tests first → confirm they fail (“red”) → implement until they pass (“green”).
    • Why it works with agents: reduces the odds of shipping code that doesn’t work or that’s unnecessary, and leaves you with a regression suite .
    • Copy/paste starter prompt:
      • Build a Python function to extract headers from a markdown string. Use red/green TDD.
  • “Conformance suite + reference implementation” makes big agentic ports safer (Ladybird)

    • Andreas Kling ported LibJS to Rust using Claude Code and Codex, but emphasizes it was human-directed (he chose what to port, in what order, and how the Rust should look) .
    • Guardrails that mattered:
      • Started with components that had strong test262 coverage .
      • Required byte-for-byte identical output vs the C++ pipeline; verified identical ASTs and bytecode; reported zero regressions.
    • Result: ~25,000 lines of Rust in ~two weeks (vs “multiple months” manually) .
  • Context files (AGENTS.md / CLAUDE.md): when they help vs when they’re just tax

    • Theo cites a study on “context files” for GitHub issue resolution:
      • Dev-written context files: only +4% success vs omitting .
      • LLM-generated context files: -3% success .
      • More exploration/testing/reasoning → >20% higher costs.
      • Recommendation: omit LLM-generated context files; keep only minimal non-discoverable requirements like specific tooling .
    • Addy Osmani’s rule of thumb: auto-generated AGENTS(.md) duplicates what agents can discover and inflates cost; human-written files help mainly for non-discoverable gotchas/conventions/landmines. He suggests treating AGENTS(.md) as a living list of codebase smells (not permanent config) .
    • Theo’s practical heuristics:
      • Don’t distract the model with irrelevant background—keep it focused on “the thing” .
      • If the info is in the codebase, it often doesn’t belong in AgentMD; models can usually find what they need (e.g., via package.json + repo search) .
      • If you’re investing time, prioritize unit/integration tests, type checks, and feedback systems you can expose to the model over growing AgentMD files .
  • Agentic quality loops you can steal

    • Automated “review → fix → review” loop (Armin Ronacher): his /review extension for ralph loops between “review on an empty branch” and “go back and fix your shit” until P0/P1/P2 are resolved .
    • Unblock multi-step tasks (Theo): if step 2 keeps failing, ask the agent for step 3—he claims it often back-solves step 2 to get there .
    • Infra upgrade prompt that actually worked (Ronacher): upgrade me to postgres 18. don’t make any mistakes—shared as a successful approach for painful major version upgrades .

👤 PEOPLE TO WATCH

  • Simon Willison — launched Agentic Engineering Patterns (written by him, not an LLM) and is turning scattered best practices into an evergreen “guide” format . First chapters: “writing code is cheap now” and “red/green TDD” .
  • Theo (t3.gg) — consistently practical on agent context management; argues many AGENTS.md/CLAUDE.md setups are counterproductive and measured as a cost/latency hit .
  • Addy Osmani — sharp framing: AGENTS.md should be about non-discoverable landmines, and a single root file won’t scale for complex repos (he argues for a hierarchy of scoped files) .
  • Kent C. Dodds — evolving his reviews of agent code toward “is it actually wrong or just different,” focusing on principles over personal style; also calls out UI “taste” as a remaining bottleneck (CSS + knowing when UI looks bad) .
  • Armin Ronacher — hands-on, blunt tool feedback: calls MCP architecture token-inefficient/resource-intensive and says it underperforms “skills” in his testing .

🎬 WATCH & LISTEN

1) Prompt/context hierarchy explained (and why “extra context” sneaks into every request) — Theo (≈ 7:10–10:28)

Hook: A concrete mental model for why AgentMD/ClaudeMD “rules” are sticky: provider/system/developer/user layers, and everything above gets sent each turn—so context decisions directly impact cost and behavior .

2) What a “better coding benchmark” should measure — Latent Space + OpenAI Frontier Evals (≈ 14:04–15:51)

Hook: The team argues we’re moving beyond “solve a small GitHub issue” toward longer-running tasks and harder-to-measure signals like design taste, code quality, and maintainability .

📊 PROJECTS & REPOS


Editorial take: “Writing code is cheap now,” but proving it’s good (tests, evals, reviews, and anti-contamination discipline) is where serious teams will win .

SWE-bench Verified is deprecated; WebSockets land in Responses API; AgentMD skepticism goes mainstream
Summary
Coverage start
Feb 23 at 7:00 AM
Coverage end
Feb 24 at 7:00 AM
Frequency
Daily
Published
Feb 24 at 8:11 AM
Reading time
6 min
Research time
2 hrs 23 min
Documents scanned
185
Documents used
31
Citations
66
Sources monitored
110 / 110
Insights
Skipped contexts
Source details
Source Docs Insights Status
Lukas Möller 0 0
Jediah Katz 2 0
Aman Karmani 3 0
Jacob Jackson 0 0
Cursor Blog | RSS Feed 0 0
Nicholas Moy 0 0
Mike Krieger 0 0
Sualeh Asif 0 0
Michael Truell 0 0
Google Antigravity 1 1
Aman Sanger 0 0
cat 0 0
Mark Chen 0 0
Greg Brockman 8 2
Tongzhou Wang 0 0
fouad 0 0
Calvin French-Owen 0 0
Hanson Wang 0 0
Ed Bayes 1 0
Alexander Embiricos 2 1
Tibo 3 1
Romain Huet 4 2
DHH 8 0
Jane Street Blog 0 0
Miguel Grinberg's Blog: AI 0 0
xxchan's Blog 0 0
<antirez> 0 0
Brendan Long 0 0
The Pragmatic Engineer 0 0
David Heinemeier Hansson 0 0
Armin Ronacher ⇌ 13 3
Mitchell Hashimoto 0 0
Armin Ronacher's Thoughts and Writings 0 0
Peter Steinberger 0 0
Theo - t3.gg 28 2
Sourcegraph 0 0
Anthropic 13 0
Cursor 0 0
LangChain 0 0
Anthropic 0 0
LangChain Blog 0 0
LangChain 3 0
Cursor 0 0
Riley Brown 0 0
Riley Brown 5 1
Jason Zhou 3 2
Boris Cherny 0 0
Mckay Wrigley 0 0
geoff 14 1
Peter Steinberger 🦞 7 1
AI Jason 0 0
Alex Albert 0 0
Latent.Space 1 1
Logan Kilpatrick 0 0
Fireship 0 0
Fireship 0 0
Kent C. Dodds ⚡ 6 2
Practical AI 0 0
Practical AI Clips 0 0
Stories by Steve Yegge on Medium 0 0
Kent C. Dodds Blog 0 0
ThePrimeTime 1 0
Theo - t3․gg 1 1
ThePrimeagen 10 1
Ben Tossell 0 0
swyx 27 1
AI For Developers 0 0
Geoffrey Huntley 0 0
Addy Osmani 2 1
Andrej Karpathy 1 0
Simon Willison 8 1
Matthew Berman 0 0
Changelog 0 0
Simon Willison’s Newsletter 0 0
Agentic Coding Newsletter 0 0
Latent Space 1 1
Simon Willison's Weblog 7 4
Elevate 0 0
Lukas Möller 0 0
Jediah Katz 0 0
Sualeh Asif 0 0
Mike Krieger 0 0
Michael Truell 0 0
Cat Wu 0 0
Kevin Hou 0 0
Aman Sanger 0 0
Nicholas Moy 0 0
Andrey Mishchenko 0 0
Jerry Tworek 0 0
Romain Huet 0 0
Thibault Sottiaux 0 0
Alexander Embiricos 0 0
xxchan 1 0
Salvatore Sanfilippo 1 0
Armin Ronacher 0 0
David Heinemeier Hansson (DHH) 0 0
Alex Albert 0 0
Logan Kilpatrick 0 0
Shawn "swyx" Wang 0 0
Jason Zhou 0 0
Riley Brown 0 0
McKay Wrigley 0 0
Boris Cherny 0 0
Ben Tossell 0 0
Geoffrey Huntley 0 0
Peter Steinberger 0 0
Addy Osmani 0 0
Simon Willison 0 0
Andrej Karpathy 0 0
Harrison Chase 0 0