Brief details for SWE-bench Verified is deprecated; WebSockets land in Responses API; AgentMD skepticism goes mainstream

SWE-bench Verified is deprecated; WebSockets land in Responses API; AgentMD skepticism goes mainstream

SWE-bench Verified is being retired as a frontier coding eval: OpenAI says it’s saturated, contaminated, and riddled with test-design issues—SWE-bench Pro is the new recommendation. Also: practical agent workflows (red/green TDD, conformance-test-driven ports), new tool updates (Responses API WebSockets, Codex CLI multi-agent), and a hard look at when AGENTS.md helps vs just adds cost.

Google Antigravity

OpenAI Developers

Corey Quinn

+16

🔥 TOP SIGNAL

OpenAI is stopping SWE-bench Verified reporting and recommending SWE-bench Pro, citing benchmark saturation, contamination (frontier models can regurgitate solutions/problem statements from the Task ID), and test-design issues that make a large chunk of remaining tasks effectively unsound to chase . If you’re using SWE-bench numbers to pick models or to market agent gains, this is a hard reset on what “good” looks like in coding evals .

🛠️ TOOLS & MODELS

OpenAI Responses API — WebSockets mode
- New WebSockets support aimed at low-latency, long-running agents with heavy tool calls (explicitly positioned as good for coding agents) .
- Docs: http://developers.openai.com/api/docs/guides/websocket-mode.
- Huet notes it was built to “keep up” with GPT-5.3-Codex-Spark.
Codex CLI — multi-agent mode
- Enable multiple specialized agents in one session (each with its own role/model/behavior) .
- Setup:
  1. Open ~/codex/config.toml
  2. Add [features] multi_agent = true
  3. Run /experimental → “Multi-agent mode is now on”
- Comes with explorer / worker / general helper agents out of the box .
Agentic “full stack orchestration” demo — Antigravity
- “Add GPay to your website” via one prompt: detects Angular, installs deps, edits frontend+backend, then verifies via an automated browser run .
OpenClaw — new beta
- Beta focuses on security + bugfixes (and regression fixes), plus adds Kilo provider and Kimi vision + video support.
- Release notes: https://github.com/openclaw/openclaw/releases.
Practitioner model notes (Codex vs Claude, cost/latency)
- Multiple practitioners are calling GPT-5.3-Codex + Codex app the best option “for getting software dev work done,” with strong instruction-following (trade-off: more “machine-like” personality) . Brockman attributes this to heavy investment + model/harness co-design + rapid post-training iterations .
- QuinnyPig reports Codex made Claude Code feel dramatically weaker after testing (starting from skepticism) .
- Claude Code pain points surfaced today:
  - “Opus 4.6 is thinking WAY TOO long” (annoying, not delivering value) .
  - Primeagen tried “Claude fast 4.6” for high-stakes work and spent $100s in ~1 hour (but said it was fast) .

💡 WORKFLOWS & TRICKS

New eval reality: stop optimizing for brittle tests
- OpenAI’s critique: SWE-bench Verified became less meaningful at high scores—narrow tests can devolve into “guessing” exact names/implementation details rather than measuring coding ability .
- What they say they want next: longer-term tasks, open-ended design decisions, code quality/maintainability, real-world product building, and human-intensive rubric evaluation.
Red/green TDD as an agent control surface (Willison)
- Prompt pattern: write tests first → confirm they fail (“red”) → implement until they pass (“green”).
- Why it works with agents: reduces the odds of shipping code that doesn’t work or that’s unnecessary, and leaves you with a regression suite .
- Copy/paste starter prompt:
  - Build a Python function to extract headers from a markdown string. Use red/green TDD.
“Conformance suite + reference implementation” makes big agentic ports safer (Ladybird)
- Andreas Kling ported LibJS to Rust using Claude Code and Codex, but emphasizes it was human-directed (he chose what to port, in what order, and how the Rust should look) .
- Guardrails that mattered:
  - Started with components that had strong test262 coverage .
  - Required byte-for-byte identical output vs the C++ pipeline; verified identical ASTs and bytecode; reported zero regressions.
- Result: ~25,000 lines of Rust in ~two weeks (vs “multiple months” manually) .
Context files (AGENTS.md / CLAUDE.md): when they help vs when they’re just tax
- Theo cites a study on “context files” for GitHub issue resolution:
  - Dev-written context files: only +4% success vs omitting .
  - LLM-generated context files: -3% success .
  - More exploration/testing/reasoning → >20% higher costs.
  - Recommendation: omit LLM-generated context files; keep only minimal non-discoverable requirements like specific tooling .
- Addy Osmani’s rule of thumb: auto-generated AGENTS(.md) duplicates what agents can discover and inflates cost; human-written files help mainly for non-discoverable gotchas/conventions/landmines. He suggests treating AGENTS(.md) as a living list of codebase smells (not permanent config) .
- Theo’s practical heuristics:
  - Don’t distract the model with irrelevant background—keep it focused on “the thing” .
  - If the info is in the codebase, it often doesn’t belong in AgentMD; models can usually find what they need (e.g., via package.json + repo search) .
  - If you’re investing time, prioritize unit/integration tests, type checks, and feedback systems you can expose to the model over growing AgentMD files .
Agentic quality loops you can steal
- Automated “review → fix → review” loop (Armin Ronacher): his /review extension for ralph loops between “review on an empty branch” and “go back and fix your shit” until P0/P1/P2 are resolved .
- Unblock multi-step tasks (Theo): if step 2 keeps failing, ask the agent for step 3—he claims it often back-solves step 2 to get there .
- Infra upgrade prompt that actually worked (Ronacher): upgrade me to postgres 18. don’t make any mistakes—shared as a successful approach for painful major version upgrades .

👤 PEOPLE TO WATCH

Simon Willison — launched Agentic Engineering Patterns (written by him, not an LLM) and is turning scattered best practices into an evergreen “guide” format . First chapters: “writing code is cheap now” and “red/green TDD” .
Theo (t3.gg) — consistently practical on agent context management; argues many AGENTS.md/CLAUDE.md setups are counterproductive and measured as a cost/latency hit .
Addy Osmani — sharp framing: AGENTS.md should be about non-discoverable landmines, and a single root file won’t scale for complex repos (he argues for a hierarchy of scoped files) .
Kent C. Dodds — evolving his reviews of agent code toward “is it actually wrong or just different,” focusing on principles over personal style; also calls out UI “taste” as a remaining bottleneck (CSS + knowing when UI looks bad) .
Armin Ronacher — hands-on, blunt tool feedback: calls MCP architecture token-inefficient/resource-intensive and says it underperforms “skills” in his testing .

🎬 WATCH & LISTEN

1) Prompt/context hierarchy explained (and why “extra context” sneaks into every request) — Theo (≈ 7:10–10:28)

Hook: A concrete mental model for why AgentMD/ClaudeMD “rules” are sticky: provider/system/developer/user layers, and everything above gets sent each turn—so context decisions directly impact cost and behavior .

2) What a “better coding benchmark” should measure — Latent Space + OpenAI Frontier Evals (≈ 14:04–15:51)

Hook: The team argues we’re moving beyond “solve a small GitHub issue” toward longer-running tasks and harder-to-measure signals like design taste, code quality, and maintainability .

📊 PROJECTS & REPOS

OpenClaw — beta release notes (security/bugfix focus): https://github.com/openclaw/openclaw/releases
Agentic Engineering Patterns (Willison) — guide hub + first chapters:
test262 (referenced as a key “unlock” for safe agentic work on language tooling): https://github.com/tc39/test262

Editorial take: “Writing code is cheap now,” but proving it’s good (tests, evals, reviews, and anti-contamination discipline) is where serious teams will win .

SWE-bench Verified is deprecated; WebSockets land in Responses API; AgentMD skepticism goes mainstream

Summary

Coverage start

Feb 23 at 7:00 AM

Coverage end

Feb 24 at 7:00 AM

Frequency

Daily

Published

Feb 24 at 8:11 AM

Reading time

6 min

Research time

2 hrs 23 min

Documents scanned

185

Documents used

Citations

Sources monitored

110 / 110

Insights

View

Skipped contexts

View

Source details

Source	Docs	Insights
Lukas Möller	0	0
Jediah Katz	2	0
Aman Karmani	3	0
Jacob Jackson	0	0
Cursor Blog \| RSS Feed	0	0
Nicholas Moy	0	0
Mike Krieger	0	0
Sualeh Asif	0	0
Michael Truell	0	0
Google Antigravity	1	1
Aman Sanger	0	0
cat	0	0
Mark Chen	0	0
Greg Brockman	8	2
Tongzhou Wang	0	0
fouad	0	0
Calvin French-Owen	0	0
Hanson Wang	0	0
Ed Bayes	1	0
Alexander Embiricos	2	1
Tibo	3	1
Romain Huet	4	2
DHH	8	0
Jane Street Blog	0	0
Miguel Grinberg's Blog: AI	0	0
xxchan's Blog	0	0
<antirez>	0	0
Brendan Long	0	0
The Pragmatic Engineer	0	0
David Heinemeier Hansson	0	0
Armin Ronacher ⇌	13	3
Mitchell Hashimoto	0	0
Armin Ronacher's Thoughts and Writings	0	0
Peter Steinberger	0	0
Theo - t3.gg	28	2
Sourcegraph	0	0
Anthropic	13	0
Cursor	0	0
LangChain	0	0
Anthropic	0	0
LangChain Blog	0	0
LangChain	3	0
Cursor	0	0
Riley Brown	0	0
Riley Brown	5	1
Jason Zhou	3	2
Boris Cherny	0	0
Mckay Wrigley	0	0
geoff	14	1
Peter Steinberger 🦞	7	1
AI Jason	0	0
Alex Albert	0	0
Latent.Space	1	1
Logan Kilpatrick	0	0
Fireship	0	0
Fireship	0	0
Kent C. Dodds ⚡	6	2
Practical AI	0	0
Practical AI Clips	0	0
Stories by Steve Yegge on Medium	0	0
Kent C. Dodds Blog	0	0
ThePrimeTime	1	0
Theo - t3․gg	1	1
ThePrimeagen	10	1
Ben Tossell	0	0
swyx	27	1
AI For Developers	0	0
Geoffrey Huntley	0	0
Addy Osmani	2	1
Andrej Karpathy	1	0
Simon Willison	8	1
Matthew Berman	0	0
Changelog	0	0
Simon Willison’s Newsletter	0	0
Agentic Coding Newsletter	0	0
Latent Space	1	1
Simon Willison's Weblog	7	4
Elevate	0	0
Lukas Möller	0	0
Jediah Katz	0	0
Sualeh Asif	0	0
Mike Krieger	0	0
Michael Truell	0	0
Cat Wu	0	0
Kevin Hou	0	0
Aman Sanger	0	0
Nicholas Moy	0	0
Andrey Mishchenko	0	0
Jerry Tworek	0	0
Romain Huet	0	0
Thibault Sottiaux	0	0
Alexander Embiricos	0	0
xxchan	1	0
Salvatore Sanfilippo	1	0
Armin Ronacher	0	0
David Heinemeier Hansson (DHH)	0	0
Alex Albert	0	0
Logan Kilpatrick	0	0
Shawn "swyx" Wang	0	0
Jason Zhou	0	0
Riley Brown	0	0
McKay Wrigley	0	0
Boris Cherny	0	0
Ben Tossell	0	0
Geoffrey Huntley	0	0
Peter Steinberger	0	0
Addy Osmani	0	0
Simon Willison	0	0
Andrej Karpathy	0	0
Harrison Chase	0	0