Error!
Unable to generate download right now.
We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
🔥 TOP SIGNAL
OpenAI is stopping SWE-bench Verified reporting and recommending SWE-bench Pro, citing benchmark saturation, contamination (frontier models can regurgitate solutions/problem statements from the Task ID), and test-design issues that make a large chunk of remaining tasks effectively unsound to chase . If you’re using SWE-bench numbers to pick models or to market agent gains, this is a hard reset on what “good” looks like in coding evals .
🛠️ TOOLS & MODELS
OpenAI Responses API — WebSockets mode
- New WebSockets support aimed at low-latency, long-running agents with heavy tool calls (explicitly positioned as good for coding agents) .
- Docs: http://developers.openai.com/api/docs/guides/websocket-mode.
- Huet notes it was built to “keep up” with GPT-5.3-Codex-Spark.
Codex CLI — multi-agent mode
- Enable multiple specialized agents in one session (each with its own role/model/behavior) .
-
Setup:
-
Open
~/codex/config.toml -
Add
[features] multi_agent = true -
Run
/experimental→ “Multi-agent mode is now on”
-
Open
- Comes with explorer / worker / general helper agents out of the box .
Agentic “full stack orchestration” demo — Antigravity
- “Add GPay to your website” via one prompt: detects Angular, installs deps, edits frontend+backend, then verifies via an automated browser run .
OpenClaw — new beta
- Beta focuses on security + bugfixes (and regression fixes), plus adds Kilo provider and Kimi vision + video support.
- Release notes: https://github.com/openclaw/openclaw/releases.
Practitioner model notes (Codex vs Claude, cost/latency)
- Multiple practitioners are calling GPT-5.3-Codex + Codex app the best option “for getting software dev work done,” with strong instruction-following (trade-off: more “machine-like” personality) . Brockman attributes this to heavy investment + model/harness co-design + rapid post-training iterations .
- QuinnyPig reports Codex made Claude Code feel dramatically weaker after testing (starting from skepticism) .
-
Claude Code pain points surfaced today:
- “Opus 4.6 is thinking WAY TOO long” (annoying, not delivering value) .
- Primeagen tried “Claude fast 4.6” for high-stakes work and spent $100s in ~1 hour (but said it was fast) .
💡 WORKFLOWS & TRICKS
New eval reality: stop optimizing for brittle tests
- OpenAI’s critique: SWE-bench Verified became less meaningful at high scores—narrow tests can devolve into “guessing” exact names/implementation details rather than measuring coding ability .
- What they say they want next: longer-term tasks, open-ended design decisions, code quality/maintainability, real-world product building, and human-intensive rubric evaluation.
Red/green TDD as an agent control surface (Willison)
- Prompt pattern: write tests first → confirm they fail (“red”) → implement until they pass (“green”).
- Why it works with agents: reduces the odds of shipping code that doesn’t work or that’s unnecessary, and leaves you with a regression suite .
-
Copy/paste starter prompt:
Build a Python function to extract headers from a markdown string. Use red/green TDD.
“Conformance suite + reference implementation” makes big agentic ports safer (Ladybird)
- Andreas Kling ported LibJS to Rust using Claude Code and Codex, but emphasizes it was human-directed (he chose what to port, in what order, and how the Rust should look) .
-
Guardrails that mattered:
- Started with components that had strong test262 coverage .
- Required byte-for-byte identical output vs the C++ pipeline; verified identical ASTs and bytecode; reported zero regressions.
- Result: ~25,000 lines of Rust in ~two weeks (vs “multiple months” manually) .
Context files (AGENTS.md / CLAUDE.md): when they help vs when they’re just tax
-
Theo cites a study on “context files” for GitHub issue resolution:
- Dev-written context files: only +4% success vs omitting .
- LLM-generated context files: -3% success .
- More exploration/testing/reasoning → >20% higher costs.
- Recommendation: omit LLM-generated context files; keep only minimal non-discoverable requirements like specific tooling .
- Addy Osmani’s rule of thumb: auto-generated AGENTS(.md) duplicates what agents can discover and inflates cost; human-written files help mainly for non-discoverable gotchas/conventions/landmines. He suggests treating AGENTS(.md) as a living list of codebase smells (not permanent config) .
-
Theo’s practical heuristics:
- Don’t distract the model with irrelevant background—keep it focused on “the thing” .
- If the info is in the codebase, it often doesn’t belong in AgentMD; models can usually find what they need (e.g., via package.json + repo search) .
- If you’re investing time, prioritize unit/integration tests, type checks, and feedback systems you can expose to the model over growing AgentMD files .
-
Theo cites a study on “context files” for GitHub issue resolution:
Agentic quality loops you can steal
- Automated “review → fix → review” loop (Armin Ronacher): his
/reviewextension for ralph loops between “review on an empty branch” and “go back and fix your shit” until P0/P1/P2 are resolved . - Unblock multi-step tasks (Theo): if step 2 keeps failing, ask the agent for step 3—he claims it often back-solves step 2 to get there .
- Infra upgrade prompt that actually worked (Ronacher):
upgrade me to postgres 18. don’t make any mistakes—shared as a successful approach for painful major version upgrades .
- Automated “review → fix → review” loop (Armin Ronacher): his
👤 PEOPLE TO WATCH
- Simon Willison — launched Agentic Engineering Patterns (written by him, not an LLM) and is turning scattered best practices into an evergreen “guide” format . First chapters: “writing code is cheap now” and “red/green TDD” .
- Theo (t3.gg) — consistently practical on agent context management; argues many AGENTS.md/CLAUDE.md setups are counterproductive and measured as a cost/latency hit .
- Addy Osmani — sharp framing: AGENTS.md should be about non-discoverable landmines, and a single root file won’t scale for complex repos (he argues for a hierarchy of scoped files) .
- Kent C. Dodds — evolving his reviews of agent code toward “is it actually wrong or just different,” focusing on principles over personal style; also calls out UI “taste” as a remaining bottleneck (CSS + knowing when UI looks bad) .
- Armin Ronacher — hands-on, blunt tool feedback: calls MCP architecture token-inefficient/resource-intensive and says it underperforms “skills” in his testing .
🎬 WATCH & LISTEN
1) Prompt/context hierarchy explained (and why “extra context” sneaks into every request) — Theo (≈ 7:10–10:28)
Hook: A concrete mental model for why AgentMD/ClaudeMD “rules” are sticky: provider/system/developer/user layers, and everything above gets sent each turn—so context decisions directly impact cost and behavior .
2) What a “better coding benchmark” should measure — Latent Space + OpenAI Frontier Evals (≈ 14:04–15:51)
Hook: The team argues we’re moving beyond “solve a small GitHub issue” toward longer-running tasks and harder-to-measure signals like design taste, code quality, and maintainability .
📊 PROJECTS & REPOS
- OpenClaw — beta release notes (security/bugfix focus): https://github.com/openclaw/openclaw/releases
- Agentic Engineering Patterns (Willison) — guide hub + first chapters:
- test262 (referenced as a key “unlock” for safe agentic work on language tooling): https://github.com/tc39/test262
Editorial take: “Writing code is cheap now,” but proving it’s good (tests, evals, reviews, and anti-contamination discipline) is where serious teams will win .