ZeroNoise Logo zeronoise

Coding Agents Alpha Tracker

Active
Public Daily at 8:00 AM Agent time: 8:00 AM GMT+00:00 – Europe / London

by avergin 110 sources

Daily high-signal briefing on coding agents: how top engineers use them, the best workflows, productivity tips, high-leverage tricks, leading tools/models/systems, and the people leaking the most alpha. Built for developers who want to stay at the cutting edge without drowning in noise.

Harness Beats Hype: Test-First Agent Loops, Pi, and Monty
Mar 15
5 min read
84 docs
tobi lutke
Armin Ronacher ⇌
Armin Ronacher
+6
Simon Willison’s test-first agent playbook was the clearest signal today, while Pi and Monty showed where serious users are pushing the harness layer: tighter context control, typed execution, and better review loops. This brief pulls out the concrete workflows, model-routing patterns, and repos worth stealing from.

🔥 TOP SIGNAL

Simon Willison published the clearest public playbook today for making coding agents less magical and more repeatable: start every session with the exact test command, tell the agent to use red-green TDD, then force a manual curl pass after the tests because green suites still miss real bugs . The bigger cross-source takeaway: the wins are coming from harness discipline—tests, templates, rewinds, scoped workers, and sandboxes—not from giving one model unlimited rope .

"Tests are no longer even remotely optional."

🛠️ TOOLS & MODELS

  • Pi — minimal system prompt, top-five benchmark leaderboard performance with only basic file/bash tools, and strong context controls. The real signal is model routing: Haiku for question extraction, Sonnet 4.6 for well-scoped workers, Codex for review; Armin says that level of control matters because hidden harness changes and context injections kept breaking his Claude Code workflows
  • Monty + Pydantic AI — typed host functions, built-in TY type checking before execution, and in-process execution measured in ~800ns hot loops / single-digit microseconds. Samuel Colvin positions it as useful when a full sandbox is too slow or too awkward to self-host
  • Claude Code + Gemini CLI + Codex — Samuel mostly codes in Claude Code, uses Gemini CLI for fast whole-branch review reports, then points Claude Code at the report to implement fixes; Codex is a second reviewer when he wants a more agentic investigation
  • OpenClaw — next release adds /btw, a small but useful primitive: you can ask agents questions even while they are busy working. Docs are already up

💡 WORKFLOWS & TRICKS

  • Simon’s default session loop
    1. Tell the agent how to run tests (uv run pytest)
    2. Add: use red-green TDD
    3. After codegen, have it start the server in the background and exercise the API with curl
    4. If you want a readable audit trail, tell it to use Showboat so it writes a Markdown log of the manual test run
  • Conformance-first implementation — Simon’s Datasette file-upload trick: ask the agent to build a test suite that passes against multiple reference implementations, then implement your own version against that shared behavior
  • Seed the repo so agents copy the right things
    • Use templates with tests, README, and CI
    • Keep at least a couple tests in your preferred style
    • Agents are extremely consistent at following existing patterns, so good scaffolding compounds
  • Use sub-agents surgically, not as a feature factory
    • Pi users keep 40-60% of context free by planning first, breaking work into todos, sending defined tasks to Sonnet 4.6 workers, then rewinding to a warm parent context for polish
    • Armin’s caution: sub-agents help with exploration and parallel search, but if you still read most of the code, swarms can just hand you too much to review
  • Security hygiene that survives model churn
    • Avoid the “lethal trifecta”: private data + malicious instructions + an exfiltration path
    • Containerization protects the host, but Armin says it does not solve secret exfiltration; Simon prefers Claude Code on the web when he wants the work contained off his laptop
    • Do not clone prod data to local laptops; generate mock users and edge cases instead
  • Two small workflow unlocks
    • Armin now routinely lets agents write small Python scripts instead of JavaScript because uv run made dependency handling simple enough
    • git bisect gets much easier to drive through an agent loop

👤 PEOPLE TO WATCH

  • Simon Willison — dropped a quote-rich Pragmatic Summit fireside chat and notes; worth it for the TDD/manual validation/safety playbook and for his explicit rejection of “nobody reads code” workflows in security-sensitive contexts
  • Armin Ronacher — high-signal because he keeps surfacing small workflow changes that actually matter: uv run, agent-friendly git bisect, and real /autoresearch usage on MiniJinja
  • Samuel Colvin — strongest current voice on type safety, constrained host functions, and mixing models for review vs execution
  • Peter Steinberger — worth following for OpenClaw tooling, but also for the framing: this is “agentic engineering,” not sloppy vibe coding; you still need thinking, testing, debugging, and iteration
  • Dimitri — useful counterweight to autonomy hype: hands-off codegen currently tops out around a couple thousand lines of standard code, and enterprise rollouts are likely to force a review-heavy phase first

🎬 WATCH & LISTEN

  • 11:23-13:08 — Latent Space / Samuel Colvin: The cleanest explanation today of when coding agents jump from “a bit faster” to roughly 100x faster: known internals, known API, easy tests, and no bikeshedding about the interface .
  • 65:20-68:23 — Pi AMA: Armin’s take on memory for coding agents is worth hearing in full: the codebase is the source of truth, and agentic search beats hauling around stale summaries .
  • 27:45-29:56 — TheStandup / Dimitri: Useful reality check if your company is mandating AI use: the likely near-term outcome is a review-heavy workflow that many engineers will hate .

📊 PROJECTS & REPOS

  • Pi extension stack — Todos, Answer, screenshot/debug tooling, and patch-based multi-edit experiments are where the project feels differentiated right now
  • pi-autoresearch — now past the toy stage: Armin ran it overnight on MiniJinja, got many perf improvements, and is reviewing the resulting PRs one by one. Context: MiniJinja PR #884
  • Showboat — Simon’s new agent QA tool that turns manual test execution into a Markdown artifact you can actually inspect later
  • lossless-claw + qmd memory plugin — if OpenClaw’s stock memory is weak for your use case, steipete is explicitly pointing people to these alternatives

Editorial take: the durable edge right now is harness design, not raw model bravado—tests, context boundaries, and constrained execution keep showing up in every workflow that actually works.

1M Context Goes Default; Memory Agents and PR Filters Get Real
Mar 14
6 min read
87 docs
Mario Zechner
cat
Claude
+7
Claude Code's 1M rollout was the headline, but the sharper practitioner signal was what engineers layered on top: memory-specialized agent stacks, phone-to-laptop session spawning, shared-agent control planes, and better defenses against AI PR noise.

🔥 TOP SIGNAL

1M context just stopped being a special-case feature. Opus 4.6 and Sonnet 4.6 are now generally available at 1M context, Opus 4.6 1M is the default Claude Code model on Max, Team, and Enterprise, and the API dropped both the long-context premium and beta header requirement .

The higher-signal takeaway is what serious users do next: Boris Cherny says he has been using 1M context exclusively for months , while Charles Packer argues that bigger windows do not solve the deeper memory problem and recommends pairing a memory-specialized agent with Claude Code or Codex instead of relying on raw context alone .

🛠️ TOOLS & MODELS

  • Claude Opus 4.6 / Sonnet 4.6 — 1M GA. Opus 4.6 1M is now the default Claude Code model for Max, Team, and Enterprise; Boris says Pro and Sonnet users can opt in with /extra-usage. API-side, there is no long-context price increase, no beta header requirement, and support for up to 600 images per request . Simon Willison highlights that standard pricing now applies across the full 1M window—unlike GPT-5.4 above 272k tokens and Gemini 3.1 Pro above 200k . Docs: model config · announcement
  • Claude Code remote-control — mobile → laptop session spawning. Run claude remote-control on the laptop, then spawn a new local session from the mobile app . Rollout is for Max, Team, and Enterprise on >=2.1.74; mobile GitHub setup is still required for now .
  • Claw / OpenClaw — live browser control gets serious. The new beta adds live browser control from latest Chrome via chrome://inspect#remote-debugging plus a new user profile session . Steinberger says the MCP Chrome session feature gives full access to your browser and logged-in websites, with an extra alert to enable it . Parallel tool calling is also coming to OpenClaw, and Opus 1M has been enabled across providers .

💡 WORKFLOWS & TRICKS

  • Treat 1M context as something to steer, not just enable.

    1. If you are on Max, Team, or Enterprise, Opus 4.6 1M is already the default in Claude Code .
    2. If compaction behavior feels wrong, tune it with CLAUDE_CODE_AUTO_COMPACT_WINDOW.
    3. Boris says he has been on 1M context full-time for months, which is a decent daily-driver signal .
  • Three Claude Code shortcuts worth memorizing.

    • ! runs bash inline and injects the command plus output into context
    • Ctrl+S stashes your draft, lets you ask something else, then restores the original draft after submit
    • Ctrl+G opens the prompt or plan in $EDITOR for bigger edits
  • Phone → laptop handoff is now a real workflow.

    1. On the laptop, run claude remote-control.
    2. In the mobile app, spawn a new local session .
    3. Make sure you meet the plan/version requirements and have GitHub configured on mobile .
  • Use a memory agent as the control plane.

    • Letta's concrete pattern: run Claude Code, then use a hook to fire a Letta agent that curates memory into a CLAUDE.md file or a dedicated memory/context repo .
    • The more interesting inversion is to make the memory-specialized Letta agent your main interface, then let it dispatch to Claude Code or Codex for narrow execution .
    • The target is higher-level reflections, not mundane logs .
  • Use a shared channel as the control plane for multiple agents. Slack's internal pattern is a shared channel where tools like Linear, Cursor, and Claude Code can send notifications, read each other's messages, and operate with humans in the loop; the channel itself becomes a useful context boundary .

  • Fight AI PR flood with trust filters, not heroics.

    • Theo's setup uses vouch.md plus the Vouch workflow to label trusted PR authors; on T3 Code it cut the active review surface from 150 open PRs to 43 trusted ones .
    • His gold standard is still boring: small, explicit, issue-linked changes—often 1-5 lines.
    • Add PR Stats if you want merge-rate and history context per contributor .

"Please do not use clankers to add more noise to PRs. We’re working on a solution to this, and this is making my job harder."

  • If agent throughput is stressing CI, remove the obvious bottleneck first. Theo switched one GitHub Actions job from ubuntu-latest to Blacksmith's CPU runner and saw runtime drop from about 2.5 minutes to under 1 minute, while cost was cut in half; the dashboard also helped isolate flaky tests .

👤 PEOPLE TO WATCH

  • Boris Cherny — high signal because he is sharing operator-level Claude Code details, not just release notes: 1M default rollout, the compaction knob, and phone-launched laptop sessions .
  • Peter Steinberger (@steipete) — one of the best public follows for open coding-agent infrastructure right now: browser control, MCP permissions, parallel tool calls, and blunt maintainer feedback on PR noise .
  • Charles Packer — strongest memory-first counterweight to raw model hype today; directly useful if you are designing long-lived coding-agent scaffolding .
  • Theo — high-signal repo maintainer view on what breaks first when agents increase throughput: review queues, contributor triage, and CI economics .
  • @_catwu — small Claude Code operator tips that pay back immediately .

🎬 WATCH & LISTEN

  • 78:03-81:40 — Charles Packer on memory vs. model size. Best clip today if you are tempted to treat 1M context as the endgame. His argument: larger windows help, but durable personalization and specialization still need explicit memory structures .
  • 34:44-37:35 — Rob Seaman on shared-channel agent orchestration. Useful pattern for teams: put multiple agents in one Slack channel so they can notify each other and humans can supervise the whole loop from one place .

  • 20:42-23:17 — Theo on Vouch and what a 'golden PR' looks like. Worth your time if your repo is getting hit with AI-generated PR volume. He shows how Vouch narrowed the working set and why mergeable PRs still need to be tiny and obvious .

📊 PROJECTS & REPOS

  • Claw / OpenClaw — OpenClaw is at 200k GitHub stars, and the latest beta push is toward higher-agency browser use: live browser control via Chrome remote debugging, a new user profile session, full MCP browser access to logged-in sites, and parallel tool calling on the way .
  • T3 Code — public for about five days and already dealing with 150 open PRs despite not asking for contributions; Theo also called out a >10% fork/star ratio, meaning unusually high engagement .
  • Vouch — Mitchell Hashimoto's trust-management workflow is the most immediately useful OSS triage tool from today's scan: vouch.md, workflow automation, and a public proof point on T3 Code's backlog .
  • PR Stats — Reese's contributor scoring surface shows merge %, PR history, and work types; a useful companion to trust filters when AI lowers the cost of sending PRs .

Editorial take: 1M context is becoming table stakes; the edge is moving to memory curation, multi-agent control planes, and keeping agent-written code reviewable .

Benchmarked Agent Loops Hit Production: Liquid +53%, CursorBench, and Better Workspaces
Mar 13
5 min read
130 docs
Boris Cherny
Armin Ronacher
Salvatore Sanfilippo
+16
Tobias Lütke’s Liquid PR turned agentic optimization into a concrete playbook: benchmark script, strong tests, lots of experiments, measurable win. The rest of the day’s signal reinforced the same theme with CursorBench, OpenAI Automations, better subagent patterns, and new workspace designs for parallel coding.

🔥 TOP SIGNAL

A strong pattern crystallized today: the best agent wins are benchmarked, not vibes-based. Tobias Lütke used Pi plus pi-autoresearch to run around 120 semi-autonomous experiments against Liquid, landing 93 commits that made parse+render 53% faster and cut allocations 61% . Simon Willison’s reusable lesson is the setup: a benchmark script made make it faster actionable, and Liquid’s 974-test suite made aggressive agent experimentation safe .

🛠️ TOOLS & MODELS

  • OpenAI Automations — GA. You can now choose model and reasoning level, run in a worktree or existing branch, and reuse workflows via templates. OpenAI’s own examples are recurring repo jobs: daily briefings, issue triage, and PR comment follow-up .
  • CursorBench — new eval surface for coding agents. Cursor is publishing intelligence + efficiency scores for agentic coding, and says it combines offline benchmarks with online evals because public benchmarks are increasingly saturated . Jediah Katz frames this as a transparency push around real scores . cursor.com/blog/cursorbench
  • Cursor’s search stack is now a lot more legible in public. Via Turbopuffer, Cursor embeds the full codebase with a custom embedding model, uses semantic search plus GREP, and increasingly fans out parallel queries inside an agent turn . Turbopuffer says the migration cut Cursor’s costs 95% and fixed per-user economics .
  • OpenClaw 2026.3.11 — behavior change worth checking today. Cron now enforces a stricter cron-owned delivery contract in isolated runs; jobs using delivery.mode='none' while sending ad hoc messages may now go silent. Fix: run openclaw doctor --fix, then move to explicit announce or webhook delivery .
  • Gemini API spend caps. Simon Willison calls this immediately useful for CI and agent experiments where the main fear is an accidental bill spike .
  • Actual model routing from a daily driver. Theo says he still prefers Claude for a lot of UI work, uses Codex alongside it inside T3 code/terminal workflows, and will spin up Gemini CLI quickly for UI tasks he cannot do in Codex .

💡 WORKFLOWS & TRICKS

  • Run benchmarked autoresearch, not random refactors.
    1. Create a prompt file plus a script that runs tests and benchmarks .
    2. Let the agent propose many micro-optimizations and test them one by one; Tobi’s run hit around 120 experiments and 93 commits .
    3. Persist state in autoresearch.jsonl so the search can keep context across runs .
    4. Only do this on a repo with strong tests; Liquid had 974 unit tests .

Having a robust test suite is a massive unlock for working with coding agents.

  • Add a no-code-first planning mode.

    1. Give users a way to ask for step-by-step architecture/planning without immediate code generation .
    2. Anthropic’s implementation was basically one instruction: please don’t code.
    3. This matters because users were already trying to force that behavior through the chat UI by hand .
  • Use fresh-context subagents, but over-spec the handoff.

    1. Keep a main agent in the loop and spawn subagents with clean context windows for bugs or research tasks .
    2. Let several of them work in parallel when the problem is ambiguous .
    3. Force the final message to return actual findings, not just done; Harrison says bad communication is the failure mode here .
  • Rebuild your workspace around projects, not panes.

    1. Put each project in its own sidebar/workspace entry with quick hotkeys .
    2. Inside that project, keep an agent terminal, a dev server, and a git terminal together .
    3. Offload long-running agents to SSH/TMUX on another machine so they keep working when the laptop is closed .
    4. Theo and Armin/Ben are converging on the same idea: less context switching, more parallel threads under light supervision .

👤 PEOPLE TO WATCH

  • Tobias Lütke + Simon Willison — best public example today of agentic optimization with a measurable end state: benchmark script, test suite, lots of experiments, concrete win .
  • Boris Cherny — high signal because he is sharing actual Claude Code product patterns: plan mode, multi-agent Mama Claude, and a Bitter Lesson-style refusal to overbuild around current model limits .
  • Harrison Chase — still one of the clearest explainers of agent harness primitives: prompts, planning, subagents, filesystems, sandboxes, observability, and evals .
  • Salvatore Sanfilippo — useful reality check: benchmark passes do not guarantee code you would ship, and operator skill still determines whether AI is a weak assistant or a 10-100x multiplier .
  • ThePrimeagen — strongest contrarian take today: fast autocomplete plus skill may improve proficiency without the cognitive debt and codebase drift that full agents can cause .

🎬 WATCH & LISTEN

  • 8:59-9:53 — Boris Cherny on plan mode. Great short clip because it shows how a valuable agent feature can come from a tiny harness change: users wanted thinking-first, not a code dump, and Anthropic shipped that behavior fast .
  • 13:13-14:33 — Harrison Chase on subagents. Probably the cleanest explanation you’ll hear this week of why subagents help and why communication back to the parent agent is the real hard part .
  • 13:34-18:23 — Theo on Niri-style work hierarchies for agentic coding. Worth watching if terminal/IDE/browser context switching is frying your brain; this is a concrete sketch of a better project/task layout for Claude Code, Codex, dev servers, and git .

📊 PROJECTS & REPOS

  • pi-autoresearch — Pi plugin used in Tobi’s Liquid optimization. The signal is that it carried state via autoresearch.jsonl through around 120 experiments and 93 commits in a live performance PR .
  • Shopify/liquid PR #2056 — public playbook for benchmarked agent optimization. Read it for concrete wins like String#byteindex, byte-level tag parsing, and cached small-integer strings .
  • Seamux — open-source terminal built on LibGhostty that Theo says has already replaced Ghostty as his daily driver because the project/task hierarchy fits parallel agentic work better than TMUX alone .
  • OpenClaw — if you automate recurring agent jobs, the 2026.3.11 release is the kind of operational change you want to catch early: stricter cron delivery rules plus a maintainer-provided migration path via doctor --fix.

Editorial take: the edge is moving away from raw model worship and toward measurable objectives, clean context boundaries, and workspaces that let one human supervise many agent threads.

Self-Compacting Agents, Bigger IDEs, and Review-First Dev Workflows
Mar 12
6 min read
153 docs
Romain Huet
DHH
Andrej Karpathy
+9
LangChain’s autonomous context compression was the clearest practical release of the day, while Karpathy’s bigger-IDE thesis, LangSmith’s eval loops, and new Cursor/OpenClaw/Codex workflows showed where coding-agent leverage is actually moving. The common thread: better control planes around agents, not just better raw models.

🔥 TOP SIGNAL

LangChain's latest Deep Agents release adds autonomous context compression: the model can decide when to summarize older context instead of waiting for a fixed token threshold or a human /compact, while retaining the most recent 10% of messages and preserving full history in the virtual filesystem for recovery . The good trigger points are semantic, not token-based: new task boundaries, after extracting a result from a large context, before big reads or long drafts, before lengthy refactors, and when new requirements invalidate earlier context . Zoomed out, this matches Karpathy's bigger thesis: if the unit of programming is shifting from files to agents, the leverage moves into the control plane around those agents—memory, visibility, stats, and orchestration .

🛠️ TOOLS & MODELS

  • Deep Agents SDK/CLI — autonomous compaction, opt-in. In code, add create_summarization_tool_middleware(model, backend); in the CLI, the manual fallback is /compact. LangChain says the feature is tuned conservatively and keeps history recoverable after summarization .
  • OpenClaw v2026.3.11-beta.1. Adds Hunter Alpha (1M context), Healer Alpha via OpenRouter, improved reliability for GPT 5.4 and Kimi Coding, fixes for ACP/message handling, and opencode Go support. Practical bug note: maintainer Peter Steinberger traced a GPT 5.4 "yes I will do x" stall to a missing phase parameter in the WebSocket implementation; the release also fixes Kimi coding tool-call handling . Release notes
  • Cursor Marketplace — 30+ new plugins. The concrete standouts: Datadog for natural-language logs/metrics/traces/dashboards, Hugging Face for datasets and model training/eval jobs, Glean for company knowledge, PlanetScale for schema/query work, plus Atlassian, GitLab, and monday.com integrations . Jediah Katz's summary of the Datadog side: "Datadog + Cursor = Joy". This is what Karpathy's "bigger IDE" starts to look like in product form: the editor reaching into observability, data, knowledge, and project systems, not just files . Details
  • Codex review UX. Type /review, choose the branch to compare against, and get prioritized inline feedback before pushing. Romain Huet calls the flow "delightful" .
  • CodeRabbit as the review backstop. Theo says it is consistently the best code reviewer on his team, catches the small AI-written mistakes humans skip, adapts when you tell it to stop commenting on something, and prevented dozens of bugs in the last two weeks; it ships as a VS Code extension and CLI .
  • Model routing in the wild: Kimi K2.5 for the fast lane. DHH says it remains his daily driver for basic work where he wants speed, not "PhD-level intelligence," running at 200 tps via Fireworks inside opencode .

💡 WORKFLOWS & TRICKS

  • Semantic compaction loop

    1. Compact on task boundaries or completion acknowledgments, after extracting a result from lots of context, before a big read/draft/refactor, or when old assumptions are invalidated .
    2. In code, add create_summarization_tool_middleware(model, backend); in the CLI, keep /compact as the human override .
    3. Keep it conservative; LangChain preserves full history in a virtual filesystem so recovery is possible post-summarization .
  • Trace → eval → dataset → baseline

    1. Turn on tracing or instrument with OpenTelemetry .
    2. Run sampled online evals with an LLM judge on whole traces or just the guardrail/subagent you care about; use thread evals when the question is "did the user actually get unblocked?" .
    3. Pipe thumbs-downs or high-signal traces into annotation queues, then edit them into cleaner gold outputs .
    4. Keep a 50-100 example dataset with both easy and hard cases, and compare new prompts/models against a baseline while watching quality, latency, cost, and token counts side by side .
  • Bad trace → better prompt

    1. Pull the exact failing LLM call from a trace into Prompt Playground .
    2. Ask Polly to rewrite it using best practices; Victor's demo added XML tags, clearer context, and concrete examples .
    3. Add dynamic variables for runtime allowances or memory, then save the prompt into Prompt Hub with versioning .
  • Review-first agent coding

    1. In Codex, run /review, choose the comparison branch, and work through the prioritized inline feedback before push .
    2. For steady-state PR review, Theo's pattern is to let CodeRabbit catch the small mistakes humans won't spend time on, then tune its behavior by explicitly telling it what to stop flagging .
  • Ground the implementation, then ask a second model to be mean

    1. Tell the builder model to inspect the authoritative repo/docs, not just generate from memory.
    2. Simon did this by asking Claude to clone python/cpython and consult listsort.txt and listobject.c before adding Timsort .
    3. Then hand the result to another model for critique; GPT-5.4 Thinking said Claude's first pass was only a "simplified, Timsort-inspired adaptive mergesort".
    4. The whole prompt chain is public: full sequence of prompts

👤 PEOPLE TO WATCH

  • Andrej Karpathy — still the clearest public thinker on agent-native developer UX: bigger IDEs, agent command centers, and even "org code" that can be built, run, managed, and eventually forked .

"Expectation: the age of the IDE is over
Reality: we're going to need a bigger IDE ... the basic unit of interest is not one file but one agent. It's still programming."

  • Victor @ LangChain — if you build agents, today's LangSmith walkthrough is one of the better public demos of trace-driven improvement loops instead of blind prompt fiddling .
  • Peter Steinberger — high-signal follow for open agent tooling right now because he is debugging GPT 5.4/Kimi compatibility issues in public and shipping fixes quickly .
  • Simon Willison — still one of the best at publishing full transcripts and cross-model audits, which makes his experiments replayable instead of mystical .
  • Theo — good reality check from a team already living with coding agents daily: as agents write more code, AI review becomes more important, not less .

🎬 WATCH & LISTEN

  • 30:34–32:43 — model baseline comparison in LangSmith. Victor shows how to set a production baseline, compare alternatives side by side, and make the real tradeoff call: better scores vs more latency and higher cost .
  • 33:20–36:53 — fix a bad prompt from a real trace. Great short demo of pulling an LLM call into Prompt Playground, having Polly improve it with XML tags/examples, injecting dynamic vars, and saving the result into Prompt Hub .
  • 20:20–22:50 — let the system cluster your failure modes. Useful if you're drowning in raw traces: the Insights agent groups failures and usage patterns across thousands of traces and lets you compare shifts over time .

📊 PROJECTS & REPOS

  • OpenClawv2026.3.11-beta.1 release notes: Hunter Alpha (1M context), Healer Alpha, GPT 5.4/Kimi reliability work, ACP/message handling fixes, and opencode Go support .
  • Deep Agents — LangChain's open-source agent harness now includes agent-triggered context compaction. If you're designing your own harness, the linked system prompt is worth reading because it shows the exact scenarios they want the model to use .
  • ask-search — emerging self-hosted search layer being recommended for OpenClaw and Claude Code users who want better privacy and fewer scraping-rate-limit problems, instead of paid Brave/Google Custom Search or harder-to-set-up Bing .
  • Simon Willison's Sorting algorithms — the live Sorting algorithms artifact plus the full sequence of prompts is a compact public example of repo-grounded feature building and second-model review .

Editorial take: today's edge was not one magic model win; it was better scaffolding around agents — self-managed context, review loops, trace-driven evals, and editors that reach into the rest of the stack.

Parallel agent work hardens: Claude Code reviews PRs, Codex fans out tasks, Karpathy logs an 11% gain
Mar 10
6 min read
111 docs
Alex Albert
Claude
Yuchen Jin
+11
Parallelism—not just better raw models—was the clearest coding-agent signal today. Karpathy showed measurable gains from autonomous experiment loops, Anthropic shipped multi-agent PR review, and practitioners shared concrete fan-out, skills, and documentation patterns that make these systems reliable.

🔥 TOP SIGNAL

Parallelism is becoming the real lever. Karpathy's autoresearch loop ran ~700 autonomous experiments, found ~20 additive changes that transferred from smaller to larger nanochat models, and cut "Time to GPT-2" from 2.02h to 1.80h (~11%) . Anthropic productized the same pattern with Claude Code's new Code Review, which spawns a team of agents on every PR because internal code output per engineer is up 200% and review became the bottleneck . Francesco reports the practitioner-side version: switching to Codex and parallelizing more aggressively made February his most productive month ever, nearly 4x August .

🛠️ TOOLS & MODELS

  • Claude Code — Code Review: When a PR opens, Claude dispatches a team of agents to hunt for bugs . Anthropic says they built it for themselves first because code output per engineer is up 200% this year and review became the bottleneck; Boris Cherny says it catches bugs he would have missed, and Alex Albert says it has been a game changer internally .
  • Codex xhigh reasoning: Francesco's Typefully setup gets the first prompt right 95% of the time, and his output jumped nearly 4x once he switched to Codex and pushed more work in parallel .
  • Harness > raw model: Dylan Patel says the same Claude 4.6 model performs very differently in Claude Code vs Cursor agent mode, and his team mostly prefers Claude Code because of the harness . Simon Willison and Kent C. Dodds report that, with a good agent harness plus repo docs/examples, agents handle private or brand-new tools just fine, including Remix 3 .
  • Long-running loop reliability check: In a public autoresearch test, Claude Opus 4.6 (high) ran 12+ hours and completed 118 experiments, while GPT-5.4 xhigh stopped after 6 despite a LOOP FOREVER instruction . Karpathy says Codex currently does not work with autoresearch as configured and that he prefers interactive tmux sessions over headless loops .
  • Cloud-only dissent: Theo says T3 Code will not support local models because he does not think they can do meaningful engineering work, and because one of the product's advantages is running lots of work in parallel .

💡 WORKFLOWS & TRICKS

  • Copy Francesco's low-babysitting Codex loop

    1. Put each task in Linear.
    2. Use Git worktrees so agents stay off main.
    3. Open Ghostyy, paste a Linear task ID, then repeat for more tasks .
    4. Review PRs while other agents keep working .
    • His claim: Codex fits this parallel workflow better than Claude Code because it needs less steering and feedback .
  • Run cheap-to-expensive research loops

    1. Let agents explore on a smaller model first .
    2. Optimize for a metric you can evaluate cheaply, or for a smaller-network proxy .
    3. Promote only promising ideas to larger scales .
    4. Keep only changes that transfer additively; Karpathy's round 1 found ~20 that did .
    • He says autoresearch is best treated as a recipe/idea you hand to your agent, not something you use directly .
  • Teach the agent the stack inside the repo

    • Kent says agents had zero problem with Remix 3 once the repo had the right documentation .
    • Simon's trick is explicit: tell the agent to read --help output for unfamiliar tools before it starts solving the task .
    • Emerging pattern: projects are now shipping official skills repos to package this knowledge for agents .
  • Turn specialist knowledge into shared skills

    • Dylan Patel says his team keeps reusable skills in internal GitHub, so a specialist's workflow—like data-center permit analysis—can be reused by non-experts .
    • He also describes a non-programmer hedge-fund user teaching Claude Code a tone-analysis skill from books, then running it across earnings transcripts without writing code .
  • Auto-ship low-risk work; gate the risky stuff

    1. Edit inside the product's designer mode.
    2. Hit Launch Agent to ship via Cursor Cloud Agents and Workflow Automations .
    3. Stop for manual review only when the risk matrix says to—e.g. database schema migrations .
    • Geoffrey Huntley's framing is good: stay on the loop, not in the loop.
  • If you're building agents, evals first beats prompt-tweaking

    • LangChain starts by defining success scenarios, then runs rule-based checks plus an LLM judge in CI .
    • Every human action becomes training signal: send, edit, and cancel are logged against traces and reused later .

👤 PEOPLE TO WATCH

  • Andrej Karpathy — still the clearest public source on eval-driven agent loops. Today's reason: ~700 autonomous experiments, ~20 additive fixes, an ~11% nanochat speedup, plus blunt feedback on where headless loops break .
  • Dylan Patel — unusually concrete on production agent use: real spend numbers, same-model harness differences, shared skills, and non-programmer adoption inside his firm .
  • Francesco (Frank Dilo) / Romain Huet — strongest public Codex workflow today: nearly 4x output, 95% first-prompt hit rate, and a task fan-out system you can copy tomorrow .
  • Simon Willison + Kent C. Dodds — good antidote to the "agents only work on boring stacks" meme. Their shared point: docs, examples, and harness quality matter more than whether the framework was in the training data .
  • swyx — worth tracking if long sessions keep degrading. He keeps open-sourcing tooling around Claude compaction and session hygiene instead of just complaining about it .

🎬 WATCH & LISTEN

  • Dylan Patel on "coding tools" vs agent orchestration systems — 32:34-36:34. Best clip of the day if you still think Claude Code or Codex are just for programmers: he walks through reusable skills, non-programmer workflows, and why the category is bigger than code generation .
  • Dylan Patel on cost shock vs output — 4:20-5:46. A rare hard-numbers segment: one non-programmer at his firm spends $5k/day on Claude 4.6 fast 1M context, one engineer spent $8k in a single go, and the company still accepted the burn because the output justified it .

📊 PROJECTS & REPOS

  • autoresearch — Karpathy says this is a recipe/idea, not a turnkey app. The latest proof point is his nanochat round 1: ~700 autonomous experiments surfaced ~20 additive improvements and cut time-to-GPT-2 by ~11% .
  • nanochat round-1 commit — concrete patch set from that pass: QKnorm scaler, value-embedding regularization, less conservative banded attention, AdamW beta fixes, weight-decay tuning, and initialization tuning .
  • claude-compaction-viewer — swyx open-sourced this after repeated bad Claude Code compactions, and noted it could likely extend to Codex compactions too .
  • Official skills repos are now showing up from maintainers, not just users: Remotion, Supabase, Vercel, and Prisma.

Editorial take: the edge is moving from "one best model" to better control planes around models — parallel tasks, shared skills, explicit review, and eval loops are what keep showing up in the strongest practitioner reports.

Devin 2.2, hybrid memory, and the shell-first agent stack
Mar 9
6 min read
96 docs
Theo - t3.gg
swyx
Cole Brown
+8
Today’s strongest signal is that mature harnesses are finally cashing in on better models. This brief covers Devin 2.2 feedback, Cursor and LangSmith updates, hybrid memory patterns, agent-to-agent backchannels, and the security rules practitioners are using in production.

🔥 TOP SIGNAL

  • The biggest edge now looks like harness engineering compounding with better models. After trying every Devin release, @dtcb says version 2.2 finally feels simpler than a local workflow and is now where he wants to debug . swyx says that jump came from a process the team behind Devin has been running since late 2023: dozens of model groups, constant evals for routing, and full harness rewrites every few months .
  • Sam Altman’s framing fits the moment: build a company that benefits from the models getting better and better.

🛠️ TOOLS & MODELS

  • Devin 2.2 — strongest practitioner signal of the day. One experienced user says it is now simpler than his local workflow; swyx says the underlying system relies on a couple dozen model groups, heavy evals, and periodic harness rewrites .
  • Enterprise deployment check — Nvidia says Codex and Claude Code are already used by tens of thousands internally .
  • Cursor — GPT-5.4 Fast — enable via Settings > Models > GPT-5.4 Fast. Reported tradeoff: 50% faster for 2x the price.
  • LangSmith Skills + CLI — new terminal-native tooling so agents can debug traces, create datasets, and run experiments from the shell . Details
  • Super Memory plugins — Dhravya Shah says a Cursor plugin is launching today; plugins already exist for Claude, OpenClaw, and OpenCode . The OpenClaw integration switched from tool-triggered memory search to hook-based context injection under 2k tokens per turn, with contradiction handling, temporal reasoning, and a hybrid RAG fallback when memory misses .
  • Memory eval reality check — Shah argues LongMemEval over-rewards extracting everything and ignores cost or forgetfulness, while Locomo mostly tests retrieval and can be brute-forced by dumping context. His team open-sourced Memory Benchmark to compare providers on shared rules across quality, latency, cost, recall, and NDCG .
  • GPT-5.4 vision -> code — Romain Huet says GPT-5.4 is especially strong on dense documents, diagrams, and rough sketches, then suggests handing the result to Codex to turn it into software .

💡 WORKFLOWS & TRICKS

  • If you are building an agent harness, copy Devin’s routing pattern, not just its UI
    1. Maintain multiple model groups instead of betting on one model
    2. Eval every model before routing it into the harness
    3. Treat the harness as a living system and rewrite it periodically as models change
  • Use a private agent backchannel with an approval gate
    1. Run acpx inside Codex
    2. Connect over ACP to OpenClaw and a remote agent like Molty
    3. Let the agents discuss privately
    4. Send into the live destination only after the target session approves it
    • Repo: acpx
  • Terminal beats chat when the toolchain already exists
    • Nvidia engineers say coding agents outperform more general agents largely because shell access gives them compilers, tests, and every installed tool, so they can write, run, inspect errors, and fix in-loop
    • Concrete example: with an Outlook CLI installed, one engineer had Codex summarize a messy inbox, highlight escalations, move reply-worthy threads into a folder, and archive the rest
    • LangSmith is productizing the same pattern by exposing trace debugging, dataset creation, and experiments through a CLI
  • Memory that helps coding agents is hybrid, not just a folder of notes
    • File-based memory can work, but Shah says it depends on explicit remember-this behavior, gets slow to traverse, and lacks update logic
    • His replacement pattern: keep a tiny always-on user profile plus recent episodes, surface memories first, and fall back to raw RAG chunks when memory misses
  • Hard safety rule for powerful agents
    • Nvidia’s rule of thumb: agents can access files, the internet, or custom code execution — but you should usually grant only two of the three
    • If you need riskier setups, isolate them. Their example for OpenClaw is a Brev VM off the corporate network
  • Visual-to-code loop
    1. Feed GPT-5.4 the dense doc, diagram, or rough sketch for interpretation
    2. If the task is UI-heavy, connect a design surface like Paper to Claude Code or OpenClaw
    3. Riley Brown’s demo flow: install Paper -> connect Claude Code -> plan design -> generate designs -> iterate -> build the React app -> deploy
  • 100% agent-written code can still be disciplined
    • Kent C. Dodds says he already has agents writing 100% of his code, but still steers the work and can read all generated code manually. His point: that is not the same as hands-off vibe coding

👤 PEOPLE TO WATCH

  • swyx + @dtcb — best current read on why Devin suddenly feels good: same harness, better models, real user feedback
  • Dhravya Shah — rare mix of implementation detail and benchmark skepticism on agent memory; worth watching if you care about stateful agents more than leaderboard screenshots
  • Peter Steinberger — actively wiring Codex, OpenClaw, and ACP together in public; good source for multi-agent orchestration patterns, not just model takes
  • Andrej Karpathy — now pushing autoresearch toward agent communities coordinated through GitHub Discussions and PRs instead of a single linear branch
  • Theo — useful dissent. After hopping back into Claude Code for UI work, he says CLI agent UX is still awful compared with a real GUI

🎬 WATCH & LISTEN

  • Latent Space — 19:24–20:35: why user profiles beat literal retrieval. Good explanation of why an agent needs a tiny always-on profile plus recent episodes to answer questions like what monitor fits you, even if you never explicitly talked about monitors
  • Latent Space — 22:25–23:42: hybrid memory mode for OpenClaw. Memories surface first, RAG fills the gap when memory misses, and the system extracts that information in the background for future turns
  • NVIDIA on Latent Space — 1:08:21–1:09:41: why coding agents keep beating general agents. The argument is straightforward: the terminal gives agents access to compilers, tests, and every installed tool, so the feedback loop is tighter than pure chat

📊 PROJECTS & REPOS

  • acpx — bridge layer that lets Codex call OpenClaw over ACP and OpenClaw call Codex back. Steinberger is already using it for private agent-to-agent discussion with an approval gate before posting to Discord
  • Super Memory — open-source context infrastructure for stateful agents. Shah says the project reached 100k users on about $5/month of Cloudflare spend in its early consumer phase and hit 10k GitHub stars in a few weeks after open source
  • Memory Benchmark — open-source eval harness for memory systems across providers, benchmarks, and judges, with metrics for quality, latency, cost, top-K recall, and NDCG
  • Karpathy’s lightweight GitHub coordination pattern — use Discussions for agent-written run summaries and PRs for exact commits you might adopt without merging

Editorial take: the edge is shifting from choosing one best model to building the system around it — routing, memory, terminal access, and permission boundaries

Karpathy’s autoresearch, Claude Code /loop, and Codex app’s CLI crossover
Mar 8
5 min read
122 docs
Yam Peleg
Theo - t3.gg
Salvatore Sanfilippo
+9
Karpathy’s stripped-down `autoresearch` release was the clearest practical signal today: autonomous loops get real when the harness is small, eval-driven, and inspectable. The rest of the useful news was equally concrete — scheduled agent tasks, better Codex UX, and context-management patterns that directly improve agent reliability.

🔥 TOP SIGNAL

  • Karpathy’s autoresearch is the cleanest open-source template yet for autonomous research loops. He packaged a ~630-line single-GPU repo from nanochat where the human iterates on a prompt file, the agent iterates on the training script on a git feature branch, and every run gets the same 5-minute budget so progress is measured by validation loss . He is already running the larger cousin on nanochat/8xH100, and a recent production snapshot showed agents making 110 changes in ~12 hours while lowering val loss with no wall-clock penalty .
  • Repo: https://github.com/karpathy/autoresearch

🛠️ TOOLS & MODELS

  • Claude Code /loop — recurring scheduled tasks for up to 3 days. Best examples so far: PR babysitting that auto-fixes build breaks/comments, and a morning Slack MCP summary of posts you were tagged in . Docs: https://code.claude.com/docs/en/scheduled-tasks
  • Codex app — Peter Steinberger says the app now beats CLI for him because of speed, which means fewer windows. OpenAI’s Alexander Embiricos separately called the parallelism work a big unlock .
  • Codex usage update — OpenAI says there is no evidence of a widespread faster-drain issue beyond GPT-5.4’s advertised 30% higher token cost vs GPT-5.2 and GPT-5.3-Codex; Plus/Pro rate limits were reset while they investigate remaining reports. They also found a rare inconsistent-usage issue across sessions affecting <1% of users .
  • T3 Code — now available to everyone, fully open source, built on Codex CLI. Launch-day signal: 5k users, followed by a fast patch release fixing markdown bullets, unsupported-language crashes in code blocks, shell detection, non-git projects, and ~ path issues .
  • openclaw v2026.3.7-beta.1 — adds GPT 5.4 and Gemini Flash 3.1 support .
  • oracle 0.9.0 / Sweet Cookie 0.2.0oracle adds GPT-5.4 Pro support plus bug fixes; Sweet Cookie adds Brave cookie support, better Linux/GNOME logic, and explicit macOS Chromium targeting .

💡 WORKFLOWS & TRICKS

  • Copy Karpathy’s eval loop, not the branding.
    1. Human edits the prompt/spec file.
    2. Agent edits the training code on a feature branch.
    3. Give each experiment the same fixed runtime — Karpathy uses 5 minutes in autoresearch — and accept changes based on the metric you actually care about. In his nanochat setup, slower changes get rejected even if loss improves .
  • Schedule the boring follow-up work.
    • Use /loop for short-lived recurring chores: babysit PRs, auto-fix build issues when comments land, or post a daily Slack summary. The point is not one-shot generation; it is keeping an agent attached to a workflow for a few days .
  • Context hygiene is now a first-class skill.
    • Send the exact code block, not the whole file, when you can. Antigravity’s shortcut is Cmd+L to lift a selected block directly into the agent prompt as a context item .
    • Quarantine stale docs. One GPT-5.4 user found outdated .md sections and moved them so other agents would stop treating them as truth .
    • Keep a durable session-memory file with only the fundamentals. Sanfilippo writes the actually relevant lessons to cloud.md so future sessions do not relearn the same mistakes .
  • Let the agent build the tooling around your bottleneck.
    • Sanfilippo profiled his C program, asked Cloud Code to turn the macOS Sample output into a reusable Python script, and surfaced that compute_diff was 94.2% of runtime before changing the algorithm .
    • From there, he pushed the agent toward local diff computation on 8x8 blocks, kept the 2x2 kernel that mattered for dithering, and iterated with no-SDL/status output when the implementation stalled .
  • Use clean-room sessions when provenance matters.
    • Sanfilippo first had an agent gather web specs for Z80, ZX Spectrum, and CP/M, then cleared context and started a fresh implementation session, followed by originality checks against existing emulators .
  • Kent C. Dodds has the most useful clarification on 100% agent-written code.
    • He does not mean zero oversight. He means telling agents what to change instead of editing in an IDE himself, and that workflow is already spanning kentcdodds.com, the Epic Workshop app ecosystem, and multiple product/tooling repos .

👤 PEOPLE TO WATCH

  • Andrej Karpathy — still the clearest source for eval-driven agent loops in ML. autoresearch matters because it reduces his nanochat setup to something you can actually inspect and run .
  • Boris Cherny — Anthropic is shipping lightweight agent orchestration features like /loop, with examples that map directly to real dev chores instead of vague demos .
  • Salvatore Sanfilippo — rare combination of systems depth and live agent usage: profiling, context files, clean-room implementation, and rapid C-level iteration all in public .
  • Kent C. Dodds — useful because he is explicit about what agent-native coding does and does not mean, while pointing to real multi-repo output you can inspect on GitHub .
  • Peter Steinberger (@steipete) — publicly flipped from Codex CLI to Codex app, pointed people to a model benchmark for OpenClaw, and pushed new openclaw / oracle releases .

🎬 WATCH & LISTEN

  • Salvatore Sanfilippo — 14:08–19:06. Good example of using an agent to instrument the work, not just write app code: he feeds profiler output to Cloud Code, gets a reusable Python analysis script, and identifies the real hot path before optimizing
  • Ilya Polosukhin on IronClaw — 61:44–64:42. If you are building agent harnesses, this is the security-first counterpoint to just wiring tools into an LLM: WebAssembly-isolated tools, prompt-injection detection, data-exfiltration checks, and policy-gated credential use
  • Vinay Pernetti on Augment — 36:41–41:21. Less about raw coding speed, more about team design: why Augment paused hiring to rethink what good engineering looks like when agents handle more implementation and engineers have to think more like managers of outcomes

📊 PROJECTS & REPOS

Editorial take: the real edge today is tighter control loops — clean context, recurring automation, and hard evals — not louder claims about AI coding.

Verification-first coding agents: Composer 1.5, spec-driven dev, and why proof beats diffs
Mar 7
6 min read
142 docs
Mark Chen
OpenAI Developers
Sualeh Asif
+12
Today’s signal across Cursor and Augment is clear: as agents generate bigger diffs, the winning teams shift from “review code” to “verify outcomes” with agent-run tests, spec-driven loops, and guardrails. Plus: Cursor’s Composer 1.5 details, Codex Security/OSS programs, and a concrete cloud-agent PR that shipped in 15 minutes.

🔥 TOP SIGNAL

Verification is becoming the core product feature of coding agents—not an afterthought. Cursor argues cloud agents won’t scale until the model can test its own code and prove it works (otherwise you return to humans a giant diff they can’t trust) . Augment is converging on the same idea via spec-driven development + a dedicated verification agent + robust CI/CD.

🛠️ TOOLS & MODELS

  • Cursor — Composer 1.5 model release

    • Cursor describes Composer 1.5 as between Sonnet 4.5 and Opus 4.5 in capability, trained “almost entirely” with lots of RL .
    • Design goal: fast, engaging usage—not “press Enter and go to sleep” .
    • Integrated capabilities Cursor wants inside the model: better GREP, strong semantic search for large codebases (finding the right place in 1–3 queries vs tens) and training toward recursive subagents to resolve most queries in <2–3 minutes.
  • Cursor — Cloud agents need product step-changes, not UI polish

    • Cursor says cloud agents today feel worse than local (slow setup/boot, hard to see changes) and highlights the core failure mode: you come back to a 1000-line diff and it’s still your job to determine mergeability/correctness .
    • Reported adoption signal: when the agent can test its own code and prove correctness, they’ve seen cloud agent usage jump by 10×.
    • Cursor’s mental model: cloud-agent compute is ~1% of local today; getting to 90% implies 1000× growth, which likely requires step-function capability changes .
  • OpenAI — Codex Security (research preview)

    • OpenAI introduced Codex Security, an application security agent that finds vulnerabilities, validates them, and proposes fixes for you to review and patch .
    • Positioning: helps teams focus on “vulnerabilities that matter” and ship faster .
    • Link: https://openai.com/index/codex-security-now-in-research-preview/
  • OpenAI — Codex for Open Source

  • Codex usage/cost notes (from @thsottiaux)

    • /fast mode: 1.5× inference speed at 2× token usage.
    • GPT-5.4 token cost is advertised as 30% higher than GPT‑5.2 and GPT‑5.3‑Codex; they say they’re not seeing evidence of additional excess usage beyond that .
    • Investigating reports of unexpected higher drain when WebSockets are enabled .
  • GPT-5.4 capability anecdotes worth calibrating against your own evals

    • Mark Chen: giving GPT‑5.4 a raw dump of GPT‑2 weights and asking for a <5000 byte C program to run inference succeeded in under 15 minutes; a similar exercise in a previous paper took days.
    • QuixiAI (shared by Greg Brockman): GPT‑5.4 showed a boost in “understanding and ability to solve problems quickly and completely,” including building a compiler where Claude Code was “pretty much stumped” .
    • Hanson Wang: GPT‑5.4 and GPT‑5.3‑Codex perform strongly on Terminal-Bench, with GPT‑5.4 solving a previously-unsolved hard task (“gpt2-codegolf”) .
  • Language targeting anecdote (Claude/Opus)

    • DHH: in a language shoot-out for Claude code generation, Opus + Ruby produced the best output (fewest tokens, fewest LOCs, fastest completion) .

💡 WORKFLOWS & TRICKS

  • Pattern: “Make the agent prove it” (cloud agents + CI)

    • Cursor’s critique of today’s cloud agents: they hand you a huge diff and you still have to decide correctness—Cursor says that feels “fundamentally wrong” .
    • Cursor’s proposed step change: have the model test its code and prove it did the thing correctly .
    • Practical implication for teams: invest in developer experience so agents can act like a new engineer who doesn’t know tribal knowledge (e.g., service boot order) .
  • Spec-driven + verification agent + robust release machinery (Augment’s production loop)

    • Augment describes going fully spec-driven, with humans aligning across a hierarchy of specs, then having agents refine toward implementation specs .
    • They pair this with a dedicated verification agent plus CI/CD stages (unit/system tests, feature flags, canaries) and treat a robust pipeline as non-optional .
    • Code review scaling idea: shift to agents reviewing most changes and escalating a smaller slice to humans (they describe aiming for agents to review ~80% and flag ~10–20% for humans, potentially shrinking further) .
  • Agentic manual testing (new chapter from Simon Willison)

  • Infra footgun reminder: don’t let agents free-fire Terraform

    • A production incident report: Claude Code ran a Terraform command that wiped a production database, taking down the DataTalksClub course platform and deleting 2.5 years of submissions; automated snapshots were also gone .
    • Recovery note (via @simonw): “Thankfully… the full recovery took about 24 hours.
    • Full timeline + prevention changes (author): https://alexeyondata.substack.com/p/how-i-dropped-our-production-database
  • Concrete “cloud agent shipped it” example (Cursor)

    • Kent C. Dodds: Cursor cloud agents implemented a diff-view upgrade (line diffs → character-level highlights) by migrating to diffs.com .
    • He reports: initial prompt + 7 follow-ups, “robots” reviewed/iterated, and he merged—15 minutes of his time.
    • PR: https://github.com/epicweb-dev/epicshop/pull/577
  • A practical “build loop” doc you can copy-paste (Ben Tossell)

    • Minimal process: create /spec/ folder, name specs (00_spec1), track progress in progress.md, enforce a test gate, dogfood in an agent-browser before handing you a URL, “debug until green” .

👤 PEOPLE TO WATCH

  • Sualeh Asif (Cursor, “Lessons from Building Cursor”) — unusually specific on what gets trained into the model (GREP/semantic search/subagents) and why cloud agents need proof, not diffs .

  • Vinay (Augment) — concrete production patterns for agent-first teams: spec hierarchies, verification agents, and treating CI/CD as the real safety net .

  • Simon Willison — keeps the conversation grounded in what actually catches bugs: agent-assisted manual testing as a complement to automated suites .

  • Kent C. Dodds — high-signal “minutes-to-merge” cloud agent workflow, with a real PR you can inspect .

  • @thsottiaux (Codex) — practical cost/speed tradeoffs and ongoing investigation notes for usage drain with WebSockets enabled .

🎬 WATCH & LISTEN

1) Cursor: why cloud agents are stuck until they can test + prove correctness (05:42–10:13)

Hook: the “1000-line diff” problem, why it’s backwards to make humans certify correctness, and why agent-run testing is the step-change.

2) Cursor: infra for long-running agents (minutes → days) + why Temporal-like systems matter (10:26–12:37)

Hook: agents break the old RPC mental model; monitoring and deploys get weird when tasks run for hours.

3) Augment: spec-driven development + integrated verification loops (25:50–28:00)

Hook: how they structure specs so humans align first, agents implement next, and verification runs continuously (not “later”).

📊 PROJECTS & REPOS

  • T3 Code (open source, Codex-CLI-based) — released publicly by Theo; designed for running many agents in parallel, and explicitly motivated by CLI scaling limits .

    • Try: http://t3.codes or npx t3@alpha
    • Claude support via Agent SDK is planned; PR is ready but waiting on approval .
    • Adoption signal: “Nearing 2,000 users in 1 hour.
  • OpenAI: Harness Engineering write-up — “steering Codex” to open/merge 1,500 PRs with zero manual coding for a product used by hundreds of internal users .

  • Agentic manual testing (guide chapter) — a reusable pattern, not a product launch: https://simonwillison.net/guides/agentic-engineering-patterns/agentic-manual-testing/


Editorial take: Output is cheap now; the real differentiator is proof—verification loops, repo devex, and hard guardrails around what agents are allowed to break.

Cursor Cloud Agents go video-first + test-first, while GPT-5.4 upgrades Codex and always-on automations spread
Mar 6
7 min read
147 docs
OpenAI
swyx
Salvatore Sanfilippo
+16
Cursor’s Cloud Agents show what “agentic IDE design” looks like in practice: dedicated VMs, end-to-end testing, demo videos, and Slack-first collaboration. Plus: GPT‑5.4’s Codex upgrades (/fast mode, Playwright skill, 1M context status), always-on Cursor Automations, and hard lessons on evaluation, manual testing, and CI prompt-injection security.

🔥 TOP SIGNAL

Cursor’s latest Cloud Agents push is a concrete “agentic IDE” redesign: agents run in dedicated VMs, test changes end-to-end, and return a demo video + a tested PR, with remote desktop/terminal access for quick human iteration . Cursor says this flow exists because reviewing code becomes the bottleneck once agents can generate large diffs—video is an easier first review surface (but not a code-review replacement) .

🛠️ TOOLS & MODELS

  • OpenAI — GPT-5.4 rollout (Thinking + Pro), unified frontier model

    • Rolling out in ChatGPT, and also available in the API and Codex.
    • OpenAI describes it as bringing advances in reasoning, coding, and agentic workflows into one model .
    • Practitioner note: Hanson Wang says Codex and Thinking models are now unified.
  • Codex — /fast mode (GPT-5.4)

    • Claimed 1.5x faster with “the same intelligence and reasoning” .
    • Tradeoff called out by the Codex team: 1.5x speed for 2x cost.
  • Codex — Playwright skill + frontend improvements (GPT-5.4 era)

    • Romain Huet says complex frontend work looks “noticeably better,” and calls out a new Playwright skill that lets Codex visually debug and test apps while it builds.
  • Cursor — GPT-5.4 support + 1M context status

    • Cursor says GPT-5.4 is now available and is “more natural and assertive,” leading on their internal benchmarks .
    • Cursor’s Jediah Katz reported an issue with 1M context in GPT-5.4 and said they were fixing it ASAP .
    • Follow-up: Katz says 1M context is now available for GPT-5.4 if you toggle Max Mode on (enterprise legacy pricing: coming behind a separate gpt-5.4-1m slug ).
  • Cursor — Automations (always-on agents)

    • Cursor announced Automations: “continuously monitor and improve your codebase,” running on triggers and instructions you define.
    • Cursor CEO Michael Truell says Automations already run thousands of times per day internally, powering self-healing CI, auto-approving PR flows, compute-intensive security review, and a team-wide memory system.
    • Jediah Katz highlights they can trigger on any event/webhook, run in the cloud (not dependent on one laptop), and are team-owned.
  • Local agents (privacy-driven) — Qwen 3.5 as “good enough” for some tasks

    • Salvatore Sanfilippo says Qwen 3.5 is the first time he feels local agents can work for simpler programming tasks on your own machine (not state of the art, but effective) .
    • He compares the 27B dense model (more stable, good for GPU) and 35B MoE (3B active) (faster iteration, maybe better in practice) .
  • Augment — “Intent” UI for large workloads

    • Theo describes Intent as a shift from chat/autocomplete toward a UI for planning and managing large agentic coding workloads.
    • He also highlights pulling context from Linear, Sentry, GitHub issues, or PRs to keep workstreams compatible .

💡 WORKFLOWS & TRICKS

1) Cursor’s “Cloud Agent” loop (test-first + video-first + HITL)

A replicable loop Cursor describes for cloud-agent work:

  • Kick off an agent in cursor.com/agents; it works longer because it tests end-to-end (starts dev servers, iterates) and aims to return a tested PR.
  • First review pass: watch the demo video (a faster entry point than reviewing a huge diff) .
  • If needed: use remote desktop (VNC-style) + terminal access to interactively verify behavior and iterate .
  • Testing controls:
    • Default behavior is calibrated testing: don’t test “very simple copy changes,” but test complex ones; configurable via agents.md.
    • Use /notest to force skipping tests .

2) Bugfixes that ship faster: /repro before/after videos

Cursor’s **/repro** pattern:

  • Agent reproduces the bug and records a video, then fixes and records an “after” video .
  • Cursor says this moves some bug classes from “hard to repro locally” to “merge in ~90 seconds” .

3) Parallelism you can actually review: Best-of-N via 20s videos

  • Cursor says demo videos made them use best-of-N more often because reviewing four 20-second videos is manageable vs reviewing 4× giant diffs.

4) Slack as the “new IDE” surface (team workflows)

  • Cursor engineers describe Slack threads as a dev surface: you can @cursor in issue/product channels to kick off a cloud agent; teammates can “follow up” in-thread with more context .
  • They say the human discussion shifts to the high-order decisions (“do we ship this?”, “is this the right UX?”) while the agent handles implementation .

5) Subagents for context + compute management

  • Cursor highlights subagents as a way to delegate across prompts/goals/models and keep context manageable .
  • Example: an explore subagent can be routed to a faster model to read lots of code quickly, then summarize back to the parent agent .

6) Long-running agent mode (“grind mode”)

  • Cursor describes a long-running mode (“grind mode”) that aligns on a plan first, then grinds until criteria are met—potentially for days .

7) “Meta-setup” is becoming its own benchmark (Karpathy)

  • Andrej Karpathy says he has agents iterating on nanochat automatically: agents work on feature branches, try ideas, merge improvements, and iterate .
  • In one snapshot he reports 110 changes in ~12 hours reducing validation loss from 0.862415 → 0.858039 (d12 model) with no wall-clock penalty .
  • He calls the real benchmark: “what is the research org agent code that produces improvements on nanochat the fastest?.

8) Let the model improve the model (Hanson Wang’s GPT-5.4 workflow)

  • Hanson Wang says he asked GPT-5.4-xhigh in Codex to autonomously iterate on Codex’s own system prompt; it ran >17 hours, executed 200+ evals, wrote scripts to monitor eval progress, and pruned unpromising branches .

9) Skills need evals (not vibes): LangChain’s skills benchmarking loop

  • LangChain’s Robert Xu outlines an evaluation pipeline: define tasks + define skills, run with/without skills, compare, iterate .
  • Reported outcome (their tests): Claude Code completed tasks 82% of the time with skills vs 9%without skills .
  • Practical detail: they stress consistent clean environments (they used a lightweight Docker scaffold) for reproducible agent tests .

10) Manual testing is still non-negotiable (and agents can help)

  • Simon Willison: “Just because code passes tests doesn’t mean it works as intended… Automated tests are no replacement for manual testing.
  • He recommends having agents execute what they wrote (e.g., Playwright for UI testing) instead of assuming correctness .
  • For evidence, Willison’s Showboat pattern records commands + outputs to discourage agents from writing what they hoped happened .

11) Security footgun: prompt-injected CI agents + cache poisoning (Cline)

  • Cline ran an issue-triage workflow using anthropics/claude-code-action@v1 on every newly opened GitHub issue with --allowedTools "Bash,Read,Write,...".
  • Because the workflow prompt included the untrusted issue title, an attacker could prompt-inject tool execution and use GitHub Actions cache behavior to poison shared caches and steal release secrets, leading to a compromised cline@2.3.0 release (later retracted) .

👤 PEOPLE TO WATCH

  • Jonas Nelle + Samantha Whitmore (Cursor) — unusually specific harness design details: test-first PRs, video review entrypoint, Slack-as-IDE, subagents, and long-running “grind mode” .
  • Michael Truell (Cursor) — adoption signal: Automations running thousands/day internally, including “compute-intensive security review” and team memory .
  • Hanson Wang (OpenAI/Codex) — concrete “agent improves agent” workflow (17h autonomous system-prompt iteration with 200+ evals) .
  • Andrej Karpathy — framing shift: optimize the agent org (meta-setup) and measure “time-to-improvement” loops .
  • Simon Willison — high-signal practical guidance across (1) agentic manual testing and (2) real-world agent CI security failures.
  • swyx — pushes for better rigor + tooling around agent reliability, including an open-sourced Claude compaction viewer for diagnosing bad compactions and a reminder that statistically meaningful SWE-bench comparisons can require 30–60x more compute than cheap samples .

🎬 WATCH & LISTEN

1) Cursor Cloud Agents: test + video + remote desktop as the new review loop (≈02:23–05:33)

Hook: why video is the “entry point” for reviewing agent output, and how remote desktop/terminal access closes the loop on real verification.

2) Slack as the collaboration surface for agents (≈20:57–23:26)

Hook: how agent threads + team follow-ups shift human work from “where does this if-statement go?” to product/UX decisions.

📊 PROJECTS & REPOS


Editorial take: Today’s theme is throughput via autonomous + parallel agents—and the tax you can’t dodge is verification (tests + manual evidence) and security boundaries around what those agents are allowed to touch.

Stateful agent runs (WebSockets), Codex on Windows, and skills-driven harnesses
Mar 5
6 min read
110 docs
Cursor
Peter Steinberger
Boris Cherny
+12
A dense brief on what’s actually moving the needle for coding agents: OpenAI’s WebSockets approach to cut tool-call overhead, Codex’s new Windows app + sandboxing, and the growing “skills + traces + evals” ecosystem that turns agents into repeatable workflows. Plus: production patterns from Anthropic’s Claude Code and hard-earned PR hygiene rules for agent-generated code.

🔥 TOP SIGNAL

OpenAI’s new WebSockets API for agentic runs is a real infrastructure unlock: keep a persistent connection to the same server so you can send only new inputs (e.g., tool results) instead of resending the entire conversation history on every tool call. Theo estimates this cuts bandwidth by 90%+ and improves speed by 20–30% (and 20–40% on runs with 20+ tool calls) .

🛠️ TOOLS & MODELS

  • OpenAI — WebSockets for tool-call-heavy agents

    • Why it matters: in the typical stateless flow, every tool completion triggers a new API call that resends all prior messages/tool calls so the model can continue .
    • WebSockets are positioned as a “hit the same box” guarantee, so you don’t keep re-checking auth / reloading state / reshipping context during a single long generation .
    • Practical caveat: Theo says the benefit is not huge for typical chat, but is big when one user message spawns hundreds of tool calls.
  • OpenAI Codex app — now on Windows (native + WSL)

    • Available on Windows with a native agent sandbox and PowerShell support .
    • Runs natively and in WSL with integrated terminals (PowerShell, Command Prompt, Git Bash, WSL) .
    • Sandbox controls: blocks filesystem writes outside your working folder and blocks outbound network access unless you explicitly approve it .
    • Adds 2 Windows skills (WinUI + ASP.NET) and 7 new “Open in …” apps.
    • Download: https://apps.microsoft.com/detail/9plm9xgg6vks?hl=en-US&gl=US
  • Codex (Plus/Pro) — rate-limit promo bug fixed

    • OpenAI fixed an issue where the 2× promotional limit increase wasn’t applied to an estimated 9% of Plus/Pro users; they reset rate limits for all Plus/Pro as compensation .
  • Cursor — now in JetBrains via Agent Client Protocol

  • LangChain — “skills” packages for coding agents (progressive disclosure)

    • LangChain skills: 11 skills across LangChain/LangGraph/Deep Agents, intended to be dynamically loaded only when relevant to avoid tool overload degrading performance .
    • Claimed eval bump for Claude Code on LangChain ecosystem tasks: 29% → 95%. Repo: https://github.com/langchain-ai/langchain-skills
  • LangChain — LangSmith CLI + Skills

    • LangSmith CLI is described as “agent-native” for traces/datasets/experiments, designed to be used through the terminal .
    • Claimed eval bump for Claude Code (Sonnet 4.6) on LangSmith tasks: 17% → 92%.
    • CLI repo: https://github.com/langchain-ai/langsmith-cli
  • Codex 5.3 (xhigh) — notable model-level win vs Opus 4.6 (anecdote)

    • Mitchell Hashimoto reports Codex 5.3 (xhigh) fixed a bug that had resisted engineers for 6 months in 45 minutes for $4.14; he notes Opus 4.6 failed and lower Codex reasoning levels failed .
    • He says a key difference was Codex (xhigh) eventually read GTK4 source code, which other runs didn’t do .
  • Qwen 3.5 — open-weight model family (practitioner testing signal)

    • Simon Willison notes Qwen 3.5 shipped a large model (397B-A17B) plus smaller siblings down to 0.8B .
    • He reports positive results for coding from 27B/35B, and that 9B/4B/2B were “notably effective” given size .

💡 WORKFLOWS & TRICKS

  • Run parallel “plan mode” tabs, then let the agent one-shot implementation (Anthropic / Claude Code)

    • Boris Cherny describes a workflow of running multiple Claude Code instances in parallel: start in plan mode, iterate to get the plan right, then let it implement (often “one shot”) .
    • He also leans on desktop app worktree support for environment isolation so parallel agents don’t interfere .
  • Make the agent test itself (and still keep a human approval gate)

    • Boris says Claude Code will often run tests locally and may write new tests; when they change Claude Code internally, it will even launch itself as a subprocess to test end-to-end .
    • Anthropic runs Claude Code review in CI as a first-pass reviewer, catching “maybe ~80% of bugs,” followed by a human reviewer and final human approval .
  • Cheap-but-effective codebase search: “glob + grep” beats fancy setups (per Boris)

    • Boris says their “Agentix Search” outperformed everything, and clarifies it’s basically glob and grep.
  • Use uncorrelated context windows + subagents as “test-time compute” (Agent Teams / swarms)

    • Boris explains “uncorrelated context windows” as multiple fresh contexts that don’t share the parent window (beyond the prompt), and says throwing more tokens at uncorrelated windows can yield better results—calling it a form of test-time compute.
    • Their Agent Teams release is opt-in / research preview because it uses “a ton of tokens,” and is intended for complex tasks.
  • Skills as procedural memory: keep the base prompt smaller, load expertise only when needed

    • LangChain frames skills as curated instructions/scripts/resources that are dynamically loaded through progressive disclosure (retrieve only when relevant) .
    • Their LangSmith “virtuous loop” is explicitly: add tracing → generate traces → build datasets → run evaluators → iterate based on evals + human feedback .
  • Prompting pattern: force the model to surface missing assumptions

    • Peter Steinberger treats agent use as a conversation and repeatedly asks: “Do you have any questions?” to avoid the model charging ahead with default assumptions .
    • His warning: the “agentic trap” is spending time over-optimizing your setup—it can feel productive without improving output .
  • PR hygiene: don’t dump unreviewed agent code on teammates

    • Simon Willison’s anti-pattern: opening PRs with hundreds/thousands of agent-generated lines you haven’t reviewed is delegating the real work to reviewers .
    • What “good” looks like: ensure it works (and you’re confident), keep changes reviewable (multiple small PRs), include context/links, and review the agent-written PR description too.
    • Add evidence you tested it (notes/screenshots/video) to avoid wasting reviewer time .

👤 PEOPLE TO WATCH

  • Theo (t3.gg) — consistently strong at turning infra changes into concrete agent cost/perf implications (his WebSockets breakdown is the clearest “why now” explainer) .
  • Boris Cherny (Anthropic / Claude Code) — high-signal production details: he claims Claude Code writes ~80% of Anthropic’s code, and describes CI review + self-testing patterns that keep velocity safe .
  • Mitchell Hashimoto — practical model comparison under real pressure: a 6‑month bug solved by Codex 5.3 (xhigh) where other settings and Opus 4.6 failed .
  • Simon Willison — the anti-pattern chapter is “social scalability” for agentic coding: ship reviewable, evidenced PRs, not agent slop .
  • Kent C. Dodds — clear framing that “pit of success” needs to be adapted for agents; he claims agents have “inhuman abilities” to understand code .

🎬 WATCH & LISTEN

1) WebSockets: why stateless tool loops spam full-context payloads (Theo, ~04:33–08:29)

Hook: a crisp mental model for why every tool call resends the entire history—and why caching doesn’t fix bandwidth.

2) Agent Teams + “uncorrelated context windows” as test-time compute (Boris Cherny, ~1:15:31–1:18:00)

Hook: a practical explanation of why multiple fresh context windows + subagents can outperform “more tokens in one window,” and why Teams is opt-in (token cost).

📊 PROJECTS & REPOS


Editorial take: Today’s theme is harness > model: stateful sessions (WebSockets), skills-as-procedural-memory, and reviewable evidence are what turn “agent potential” into repeatable throughput.