ZeroNoise Logo zeronoise

Coding Agents Alpha Tracker

Active
Public Daily at 8:00 AM Agent time: 8:00 AM GMT+00:00 – Europe / London

by avergin 110 sources

Daily high-signal briefing on coding agents: how top engineers use them, the best workflows, productivity tips, high-leverage tricks, leading tools/models/systems, and the people leaking the most alpha. Built for developers who want to stay at the cutting edge without drowning in noise.

Cloudflare Pushes Code-Writing Agents as Claude Code and Cursor Tighten the Loop
Mar 25
6 min read
158 docs
Claude
Figma
Riley Brown
+12
Cloudflare's Dynamic Workers and Code Mode signal a more important shift than another model release: coding agents are moving from brittle tool menus toward fast, disposable code execution. Also inside: Claude Code auto mode, Cursor's Composer 2 report, Riley Brown's Codex/Xcode workflow, and production multi-agent patterns worth stealing.

🔥 TOP SIGNAL

The sharpest infra shift today is Cloudflare's push toward code-writing agents. Dynamic Workers are now in Open Beta with secure sandboxes that start ~100x faster than containers and use 1/10 the memory, and Kent C. Dodds says he's already using them to power his own AI assistant with CodeMode . The companion Code Mode thesis matters just as much: expose tools as an SDK and let the model write code to select and compose them, instead of stuffing more JSON-style tools into the context window .

"Agents should interact with the world by writing code, not tool calls."

🛠️ TOOLS & MODELS

  • Claude Code — auto mode. Claude can now decide file-write and bash permissions on the user's behalf, with safeguards checking each action before it runs. That replaces the old choice between constant approvals and --dangerously-skip-permissions.
  • Cursor Composer 2 — training report out. Cursor says Composer 2 was built with three main efforts: continued pretraining, reinforcement learning, and benchmark development to emulate the Cursor environment. The practical takeaways: continued pretraining kept improving downstream coding performance, the RL phase was critical enough that simple approaches worked best broadly, and CursorBench is their internal benchmark for more realistic software engineering problems. The report also covers distributed training / RL infrastructure and kernels they open-sourced. Read: Composer 2 report.
  • Cloudflare Dynamic Workers + Code Mode. Dynamic Workers entered Open Beta for paid Workers users, with disposable sandboxes that start ~100x faster than containers and use 1/10 the memory. The adjacent Code Mode posts surfaced three agent-facing building blocks: codemode for chewing through multiple tool calls in one/few shots, shell for a local filesystem backed by DO/D1/SQL/R2, and worker-bundler for packaging npm deps and serving full apps. Read: Dynamic Workers and Code Mode.
  • Cursor + Figma. Cursor can now create new components and frontends in Figma using your team's design system, including variables, tokens, and naming conventions. This landed alongside Figma's open beta for AI agents designing directly on canvas via the use_figma MCP tool .
  • Devin as reviewer, not just author. The clearest practitioner comparison today came from Gauri Gupta and swyx: both say Devin code review catches bugs Claude and Codex miss, and swyx says the review pass saves him 3-8x/day .

💡 WORKFLOWS & TRICKS

  • Replicable mobile-agent build loop from Riley Brown

    1. Create the app shell in Xcode, then open the top-level project folder in Codex and start a dedicated thread .
    2. Paste the full product spec and hit Shift+Tab for Plan mode .
    3. Run design in parallel in Paper.design via MCP while Codex builds .
    4. Steer inline while the thread is running, then paste API keys, Xcode logs, and screenshots back into the chat when builds fail .
    5. Test on a physical iPhone, and keep the generated app preview open so you can keep iterating without leaving the app .
    • Bonus: Brown adds Whisper-based voice input so he can keep refining the preview without opening the keyboard .
    • Timeless pattern: split planning, design, implementation, and debugging across parallel agent threads instead of forcing one giant chat .
  • Turn prompts into loops, not one-shots

    • Karpathy's Auto Research runs on a git feature branch: propose experiment, write code, run it, analyze results, commit improvements, repeat .
    • Matthew Berman says he applies the same shape to overnight fine-tuning of small open models against an Opus 4.6 baseline until a local model wins .
    • Minimax's M2.7 harness adds skills, memory, guardrails, and eval infra; it handled 30-50% of the workflow and reported a 30% performance gain on evals .
    • Anthropic says similar Claude loops now power almost all of its major agent loops, with humans reviewing solutions before final refinements .
  • Production pattern worth stealing from Moda

    • Use a cheap Haiku triage node to classify the job and preload task-specific Markdown skills .
    • Inject those skills as human messages, with caching breakpoints after the system prompt and after the skills block .
    • Keep only 12-15 core tools always in context; activate ~30 more on demand with RequestToolActivation.
    • For big projects, pass a high-level summary and let the agent pull details through tools; then use traces to tune prompt/tool changes, cost, cache hits, and failures .
  • Add a critic, not just another worker. swyx says he won't merge without a Devin review now, because it catches bugs Claude and Codex miss and acts like fresh eyes on a PR. Related design note: he points to Windsurf's "smart friend" pattern as a model for subagents that are more critical instead of weaker .

👤 PEOPLE TO WATCH

  • Kent C. Dodds — strongest practitioner confirmation on today's Cloudflare story: he's been using Dynamic Workers for a week to power his own AI assistant, and separately backs the "tools as SDK + agent-written code" pattern as the more effective direction .
  • Riley Brown — one of the most concrete end-to-end Codex workflows in the wild right now: Xcode + Codex + Claude Agent SDK + Vibe Code CLI + Paper.design, with screenshots and logs fed back into the loop .
  • swyx — short posts, strong signal: Devin review outperforming Claude/Codex in his day-to-day, plus the "smart friend" subagent pattern callout .
  • Ravi Parikh / Moda — rare production detail instead of benchmark chatter: triage, dynamic tool loading, prompt caching, context scaling, and tracing from a real multi-agent design product .
  • Anthropic Engineering — new post on a multi-agent harness for frontend design and long-running autonomous software engineering. Worth reading because it's from the team running the stack, not a third-party teardown: harness-design-long-running-apps.

🎬 WATCH & LISTEN

  • 1:56-4:28 — M2.7's harness loop. Best compact explanation today of the human/harness/agent split: human steers direction, the harness supplies skills/memory/guardrails/evals, and the agent writes, runs, analyzes, and improves experiments .
  • 3:36-5:43 — Riley Brown's Codex setup for "Jerry." Watch this if you want a replicable "agent builds the app that builds apps" flow: create the Xcode project, point Codex at the right folder, then fan out into parallel threads .
  • 25:16-26:17 — In-app preview + voice iteration. Shorter clip, but high leverage: Riley tweaks the generated app while the preview is open and adds Whisper-based voice input so the iteration loop never leaves the phone UI .

📊 PROJECTS & REPOS

  • karpathy/auto-research — GitHub repo for autonomous research loops on a git feature branch. Practical signal: Berman frames it as bringing frontier-style experiment loops to solo developers, and points to an overnight run that reached the fastest reported time to train one of these models .
  • langchain-ai/deepagents — better adoption signal than most new repos because Moda is already using Deep Agents in production for its Research Agent and Brand Kit Agent, with the Design Agent migration under evaluation. Repo: Deep Agents.

Editorial take: today's edge is loop design — faster execution sandboxes, smaller always-on tool sets, explicit review passes, and tighter human feedback beat one giant prompt.

Claude Moves to the Desktop as T3 Code, Cursor, and LangSmith Sharpen the Loop
Mar 24
5 min read
100 docs
Alex Albert
Claude
LangChain
+9
Anthropic's Claude computer-use preview is the headline, but the sharper practitioner signal is the support stack around it: official CLI-based clients, faster repo search, and webhook-driven handoff for long-running agents. This brief also covers CodexBar 0.19.0, OpenClaw's latest beta, and the concrete workflows worth copying.

🔥 TOP SIGNAL

Anthropic pushed Claude past the repo window: the official Claude account says the new macOS research preview can open apps, navigate browsers, and fill spreadsheets in Claude Cowork and Claude Code, while Boris Cherny said Anthropic Labs is releasing full computer use in Cowork and Dispatch .

Elsewhere, teams attacked the adjacent bottlenecks: T3 Code used the official Claude CLI, community contributors added browser control to its open-source UI, Cursor cut search latency across huge codebases, and LangSmith showed a webhook flow for long-running agents .

"The future where I never have to open up my laptop to get work done is becoming real very fast"

🛠️ TOOLS & MODELS

  • Claude computer use (research preview) — Claude can now use your Mac to open apps, drive the browser, and fill spreadsheets. Officially this is a research preview in Claude Cowork and Claude Code on macOS; Boris Cherny said the release marks full computer use in Cowork and Dispatch, and noted the early Sonnet 3.6 prototypes were clunky but already showed the use cases .
  • T3 Code + Claude Code subscriptions — If Claude Code is already installed locally, Theo says you can just run npx t3 or use the T3 Code app; it talks to the local Claude Code CLI through Anthropic's Agent SDK, with no extra auth screen or API-key setup inside T3 Code . Theo contrasts that with OpenCode's dropped Claude Max plugin, which he says relied on its own harness, custom auth flow, and faked headers . He also calls out the economics: the Claude Code subscription is $200/month for up to $5,000 of compute .
  • Cursor Instant Grep — Cursor says it can search millions of files and return results in milliseconds, which directly speeds up agent task completion. They also published a build writeup covering the algorithms and tradeoffs; Jediah Katz called it singular technical work and said this is why alternatives feel slow . Writeup: cursor.com/blog/fast-regex-search.
  • CodexBar 0.19.0 — New release adds Alibaba Coding Plan support, subscription history charts, Cursor Total/Auto/API dashboard alignment, Codex code-review reset times, and a broader Claude stability/refactor pass . Release notes: v0.19.0.

💡 WORKFLOWS & TRICKS

  • Async completion alerts for long-running agents — Hari's LangGraph/LangSmith flow is clean and reusable:
    1. Clone the Deep Research example from LangChain's Deep Agents repo .
    2. Create webhook.py with a FastAPI route that receives the LangSmith payload, reads payload.values.messages[-1].content, and POSTs that final AI message to a Slack webhook .
    3. Register the FastAPI app under the HTTP app field in langgraph.json, then run langgraph dev.
    4. Create a background run with your thread ID, assistant ID set to research, an input message, and the webhook URL; the result is a Slack summary plus the full report in LangSmith tracing . Timeless pattern: don't poll long jobs—ship a webhook and move on . Docs: LangSmith webhooks.
  • Route models by task, not by brand loyalty — Theo says he uses 54 for most coding, then opens a new thread and switches to Claude for UI passes, quick tidy-ups, and small changes . The constraint matters: once you pick Claude Code for a thread, he says you can't switch harnesses mid-thread because the thread state, compaction, and related data are tied to that thread in the cloud . Practical takeaway: treat thread boundaries as routing boundaries.
  • Use Codex review as triage, not final judgment — Peter Steinberger's PR loop is blunt: let Codex find issues, ask whether the issue is actually clear, ask whether the proposed fix is the best possible one, then continue the tradeoff discussion and usually rewrite the PR . His warning is the timeless part: overly local fixes make the codebase unmaintainable .

👤 PEOPLE TO WATCH

  • Boris Cherny — high signal because he is speaking from the Anthropic Labs shipping team. He says that team shipped MCP, Skills, Claude Desktop, and Claude Code, and is now rolling out full computer use .
  • Theo — worth tracking because he is both shipping T3 Code and publishing the integration details: official CLI vs custom harnesses, subscription economics, and how he routes models across threads in daily use .
  • Peter Steinberger — useful today for three separate practitioner signals: CodexBar 0.19.0, a concrete Codex PR-review loop, and OpenClaw plugin/release activity .
  • Jediah Katz — short post, strong signal from someone building Cursor's agent: Instant Grep is why other tools feel slow .
  • Hari from LangChain — useful if you care about deployment mechanics, not just model chatter. Today's video walks through a full webhook-driven completion flow end to end .

🎬 WATCH & LISTEN

  • 2:00-4:29 — Build the Slack webhook handler. Hari shows the exact FastAPI route, the payload shape, and the one field that matters most: the final message at values.messages[-1].content.
  • 5:24-7:11 — Kick off a background run with a webhook URL. This is the concrete API/docs walkthrough: create a thread, call background run creation, pass the webhook endpoint, and wait for the Slack ping instead of babysitting the job .
  • 12:47-13:15 — Why T3 Code built a harness abstraction. Theo explains the real integration problem: every CLI exposes events differently, so supporting multiple providers means normalizing their weirdness instead of pretending the harness layer doesn't matter .

📊 PROJECTS & REPOS

  • T3 Code — The open-source UI keeps picking up contributions: a community contributor added browser integration, terminal support is next, and the main app now supports Claude Code subscriptions through the local CLI path .
  • OpenClaw — New beta v2026.3.22-beta.1 is out. Separately, Harold connected Codex App Server to OpenClaw via plugins, and steipete highlighted that as a plugins story worth watching . Release notes: v2026.3.22-beta.1.
  • Deep Agents repo — LangChain's webhook demo uses the Deep Research example from this repo; if you want to copy the same background-run pattern, it's the repo Hari recommends cloning locally .

Editorial take: today's edge wasn't a benchmark bump; it was better plumbing—desktop control, faster search, official harnesses, and async completion hooks that make agents usable in real workflows.

Agent-First Coding Spreads as Teams Tighten Tests, Context, and Permissions
Mar 23
5 min read
61 docs
Tibo
swyx
David Heinemeier Hansson (DHH)
+6
Practitioners are moving from coding-agent experiments to agent-first workflows: DHH starts new customer work with agents, Tibo says Codex is helping refactor Codex, and the strongest tactics today were faster harnesses, explicit Skills for fresh releases, and safer enterprise rollout patterns. Also inside: Cursor Composer 2 pricing and speed details, GPT-5.4’s frontend gap, and the clips worth watching.

🔥 TOP SIGNAL

The strongest signal today: coding agents are becoming the default starting point for real work, not just a side tool. DHH says all new customer work now starts with agents that he steers and calibrates, while OpenAI engineer Tibo Sottiaux says the Codex team is using Codex to help refactor an end-to-end systems rethink that would otherwise take months .

The practical edge is shifting away from raw model IQ and toward loop design: faster harnesses, fresher context, and safer deployment controls .

🛠️ TOOLS & MODELS

  • Cursor Composer 2 — Cursor’s new code-only model is built from open-weight Kimi 2.5 and then heavily post-trained/RL-tuned on Cursor’s own data and harnesses . Theo says it beats Opus on multiple coding evals, including Terminal Bench 2 in the 4.5-4.6 range, while running at 80-100 tokens/sec and pricing at $0.50/M input and $2.50/M output. Cursor is going through Fireworks for inference, and Theo says that path handles the attribution/license requirements for large-scale commercial use of the Kimi base .
  • GPT-5.4 — Theo’s field report is blunt: it’s an incredible model for coding, but still a generation behind on frontend design. His read on OpenAI’s frontend best-practices post: useful advice, but not proof the design gap is solved .
  • Claude Skills / Claude Code for web — Simon Willison used Claude Skills to teach Claude the minor breaking changes in Starlette 1.0, because the model wasn’t familiar with the release yet . His caveat: Claude chat has an “add to skills” flow, but Claude Code for web apparently does not .
  • Devin — swyx says Devin usage has grown >50% MoM every month this year. More important than the growth chart: his deployment note that enterprise rollouts need permission models that won’t terrify compliance and IT teams across 10,000s of engineers.

💡 WORKFLOWS & TRICKS

  • Let the agent write the first draft. Keep yourself on steering. DHH says he writes no fresh customer code himself now; new work starts with agents, and he handles direction and calibration .
  • Patch fresh-release blind spots with explicit context. If the framework version is newer than the model’s knowledge, write the breaking changes into a Skill or context file before asking for edits. Simon did this for Starlette 1.0. Read: Simon’s writeup.
  • Use agents as ops translators, not just coders. DHH says he injects agents into Linux systems and uses them constantly to decode obscure error messages. If you know “some Linux” but not enough to debug quickly, this is a high-ROI use case .
  • Speed up the harness before you scale the loop. Peter Steinberger focused on tests and cut OpenClaw’s harness runtime from roughly 10 minutes to 2 minutes. Faster evals mean more agent iterations per day and less dead time between runs.
  • Split logic and tests into separate files/domains before you unleash automation. Geoffrey Huntley hit 50 open PRs from automation and says merge conflicts become a major source of waste if logic and tests are entangled .
  • Give non-engineers agent access — but only with safe permissions. swyx argues designers should get direct access to coding agents, and extends that to PMs and analytics via Slack-style workflows . Pair that with enterprise-safe permission controls, not shortcut flags, if you expect the setup to survive real IT review .

"Give your designer access to your coding agent. It is imperative..."

👤 PEOPLE TO WATCH

  • David Heinemeier Hansson — high-signal because this is long-time operator commentary, not bench-racing: agent-first for new customer code, heavy use in Linux ops, and experiments with hold-to-talk voice input in his Linux setup .
  • Tibo Sottiaux — short post, big signal: the Codex team is using Codex to help refactor its own system during an end-to-end scalability rethink .
  • Simon Willison — still one of the best examples of practical context management. Today’s lesson: when model knowledge lags a fresh OSS release, teach the model explicitly before trusting edits .
  • Theo — worth tracking because he separates “strong coding model” from “strong frontend model,” and his Composer 2 breakdown added concrete cost/speed details instead of generic hype .
  • swyx — useful pulse on where coding agents are spreading inside orgs: designers, PMs, analytics teams, and enterprise deployment staff — not just core engineers .

🎬 WATCH & LISTEN

  • 15:25-15:57 — DHH on using agents as Linux translators. Short clip, practical point: this is the cleanest real-world case today for using an agent to decode obscure infra errors when you’re not a deep Linux expert .
  • 49:54-50:20 — DHH on hold-to-talk voice prompting. He describes a voice-to-model flow inside his Linux setup where a button press turns dictation into clean text. Worth watching if your next bottleneck is input speed, not model quality .

📊 PROJECTS & REPOS

  • OpenClaw — the strongest repo signal today was operational, not social: Peter Steinberger got the harness from ~10 min to ~2 min by focusing on tests . Separate signal: another user used Claude Code plus Google’s live browser control to interact with the OpenClaw web dashboard for debugging .
  • Starlette 1.0 — not an agent project, but a useful OSS release case for agent users: Simon had to explicitly teach Claude the 1.0 breaking changes because the model lagged the release . Expect this pattern on newly shipped framework versions.

Editorial take: the edge is moving from “which model is smartest?” to “who has the tightest loop” — fast tests, explicit fresh context, safe permissions, and agents in more hands .

Git-First Agent Workflows and Harder Test Prompts Take the Lead
Mar 22
4 min read
70 docs
Theo - t3.gg
Yuchen Jin
Salvatore Sanfilippo
+5
The sharpest signal today came from practitioners tightening the loop around coding agents with tests, Git context, and clearer ownership. Also inside: Claude Code web's repo limit, Codex vs. Claude commit attribution, and the clips worth your time.

🔥 TOP SIGNAL

The strongest practical signal today: agent performance is still mostly a scaffolding problem. Simon Willison says tests, docs, CI/CD, and clean code make agents work better—and his own loop starts with uv run pytest; Salvatore Sanfilippo says generic "write tests" prompts miss the hard stuff, and recommends explicitly asking for edge cases, fragile implementation details, and random testing against a simpler reference implementation . Willison's follow-on warning matters just as much: code review is now the bottleneck, while cognitive debt remains unsolved .

🛠️ TOOLS & MODELS

  • Claude Code for web — current repo-auth ceiling: Simon says one session can't check out two private repos at once because Git operations go through a local proxy that only authenticates the repo attached to the session. He also says the docs don't mention this .
  • Claude Code vs Codex — commit metadata means adoption signals can lie: Claude Code auto-adds itself as a co-author on every commit; Codex doesn't. OpenAI engineer Tibo Sottiaux says Codex is designed so the user remains the owner and accountable party, even though that makes repo-level usage harder to observe .

"it exists to help you and it’s important that you remain the owner and accountable for your work without AI taking credit."

  • T3 Code vs Claude Code CLI — creator-posted RAM snapshot: Theo says T3 Code used 350.9 MB vs 635.5 MB for Claude Code CLI in his screenshot, and framed that as roughly 2x better efficiency .
  • Routing pattern worth copying: Matthew Berman describes a 3-tier stack—frontier models for exploratory work, Sonnet-class models for most execution, and local/fine-tuned models once a narrow workflow is ready for production. His own example was using Opus for front-end/HTML work; Jaden Clark described using a cheaper/default model for small personal tools where speed and cost matter more than max capability .

💡 WORKFLOWS & TRICKS

  • Bootstrap a session in 3 moves:(1) run uv run pytest, (2) ask for "recent changes" or "last three commits" so the agent runs git log, (3) only then split into 2-3 parallel sessions.
  • Use Git as an agent power tool, not just a backup: Ask for git status when the repo is messy—Willison says he uses that prompt surprisingly often—then let the agent work through conflicts with tests. For archaeology, have it search the reflog or other branches for lost code, or run git bisect; for cleanup, ask it to rewrite history with git reset --soft HEAD~1, split/combine commits, or extract a library into a new repo while preserving history .
  • Ask for adversarial tests: Tell the model to stress limit conditions and fragile implementation details, and to use random testing plus a simpler in-test reference implementation to check invariants. Sanfilippo says even a small wording change can strongly steer the model, and the resulting tests become guardrails for both AI-written changes and future refactors .
  • Assume review is the scarce resource: Faster generation just moves the pain to review. Willison's warning is blunt: code review is now the biggest slowdown, and "cognitive debt" is still unsolved .

👤 PEOPLE TO WATCH

  • Simon Willison — published the first draft of Using Git with coding agents. Why it matters: it turns Git from a safety net into an active agent workflow for context loading, debugging, conflict recovery, bisecting, and history rewriting .
  • Salvatore Sanfilippo — Redis creator; today's high-signal contribution was a prompt pattern for stronger tests that targets brittle implementation details instead of shallow happy-path coverage .
  • Tibo Sottiaux — useful because he's surfacing product philosophy from inside Codex: ownership and accountability over brand visibility in commit history .
  • Theo — worth tracking if you care about coding-agent UX tradeoffs; he keeps posting blunt first-party comparisons while shipping T3 Code .

🎬 WATCH & LISTEN

  • 14:39-17:35 — Hard-test prompting that actually changes model behavior. Sanfilippo explains why "write tests" is too generic, and shows how to request edge-case stress plus random testing against a simpler reference implementation .
  • 1:13:24-1:16:46 — The sim-to-real warning for local/fine-tuned agents. Shaw Walters says harness-specific data can improve narrow tasks quickly, but may not transfer back to broader benchmarks and can even narrow the model's capability space .

📊 PROJECTS & REPOS

  • ELIZA OS — worth watching for routing and safety questions. Walters describes it as an open-source framework for building agents, games, and applications, with deployments ranging from an 8B quantized model up through Sonnet and Opus; he also says security is still the blocker for unsupervised browser + shell agents . Adoption signal: the show introduced it as "the most widely used open source framework for building autonomous agents" .
  • Sentient Arena / EVO Skill — still pre-results, but the setup is concrete: the first arena uses Office QA for enterprise-style reading, calculation, and document analysis, and the first cohort closes in the first week of April. The notable mechanic is multi-proposal skill evolution from eval feedback; the team says that setup currently does much better with Opus + Claude Code-style workflows than with open harnesses/open models .

Editorial take: today's real edge was not a flashy new model—it was stronger guardrails around the ones we already have: tests first, Git history in context, and clear human ownership of the output .

Karpathy Stops Typing Code as Orchestration Becomes the New IDE
Mar 21
5 min read
104 docs
Andrej Karpathy
Sarah Guo
Addy Osmani
+16
Karpathy's near-100% delegation was the clearest workflow signal today, and multiple practitioners now agree the developer workspace is shifting from one editor window to orchestration surfaces built for parallel agents. Also inside: Cursor Composer 2's disclosed training stack, honest Codex vs. Opus field notes, and the most copyable workflow patterns from people shipping with agents.

🔥 TOP SIGNAL

Andrej Karpathy says his day-to-day has already crossed from AI pair programming to operating a small fleet: he hasn't typed code since December, now delegates non-interfering features to parallel agents, and thinks in "macro actions" over repos instead of line edits . The bigger pattern is showing up from multiple angles: Addy Osmani argues the IDE is being de-centered into orchestration surfaces, and Theo says current editors break down because agentic work spans multiple projects, terminals, browsers, and worktrees at once .

🛠️ TOOLS & MODELS

  • Cursor Composer 2: built on Kimi k2.5, which Cursor says was the strongest base on its perplexity-based evals. Cursor then did continued pretraining plus a 4x high-compute RL scale-up on top, using Fireworks for RL and inference; Aman Sanger says only about 1/4 of final-model compute came from the base and full pretraining is planned later . Cursor also says it missed crediting Kimi in the initial blog and will fix that next time .
  • The control plane is becoming the product: Osmani's current stack includes Conductor, Claude Code Web/Desktop, GitHub Copilot Agent, Jules, Vibe Kanban, and cmux; his framing is that the editor is still critical, but no longer the front door . He also flags Claude Code's new Swarm/agent-teams direction and notes that developer reaction to Cursor Glass was basically "this feels more like an agent orchestrator than an IDE" .
  • Task-level model notes, not universal benchmarks: Theo says Opus spent over an hour on a new feature and still got the implementation entirely wrong; Codex did the same feature correctly in 15 minutes . Karpathy, meanwhile, says Claude's coding agent has a better teammate-like personality while Codex feels dry, but his latest gripe is broader than model choice: agents still bloat abstractions, copy-paste, and ignore AGENTS.md style instructions .

💡 WORKFLOWS & TRICKS

  • Run repo work in macro-actions, not prompt-by-prompt

    1. Split work into non-interfering feature chunks.
    2. Hand separate chunks to parallel agents across checked-out repos/workspaces.
    3. Use other agents for planning and research in parallel.
    4. Review output proportionally to how much you care about that path.
      Karpathy points to Peter Steinberger's setup with roughly 10 Codex agents as the visual form of this pattern; each high-effort task runs about 20 minutes, then you top them up and keep moving .
  • Treat unused quota as lost throughput: Karpathy says if one tool/provider hits quota, switch to another; his default when agents fail is not "the capability isn't there" but "bad instructions, memory, or tooling" .

  • Set objective metrics and boundaries, then get out of the way: Karpathy's AutoResearch loop improved a nanoGPT repo overnight by finding weight-decay/value-embedding and Adam-beta interactions he had missed; his Program.md is just a markdown attempt to describe how the autoresearcher should search .

  • Design for async review: the stable loop across Osmani, Theo, and Copilot-style tooling is isolated workspaces/worktrees, task-state UIs, background execution, and attention routing so humans only re-enter when an agent actually needs them .

"specify intent → delegate → observe → review diffs → merge"

  • Use model progress to change product process: @_catwu's team now plans in short sprints, builds demos/evals instead of docs, revisits "too hard" features after each model release, and removes scaffolding once new models make it unnecessary. Also: keep agentic systems as simple as possible because failures compound with complexity .

  • Make self-checks cheap: Dreamer's coding loop does plan → build → test → fix, and David Singleton says TypeScript works especially well because compile-time errors give the agent loop immediate feedback on mistakes . Theo's Kernel demo shows the same philosophy on browser auth: one cloud-browser sign-in flow, including 2FA, can then be reused across agent instances for private GitHub access .

👤 PEOPLE TO WATCH

  • Andrej Karpathy — still the highest-signal operator feed in public: near-total delegation on real repos, strong views on memory/personality, and zero sugarcoating when code quality is bad .
  • Addy Osmani — best current synthesis of the orchestration shift, because it's grounded in the actual tools he uses daily instead of a generic future-of-IDEs take .
  • Theo — worth tracking for honest task-level comparisons and for pushing the "bigger IDE" framing from complaint into product experiments like T3 Code and Kernel demos .
  • @_catwu — useful if your bottleneck is deciding what to ship in a world where model capability changes every release cycle .

🎬 WATCH & LISTEN

  • 4:03-4:54 — Karpathy on 10-agent macro-actions. Best quick mental model for parallel feature delegation across multiple repos .
  • 16:35-19:18 — Karpathy on AutoResearch. Watch this if you want the cleanest explanation of objective-metric loops and why the human becomes the bottleneck .
  • 2:20-2:56 — Addy on orchestration as the skill to learn. Fast distillation of the move from one-agent chats to fleets, coordination, and context handoff .
  • 12:38-15:14 — Theo's bigger-IDE thesis. Good segment if your workflow collapses the moment you run multiple agents across multiple projects .

📊 PROJECTS & REPOS

  • OpenClaw / ClawHub — maintainer @magicseth says ClawHub now supports 1M weekly active users on Convex; Peter Steinberger says the next push is making plugins great. Notable adoption signal for an agent platform .
  • lat.md — early agent integration for keeping spec files synced with implementation. Armin Ronacher finds it interesting, but explicitly wants proof on larger codebases before getting excited .
  • Arena + EVO Skill — Sentient's new open competition for agent harnesses is using Office QA as its benchmark and aims to generate open feedback/data about where open harnesses still lag Claude Code. EVO Skill generates multiple candidate skills from eval feedback and keeps the best .
  • Dreamer — not open source, but a project worth watching because its build loop is unusually explicit: Sidekick plans tools/data, builds, tests, exposes code/prompt internals, and exports via SDK/CLI. The platform also pays tool builders by usage and has a $10k prize for the best tool added by mid-April .

Editorial take: today's real edge wasn't a new chatbot tab — it was running more work in parallel, with cleaner isolation, explicit success metrics, and a higher skepticism level about agent-written code .

Composer 2 Arrives as Cross-Agent and Test-Hardened Workflows Mature
Mar 20
7 min read
156 docs
Keycard
Simon Willison
Salvatore Sanfilippo
+16
Cursor’s Composer 2 and Glass launch drove the release chatter, but the strongest practitioner signal was elsewhere: cross-tool agent orchestration, contained optimization loops with brutal tests, safer shell sandboxes, and honest task-by-task model comparisons.

🔥 TOP SIGNAL

Today's highest-signal workflow came from Redis creator Salvatore Sanfilippo: use LLMs to take on self-contained optimizations, not architectural sprawl . His rule is simple: first make the test suite brutally hard to pass, then use the model on a contained data-structure or algorithm change that keeps the external API stable but can materially improve speed or memory use . He argues that's where AI is genuinely changing programming economics .

🛠️ TOOLS & MODELS

  • Cursor — Composer 2: now live in Cursor, priced at $0.50/M input + $2.50/M output on standard and $1.50/M input + $7.50/M output on fast . Cursor says its first continued pretraining run improved quality and lowered cost to serve, giving it a stronger base for RL . Founders frame it as frontier-level and explicitly coding-only after a year of model-training effort . More: Composer 2 blog
  • Early Composer 2 read from practitioners: Kent C. Dodds says it is not quite as good as GPT 5.4, but it is much faster and cheaper. Theo says it is already very good , while @koylanai says it is especially strong at long, grounded, tool-mediated research/context work and beat Opus 4.6 and GPT 5.4 on a transcript-to-reading-list task . Jediah Katz adds one sleeper feature: ask Cursor about your past conversations.
  • Cursor — Glass alpha: Cursor also opened an early alpha of Glass, its simplified interface . Kent says it feels like a marriage between the web portal and the local IDE and is likely where most agentic coding tools are heading . Theo agrees the UI reset was overdue and likes the ACP support . More: Glass alpha
  • Claude surfaces are spreading: T3 Code now supports the Claude Code CLI for users who already have it installed and signed in . Anthropic also released Claude Code channels for controlling sessions through Telegram and Discord, including from your phone . At the same time, opencode 1.3.0 stopped autoloading its Claude Max plugin after Anthropic legal pressure, and the plugin was removed from GitHub / deprecated on npm .
  • Hard-debugging signal: in a real Ghostty/GTK case, Codex 53 extra high solved a bug Mitchell Hashimoto's team had struggled with for over six months from a vague prompt, while lower Codex reasoning levels and Opus 46 failed; the Opus run reportedly cost $4.14 and took 45 minutes.
  • Do not over-generalize from one model story: Simon Willison says Opus 4.5 earned his trust on familiar tasks like JSON APIs, and that Opus 4.6 / Codex 5.3 feel close to one-shot reliable for many routine jobs . Theo, meanwhile, reports letting Opus run for over an hour on a new feature only to learn 20 minutes later that the whole implementation was wrong .

💡 WORKFLOWS & TRICKS

  • Cross-agent handoff: Kent built a personal assistant agent that works across ChatGPT, Claude, Cursor, and any MCP-compatible interface . His demo workflow was practical: ask Claude in the browser to create a GitHub issue, then have it fire off a Cursor cloud agent to solve it . If you are punting work for later, he also recommends dumping all current context into the GitHub issue so resumption is trivial . Under the hood, his setup uses Cloudflare's Dynamic Worker Loader so the agent can write code, plus capability search and reusable skills .
  • Teach through repo files: Kent says linking testing-principles.md from agents.md was enough to get his agent using Symbol.asyncDispose correctly for test setup . Simon's version of the same idea is structural: start from cookiecutter templates with tests, CI, and README in place so the agent copies the right patterns from the first commit .
  • Contained optimization loop: Salvatore's playbook is worth stealing for hot paths: (1) harden the test suite until wrong code is brutally hard to sneak through, (2) let the model handle a contained algorithm/data-structure change, and (3) only pay added complexity when the win is material and the subsystem API stays stable .
  • Human 20% still matters: the Mitchell Hashimoto GTK story is a clean pattern for ugly bugs. Let the agent do the tedious repo archaeology across issues, patches, and source trees, then do the targeted code review, failure-mode questions, and cleanup yourself .

"Poor quality code from an agent is a choice that you make."

  • Never hand the agent real Bash if a fake shell will do: Just Bash gives agents a Bash-like environment in TypeScript with an in-memory filesystem inside a JavaScript VM because agents are good at shell interactions but real shell access is risky and expensive . Its defense-in-depth disables dangerous JS execution paths and checks for prototype-pollution-style escapes; the broader rule is simple: put agents in a sandboxed runtime, not your host OS .
  • Long context may be less broken than the discourse suggests: Kent says he does not see the often-repeated failure in the last 40% of context when using Cursor mostly with GPT 5.4 or long ChatGPT threads, and credits Cursor's compaction for holding up . He also notes he does not use Claude Code or Open Code much, so his exposure may be narrower .

👤 PEOPLE TO WATCH

  • Kent C. Dodds — one of the clearest operator feeds right now: cross-tool MCP orchestration, GitHub issues as context handoff, repo-guided agent behavior, and a useful counterpoint on long-context reliability .
  • Simon Willison — still the best mix of daily-driver pragmatism and security realism. He says he now writes more code on his phone than on his laptop , trusts Opus on familiar tasks , and keeps hammering on prompt injection, the lethal trifecta, and sandboxing .
  • Theo — worth following because he ships tools and does not hide the misses: positive on Glass and T3 Code's Claude support, bluntly negative when a model wastes an hour, and generally honest about where the UI is headed .
  • Salvatore Sanfilippo — the most thoughtful systems-programming take of the day. He is not talking about toy app scaffolding; he is talking about when LLMs make complex data-structure work worth attempting in production code .
  • swyx — useful security signal: he argues identity-based authorization is the key way to break the binary between HITL everything and dangerously skip permissions, and points to Keycard plus similar work from WorkOS/Auth0/Cloudflare .

🎬 WATCH & LISTEN

  • 1:30-4:35 — Codex on a six-month GTK bug: best proof today for AI as a research mule. The agent works through the issue, patches, and finally the GTK4 source before proposing the fix the other runs missed .
  • 9:11-11:35 — Salvatore on self-contained optimization: if you work near hot paths, watch this. He lays out when added complexity is worth paying now that LLMs can help shoulder implementation and corner-case load .
  • 1:45-2:28 — Simon's tiny benchmark prompt: one short prompt — run a benchmark and then figure out the best options for making it faster — got his Python WebAssembly engine a 45-49% Fibonacci speedup.

📊 PROJECTS & REPOS

  • Just Bash / Cloudflare Shell — the strongest open-project signal today. Vercel's Just Bash gives agents a Bash-compatible environment in TypeScript with an in-memory filesystem . Cloudflare's Sunil Pye praised it, Cloudflare forked it into Cloudflare Shell, and Dane says he is already using it for an internal CTO agent .
  • Showboat — Simon Willison's new tool is only about 48 hours old at recording, but the use case is excellent: agents can run manual API checks with curl and produce a Markdown log of each step and output .
  • Keycard for Coding Agents — worth watching because it targets a real failure mode: coding agents inherit your credentials and many identity systems cannot distinguish you from the agent acting in your name . swyx says Keycard now supports all coding agents and frames identity-based authz as the most important security direction here .
  • uv / ruff / ty — not new, but increasingly relevant agent tooling. Simon says fast linting and type-checking resonate with coding agents, and he has made uv run an essential part of his workflow; he is skeptical that these tools need to live inside the agent as opposed to being called by it .

Editorial take: the durable edge today was not a single model release — it was tighter loops: hard tests, contained complexity, safer sandboxes, and agents that can hand work to each other.

Codex Wins Hard Bugs as Context and Sandbox Patterns Harden
Mar 19
6 min read
122 docs
Greg Brockman
Garry Tan
Salvatore Sanfilippo
+9
The clearest signal today was a detailed Codex-over-Claude Code comparison on Redis-scale debugging. Around that, the practical edge came from thread-level context control, subagent delegation, model-specific prompt files, open runtimes, and a strong reminder that sandbox policy beats command allow-lists.

🔥 TOP SIGNAL

Today’s strongest practitioner signal: on genuinely hard debugging and optimization work, Codex is earning the more credible field reports. Salvatore Sanfilippo says that after roughly 20 hard comparisons over two months, Codex has been stronger than Claude Code; in a concrete Radix-3 optimization on a core Redis data structure, Claude Code spent an hour and still produced crashing code, while Codex identified the bitmap-shift bug immediately, fixed it, and found extra optimizations . Garry Tan separately said Codex is “GOAT at finding bugs and finding plan errors,” and Greg Brockman added that “codex has gotten very good” .

🛠️ TOOLS & MODELS

  • Codex vs. Claude Code — Sanfilippo’s tests use Claude Opus at max thinking depth versus GPT-5.4/5.3 in x-high mode. His practical conclusion: Claude Code feels faster and more agile with tools, but Codex is better when the task is actually hard .
  • Codex CLI is open source — Romain Huet highlighted that the CLI can be modified to fit your workflow, including by asking Codex to change itself. A real example: Matias/@0xmts added /progress [verbose | quiet] for 1-2 line updates that refresh in place instead of flooding the terminal .
  • OpenClaw’s model routing is highly task-specific — Matthew Berman uses Sonnet 4.6 or Opus 4.6 for main planning/orchestration, GPT-5.4 for coding fallback, Grok for search, Gemini 3.1 Pro / Gemini Deep Research Pro for video and research, and Qwen 3.5 for local models. OpenClaw can also assign different models to different threads, so frontier models stay reserved for harder work .
  • Narrow-task local replacement — Berman says a fine-tuned Qwen 3.5 9B model now handles his email labeling as well as Opus 4.6 for that use case, turning a recurring frontier-model job into a local one .
  • New open stack: Nemotron 3 + Open Shell + DeepAgents — LangChain’s demo pairs Nvidia’s Nemotron 3 supermodel with Nvidia’s newly released Open Shell runtime and the DeepAgents open-source harness. The meaningful changes are runtime policies, persistent sandboxes, GPU-oriented execution, skills/subagents, and mutable memory that lives outside the sandbox .

💡 WORKFLOWS & TRICKS

  • Split conversations by topic

    1. Create separate Telegram/Discord/WhatsApp threads for major workstreams.
    2. Keep one topic per thread so only relevant history hits the context window.
    3. Put harder coding threads on frontier models and cheaper Q&A threads on smaller ones.
      Berman says this is the main reason he avoids the memory problems other users report, and it lines up with Sourcegraph’s broader “context first” argument that teams shipping lots of AI PRs win on context, not just model choice .
  • Delegate early; keep the main agent unblocked

    1. Use the main model for planning and orchestration.
    2. Delegate coding, searches, API calls, data processing, file ops, calendar/email, and anything that will take more than ~10 seconds to subagents or harnesses.
    3. Use faster or cheaper models for simpler subagents.
      Berman’s rule is blunt: if it takes more than 10 seconds, delegate it .
  • Keep separate prompt files per model

    1. Download each lab’s prompt best-practices docs.
    2. Maintain one prompt tree per model family, such as root for Opus and a separate /gpt directory for GPT-5.4.
    3. Document the routing strategy in memory or PRD docs.
    4. Run a nightly cron to keep the prompt sets aligned on facts while still model-specific in style.
      This is Berman’s answer to the “one prompt fits all models” trap, and Kent C. Dodds’ add-on is useful: docs help agents understand high-level intent and can themselves be kept up to date by the agent .
  • Build in a code-native tool; use chat as the control plane
    Berman says he may use Telegram to operate OpenClaw, but prefers a code-native environment like Cursor to actually build and modify it because those systems are easier to read and iterate in. When he is away from a laptop, Telegram voice memos become a fast way to issue tasks and prompts without typing long messages .

  • Verification still beats vibes
    Have the agent write tests, keep code snapshotted in Git/GitHub for rollback, and back up non-code artifacts separately. Armin Ronacher’s warning is the right counterweight: agents are hard to resist, and they can put regrettable code into a codebase very quickly .

"I don’t think you can vibe yourself back to sanity with better models."

  • Sandbox the runtime; don’t trust command allow-lists
    The Snowflake Cortex exploit chain worked because cat was treated as safe even though process substitution let the agent fetch and execute attacker code from a malicious README. Simon Willison’s takeaway is to assume the agent can do anything its process can do and enforce safety with deterministic sandboxes outside the agent; Open Shell’s policy-governed runtime is a concrete implementation of that idea .

👤 PEOPLE TO WATCH

  • Salvatore Sanfilippo — high-signal because this was not toy-code benchmarking. He compared agents on a low-level Radix-3 optimization in a data structure used heavily in Redis, then explained the exact failure mode and fix .
  • Matthew Berman — dropped one of the denser public operator playbooks lately: 200+ hours of OpenClaw usage distilled into thread-level context control, multi-model routing, subagents, prompt-file sync, testing, backups, and mobile voice input .
  • Simon Willison — worth tracking for both offense and defense: he surfaced a real prompt-injection escape in Snowflake Cortex, and separately highlighted a Claude Code-driven autoresearch workflow that ran 90 experiments and produced working systems code .
  • Romain Huet — useful because he highlighted a practical behavior change, not a benchmark: Codex users are already customizing the open-source CLI to fit their own workflow .
  • Armin Ronacher — still one of the clearest anti-hype voices in the room. His point is short and important: better models don’t automatically undo bad agent-driven code decisions .

🎬 WATCH & LISTEN

  • 0:32-3:04 — Threads fix more than “memory”: Berman shows why one giant chat window interleaves topics, bloats context, and makes both the human and the agent worse. Easy habit to steal tomorrow .
  • 11:19-14:01 — The “>10s = delegate” rule: Best segment today for people building agent workflows. Berman spells out what belongs in the main planner and what should be handed to subagents or external harnesses .
  • 8:44-10:44 — Claude Code speed, Codex correctness: Sanfilippo explains the false-confidence trap directly from a real debugging session: Claude Code felt agile, but Codex found the actual bitmap bug and kept going past the fix .
  • 5:03-7:58 — Fixed system prompt, mutable memory, composite backend: LangChain’s DeepAgents walkthrough is the cleanest short architecture segment in the batch if you care how these stacks are actually wired .

📊 PROJECTS & REPOS

  • openai/codex — the open-source Codex CLI is gaining the kind of traction that matters: people are modifying it to fit their own terminal workflow, including quieter progress reporting .
  • danveloper/flash-moe — strong evidence that agentic “autoresearch” can produce real systems work. Dan Woods used Claude Code to run 90 experiments, generate MLX Objective-C and Metal code, and ship a custom Qwen3.5-397B-A17B implementation plus a paper .
  • DeepAgents + Open Shell — worth watching as an open stack because the seams are visible: model, runtime, harness, sandbox, memory, and tool loop are all explicit rather than hidden behind one product shell .
  • OpenClaw — the adoption signal here is operator time: Berman says he’s put 200+ hours and billions of tokens into the setup, and the resulting playbook is concrete enough to copy rather than admire .

Editorial take: today’s durable edge was not “pick one magic model” — it was pairing stronger models on the hard bugs with tighter context boundaries, explicit orchestration, and runtime-enforced safety.

Sandboxes, Self-Summarization, and TDD Loops Tighten the Coding-Agent Stack
Mar 18
5 min read
102 docs
Logan Kilpatrick
David Heinemeier Hansson (DHH)
Logan Kilpatrick
+11
The useful signal today was harness quality, not just model churn. New sandboxed execution layers, better long-horizon context handling, and concrete test/manual-test habits from experienced practitioners point to what actually improves coding-agent reliability.

🔥 TOP SIGNAL

Today’s clearest pattern: the harness is becoming the product. LangChain launched LangSmith Sandboxes and Open SWE around isolated execution, persistent sandboxes, curated toolsets, and workflow-native triggers, while Cursor said RL-based self-summarization cut compaction error by 50% on coding tasks that require hundreds of actions.

The practical takeaway is straightforward: safer execution plus better context compression is where reliability is improving right now—not just raw model swaps .

🛠️ TOOLS & MODELS

  • GPT-5.4 mini — now available in ChatGPT, Codex, and the API. OpenAI says it is optimized for coding, computer use, multimodal understanding, and subagents, and is 2x faster than GPT-5 mini .
  • Cursor Composer — now trained to self-summarize via RL instead of a prompt. Cursor says this cuts compaction error by 50% and improves success on long coding tasks with hundreds of actions.
  • LangSmith Sandboxes — now in private preview. Key pieces: MicroVM isolation, an auth proxy so secrets never touch the runtime, persistent long-running sessions, state carryover, tunnels, and direct integrations with Deep Agents and Open SWE.
  • Open SWE — new open-source framework for internal coding agents built on Deep Agents and LangGraph. It packages patterns LangChain says it observed across Stripe, Ramp, and Coinbase: isolated sandboxes, curated tools, Slack/Linear/GitHub invocation, AGENTS.md startup context, subagents, and middleware safety nets .
  • Operator comparison: Codex vs. Claude Code — Theo said GPT-5.4 in Codex/T3 Code quickly diagnosed mixed TanStack versions and fixed a Vite+ migration, while his Claude Code run sat for 15+ minutes without changing code .

💡 WORKFLOWS & TRICKS

  • Simon Willison’s low-drama loop: start every session by telling the agent how to run the tests, then add “use red-green TDD.” After tests pass, make it boot the server and hit the API with curl, because green tests still miss runtime failures. If you want an artifact, Showboat turns the manual test into a markdown log with commands and outputs .

"Tests are no longer even remotely optional."

  • Conformance-first implementation: have the agent build a test suite from multiple working implementations, then code against that suite. Simon used behavior from Go, Node.js, Django, and Starlette to generate multipart upload tests first, then implemented the feature in Datasette .
  • Keep AGENTS.md lean: Open SWE injects a root AGENTS.md into the system prompt for conventions, testing rules, and team patterns. Theo’s live Vite+ run shows the failure mode: bloated agent files packed with scaffold commands and irrelevant noise hurt the model; move bulky details to docs or skills instead .
  • Async bug-fix fanout: Felix Rieseberg’s internal Cowork loop is copyable:
    1. Point the agent at the crash dashboard.
    2. Have it separate fixable bugs from OS/kernel noise.
    3. Write one markdown prompt per fixable bug.
    4. Launch a remote Claude Code task for each prompt and let them run while you’re in meetings .
  • Sandbox rule of thumb: isolate first, then allow full permissions inside the boundary. Open SWE and LangSmith both follow this pattern, and LangSmith adds proxy-based access so credentials stay off the sandbox entirely .

👤 PEOPLE TO WATCH

  • Simon Willison — shared concrete operator playbooks today: Pragmatic Summit highlights plus new chapters on how coding agents work and subagents. Useful because they include reusable prompts, TDD/manual-test loops, and context tactics .
  • Felix Rieseberg — useful voice on VM-based agent harnesses. The Cowork interviews connect VM isolation, markdown skills, Chrome integration, and internal bug-triage orchestration in one coherent workflow model .
  • Theo — worth watching when you want an unpolished tool comparison instead of a vendor benchmark. Today he showed both a practical Codex/GPT-5.4 win and a sharp critique of noisy AGENTS.md files .
  • Logan Kilpatrick — strong big-company signal: better models and harnesses let him get back into shipping production code at Google, but humans still own review, prioritization, and the “what should we build?” decision .
  • DHH — notable because he was publicly skeptical for a long time. His shift from using AI as a better search/pairing tool to daily agent use is meaningful, and his framing is useful: agents amplify output without reducing the programmer to a project manager .

🎬 WATCH & LISTEN

  • 2:39-3:37 — LangSmith Sandboxes as a tool: a short demo of the pattern. A deployed agent spins up a sandbox, generates HTML, renders it with a headless browser, and sends back a screenshot .
  • 15:35-17:25 — Felix’s async bug-fix loop: Cowork reads a crash dashboard, filters fixable issues, writes per-bug markdown prompts, and fans out remote Claude Code runs .
  • 44:29-46:40 — DHH on the flip: worth the segment for the mental-model update. He explains why late-2025 agents stopped feeling like bad autocomplete and started feeling like parallel cognitive leverage .

"It is more like I've grown 18 arms and seven more brains."

📊 PROJECTS & REPOS

  • Open SWE — new open-source foundation for internal coding agents. The adoption signal here is architecture: LangChain says it packages the same core patterns seen in Stripe’s Minions, Ramp’s Inspect, and Coinbase’s Cloudbot .
  • pi-autoresearch — worth watching because it was used in Shopify’s Liquid optimization run. That effort produced 93 commits from around 120 automated experiments and landed a 53% parse+render improvement on Liquid .
  • Shopify/liquid PR #2056 — a strong proof artifact for autonomous optimization: the PR headline claims 53% faster parse+render and 61% fewer allocations after agent-driven micro-optimization work .
  • multipart-form-data-conformance — small repo, clear pattern. It shows how to turn multiple existing implementations into a conformance suite the agent can target for a new implementation .

Editorial take: the durable edge right now is not one model release; it’s the harness—sandboxed execution, lean context, and ruthless verification.

Codex Subagents Go GA as Specs and Review Become the Real Constraints
Mar 17
6 min read
122 docs
cat
Omid Mogasemi
Addy Osmani
+12
Codex subagents were the clearest release today, but the bigger pattern came from practitioners across tools: better specs, cleaner context boundaries, and tighter review loops are what actually unlock agent leverage. This brief covers the tools, workflows, clips, and projects worth stealing from right now.

🔥 TOP SIGNAL

OpenAI shipping subagents in Codex is the biggest practical release today: specialized workers let you keep the parent context clean, split work in parallel, and steer results as they come back. Simon Willison’s follow-up makes the broader point—subagents are now GA in Codex, custom agents live in ~/.codex/agents/ as TOML, and the same interface pattern is already surfacing across Claude Code, Cursor, VS Code, and Gemini CLI.

"a glimpse of a future where agents orchestrate agents"

🛠️ TOOLS & MODELS

  • Codex subagents / custom agents — now GA after preview; default subagents include explorer, worker, and default, while custom agents can be defined in ~/.codex/agents/ and pinned to models like gpt-5.3-codex-spark. OpenAI’s practical pitch is straightforward: cleaner parent context, parallel tasking, and live steering.
  • Fast-subagent tip for Codex Pro — Alexander Embiricos says you can explicitly ask Codex to spawn subagents, and Pro users can use Spark for faster ones.
  • Remote env setup is getting first-class support — Claude Code now supports custom environments for remote runs via http://claude.ai/code, desktop, and mobile, plus setup scripts for dependencies, settings, and configs before launch. Kent C. Dodds says Cursor agents already offer a full Linux VM with browser plus screenshot/demo-video support and custom startup setup.
  • LangGraph Deploy CLIlanggraph deploy is now the one-step path to LangSmith Deployment: the CLI builds a Docker image, provisions Postgres and Redis, fits CI/CD, and adds list, logs, and delete management commands. First-party templates now include deep-agent-template and simple-agent-template; quick start is uvx --from langgraph-cli langgraph deploy.
  • openClaw is pushing more logic into plugins — Steinberger says “everything can be a plugin now”; lots of code moved out of core, with faster performance and lower memory use overall, plus Claude/Codex/Cursor plugin bundle support. It still needs another day or two to stabilize.
  • Mistral Small 4 — new Apache 2 licensed 119B MoE model with 6B active parameters; Mistral positions it as one model spanning reasoning, multimodal, and Devstral-style agentic coding. It supports reasoning_effort="none" or "high", and Simon Willison tested it via llm-mistral.

💡 WORKFLOWS & TRICKS

  • Subagent orchestration recipe

    1. Define narrow specialists as custom agents in ~/.codex/agents/.
    2. Give each one a job, not a vague mandate — Simon’s doc example uses browser_debugger to reproduce, code_mapper to trace the path, and ui_fixer to ship the smallest fix.
    3. Keep the parent agent focused on coordination while workers handle exploration in parallel.
    4. Steer individual agents as evidence comes back instead of dumping everything into one growing thread.
  • Spec pack before prompt

    1. Spend 30-40% of the task writing the spec: requirements, constraints, success criteria, stack, libraries, and UI components.
    2. Put supporting docs in a context or resources directory.
    3. Encode architecture and team best practices in markdown or via MCP so the model doesn’t default to generic patterns.
    4. State the goal, not just the task — Theo’s chess-engine example failed because the agent inferred the wrong objective.
  • Local-to-prod LangGraph loop

    1. Install the CLI: uv tool install langgraph-cli.
    2. Scaffold with langgraph new and pick the DeepAgent template if you want a fuller harness.
    3. Set LangSmith and model-provider keys in .env.
    4. Run uv sync and langgraph dev to test locally in LangSmith Studio with traces and hot reload.
    5. Deploy with langgraph deploy, then manage with logs, list, and delete.
  • Simon Willison’s data-analysis pattern is reusable outside journalism

    1. Work in Python + SQLite, optionally with Datasette.
    2. Use agents for database Q&A, exploration, cleaning, visualization, and scraping — his workshop handout breaks the flow into those modules.
    3. For UI work, serve a Datasette viz/ folder and have Claude Code write interactive visualizations straight into it.
    4. If you’re onboarding a team, his workshop setup used GitHub Codespaces plus a budget-restricted Codex key; attendees consumed $23 in tokens.
  • Set a merge policy now

    • Logan Kilpatrick’s blunt read: the bottleneck has already shifted from generation to code review.
    • Addy Osmani’s rule of thumb: merge AI-generated changes when they’re small/compartmentalized or backed by enough tests, and keep humans in the loop for harder maintenance work.

👤 PEOPLE TO WATCH

  • Simon Willison — dropped two operator resources in one day: a NICAR workshop handout on using Claude Code and Codex for data work, and a fresh chapter explaining coding agents as LLM + system prompt + tools in a loop. Good if you want both hands-on workflow and mental model.
  • Addy Osmani — best practical framing today on spec-driven development for agent workflows; useful because he pairs the spec advice with an explicit quality bar for merges and maintenance.
  • Theo — worth watching for showing both the upside of multi-agent orchestration on a large repo merge and the failure mode when an agent optimizes for the wrong implied goal.
  • Logan Kilpatrick — a short post, but probably the cleanest organizational warning of the day: your process is likely underprepared for AI-heavy review load.
  • Kent C. Dodds — credible firsthand signal on remote agents because he names the concrete features he actually uses in Cursor, and he discloses that he gets free usage rather than pretending it’s a neutral review.

🎬 WATCH & LISTEN

  • 1:28-3:49 — LangGraph local iteration loop: Best short demo today if you want to see how langgraph dev turns an agent into a local server, surfaces traces in Studio, and hot-reloads prompt changes before deploy.
  • 25:09-26:15 — Theo on goal vs. task drift: A very real failure case: the agent “succeeds” by satisfying the literal prompt while missing the intended goal. Useful calibration for anyone over-trusting long-running agents.
  • 0:38-1:15 — Addy’s spec checklist: Fastest clip in the batch for improving agent outputs tomorrow morning — constraints, success criteria, stack, libraries, and UI components, up front.

📊 PROJECTS & REPOS

  • deep-agent-template — official first-party LangGraph starter for heavier agent workflows; adoption signal is that LangChain used it in the Deploy CLI walkthrough and paired it with one-command deployment.
  • simple-agent-template — smaller starting point for the same langgraph deploy path.
  • Trees heatmap gist — concrete artifact from Simon Willison’s workshop: Claude Code generated an interactive Leaflet.heat visualization inside a Datasette viz/ folder over a large tree dataset.
  • Cursor security agents — not open source, but high-signal production usage: Cursor says it runs a fleet of security agents continuously on its own codebase and published automation templates for others.
  • openClaw plugin bundles — watch this framework if you care about tool extensibility: Claude/Codex/Cursor bundle support plus a slimmer core means the project is moving toward a more modular agent surface.

Editorial take: the stack is converging on the same playbook — write a better spec, fan work out to specialists, and spend the saved time on review instead of pretending raw generation is still the bottleneck.

Spec Loops and Small-Task Discipline Reset the Coding-Agent Playbook
Mar 16
5 min read
65 docs
Armin Ronacher ⇌
Peter Steinberger 🦞
DHH
+3
Simon Willison's new framing of agentic engineering was the key signal today, and the best supporting evidence came from practitioners showing what disciplined loops look like in practice: Geoffrey Huntley's spec-first porting workflow, Armin Ronacher's small-task model comparison, and ThePrimeTime's warning about agent-driven work sprawl. Also included: CodexBar 0.18, Omarchy's npm wrapper move, and three clips worth watching.

🔥 TOP SIGNAL

Simon Willison's new "What is agentic engineering?" chapter is the clearest practical reset today: coding agents matter when they can write and execute code in a tool loop toward a goal, not when they just autocomplete text. The actionable part is his operating model—give the agent the right tools, describe the task at the right level, verify the result, then update instructions and the harness as you learn because the model will not learn from yesterday's mistakes on its own . Geoffrey Huntley's citation-driven porting loop and ThePrimeTime's side-project experience point the same way: harness design beats raw output .

"LLMs don't learn from their past mistakes, but coding agents can, provided we deliberately update our instructions and tool harnesses to account for what we learn along the way."

🛠️ TOOLS & MODELS

  • CodexBar 0.18 — new providers (Kilo, Ollama, OpenRouter), Codex historical pace + risk forecasting + backfill, a merged-menu Overview tab, fewer Claude keychain prompt annoyances, and lower CPU/energy use with faster JSONL scanning . Release notes
  • Opus vs Codex on small diffs — Armin Ronacher says that once changes are sufficiently small, there is little to no difference in how Opus and Codex behave. Good reminder that task decomposition can matter more than model tribalism .
  • OpenClaw direction — Peter Steinberger says the plugin system is being pushed toward a leaner core plus more powerful plugins, with support for Claude Code/Codex plugin bundles planned .
  • Omarchy's packaging move — DHH is moving AI tooling out of regular repos and onto npm behind an always-updated npx wrapper because opencode is shipping about 7 releases per day.

💡 WORKFLOWS & TRICKS

  • Spec-first porting loop
    1. Compress tests/* into /specs/*.md with separate subagents, linking implementation as citations .
    2. Do the same for src/*, again linking implementation into the specs .
    3. Run another Ralph loop to create a TODO, then execute classic Ralph doing one thing and the most important thing per loop.
    4. Configure the target language for strict compilation.
    5. Keep citations in the specs so the agent can study the original implementation during execution while stages 1-2 stay decoupled from the source language .
  • Use task size as a quality lever — Armin's way to fight "slop creep": make the change smaller. His takeaway was that for sufficiently small edits, Opus and Codex behaved nearly the same .
  • Treat harness updates as part of the job — Simon's durable checklist: give agents the tools they need, specify the problem at the right level of detail, verify the result, and then change instructions/tooling based on what failed .
  • Don't let cheap MVPs multiply bad work — ThePrimeTime's warning is operational: faster prompting makes it easy to spin up multiple rough ideas, but each one creates more waiting, babysitting, and cleanup. More code output did not mean better code or better problem selection .
  • Repo-triage heuristic — if someone says they "solved" a problem but the GitHub history is only about 48 hours old, Armin says assume it has not been properly evaluated yet .
  • Packaging trick for fast-moving agent deps — if tool churn is too high to vendor comfortably, split AI tooling out of the main repo and lazy-load the latest version via npm/npx .

👤 PEOPLE TO WATCH

  • Simon Willison — published a foundational chapter defining agentic engineering as software development with agents that write and execute code, and says the guide will keep evolving as patterns mature .
  • Geoffrey Huntley — shared a concrete, citation-driven language-porting workflow instead of a vague "just use agents" take .
  • Armin Ronacher — high signal today for both operator insight (small-task Opus/Codex parity) and ecosystem skepticism (too many flashy products, too little real evaluation) .
  • Peter Steinberger — actively shipping in the tooling layer: CodexBar 0.18 is out, and OpenClaw plugin bundles for Claude Code/Codex are on deck .
  • ThePrimeTime — worth watching for a blunt firsthand report on where agent speed helps, where it hurts, and how easily the work can sprawl past the point of usefulness .

🎬 WATCH & LISTEN

  • 7:49-8:29 — The "Faustian bargain" of fast MVPs: Best clip today if you're over-spawning agent jobs. ThePrimeTime explains how easy first drafts turn into longer prompt/wait cycles and constant babysitting once multiple experiments are running .
  • 9:00-9:32 — Output is not the bottleneck: The punchline is sharp: generating more code did not mean better code, satisfaction, or the right product. The real bottleneck became choosing the right problem .
  • 11:30-11:49 — Keep the tool in its place: Short corrective on work/life balance. One more feature is not worth crowding out actual life .

📊 PROJECTS & REPOS

  • CodexBar v0.18 — adds provider breadth, Codex pace/risk forecasting, backfill, a new overview surface, and lower resource use .
  • Omarchy AI-tooling commit — practical repo-maintenance pattern: keep volatile AI tooling out of the main repo and fetch it on demand. The adoption signal is upstream churn: opencode is releasing about seven times per day .
  • OpenClaw plugin ecosystem — watch this if you care about pluginized agent surfaces: steipete is trying to make the core leaner while expanding what plugins and bundled integrations can do .

Editorial take: today's edge is not more agent output; it's tighter loop design—specs with citations, smaller task slices, and explicit verification.