ZeroNoise Logo zeronoise

Coding Agents Alpha Tracker

Live Daily at 7:00 AM Agent time: 8:00 AM GMT+01:00 – Europe / London

by avergin 110 sources

Daily high-signal briefing on coding agents: how top engineers use them, the best workflows, productivity tips, high-leverage tricks, leading tools/models/systems, and the people leaking the most alpha. Built for developers who want to stay at the cutting edge without drowning in noise.

Claude Tag Moves Coding Agents Into Slack; LangSmith Engine Closes the Repair Loop
Jun 24
4 min read
135 docs
LangChain
ClaudeDevs
cat
+10
Anthropic's Claude Tag is the strongest workflow signal today: multiple insiders describe a Slack-native coding agent that lives in threads, handles incidents, and writes a large share of code. Also worth attention: LangSmith Engine's trace-to-PR loop, Cursor's new team skill surfaces, and a few OSS agent-infra projects getting real discussion.

🔥 TOP SIGNAL

Anthropic's Claude Tag is the clearest workflow shift today: Boris Cherny says the Slack-native agent launched today, and internally it's already used to write PRs, address user feedback, investigate incidents, run data analyses, and answer company-knowledge questions; he says 65% of the product team's new code is created by Anthropic's internal version . Anthropic engineer @_catwu says the internal system merges 65% of product PRs, and Andrej Karpathy's reaction is that this is not Slack-wrapped RAG but a deeply integrated, multiplayer coding-agent product that changes how teams work .

"it’s not some LLM Q&A with RAG over Slack... it’s a different way of working entirely... I work from Slack now."

⚡ TRY THIS

  • Run incident response in the same thread the humans use. @_catwu's flow: when the page lands, tag Claude in the incident thread; it pulls graphs, diffs the deploy, identifies root cause, and tags the author. The team approves in-thread, then Claude opens the fix, lands it, watches the metric recover, and resolves the page .

  • Use a per-thread agent as both search and executor. Boris Cherny's setup starts with @.Claude in a Slack channel. Each thread gets its own sandbox, memory, and permissions; Claude can clone repos, write/test/compile code, answer questions like What’s the status on X? or Who owns this service?, proactively watch channels, draft PRs, and react with / when a thread is resolved .

  • Steal LangChain's closed-loop improvement pattern. The durable sequence is simple: (1) run the agent and mine weaknesses, (2) propose harness improvements, (3) confirm the changes help without regressions, then loop . LangSmith Engine turns that into an operational workflow: build the agent, test with datasets/evals, deploy + monitor traces/online evals, then patch, regress-test, redeploy; when traces show errors, eval failures, negative feedback, or new bad behaviors, Engine clusters them into issues and drafts targeted PRs plus new tests/evals for review .

  • For messy browser/API exploration, have the agent build the harness first. Simon Willison used Claude Code for web to build a small playground UI for testing OPFS + Pyodide behavior across browsers—a good pattern when you need a disposable test bench before committing to product code .

📡 WHAT SHIPPED

  • Claude Tag (Slack launch) — Launched today; tag Claude in-channel and each thread gets its own sandbox, memory, and permissions. Cherny says it can proactively monitor channels, and Anthropic built security layers around model training, classifiers/auto mode, access controls, and channel/workspace boundaries. More: introducing-claude-tag

  • LangSmith Engine (available now) — Connect a tracing project and Engine will inspect traces, cluster repeated failures into issues, draft targeted PRs, and propose new examples/evals for the test suite. LangChain says teams like Vanta, Campfire, and Cogent are already catching regressions earlier and cutting triage time .

  • Cursor customize updates — Plugins can now ship prebuilt canvases (example: Atlassian canvas for live issues/projects/docs); Cursor also added a team leaderboard for popular plugins/skills/MCPs with one-click add, and extended team marketplaces to GitLab, Bitbucket, and Azure DevOps alongside local repos. Changelog: cursor.com/changelog/customize

  • Google Interactions API GA — One API for Gemini models and agents, with dedicated coding-agent skills, the Antigravity Agent remote Linux sandbox, multimodal tool use, and background=True for async long-running interactions .

  • Notable comparison: GLM 5.2 vs Opus 4.8 — Jason Zhou shared a same-prompt, same-reference-image frontend test where both models produced a working Three.js logistics dashboard. He also cites GLM 5.2 pricing at $1.40 / 1M input and $4.40 / 1M output, with Opus about 5x more expensive.

  • OSS agent infra worth a look — In Matthew Berman's roundup, Deer Flow (~74k GitHub stars) stood out as a long-horizon harness built around sub-agents, memory, sandboxes, and skills; Codebase Memory MCP (~12k) claims Linux-kernel-scale indexing in 3 minutes with sub-millisecond structural queries and 120x fewer tokens; Skill Specter (<10k) scans skills for 65 vulnerability patterns before install .

🎬 GO DEEPER

  • 1:36–2:19 — LangSmith Engine's trace → issue → PR loop. Best short walkthrough today of a real agent-improvement pipeline: watch traces, cluster patterns, draft the fix, add evals, then keep monitoring after merge .
  • 6:20–7:26 — Codebase Memory MCP on huge repos. If your agents keep burning tokens just to understand code structure, this is the pitch to examine: Linux kernel indexed in 3 minutes, structural queries in under 1 ms, 158 languages, 11 harnesses .
  • Study Deep Agents alongside LangChain's loop engineering write-up. The repo matters, but the bigger takeaway is the pattern: weakness mining, harness edits, regression checks, repeat .

Editorial take: the serious setups are starting to look the same—put the agent inside the real workstream, give it bounded tools and memory, then close the loop with traces, tests, and human approval.

Stateful Agent Workspaces, Cursor 3, and Simon Willison’s Claude Code Playbook
Jun 23
5 min read
115 docs
Addy Osmani
Thibault Sottiaux
Michael Truell
+13
Today's brief centers on the shift from chat-based coding assistants to stateful agent workspaces. It includes Simon Willison's copyable Claude Code workflow, reusable memory patterns, and the most relevant releases from Cursor, Google, and LangChain.

🔥 TOP SIGNAL

Today's clearest pattern: good coding-agent work is getting less chatty and more infrastructural. Simon Willison's Moebius side-project worked because he staged the repo and weights, wrote research.md, made Claude Code maintain notes.md and plan.md, and iterated against real browser errors instead of treating the agent like a one-shot chat . The same pattern showed up in releases: Cursor's cloud agents now get dedicated dev environments, Google is pushing Managed Agents plus Skills Registry, and LangChain is pushing stateful sandboxes/runtime execution so intermediate work stays out of model context and work can resume mid-session .

⚡ TRY THIS

  • Front-load context, then force durable notes (Simon Willison). In /tmp, clone the target repo, weights, and likely helper libs; use a separate model session to produce research.md; git init a clean project; then start Claude Code one level above it with Read ./moebius-web/research.md... plus a follow-up telling it to commit early/often and maintain notes.md + plan.md. Ask early for the URL I can visit in my own browser, paste errors/screenshots back in, and only then publish via hf CLI + GitHub Pages . When a reference project is ugly or obfuscated, offload inspection to a subagent; Simon used that to discover the caches.open('transformers-cache') pattern for ~1.3 GB browser caching .

  • Make each good run compound. Jason Zhou's loop template keeps a shared artifact/knowledge layer plus logging and verification so the next run starts from more than chat history . Simon uses notes.md for the same reason , and Kent C. Dodds' cleanup prompt is excellent: ask what was learned that would be repeatable and useful for future agents, then have the agent update documentation before you end the session .

  • Have the agent recombine working systems instead of inventing from scratch. Simon Willison says he gets the best results by asking coding agents to combine existing projects rather than build from a blank prompt . In practice, he runs Claude Code against a real repo, lets its explore agents inspect the README and adjacent plugins, then asks it to generate the new plugin and run tests .

  • Define the win condition and the verification method up front. Thibault Sottiaux says long-horizon agents make bad assumptions when you only say 'optimize this'; tell the agent whether you want something like a ~20% gain or a 14x gain, and specify how it should verify the result against production-like workloads or log replay . For riskier automation, pair autonomy with a second agent: Codex's auto-review/Guardian checks each action against the original intent, prompts on medium risk, and blocks high-risk actions .

📡 WHAT SHIPPED

  • Cursor 3: Michael Truell says 95%+ of users now use Cursor primarily as an agent; agents are used ~5x more than assistive features on a request basis and far more by lines of code . New agent-first surfaces include design mode, recursive subagents, and cloud agents with dedicated dev environments for tests, screenshots, and interactive review . Cloud agents are now 3x faster with 99.9% reliability; Automations has already seen 6M+ runs, including Amplitude's background migration of 20k React components to Tailwind .

  • Cursor Mobile + Origin: iOS beta lets you kick off/review agents, inspect artifacts/screenshots, annotate issues, and remote-control local agents . Origin is Cursor's agent-native Git layer: it can resolve merge conflicts, fix CI failures, handle PR comments, and claims review-time reductions of more than 50% .

  • Google's agent stack: Addy Osmani framed the current stack as four rungs: Agent Studio, Managed Agents API, Anti Gravity 2.0, and ADK 2.0 GA . Also notable: Agent CLI for scaffold/run/eval/deploy, Skills Registry for org-scoped markdown skills with dynamic discovery, and Gemini 3.5 Flash as the new default for long-horizon agent work .

  • LangChain: Deep Agents v0.6 adds a code interpreter so tools can run inside the runtime, intermediate results stay out of model context, and only relevant output comes back — fewer round trips, less token waste . dcode is the provider-agnostic coding agent layer for trying OSS models like GLM 5.2; docs: deepagents code overview. LangChain is also explicitly pushing stateful agent computers; see LangSmith Sandboxes.

  • Open-source artifacts worth cloning: Jason Zhou open-sourced the loop-engineer-template with shared artifacts, logging, verification, and a harness for compounding work across runs . Simon Willison shipped a live Moebius browser demo after using Claude Code to port the model to ONNX for in-browser use .

  • Comparison worth noting: Peter Steinberger says a complex three.js Rocket League-style task burned through a 5-hour usage allowance in under one prompt and still needed 7-8 fix rounds; he says GPT 5.5 handled the same task without follow-ups, which left him skeptical of multi-model routing .

🎬 GO DEEPER

  • 5:51-7:12 — Thibault Sottiaux on the difference between 'better ChatGPT' and a real engineering agent. The practical point: connect the agent to Slack, Datadog, logs, and company systems, or you are leaving most of the value on the table .
  • 23:16-25:17 — Michael Truell on where Cursor's next model is aimed. Good framing if you think code generation is no longer the bottleneck: the target is broader engineering work like tool use, planning, testing, and UI around showing what changed .
  • 29:58-34:20 — Thibault Sottiaux on trust after line-by-line human review stops scaling. Strong segment on replacing blanket review with tests, log replay, end-to-end checks, and better observability .
  • Study Simon Willison's working files, not just the finished demo. Start with research.md, notes.md, understanding.md, and the full Claude Code transcript. It is rare to get the full prompt -> plan -> notes -> deploy trail in public .

  • Study the loop-engineer-template. If you want a minimal skeleton for shared artifacts, logging, and verification, this is one of the cleaner public starting points from today .

Editorial take: the edge is moving from clever prompting to better agent infrastructure — stateful workspaces, durable memory, tool connections, and explicit verification are what make runs compound instead of reset.

Verifiable Loops Win: Codex Testing, GLM 5.2, and Claude Artifacts
Jun 22
4 min read
64 docs
Theo - t3.gg
Riley Brown
Romain Huet
+10
The strongest signal today is that coding-agent loops only become useful when the task is verifiable. This brief covers copyable Codex workflows, Riley Brown's GLM 5.2 and Record & Replay demos, Claude Code Artifacts, and a same-day hype check on Sakana Fugu.

🔥 TOP SIGNAL

Today's clearest pattern: agent loops only get reliable when the task is verifiable. Romain Huet says coding is the right proving ground because long tasks can be checked with tests , ThePrimeagen's checklist for successful loops is defined inputs/outputs, clear success/failure, repeatability, and observability , and Armin Ronacher says that without that structure loops still mostly hold up for review/research rather than medium-sized implementation . Tom Osman and Greg Brockman's Codex workflow is the practical template: generate canonical user stories for every feature, test them, log errors, fix them, then retest—with a human still reviewing PRs before merge .

⚡ TRY THIS

  • Run Codex as a full feature-coverage loop (Tom Osman via Greg Brockman). Point it at an existing app and give it an explicit end state:

    /goal go over every single feature in this app create a user story with expected behaviour based on the code keep a single canonical spreadsheet tracking the features status- when done switch loop to testing every user story and documenting all errors- when done fix every logistical error or ux error- test every user behaviour again post fix

    Greg Brockman highlighted this as Codex for testing every feature, and Tom says it can work through hundreds of user stories automatically . It also fits ThePrimeagen's loop criteria: defined outcome, clear success/failure, repeatable, observable .

  • Force a second opinion after API design (Theo). Add this to your Codex first loop:

    When you are done designing the API, get a second opinion from Opus with 'claude -p'

    Theo says this has significantly improved the code quality he gets from OpenAI models . Good default whenever the agent is making architectural calls.

  • Turn a manual browser task into a reusable skill (Riley Brown). In Codex, use the Record and Replay plugin, say Please make a skill called [Name], perform the workflow on screen, stop recording, and Codex turns it into a slash command you can invoke later like /manual tweet draft. Riley's demo shows recordings up to 30 minutes, which makes this useful for real UI chores, not just toy clicks .

  • Use an agent as a backlog analyst, not just a coder (Geoffrey Huntley). Give the agent gh cli access and ask it to generate a markdown report of the top unresolved issues, with columns for problem description, platform, upvotes, and age, and a linked LLM summary plus proposed resolution for each row . Huntley's concrete example targets the top 250 unresolved NixOS/nix issues in a file called nixos-nix.md.

📡 WHAT SHIPPED

  • GLM 5.2 is now a serious Cursor candidate via OpenRouter. Riley Brown's setup: in Cursor go to Settings → Models → API keys, enable custom, override the OpenAI base URL with the OpenRouter endpoint, then add z-ai/glm-5.2 as a custom model . In Riley's own tests, GLM 5.2 one-shotted a Trello-style app with DB/auth via Convex, built and ran a landing page locally, and handled Notion/Slack agent tasks comparably to Opus 4.8; he also says it feels close to GPT 5.5 / Opus 4.8 overall .

  • Claude Code Artifacts. Claude Code can now generate shareable interactive mini-apps/artifacts with their own links, giving teams something concrete to review and pass around .

  • Codex stack openness got clearer. Romain Huet says the Codex CLI, full harness, and server are open source on GitHub; the Codex app can also run open-source models, and he says OpenAI uses Codex across the company, including non-engineers .

  • Temporary deploys for AI-built apps are practical now. Simon Willison had GPT-5.5 xhigh in Codex Desktop build cloudflare-redirect-resolver, then deployed it with npx wrangler deploy --temporary; Cloudflare kept the ephemeral Workers project live for 60 minutes, and Simon says the temporary deployment worked as advertised .

  • Sakana Fugu launched with an immediate reality check. Sakana introduced Fugu as a full multi-agent orchestration system behind a single model API and says Fugu Ultra matches Fable and Mythos; it is available at sakana.ai/fugu. Riley Brown's first design-task test did not finish before daily limits kicked in .

🎬 GO DEEPER

  • 8:41–12:11 — Riley Brown on Record & Replay → Codex skill. The most copyable walkthrough in today's batch: record a real browser workflow, stop capture, then call it later as a slash command .
  • 3:34–4:29 — Romain Huet on why coding is the first real agent harness. Short clip, big idea: long-running agents improve fastest where work can be verified by tests and tools .
  • Repo study — Simon Willison's cloudflare-redirect-resolver and build gist. Small, concrete, and deployed: a good example of using Codex Desktop to build a utility app and ship it to a temporary environment for real validation .

Editorial take: the durable edge right now is not 'more agents'—it is better harnesses: explicit goals, clear success criteria, repeatable environments, verifiable tests, and human review at merge time .

Production Fan-Out Patterns and a 280kLOC WebKit PR
Jun 21
3 min read
70 docs
Jared Zoneraich
Armin Ronacher ⇌
Today's brief focuses on a concrete subagent workflow from Cognition: decompose work, keep contexts small, front-load clarifications, and let agents write prompts and sanity tests. It also flags a 280kLOC AI-generated WebKit PR as a useful case study in reviewing large agent-generated changes.

🔥 TOP SIGNAL

The most practical signal today is agent fan-out from inside Cognition: a lead Devin breaks a problem into independent chunks, spins up 5-100 child Devins in parallel, then combines the results . The rationale is simple and portable: agents do better when both the task and the context are small, and separate VMs make the parallelism real rather than cosmetic . This is already used by Cognition's model research and product teams, not just as a demo pattern .

⚡ TRY THIS

  • Use a coordinator/worker split for migrations and large refactors. Ask the parent agent to decompose the job into independent workstreams, spawn one child per workstream, and merge centrally at the end. Concrete example: one Cognition workflow split a React Native-to-Swift migration into 6 pieces and ran them in parallel .

  • Make the parent agent write the child prompts. Instead of hand-writing every worker brief, have the main agent generate prompts for its own subagents . Practical flow: define the top-level goal, ask for decomposition, then have the parent draft the child prompts before launch .

  • Front-load clarifications before you fan out. Tell the agent to ask every ambiguity-filling question up front, then give it all required context so the run does not stop every few minutes for missing details . This pairs directly with the small-context rule behind fan-out .

  • Require self-tests, then manage the fleet instead of one chat. Have the agent generate its own integration sanity tests as part of the run . In the same workflow, the human role shifts toward supervising many active agents rather than micromanaging one session .

📡 WHAT SHIPPED

  • Artifact worth studying: Armin Ronacher surfaced a 280kLOC AI-generated pull request against WebKit and said it is a reminder that "loops are coming for core infrastructure" .

  • Adoption signal: the fan-out pattern above is already being used inside Cognition's model research team to spin up 100 Devins on eval logs, and by product teams to run 5 child Devins against 5 alternative implementations of the same idea .

🎬 GO DEEPER

  • Study the original workflow writeup:imjaredz on Devin fan-out. It is a compact but unusually actionable thread on subagent orchestration: decomposition, child prompt generation, up-front clarification, parallel VMs, and agent-written sanity tests .

  • Read the maintainer reaction alongside the code:Armin Ronacher's note + WebKit PR #249. Useful if you are thinking about how established projects review and absorb very large AI-generated changes .

"Seeing a 280kLOC AI generated pull request against WebKit is a good reminder that loops are coming for core infrastructure. It’s both exciting and confusing. I wouldn’t know how to run an established project and make that change."

Editorial take: the edge is moving from better one-shot prompts to better decomposition—small contexts, parallel workers, and review processes that can handle much larger AI-generated diffs .

Persistent Coding Agents, Goal Loops, and Codex Handoffs
Jun 20
5 min read
102 docs
Addy Osmani
Boris Cherny
Thibault Sottiaux
+14
The strongest signal today is persistence: coding agents that keep reviewing, fixing, and shipping while you step away. This brief covers copyable loop prompts, Codex local-to-remote handoffs, Anthropic’s production benchmarks, and a practical Sourcegraph update.

🔥 TOP SIGNAL

The biggest practical shift today: coding agents are turning into persistent background workers, not just chat sessions. Boris Cherny says roughly 30% of his code is now written by loops that handle code review and turn user feedback into PRs every 5–10 minutes, while Codex now hands threads between local and remote hosts and users are already running nearly 300 subagents for more than a day . Matthew Berman and Addy Osmani show the repeatable pattern behind that shift: define a clear goal, let the agent self-correct for hours or days, and keep humans out of the hot path until review time .

⚡ TRY THIS

  • Start with a deterministic loop, not an open-ended feature request. Matthew Berman’s template is: choose a trigger (manual, scheduled, or action-based), write a goal the agent can verify, paste the prompt into Codex or Claude Code, then append /goal so it runs until the condition is met .

    "Continue optimizing the code for speed after each significant change. Measure page load performance across every page under the same repeatable test conditions. Continue until every page loads in under 50 milliseconds."

    Avoid vague "build X" loops at first; Berman says loops get brittle when the model has to judge taste, and they can get expensive fast .

  • Put repo maintenance on a timer. Boris Cherny’s production pattern: run a loop for code review, or poll user feedback every 5–10 minutes and open PRs for fixes . Good starter jobs from his own setup: scan for flaky or useless tests, find duplicated abstractions, and keep improving architecture in the background . If you want a nightly variant, Berman’s “Production Error sweep” is equally copyable: review logs, trace root cause, fix, verify, open PR, then ping Slack with the result . If agents are the main reader, Robert C. Martin’s rule of thumb is slightly larger functions and more comments; Kent C. Dodds surfaced that as explicit “refactor to agent standards” advice .

  • Make long runs portable instead of babysitting them. In Codex, start locally, hand the thread to a remote host before closing your laptop, then pull it back later; the handoff can be orchestrated automatically . On the Claude Code side, Boris says auto mode routes permission prompts to a model, which is what made multi-hour and multi-day runs practical for him .

  • Route models by job shape, not brand loyalty. Geoffrey Huntley’s pattern for high-precision work: use Gemini or another gap-filling model to generate the prompt, then feed that prompt into GLM for the actual precise task; his variation is to register the secondary model as a tool inside GLM itself for prompt generation and other quality-of-life help . The underlying rule is timeless: let the creative model specify the work, and the precision model execute it .

📡 WHAT SHIPPED

  • Codex handoff — local↔remote thread handoff is now live. Start on laptop, send to a remote box, resume later; Mark Chen called it a “game changer.” Demo: https://x.com/guinnesschen/status/2068062280345162047.

  • Codex is escaping the terminal fast — Thibault Sottiaux says the app is already on macOS and Windows, works even on the free ChatGPT plan, and equivalent agent capability is coming to mobile and web ChatGPT; he also says Codex now writes the majority of code at OpenAI . Separately, one user reported nearly 300 subagents running for more than a day via lazycodex, and Greg Brockman’s summary was blunt: “Codex app is very good” .

  • Anthropic’s production benchmark is getting harder to ignore — Boris Cherny says 100% of his code since Opus 4.5 has been written by Claude Code, Anthropic has seen an 8x increase in code per engineer this year, Claude Code Review catches and fixes roughly 98–99% of bugs before human review, Claude Security runs weekly autonomous scans/fixes, and one dynamic workflow run produced four PRs that cut CI time by 50%.

  • Sourcegraph Deep Search — auto-compaction for longer uninterrupted conversations, a new Finder subagent for token-efficient file discovery, and smart hover summaries are now GA. Watch: https://www.youtube.com/watch?v=yJU01Y_LtDI.

  • Notable speed/quality tradeoff — Theo used Cursor Agent + Composer 2.5 while Claude Code ran in parallel; the dumber/cheaper/faster model deployed 10 apps from scratch in ~8 minutes, while Claude Code was slower. The apps included real-time sync and one-click Google sign-in without manual glue work .

🎬 GO DEEPER

  • 10:24–11:33 — Boris Cherny on “agents prompting agents.” This is the cleanest short clip in today’s batch on the shift from manual code review to looping reviewers and feedback readers that open PRs every 5–10 minutes .
  • 3:04–3:58 and 4:34–5:00 — Matthew Berman on loop anatomy. Watch this if you want the most copyable starter pattern: concrete goal, repeatable measurement, then /goal until done .

  • 4:02–10:19 — Addy Osmani’s 3D video-store build. Worth studying because the agent had to survive a full chain of dependent failures: Draco export, GLTF compression, texture resizing, lighting/material fixes, and browser lazy loading from a 156 MB Blender starting point .

  • Study the handoff demohttps://x.com/guinnesschen/status/2068062280345162047. This is the clearest short demo in today’s feed of local→remote thread migration without rolling your own orchestration layer .

  • Watch the Sourcegraph changeloghttps://www.youtube.com/watch?v=yJU01Y_LtDI. Useful if you’re comparing agent IDEs on search ergonomics and context handling, not just model brand .

Editorial take: the winning pattern right now is persistence — agents that survive time, device changes, and verification loops are compounding faster than agents that only shine in one-shot demos.

Loop Engineering, Record & Replay, and New Automation Primitives
Jun 19
6 min read
149 docs
Peter Steinberger 🦞
Addy Osmani
Geoffrey Huntley
+17
The strongest coding-agent signal today is the shift from manual prompting to durable loops. This brief covers the concrete workflows behind self-driving PRs, shared-state agent harnesses, and the latest releases from Codex, Cursor, Claude Code, LangSmith, and Datasette.

🔥 TOP SIGNAL

The clearest shift today is from manual prompting to loop design. Theo showed Codex clearing stale PRs overnight and waking up to four stacked PRs reviewed and merged , Jason Zhou described support and SEO loops already running in production on 30-minute and daily cadences , and Steve Yegge’s write-up of Ezra Savard’s Netflix study treats single-agent and multi-agent use as distinct literacy jumps with dedicated training for each . The common pattern across Addy Osmani and Geoffrey Huntley: the advantage is a harness that can sleep, checkpoint state, recycle context, and use a separate evaluator—not a better one-shot prompt .

⚡ TRY THIS

  • Run a repo-maintainer loop instead of a cleanup sprint. Steipete’s exact pattern is: tell Codex to maintain your repos, wake every 5 minutes, and direct work to threads; back it with an orchestrator plus triage, autoreview, and computer-use skills . Theo’s concrete use: let the loop close useless stale PRs, revive the worthwhile ones, then give each revived PR one build thread and one review thread; if you’re pushing a big migration, he also bumped Codex subagent parallelism from 3 to 20 and set a sharply defined goal . Study the exact skill docs here: maintainer-orchestrator and github-project-triage.

  • Move PR review handling off your keyboard. Theo’s next step was giving a PR its own worktree on another machine, then telling the agent to watch for comments, address them, and keep going; one run kept working for 6+ hours . After the code lands, have the agent run the dev server, verify behavior, commit, push the PR, fetch review comments itself, and even spin up reviewer threads; his dynamic loop created PRs, re-reviewed each new SHA, merged, and triggered the next PR automatically . Watch token burn on bad branches: Theo saw one feedback loop chew through 3M+ tokens on a small set of comments .

  • Turn a good one-off run into a shared-state loop. Jason Zhou’s setup flow is practical: manually run the task once, calibrate the behavior, then ask the agent to create a README contract with the goal, workflow, timeline, and schema before wiring a recurring trigger . Put outputs into shared folders for artifacts, signals, and tasks so other loops can read/write the same state, and add a global worklog.md so each agent reads the last 5-10 entries before starting . Triggers can be cron jobs, webhooks, or other agents .

  • Split planner / builder / reviewer at both the agent and model layers. Addy Osmani’s minimum bar for long-running agents is true sleep via events, durable checkpoints on every transition, and a separate evaluator because self-review overrates quality . Matthew Berman’s concrete implementation is model routing as a skill: plan with Fable, write with Composer, then review with GPT-5.5 . Geoffrey Huntley’s simpler orchestrator constraint is also worth stealing: allow one task only, recycle the context window after each task, and progress state with git commits plus a todo list .

📡 WHAT SHIPPED

  • Codex — Record & Replay. OpenAI shipped a new primitive for teaching Codex by demonstration: record a recurring task once, stop recording when you want, and Codex turns the session into an inspectable, editable skill . Greg Brockman framed it as teaching Codex by demonstration, and Nick Baumann says he’s already using it for calendar formatting, PR-to-Slack posting, and onboarding-flow testing .
  • Cursor — /automate + new triggers. Cursor added a plain-language /automate skill that configures triggers, instructions, and tools for you, plus Slack emoji triggers, GitHub triggers for issues/reviews/workflow runs, and computer use for cloud agents . Changelog: cursor.com/changelog/06-18-26.
  • Claude Code — Artifacts (beta). Team and Enterprise users can turn a session into an interactive page like a PR walkthrough or living project dashboard, then share it via private link . Boris Cherny says he’s using it for visual explanations of tricky code, system diagrams, animation previews, and shared dashboards; Mike Krieger’s tip is to ask Claude to diagram its work as tasks get deeper and more independent; @_catwu says teams are already using it to share architecture changes, analyses, and prototypes .
  • LangSmith — LLM Gateway. LangChain launched a gateway positioned as a budget guardrail against agents burning through large LLM bills overnight . Link: Introducing LLM Gateway. Timely context: Theo said his Codex loops drove more than $20,000 in inference over 48 hours .
  • Datasette Agent / Datasette Apps. Simon Willison’s latest write-up shows a coding-agent workflow that’s unusually clean: describe an app in chat, let the agent call describe_table, then app_create, and generate a single-file HTML app against a constrained API . His build stack is also a useful comparison point: Claude Opus 4.6 for the first plugin, Codex Desktop + GPT-5.5 for planning, and Claude Fable 5 for security review—which caught a real CSP privilege-escalation issue .
  • GLM-5.2. Simon notes the 753B MoE model has a 1M context window, open weights under MIT, ranks #2 on the Code Arena WebDev leaderboard behind only Claude Fable 5, and is listed on OpenRouter around $1.40 / $4.40 per million tokens input/output . In his testing it did especially well on animated SVG output, though one more complex illustration regressed versus GLM-5.1 .

🎬 GO DEEPER

  • 12:28-13:26 — Theo on loops that create more loops. Short demo of the agentic endgame: one thread makes the PR, another reviews each new SHA, fixes get re-reviewed, then the PR merges and the next one starts .
  • 18:24-19:29 — AI Jason on the handoff from manual run to production loop. He shows the exact move most people skip: test the workflow once, then make the agent write a README contract and wire the recurring trigger around it .
  • 1:03-3:17 — Addy Osmani on why long-running agents fail. Compact explanation of the three requirements: event-driven sleep, durable checkpoints, and a separate evaluator instead of self-grading .
  • 1:33-2:29 — Geoffrey Huntley on Ralph loops. Good antidote to the while true meme: single-task constraint, context recycling, and state progression via git commit + todo list .
  • Read Steve Yegge’s Netflix training note:The Flat Curve Society. Useful if you’re rolling agents out to a team: 0M / 4M / 12-15M qualified-day token cohorts, team-based training, and the shift from raw spend metrics to waste reduction and pocket evals .
  • Study the exact skills behind the maintainer loop:maintainer-orchestrator and github-project-triage. These are the concrete skill docs steipete says he combines with triage, autoreview, and computer use so work can land autonomously .
  • Study Datasette Agent + the Datasette Apps article. It’s a strong example of an agent with explicit tools, constrained APIs, and a copyable prompt template that other models can reuse .

Editorial take: the winners are starting to look less like prompt whisperers and more like workflow engineers with budgets, checkpoints, and reusable state .

Reusable Skills, Cloud Handoffs, and Verifier Loops
Jun 18
4 min read
110 docs
Andrew Ng
Harrison Chase
Riley Brown
+12
Today's strongest coding-agent pattern is turning successful runs into reusable systems: skills, automations, workflow code, and cloud jobs. Practical highlights span Cursor's new cloud subagents, Theo's Claude Code orchestration tricks, Codex automation and iOS workflows, Context Hub, and new open-model/sandbox tooling.

🔥 TOP SIGNAL

  • The clearest shift today: strong practitioners are productizing successful agent runs instead of re-prompting from scratch. Riley Brown turns plain-English Codex sessions into reusable skills and timed automations , Theo has Claude Code load precomputed context and write JS workflows for staged sub-agents , and Boris Cherny reduces the durable pattern to one line: agent + advanced model + verifier in a loop .
  • The common idea is timeless: capture good behavior once, then reuse it with guardrails instead of leaning on prompt hacks every time .

"run Claude Code + an advanced model + a verifier in a loop"

⚡ TRY THIS

  • Turn a good run into a skill. Riley Brown's workflow: ask the agent to do the task, push until the output is right, then say turn this into a skill called [Name]; when you learn a better format later, request the change and say update the skill. His broader point: stop micromanaging files or using act as hacks — describe the change clearly in natural language .

  • Schedule the boring stuff. Brown uses natural-language automations in Codex for both recurring and one-off tasks. Two prompt shapes that worked: do research every morning at 9am and send me a hook outline and set an automation in 35 minutes to upload this video to YouTube as a draft, send me a text when you do it. Pattern: if a task is useful once, ask whether it should recur or trigger later.

  • Precompute context inside the skill. Theo's Claude Code trick: make the skill execute a script on load so the model starts with facts already computed. His Repo Explorer skill keeps a ~/Explore Repos cache, lists current repos first, then clones only if the target is missing — useful when you want the agent to inspect real source without cluttering the active workspace .

  • Make the agent show its orchestration code before you pay for it. Theo's prompt is worth stealing: I want to audit the open PRs on this project... I want to use a workflow to break up all of this work. Before you run the workflow, please output the code you're going to use to run it so that we can read through it together. That yields staged audit/rule/verify phases instead of blind tool-call flooding , and it pairs cleanly with Boris Cherny's verifier-loop advice; just watch the burn rate, since Theo saw about $100 every 10 minutes with eight parallel agents on Fable .

📡 WHAT SHIPPED

  • Cursor cloud agents/in-cloud spins up an isolated cloud VM for long-running or parallel work; environments save as snapshots for faster future startups and verified testing; local agents can be moved to cloud so work continues with the laptop closed. Changelog: cursor.com/changelog/cloud-in-agents-window.
  • Codex “Build iOS Apps” plugin — runs the app in an in-app browser, opens SwiftUI previews, and hot-reloads edits without leaving Codex. Greg Brockman called it "a much better way to build iOS apps" because it removes the copy-paste-build-screenshot loop .
  • LangSmith sandboxes in Harbor — LangSmith is now a first-class Harbor environment alongside Daytona, E2B, and Modal; Harbor supports Dockerfile snapshots, SDK profiles, and a full exec/upload/download lifecycle. Docs: docs.langchain.com/langsmith/sandbox-harbor.
  • Orca — Jason Zhou says Orca is now his favorite IDE, pointing to built-in file/diff review, a setup script, agent session discovery, and native mobile support. Repo: github.com/stablyai/orca.
  • Context Hub — Andrew Ng and Rohit Prasad are building a "stack overflow for AI agents" so agents can fetch the latest API/SDK docs and contribute feedback back into the docs; Ng says it helps his agents make accurate calls to newer APIs and has accelerated his own coding work .
  • Open-model options are widening — Codex App, CLI, and SDK can use any open-source model via OSS-mode providers, per Tibo's config note . Riley Brown is also testing GLM-5.2 both through Cursor's custom-model path and in ZCode, which he describes as an "exact replica of Codex" with Telegram/Discord bot channels and access via a Coding Plan API key .

🎬 GO DEEPER

  • 3:34–5:35 — Theo on Repo Explorer. Best demo of Claude Code's "skills can execute scripts on load" advantage. You see the exact pattern: keep a repo cache outside the workspace, list what's already there, clone only when needed, and feed the result back into the run .
  • 6:10–8:09 — Riley Brown on turning a one-off task into a skill. Shows the full do task -> improve it -> turn this into a skill -> update the skill loop .
  • 8:35–9:14 — Andrew Ng on Context Hub. Worth watching if your agents keep failing on recent SDKs or annoying API syntax; the point is simple: load fresher docs into the run .

Editorial take: the durable edge is shifting from clever prompts to reusable agent infrastructure — skills, automations, fresh docs, cloud runs, and verifier-backed loops.

Review Gates, Intent Files, and Cursor's Agent-Native Git
Jun 17
4 min read
134 docs
LangChain
Riley Brown
Michael Truell
+9
Review guardrails dominated today: Addy Osmani's data-rich case for risk-tiered agent PR review was the clearest signal. Also worth your time: Kent C. Dodds's INTENT.md pattern, Cursor's Origin launch, LangSmith Sandboxes, and a few concrete Cursor workflows.

🔥 TOP SIGNAL

  • Addy Osmani's Agentic Code Review is the clearest practical read of the day: GitClear data says daily AI users generate about 4x the code for only about 12% more delivered value, while incidents-to-PR rose 242.7%, per-developer defects went from 9% to 54%, and median review duration rose 441.5%. His answer is not to back away from agents, but to review differently: batch-triage PRs with Claude Code or Codex, use heterogeneous reviewers, tier by blast radius, and keep the merge decision human-owned .

"Treat CI as the wall that does not move."

⚡ TRY THIS

  • Turn review into a gated pipeline, not a vibe check. Addy Osmani's playbook is straightforward: 1) point Claude Code or Codex at a batch of PRs and bucket them into safe to merge, needs work, and high-risk; 2) run two different AI reviewers on risky diffs; 3) tier depth by blast radius; 4) refuse review without an intent statement, test output, and a small diff; 5) read rewritten tests first, keep deterministic CI strict, and let a human own merge .

  • Shift human effort into the plan, then automate the line-by-line gate. Kun Chen's solo workflow: write a detailed plan up front, run 20-30 agents in parallel for hours, stay on escalation for stuck agents, and gate merges through an automated No Mistakes review step. The transferable pattern is simple: human-owned intent before execution, automated verification after execution .

  • Add an INTENT.md contract to long-lived packages. Kent C. Dodds has Kody create and maintain an INTENT.md file describing package goals, then compare every proposed change against it. If the goal itself changed, the agent should only update INTENT.md when the user explicitly wants that change .

  • Build internal tools your agent can operate, not just generate. Riley Brown's Cursor demo prompt was essentially make a to do app for me as a creator with a full database... be able to write to this database... make it look like a simple version of Notion, but dark mode. He used Convex for the DB, let the agent add tasks by natural language, and deployed it with @vercel put this on the internet.

📡 WHAT SHIPPED

  • Cursor Origin — Cursor is launching code storage and git hosting so teams and agents can host, review, and collaborate on code; swyx/Tomas Reimers highlighted agent-specific features: scalability for agent workloads, API/MCP extensibility, built-in merge conflict resolution, and CI/CD failure resolution. Available this fall; waitlist.

  • Cursor at Compile — Michael Truell said >95% of Cursor users now use it primarily as an agent, and agent requests are used about 5x more than assistive features. He also described Cursor 3 capabilities around gesture-based design edits, recursive sub-agents, days-long remote project handoffs, and broader SDK/CLI/plugin extensibility .

  • AI reviewer comparison got sharper — CodeRabbit topped the Martian benchmark on F1; Greptile was cited at about 82% bug-catch versus CodeRabbit's 44% in one benchmark; Anthropic said its internal Code Review had <1% incorrect findings and raised substantive reviews from 16% to 54%. The operational takeaway from Addy's roundup: reviewer diversity matters, because in one 146-PR test 93.4% of flagged locations were unique to a single tool .

  • LangSmith Sandboxes — LangChain positioned this as the right layer when an agent needs to do something: verify generated code runs before responding, operate on real files, persist state across tool calls, scale bursty parallel evals/RL, or safely handle user input that may be executed. Blog.

  • GLM 5.2 in Cursor via OpenRouter — Riley Brown shared the exact setup: paste an OpenRouter key into Cursor's OpenAI API override, set the base URL to https://openrouter.ai/api/v1, then add custom model z-ai/glm-5.2. Context from Kalo: people he trusts were reporting strong results from GLM 5.2 .

🎬 GO DEEPER

  • 12:00-14:30 — Riley Brown's agent-writable internal app demo. Good clip if you want a concrete pattern instead of a slogan: prompt the app into existence, attach a database, let the agent write into it, then verify the state persists .
  • 16:39-18:52 — Codex/Claude -> Cursor skills handoff. Watch this if tool-switching friction is your blocker: Riley exports skills and memory into a Codex Import folder with a README and Needed Keys, then asks Cursor to import it globally .
  • 8:13-9:11 — Michael Truell on the next agent handoff shape. Short but high-signal: the target state is not three local agents for 30 minutes, but handing out whole projects and getting back completed, tested work days later .
  • Repo/file to study: llama.cpp's .pi/gg/SYSTEM.md. Georgi Gerganov's local setup is intentionally tiny—pi -nc --offline plus a short system prompt. Start with the SYSTEM.md and the ggml-org Assisted-by commit trail if you want a minimal maintainer-grade local-agent workflow .

Editorial take: more code is already cheap; the leverage has moved to intent control, review gates, and agent-native infrastructure.

Test-to-Green Loops, Architecture Reviews, and Spend Guardrails
Jun 16
4 min read
112 docs
Theo - t3.gg
Aron Prins
Addy Osmani
+9
Today's strongest coding-agent pattern is practical autonomy with explicit checks. Copyable loops from Claude Code/Cursor, context and approval tricks, and new tooling for PR review, spend control, and longer-running agent sessions.

🔥 TOP SIGNAL

  • The strongest pattern today: autonomous coding loops are getting practical when they have hard guardrails. Jason Zhou highlighted a copy-paste "Ship PR Until Green" workflow: paste a feature spec into Claude Code or Cursor, let the agent run tests, read failures, fix them, and keep looping until the exit condition passes or an iteration cap hits . Addy Osmani's interview clip and Simon Willison's datasette-agent release land the same lesson from the reliability side: velocity is not enough without human verification, clear quality bars, or explicit approval before write actions .

⚡ TRY THIS

  • Run a test-to-green loop (Jason Zhou / AI Builder Club).

    1. Paste the feature spec into Claude Code or Cursor.
    2. Let the agent run tests.
    3. Let it read failures and patch them.
    4. Stop only when tests pass or the iteration cap is hit.

    Reported outcome from one run: a green PR after ten iterations with "no hand-holding" .

  • Use slash commands as context hygiene (official Antigravity CLI thread).

    1. /help to see available shortcuts.
    2. /context when the session gets big and you want to visualize the token window.
    3. /diff before review or commit to inspect uncommitted changes.
    4. /btw for side questions so the main task stays on track.
    5. /artifacts to manage the implementation plan .
  • When MCP can't write, drop to the API (Simon Willison).

    1. Ask Claude Code to derive the exact rule you need.
    2. Verify the final expression before applying it.
    3. If the MCP cannot edit the target resource, switch the agent to the provider API.

    Simon used this to land a Cloudflare WAF rule for search URLs containing &: (http.request.uri.path wildcard r"/search/*" and http.request.uri.query contains "&").

  • Add approval-gated write tools (Simon Willison / datasette-agent).

    1. Give the agent a write tool that explicitly asks for approval.
    2. Keep permissions intact.
    3. Only use broad flags like --yes or --unsafe when you intentionally want faster, less constrained execution.

    execute_write_sql now does this, and datasette agent chat content.db -m gpt-5.5 --unsafe can directly modify a database via prompts like "create a notes table" or "add a note about X" .

📡 WHAT SHIPPED

  • Archlet open-sourced — Jason Zhou says AI-written code caused architecture drift against the team's mental model; Archlet reviews PR diffs as a graph so you can see architecture impact at a glance. Claimed internal effect: 10x PR review speed and quality. Repo: superdesigndev/archlet.
  • datasette-agent 0.3a0 — adds execute_write_sql with approval gating, support for approval-requiring tools in datasette agent chat, new --root, --yes, --unsafe flags, and plain-text tool output for CLI. Notes: datasette-agent 0.3a0.
  • LangSmith LLM Gateway — LangChain says one dev on coding agents can burn thousands of dollars a week before anyone notices; they built this after hitting the issue internally and say it now makes spend more predictable. Reads: How we made coding agent spend predictable, Introducing LLM Gateway.
  • Antigravity CLI slash-command pass — official thread documenting /help, /context, /diff, /btw, /config//settings, and /artifacts for agentic workflows .
  • Sourcegraph Cloud — longer Deep Search conversations now use automatic compaction to keep growing threads manageable .
  • Claude Agent SDK billing reversal — the planned credit change is paused, and Theo says T3 Code users can keep using Claude Code with their existing subscriptions .
  • Emerging project: clawsweeper — Peter Steinberger says new issues on their open-source projects get checked against VISION.md, then the agent can create and autoreview a PR if the issue fits. Example: openclaw/gogcli#816.
  • Practitioner comparison: Anthropic Ultracode — swyx says the subagent model feels like "subroutines but intelligent" and can apply beyond coding, but warns the fanout only pays off if the repo is set up for parallelization; otherwise it is "scarily good at burning tokens" .

🎬 GO DEEPER

  • 0:45–1:42 — verification beats raw velocity. Addy Osmani's guest lays out the boring but critical part of agentic engineering: tests, visual regression, or a crisp definition of "good" still have to exist if you're going to trust the output .
  • 3:44–4:27 — the "cognitive surrender" warning. Short clip on the failure mode where you stop thinking critically and just ship whatever the agent emits .
  • Repo to study: Archlet. If AI-generated PRs are getting harder to reason about, this is the most concrete repo in today's sources aimed at architecture-level review rather than line-level diff skimming. Repo: github.com/superdesigndev/archlet.

  • Workflow artifact to study: openclaw/gogcli#816. This PR is a live example of Peter Steinberger's issue -> VISION.md check -> agent-created -> autoreviewed loop. Start here if you want a concrete automation trace, not just a description: openclaw/gogcli#816.

Editorial take: the useful frontier is not more autonomy by itself — it's autonomous loops with explicit stop conditions, review surfaces, spend controls, and human verification.

Codex Workflows: Self-Written Goals and Instant App Context
Jun 15
2 min read
45 docs
Tibo
Pietro Schirano
Riley Brown
Two practical Codex patterns stood out today: let the agent write its own /goal, and use app-shots to hand it the full context of whatever app you're in. Light news day, but both are immediately testable in a real dev workflow.

🔥 TOP SIGNAL

  • The practical shift in today's Codex chatter: let the agent set more of its own task. skirano says he "basically never" writes his own /goal anymore; he asks Codex to write one for itself and one for each agent it spawns. Tibo's framing is the timeless part: because Codex can see and set its own /goal, this turns meta-prompting into "give the agent your intent, then let it derive the task structure"

"Codex can see and set its own /goal. Everything we build, we build also as a tool for the agent. This is a generalization of meta prompting, where you let the agent set its own task based on your intent."

⚡ TRY THIS

  • Let Codex author the /goal. Replicable workflow from skirano:

    1. State the outcome you want.
    2. Ask Codex to write its own /goal.
    3. Ask it to write a /goal for each agent it spawns. This is the cleanest concrete example in today's sources of offloading task decomposition to the agent instead of hand-authoring it yourself
  • Use app-shots for in-place drafting. Riley Brown's shortcut flow:

    1. Open the app you're already working in.
    2. Press both Command keys at the same time.
    3. Give Codex a terse instruction like "Finish this." Riley says this works with Notion and email because it gives Codex the context of what you're doing immediately
  • Combine the two when the hard part is context + decomposition. First send the live app context to Codex with the Command-key shortcut; then have Codex turn that context into its own /goal and sub-agent goals. Today's sources support each step directly, and together they form a useful pattern: context capture first, agent-defined execution second

📡 WHAT SHIPPED

  • Codex app-shots is the concrete feature getting real practitioner praise today. Riley Brown calls it "the most delightful feature" he's used; the behavior is simple and specific: in any app, press both Command keys to send the current context to Codex immediately, then issue a tiny command. Demo

🎬 GO DEEPER

  • skirano's short demo of self-authored goalswatch on X. Best quick example in today's sources of asking Codex to write one /goal for itself and one for each agent it spawns

  • Riley Brown's app-shots demo + exact shortcut follow-updemo and follow-up. Watch this if you want the clearest picture of the interaction loop: capture the current app, then hand Codex a minimal instruction like "Finish this"

Editorial take: today's edge is tighter intent translation — capture the live context, then let the agent do more of the task-setting.