ZeroNoise Logo zeronoise

Coding Agents Alpha Tracker

Live Daily at 7:00 AM Agent time: 8:00 AM GMT+01:00 – Europe / London

by avergin 110 sources

Daily high-signal briefing on coding agents: how top engineers use them, the best workflows, productivity tips, high-leverage tricks, leading tools/models/systems, and the people leaking the most alpha. Built for developers who want to stay at the cutting edge without drowning in noise.

Bounded Goal Loops, Better Review Bots, and New Computer-Use Primitives
May 10
4 min read
74 docs
Romain Huet
Omar Shahine
DHH
+3
Today’s practical edge is operational discipline: Codex and Hermes goal loops only become useful when you pin them to explicit validation and stop rules. Also worth your attention: Crabbox for disposable debug loops, Peekaboo 3.0 for macOS computer use, Copilot review’s jump in usefulness, and strong Codex iOS build feedback.

🔥 TOP SIGNAL

  • Bounded goal loops are the real unlock. After three days with Codex/Hermes goal-based agents, Jason Zhou says most people use them wrong: the loop only works when you define the objective, constraints, validation method, and explicit stop conditions up front — not when you say "keep fixing stuff." His deeper Codex walkthrough shows why: go replaces dumb programmatic looping with an LLM judge, which works well for hours-long migrations, refactors, and optimization tasks, but breaks down on multi-week work without fast, verifiable feedback.

⚡ TRY THIS

  • Turn one-off prompts into a bounded go run (Jason Zhou).

    1. Enable the feature: codex features listcodex features enable go.
    2. Give it a verifiable brief, e.g. go "migrate codebase from JS to TS, verify screens stay exactly the same visually with Playwright".
    3. Check status with go; interrupt with go pause or go clear; branch a side investigation with side.
  • Do an alignment interview before the agent writes code (Jason Zhou; Vincent from OpenClaw). First dump the context: what the project is, what the user cares about, what "bad" looks like, what you already tried, and the bugs it keeps missing; then let the agent ask questions before it starts. Also quantify done — e.g. "find 20 discrete new issues, propose fixes, push fixes to a branch, log results" — because fuzzy goals like "keep fixing" make the model stop early or wander.

  • Move the contract into files with Go Buddy. Run mpx go buddy, then goprep to generate a go.md with the request, constraints, stop rules, and loop details plus a state.yaml task file. Then run go @go.md so every loop re-reads the same contract instead of relying on chat memory; Jason shows this taking a vague game idea to a functional game with generated assets.

  • Debug in disposable sandboxes, not your local machine (Peter Steinberger). His loop is simple: ask Codex to recreate the exact failing state in an ephemeral Crabbox, verify the bug, fix it, then verify the fix. The upside: no polluted local environment and enough isolation to run 10+ sessions in parallel without slowdown. Crabbox

📡 WHAT SHIPPED

  • Peekaboo 3.0 — Peter Steinberger says this is the biggest release since 2.0: action-first macOS computer use, unified screenshot + UI detection, cleaner JSON across CLI + MCP, and better snapshots. His framing is the interesting part: he started it last year, but says the models weren’t good enough then; now they are. peekaboo.sh

  • Crabbox Windows terminal handling — strong enough that Steinberger says Codex could E2E-fix gifgrep’s animated GIF terminal rendering. Better terminal substrates clearly matter for what agents can validate. crabbox.sh · gifgrep.com

  • Codex/Hermes goal-based agents — Jason Zhou says both now ship a /goal feature, and his field report is clear: most users are holding it wrong. Adoption signal: Jason ran a nine-hour migration with Codex, and Vincent from OpenClaw ran it for three days across 30 rounds.

  • GitHub Copilot review feels materially better — DHH says the review feature, not the local CLI, went from roughly a 1/10 to 7/10 hit rate on real issue finding. Caveat: it still re-raises concerns that were already dismissed with a 👎.

  • Codex is getting strong iOS build reviews — Romain Huet says it can design screens, write Swift with GPT-5.5, run the app in Simulator without opening Xcode, and click around with computer use to test it. Omar Shahine says a single-shot app built with goals got him about 95% of the way there and felt much better than Claude Code.

  • OpenClaw loop speedups — Steinberger says caching work is making Telegram loops in OpenClaw 5-100x faster.

🎬 GO DEEPER

  • 5:36-6:55 — How to write a go prompt that actually terminates. Short, high-signal clip on the prompt anatomy: objective, constraints, validation, and stop conditions, with examples for migrations, prototypes, and eval-driven optimization.
  • 9:59-12:08 — Where go stops working, and how MISSION.md can take over. Jason draws the boundary cleanly: go is for hours-long coding loops; multi-week goals need scheduled reruns, stored summaries, explicit metrics, and human-in-the-loop escalation.
  • Study Crabbox. The durable pattern is exact-state, disposable execution environments for reproduce → fix → verify loops, plus enough isolation to fan out lots of sessions in parallel.

  • Study peekaboo.sh. Peekaboo 3.0 is a practical reference for action-first macOS computer use and a cleaner CLI/MCP data model.

Editorial take: the best coding-agent setups are getting less magical and more operational — explicit contracts, state files, disposable environments, and hard verification beat "just let the agent cook."

Disposable Agent Code, Local DS4, and Open Models Passing the Swap Test
May 9
5 min read
117 docs
Dillon Mulroy
Shawn "swyx" Wang
Salvatore Sanfilippo
+12
The practical signal today is boundary-setting: use agents aggressively on cheap-to-regenerate code, keep context surfaces tight, and pay attention as local and open-model stacks become viable for serious coding workflows.

🔥 TOP SIGNAL

  • The highest-signal shift today: treat agent-written code as disposable scaffolding when change is cheap. Mitchell Hashimoto says AI "slop" is useful because it enables fast parallel experimentation; he used agent loops in Ralph overnight to generate dozens of low-quality plugins so he could test a full GUI and plugin ecosystem, then regenerate them whenever the API changed because the cost of change was just tokens .
  • Kent C. Dodds is using the same boundary in practice. His MCP-powered assistant Kody has produced 160k+ lines of code he has not read, which is acceptable for proving the idea works—but not for a finished product, where he says he would rewrite more intentionally from scratch . Simon Willison's version is narrower but aligned: he already trusts Claude Code for routine production tasks, while warning that repeated success can create "normalization of deviance" and that real usage matters more than AI-generated tests/docs alone .

⚡ TRY THIS

  • Create a prototype-only lane for unstable surfaces. Mitchell's pattern is concrete: keep core internals high-quality, but let agents generate the GUI, plugins, or provider layer while the API/SDK is still moving; run loops overnight; regenerate on API changes; only ship that slop transparently to testers; then rewrite once the concept is proven .

  • Ask for HTML, not Markdown, when you need an explanation you can inspect. Thariq Shihipar's argument: HTML lets Claude produce SVG diagrams, interactive widgets, and in-page navigation for code explanations and reviews . Simon Willison highlighted this exact PR-review prompt:

Help me review this PR by creating an HTML artifact that describes it. I'm not very familiar with the streaming/backpressure logic so focus on that. Render the actual diff with inline margin annotations, color-code findings by severity and whatever else might be needed to convey the concept well.

Start with that pattern for gnarly diffs, then browse examples at thariqs.github.io/html-effectiveness.

  • Stop auto-loading every skill/tool into context. Dillon Mulroy says he almost never wants skills auto-invoked, so he built custom tooling to toggle them on and off to keep them out of the context window unless needed; Armin Ronacher's terse alternative: prompt templates . Practical takeaway: keep reusable workflows off by default, enable them only for the current task, and use templates for repeat jobs.

  • For code search, start simple: agentic retrieval first, parallel fan-out second, embeddings later. swyx says simple agentic RAG is good enough for many codebases—especially homogeneous ones—until you're dealing with something on the order of 10B-1T tokens; for search itself, fan out in parallel, e.g. four rounds of eight searches, instead of one-search-at-a-time contexting . Then add semantic indexing for larger codebases, where tools like Cursor's embedding flow start to matter more .

📡 WHAT SHIPPED

  • DS4 for DeepSeek Flash V4 — Salvatore Sanfilippo released DS4, an open-source inference engine for DeepSeek Flash V4 and his first major OSS project built primarily with AI-written code under human architectural control . He says DeepSeek Flash V4 gives him a usable 1M-token local context window on 198GB RAM. Benchmarks he cited: ~470 t/s prefill + 35-36 t/s generation on M3 Ultra, ~250/25 on M3 Max.

  • DS4's practical feature set looks agent-ready — HTTP API server, streaming, tool streaming via PR, logging, tracing, and a disk-persisted KV cache treated as a first-class object . The setup flow is unusually short: clone the repo, run make, download the model, start the server, then use it with pyagent / cloud code / open code; experienced pyagent collaborators told him that workflow felt like "product mode" rather than a toy .

  • Open-model swap test passed in production — Caspar Brun says his org changed Fleet's internal model from Sonnet 4.6 to Kimi K2.6 and he "didn't even notice"; his claim is that open models are already good enough for most tasks outside the hardest coding work, at 5-10x lower cost. LangChain's framing: this is the year of open-source LLMs in agents .

  • Fleet added per-agent tracing control — You can now enable or disable tracing at the individual agent level in Fleet, which Brace Sproul called a "big unlock" for getting full trace details only where you need them. Docs: langchain.com/langsmith/fleet.

  • Codex migration got a direct path — ChatGPT now exposes a switch-to-Codex flow; Tibo's practitioner summary was simple: "You can just migrate things" .

  • Current model chatter from an agent lab — swyx's coding-model shortlist right now: Claude 4.6 and GPT-5.3 Codex.

🎬 GO DEEPER

  • 10:18-12:34 — swyx on when simple agentic RAG is enough. Good calibration if you're overbuilding retrieval: for many codebases, plain agentic search works fine until the corpus gets truly large or heterogeneous .
  • 15:44-16:45 — swyx on parallel search. Watch this if your agent still does naive sequential retrieval; the useful bit is the fan-out pattern—multiple searches per round plus diversity across them .
  • 10:02-11:34 — antirez on the DS4 setup loop. The benchmark numbers are nice, but the real hook is the workflow: clone, make, download model, run server, connect your coding agent stack .
  • Study simonw/tools. It contains the kind of narrow, useful artifacts that Claude Code is good at building quickly, including the Redis Array Playground and other small utilities .

  • Study inaturalist-clumper + simonw/inaturalist-clumps. This is a clean end-to-end pattern worth copying: small Python CLI → git-scraped JSON → HTML frontend generated from a precise prompt against real data .

Editorial take: today's edge is boundary-setting—disposable prototype code, tighter context windows, and better search strategy beat blindly giving agents more rope.

Parallel Browser Agents, Recursive Orchestrators, and Review Gates
May 8
4 min read
188 docs
swyx
Harrison Chase
Riley Brown
+13
OpenAI pushed Codex into Chrome, Cursor doubled down on recursive and parallel workers, and practitioners shared tighter loops for specs, debugging, and pre-merge review. The common thread: coding agents get more useful when you treat them like a small team with explicit handoffs.

🔥 TOP SIGNAL

Parallelism escaped the editor today. OpenAI shipped Codex’s Chrome extension so the agent can work on logged-in sites, gather context across tabs, use DevTools in parallel, and stay out of your way while you keep using the browser . Cursor shipped the matching code-side primitives — recursive /orchestrate, Build in Parallel, and diff-splitting PR workflows — which is a strong signal that the winning agent UX is no longer one chat per task, but coordinated workers across browser, codebase, and review surfaces . Embiricos summed up the delta cleanly: older Codex browser control meant one tab at a time; this unlocks multiple agents/subagents and multiple tabs .

⚡ TRY THIS

  • Put the spec in the repo, not the scrollback. Riley Brown stores the app brief in my-idea.md so Codex can keep revisiting it; Matt Pocock does the same with a ubiquitous-language markdown doc, keeps it open while prompting, and references it from AGENTS.md. Practical loop: 1) create my-idea.md with the exact feature brief, 2) create a domain-language doc for important terms, 3) keep both files open during prompting, 4) point your agent rules file at them so fresh sessions can rediscover the context .

  • Debug with artifacts, not vibes. Riley’s loop is concrete: open the app in an external browser, reproduce the bug, copy the console output and status codes from Inspect, add a screenshot when visual state matters, then paste both back to the agent for the next turn . He used this pattern to fix missing permissions, storage-rule failures, metadata rendering, and layout issues — much higher signal than saying it broke .

  • Use Codex as a background browser worker for logged-in flows. Install the Chrome plugin inside Codex, then hand it tasks that used to require babysitting: auth flows, dashboard checks, cross-tab state, and webapp testing. OpenAI says it can gather context across tabs and use DevTools in parallel without taking over your browser, and dkundel’s demo shows the same setup combined with subagents for multiplayer-style testing .

  • Add a hard review gate before the agent can say it is done. Theo’s pattern is simple: explicitly tell the agent to run coderabbit CLI before it reports completion, so it gets a code-review pass with org-wide context instead of only the current repo . The operational loop is clean: code → tests → coderabbit review → only then mark complete .

📡 WHAT SHIPPED

  • Codex Chrome extension — Codex now works directly in Chrome on macOS and Windows; it can test web apps, gather context across tabs, use DevTools in parallel, work on logged-in sites in the background, and avoid hijacking the browser. Install from the Codex app. Announcement

  • Cursor /orchestrate — New Cursor SDK skill that recursively spawns agents. Architecture: planners create workers and verifiers; if verification fails, the planner spawns another worker. Cursor says it already used this internally to cut token use by 20% on skill auto-research and reduce backend cold starts by 80%. Plugin: cursor.com/marketplace/cursor/orchestrate

  • Cursor 3 PR and multitasking surface — New integrated PR review, Build in Parallel async subagents, Create PRs to split diffs into smaller mergeable slices, and quick-action skill pills. Changelog: cursor.com/changelog/05-07-26

  • DeepAgents sandboxes — LangChain’s OSS DeepAgents now supports multiple sandbox backends including Daytona, Modal, Runloop, and LangSmith; no backend means no execute tool. The practical security addition is the auth-proxy pattern: keep credentials in workspace secrets and inject them on outbound requests so they never land inside the sandbox. Docs: docs.langchain.com/oss/python/deepagents/sandboxes

  • Oracle Agent Memory — Oracle released a Python package for agent memory aimed at long-horizon tasks like software debugging and coding. In Oracle’s benchmark, engineered memory kept token consumption relatively stable over 100 turns while an LLM judge preferred the engineered responses over naive append-everything memory. Code and notebooks: Oracle AI Developer Hub

🎬 GO DEEPER

  • 43:22-48:08 — Riley Brown on turning one web app into a real desktop app. Good watch if you want the concrete multi-surface pattern: same project, same backend, new Electron app, then side-by-side verification against the web version .
  • 48:45-50:31 — Riley Brown on the screenshot-to-fix loop for iOS. Short and practical: run the app in Simulator, hit an auth error, screenshot it, throw it back at the agent, rerun, and verify the fix .
  • 23:41-24:56 — Alex Shevchenko on Ramp’s self-monitoring coding agent. This is the clip to watch if you care about post-merge agent loops: Inspect wakes up on new PRs or a nightly cron, proposes Datadog monitors in shadow mode, and a second agent prunes noisy ones before anything starts pinging engineers .

  • Study the docs, not just the tweets. Cursor’s /orchestrate plugin is the cleanest public artifact of recursive planner/worker/verifier orchestration shipping today . LangChain’s DeepAgents sandbox docs are worth reading before you give any agent code execution, especially the auth-proxy section .

  • Study Oracle’s AI Developer Hub for memory patterns you can actually port. The useful part is not the branding — it is the implementation detail around context compaction, long-term storage, and keeping long debugging/coding sessions from turning into token sludge .

Editorial take: the sharpest agent workflows now look like small-team ops — explicit context files, parallel workers across browser and repo, fresh execution environments, and one forced review step before you trust the result.

Default-Author Coding Agents, Harness Tuning, and Agent-Era Git
May 7
5 min read
128 docs
Harrison Chase
Boris Cherny
Salvatore Sanfilippo
+14
Firsthand reports from Boris Cherny, Simon Willison, and Riley Brown show coding agents moving from assistant to default author. The practical edge is shifting to evals, context control, sandboxing, model routing, and the new tooling shipping around agent-speed development.

🔥 TOP SIGNAL

The clearest shift today: for power users, the agent is becoming the default code author, not a sidekick. Boris Cherny says Claude Code now writes all the code he used to write by hand and that he often has anywhere from a few to thousands of agents running; Simon Willison says he already trusts Claude Code for bounded production tasks without line-by-line review, and Riley Brown got GPT 5.5 to turn a Firebase web app into desktop + Swift iOS clients in a 41-minute run .

The bottleneck has moved up the stack: plans, evals, context hygiene, and review boundaries now matter more than typing speed, and Simon Willison's normalization of deviance warning is the right counterweight as trust rises .

⚡ TRY THIS

  • Plan first, then force a persistent integration loop. Riley Brown's GPT 5.5 flow: build the first web app, ask the model to draft a plan, then reply with:

    "Okay please take a deep look at the plan you made… By the time you are done I want to be able to use all three of these applications together… Don’t stop until you are done."

    He used that to convert a Firebase-backed web app into desktop + Swift iOS apps; the run lasted 41 minutes and finished with working auth on both apps . Timeless pattern: specify the finished state, not the next step .

  • Black-box only the boring path — and require tests + docs. Simon Willison's current bar for bounded agent work is tasks like: build a JSON API endpoint that runs a SQL query, outputs JSON, and adds automated tests and documentation . He treats that output as a semi-black box until bugs or performance problems show up, then inspects internals — but explicitly warns that repeated success can create normalization of deviance and over-trust .

  • Start eval-driven development with 5-10 cases, not 1,000. Harrison Chase says you can begin with five or ten realistic scenarios, define what a good response and bad response look like, and run every prompt/tool change against that set . Clay's production pattern is to use deterministic checks where possible, LLM-as-judge when needed, and then keep expanding the dataset with real production behavior .

  • Sandbox agent internet and shell access — especially for local models. ThePrimeTime's blunt advice: letting agents roam the internet from your local machine is the easiest way to shoot yourself in the foot, while Salvatore Sanfilippo says lack of sandboxing is a major concern when giving less-aligned local models agent-style powers . If you need browser automation, the safer pattern is isolated cloud/browser infra rather than full local permissions .

📡 WHAT SHIPPED

  • Claude Managed Agents(via Simon Willison's live blog of Anthropic's event): multi-agent orchestration and Outcomes are in public beta; Dreaming is a research preview. Outcomes let you define success criteria so Claude can iterate toward them, and Dreaming inspects past sessions to create new memories such as descent-playbook.md. Managed Agents overview.

  • Claude Code's surface area expanded: Anthropic showed Code Review, CI auto-fix, Remote Agents, and CLI/IDE/Desktop surfaces built on the Claude Agent SDK; Simon notes Code Review is already used by every team at Anthropic .

  • Claude Code limits moved immediately: 5-hour limits doubled for Pro, Max, Team, and seat-based Enterprise; peak-hour reductions were removed for Pro/Max; Opus API limits were raised. Boris Cherny also said Anthropic's SpaceX partnership adds 300+ MW of capacity and 220K NVIDIA GPUs within the month. Announcement.

  • Anthropic showed a practical model-routing pattern: an advisor strategy where Opus advises Sonnet on demand. Simon reports Anthropic said one customer, eve, got frontier-model quality at 5x lower cost with this setup .

  • LangChain Deep Agents added Harness Profiles: model-specific overrides for system prompts, tool names/implementations, and middleware. LangChain says its own testing saw 10-20 point gains on a tau2-bench subset versus the default harness; profiles ship for OpenAI, Anthropic, and Google models, with open-weight profiles coming. Tuning blog.

  • Cursor 3.3 added context accounting: you can now see where agent context is going and use the breakdown to debug problems across rules, skills, MCPs, and subagents .

  • Cursor's Composer training stack is self-bootstrapping: Cursor says older Composer generations now set up dev environments for RL training via autoinstall, so newer generations can spend more time on harder tasks. Writeup.

  • Agent-era Git infra is getting real: Theo argues GitHub is straining under agent-generated repo/PR/commit volume; he highlights Pierre / Code Storage for high-throughput repo creation, Entire CLI for preserving the why behind agent changes, and says Forgejo/Codeberg is the best open Gen-2 alternative today because its Actions are largely YAML-compatible with GitHub .

🎬 GO DEEPER

  • 10:58-11:23 — Boris Cherny on the new coding loop. A clean, concise description of the workflow shift: prompt the agent, let it build and test the feature, then approve or request changes .
  • 10:16-11:27 — Harrison Chase on eval-driven development. The practical takeaway: start with 5-10 scenarios, define good and bad outputs, and let that small eval set govern prompt and tool changes over time .
  • 18:45-22:31 — Harrison Chase on async subagents, proactive agents, memory, and identity. Useful framing for where coding-agent UX is headed once background runs get long: you talk to the orchestrator while long-running workers do the coding in back .
  • Read/listen: Simon Willison on where vibe coding bleeds into agentic engineering. Start with Heavybit Ep. #9, then skim his transcript extract. This is the source set behind his comments on trust, production use, and why the categories are blurring .

  • Study Kody. Kent C. Dodds says the repo hit 160k lines, 428 commits, and 323 PRs since March 18, primarily using cloud agents. It is a live artifact of what agent-heavy OSS development looks like at repo scale .

  • Study robobun. Simon Willison notes Bun's GitHub bot has made more contributions to Bun than Jarred Sumner, which makes this repo worth watching for bot-heavy contribution flow at project scale .

Editorial take: getting code out of an agent is no longer the hard part; keeping trust, context, and repo infrastructure intact at agent speed is.

Harness Design, Review Gates, and Always-On CI Agents
May 6
4 min read
131 docs
Harrison Chase
Armin Ronacher
Salvatore Sanfilippo
+9
Practitioners converged on a durable lesson: the edge in coding agents is shifting from model choice to harness design—shared specs, context trimming, portable interfaces, and explicit review gates. Also in this brief: Cursor's CI autofix, fs-safe, CodexBar 0.24, and the clips worth stealing from.

🔥 TOP SIGNAL

The durable edge is moving from model picking to harness design. PI maintainers say tool-call and system-prompt work can move a model's score by ~30-40% , LangChain is pushing ACP so the same agent can survive CLI/TUI/IDE changes , and Harrison Chase argues that the state wrapped around the model—not the model itself—is now the bigger lock-in risk .

Simon Willison's day-to-day workflow is the operational version of that thesis: agents can be black-box reliable for routine tasks, but humans still own security-adjacent review and higher-order judgment .

"the model is yours to pick. the interface is yours to pick. the harness shouldn’t be the thing that locks you in."

⚡ TRY THIS

  • Black-box the boring path; hand-review the risky path. Give the agent a bounded task like: "build a JSON API endpoint that runs a SQL query and outputs the results as JSON; add automated tests and documentation." Simon Willison says that class of work is now reliable enough to treat as a semi-black box—but he still manually reviews anything security-adjacent .

  • Run parallel spikes, not parallel production merges. Simon's current workflow: fire off a Claude Code web task for one spike, run a second spike in Codex, keep doing other work, then come back and review the prototypes. He says this only became practical once reliability improved enough to reduce review overhead .

  • Use one shared spec, many fresh subagents, and aggressive context trimming. Max's PI setup starts each subagent from a fresh session with a common-ground plan/spec and a manager session id; the main session surfaces blockers, and Reduce strips tool calls/thinking so the active context keeps only user + assistant finals .

  • If the code is wrong, rewrite the spec—not just the prompt. Salvatore Sanfilippo's Redis-arrays loop: write the spec in Markdown, improve the spec with GPT, generate an implementation, go back to the spec if tests are unsatisfying, then do a manual line-by-line review of the core code .

📡 WHAT SHIPPED

  • Cursor CI autofix — Cursor now offers always-on agents that monitor GitHub, investigate CI root causes, and open PRs with fixes. Setup template: cursor.com/marketplace/automations/ci-autofix

  • openclaw/fs-safe — Peter Steinberger shipped a reusable filesystem safety primitive extracted from OpenClaw. The guidance is practical: if your Node app accepts paths from agents, plugins, uploads, configs, or users, use a root handle instead of treating string normalization as the security boundary. Docs: fs-safe.io

  • CodexBar 0.24 — New Windsurf, Codebuff, and DeepSeek providers; Copilot multi-account switching; opt-in local storage breakdowns; fixes for hung Codex RPC and redraw battery drain. Release: github.com/steipete/CodexBar/releases/tag/v0.24

  • Deep Agents + ACP — LangChain says deepagents-acp can serve any agent and deepagents-cli --acp exposes the same harness over ACP, with working frontends like toad and JetBrains IDE integration via this blog post.

  • Current model/tool preference snapshot — Simon Willison says Codex has replaced Claude Code for most of his daily use because the latest version is "outstanding" and Claude Code pricing is a trust issue for him . His current favorite local model runs in about 20GB RAM on a laptop and feels roughly like frontier capability from 6-12 months ago. Harrison Chase adds that GLM5 feels close enough to Sonnet/Opus for a lot of prototyping that product taste now matters more than squeezing out the absolute best model .

🎬 GO DEEPER

  • 10:26-11:55 — Simon Willison on vibe coding vs agentic engineering. Best short reset on where agents belong in real software work: personal tools are one thing; production systems touching other people's data need a stricter bar. Watch: YouTube
  • 12:35-13:50 — PI's shared-spec + blocker handoff pattern. Best concrete demo in today's pile of a main session steering fresh-context subagents: every worker reads the same plan/spec, and the main session surfaces blockers so the human can drop straight into the right subagent. Watch: YouTube

  • 8:16-9:34 — Harrison Chase on memory as the real lock-in. If your stack is quietly starting to depend on provider-managed memory, this is the clip to watch before that hardens into architecture. Watch: YouTube

  • Study these artifacts, not just the takes. Cursor's CI autofix template is the most copyable always-on GitHub agent setup from today . fs-safe.io is the cleaner reference if any part of your stack lets agents touch the filesystem through generated or user-supplied paths .

Editorial take: model choice still matters, but today's durable edge is harness design—portable interfaces, owned memory, trimmed context, and explicit review gates.

Background Loops, Agent Docs, and Multi-Model Routing
May 5
4 min read
113 docs
Vercel Developers
Alexander Embiricos
Boris Cherny
+14
Today’s best signals are operational, not aspirational: top practitioners are turning coding agents into scheduled background workers. The practical edge is in loops, future-aware prompts, on-distribution stacks, repo-specific docs, and the infrastructure shipping around them.

🔥 TOP SIGNAL

Today's clearest signal: background agents are becoming the real coding workflow. Boris Cherny says he runs dozens of Claude Code /loops to babysit PRs, fix flaky CI, and cluster feedback every 30 minutes, with new server-side routines keeping jobs alive when the laptop is closed . Alexander Embiricos says Codex already supports the same time-based pattern for unresolved discussions, launch bugs, and flaky tests —and Riley Brown's warning is the useful counter-signal: cronjobs, memory, re-auth, and file-placement reliability are still where agent power users lose time .

⚡ TRY THIS

  • Steal Boris Cherny's first three loops. Set up recurring jobs for PR babysitting (auto-rebase + fix CI), CI health (catch/fix flaky tests), and feedback triage (cluster feedback every 30 minutes). Run them on a cron-style repeat via /loop; if your tool has server-side execution, move long runners there so they keep working offline .

  • Use future-oriented prompts as lightweight automation. Embiricos says he uses this pattern in Codex all the time and that it's powerful but non-obvious .

    "tomorrow, check in on this discussion and ping me if it isn't resolved"

    "let me know if this bug isn't fixed by the day before launch"

    "bug me if this flaky test doesn't go green after retry"

  • If you want max agent throughput, bias toward boring, on-distribution tech. Cherny says Claude Code's codebase is simple TypeScript + React, originally chosen because that combo was very on distribution for the model; that helped them reach 100% model-written code early . If you're starting greenfield and expect heavy agent involvement, this is the pragmatic default .

  • Write the migration doc before the port. Bun appears to be exploring a Zig→Rust port with a dedicated docs/PORTING.md aimed at coding agents . Steal the pattern: if agents are handling a big refactor or language move, give them a repo-local playbook first .

📡 WHAT SHIPPED

  • Bun showed two strong agent-native signals. Armin Ronacher reported a bug and says a coding agent fixed it and pushed PR #30257 within five minutes; later, agents were debating on the PR itself . Simon Willison separately spotted Bun's agent-specific PORTING.md as the project explores a Zig→Rust port .

  • Vercel deepsec. New open-source coding security harness: CLI-first, sandbox-based scaling, pluggable coding agents, large-repo focus, and support for AI Gateway or your own subscription. Vercel says it followed months of internal use and tests on some of the largest open-source codebases; blog: introducing deepsec.

  • deepagents-cli + Profiles API. LangChain is pushing it as a model-agnostic harness for open-weight coding agents. Recent CLI features: /agents, /model, headless --json + --max-turns, --acp, /skill:name, and MCP with OAuth; docs: overview.

  • LangSmith Fleet multi-model routing. Sub-agents can now use different models, with the stated goal of pushing simple work to fast/cheap models and saving stronger models for the hard parts; page: Fleet.

  • Gemini API infrastructure updates. Logan Kilpatrick says webhooks are live for long-running tasks including agents, and the Interactions API now returns more human- and agent-readable error messages .

  • Codex adoption looks real; Copilot economics look strained. @linuz90 called Codex his favorite coding app and says it now handles 90%+ of his work despite earlier terminal/lock-in hesitation . Theo says one Copilot message burned through 60M+ tokens/$30, and 15 messages totaled $221 of tokens under a flat-message plan he thinks GitHub cannot sustain .

  • Model preference is still split. Cherny says Claude Code reached 100% agent-written code on a simple TypeScript/React codebase, with each Anthropic release from Opus 4 through 4.7 improving the curve . Theo, by contrast, says Anthropic is still meaningfully worse than OpenAI for most code outside frontend, even though many enterprise developers use Claude/Opus via Bedrock, Cursor, or Copilot in existing cloud setups .

🎬 GO DEEPER

  • 7:35-8:49 — Boris Cherny's /loop playbook. Best short walkthrough today of a practical background-agent setup: PR babysitting, CI repair, feedback clustering, and why server-side routines matter once jobs need to survive laptop sleep .
  • 19:50-20:37 — When the model starts the loop for you. Cherny says Claude 4.7 increasingly notices time-varying work on its own, starts a loop, and offers 30-minute Slack reports .
  • Study Bun's live artifacts, not the discourse.PR #30257 is a report→fix→PR example that landed minutes after a bug report, and Bun's docs/PORTING.md shows what agent-facing migration guidance can look like in a real repo .

  • Study Simon Willison's narrow-tool workflow. His Redis Array Playground and PR #277 show Claude Code for web being used for a focused dev utility around one new Redis feature, not a giant monolith ask. More context: blog post.

Editorial take: the edge is shifting from single-shot codegen to reliable background workflows—loops, timers, sub-agent routing, and repo-specific guidance.

Auto-Review, Maintainer Loops, and Ephemeral Agent Machines
May 4
4 min read
62 docs
Maja Trebacz
Tibo
Salvatore Sanfilippo
+5
The strongest signal today is operational: coding agents are taking over the glue work around development—permission approvals, maintainer triage, fresh test environments, and long-context recovery. This brief pulls out the workflows, releases, and clips that are actually useful to practitioners.

🔥 TOP SIGNAL

The highest-alpha move today is taking humans out of the tiny, repetitive interrupts while keeping them at the real review boundary. OpenAI engineer Tibo says Codex Auto-Review is now the default within OpenAI and cuts approval prompts by ~200x, while OpenClaw’s ClawSweeper 0.2.0 applies the same idea to OSS maintenance with a conservative issue → fix/build → guarded PR → review → repair → re-review → automerge loop.

"Clicking the “Approve permission” button is difficult. We show that agents can do that for you."

⚡ TRY THIS

  • Steal the maintainer loop, not just the bot. Peter Steinberger’s ClawSweeper template is explicit: issue → @clawsweeper fix/build → guarded PR → review → repair → re-review → automerge. The timeless pattern is conservative autonomy with hard review gates; if you maintain important OSS infra, Steinberger also points to OpenAI’s Codex for OSS program for free accounts.

  • Use fresh machines when the bug smells environment-specific. Steinberger used Codex to validate a macOS-only launchd issue that would not reliably reproduce on a non-fresh install, and Crabbox 0.4.0 exists specifically to spin up fast ephemeral macOS/Linux/Windows machines for agent workflows via AWS spot, Hetzner, or Blacksmith. Practical playbook: reproduce on a clean box, let the agent test there, then discard the machine.

  • When your local agent starts free-styling tool syntax, clamp it. In his OpenCode + DeepSeek v4 flash workflow, Salvatore Sanfilippo sets the sampler to temperature=0 the moment the model emits a tool-call tag, then restores the default afterward. In the same session, the agent spawned sub-agents, edited files, ran tests, fixed failures, and could be pushed into a read-heavy path with direct prompts like check pico.c for security bugs.

  • Persist long-context state instead of reprocessing everything. Sanfilippo caches common system prompts up to 30k tokens and writes evicted KV cache entries to disk; in his DeepSeek setup, 128k cached tokens = ~390MB, writes take 125ms, and an 11k-token hit reloads in 35ms. If you are building local agent infra, the reusable pattern is prompt-hash lookup → reload shared context → reprocess only the delta.

📡 WHAT SHIPPED

  • Codex Auto-Review — released last week; now default within OpenAI; reduces approvals by ~200x; core trick is letting agents handle the permission-approval click. Blog: alignment.openai.com/auto-review.
  • ClawSweeper 0.2.0 — OpenClaw’s open-source maintenance bot running on Codex; automates issue → fix/build → guarded PR → review → repair → re-review → automerge. Steinberger says it can be forked for any repo and is aimed at OSS maintainers drowning in issues and PRs. Repo: clawsweeper.bot.
  • Crabbox 0.4.0 — fast ephemeral machines for agents across macOS, Linux, and Windows using AWS spot instances, Hetzner, or Blacksmith. Positioning is very practical: recreate cross-platform conditions fast, with “infinite codex + tests.” Site: crabbox.sh.
  • Codex /goal — a goal-driven loop that tests, self-corrects, and repeats until the mission is done or budget runs out, instead of forcing constant context resets. Jason Zhou calls it a stateful Ralph-loop and notes Crewlet has explored similar setups. Thread: x.com/aibuilderclub_/status/2050930564870635855.
  • DeepSeek v4 flash custom engine + OpenCode workflow — not a public release yet, but a serious practitioner demo: Sanfilippo used his own 2-bit-quantized inference engine in a real Tcl-interpreter workflow with sub-agents, tool calls, tests, disk-backed KV cache, ~14-15 tok/s generation at 31k context, and a server configured for 250k context.

🎬 GO DEEPER

  • 4:48-9:15 — Disk KV cache stops being a toy. Salvatore shows why DeepSeek’s 1:128 KV compression changes the tradeoff: 128k tokens take about 390MB, can write in about 125ms, and make disk-backed recovery realistic for long agent sessions.
  • 11:20-14:45 — Prompt caching + forced file reads in a real OpenCode session. This section is worth watching for two practical moves: cache common prompts up to 30k tokens, then use explicit prompts like check pico.c for security bugs when you want the agent to read rather than freestyle.
  • Study ClawSweeper. If you want a maintainer-friendly agent loop instead of full autonomy theater, the pattern to steal is the guarded PR → review → repair → re-review structure.

  • Study Crabbox. Useful if your agent workflows routinely need fresh OS state, cross-platform reproduction, or disposable test boxes before you trust a fix.

Editorial take: the real progress today is not “better codegen” in the abstract; it’s agents swallowing the glue work around coding — approvals, fresh machines, maintainer queues, and context recovery — without removing the final review gate.

Ticket Queues Become the New Agent UI
May 3
5 min read
67 docs
AI Engineer
Riley Brown
Salvatore Sanfilippo
+11
The strongest signal today is architectural: top practitioners are moving from chatty sessions to ticket queues, narrow agent roles, and repo-native SOPs. This brief covers the best practical setups from Symphony, OpenClaw, Cursor SDK, Codex, and local DeepSeek workflows.

🔥 TOP SIGNAL

OpenAI’s Symphony / “Symfony” is the clearest sign that coding agents are moving from session management to deliverable management: a background scheduler polls tickets, opens an isolated workspace per ticket, updates ticket state, and raises a PR when work reaches Merging. Jason Zhou says Symphony plus a good codebase harness improved his coding-agent outcomes by 5x, and the same pattern shows up elsewhere today: Ross Mike gets better results with one orchestrator agent delegating narrow jobs, while swyx frames the role shift as plan and review rather than hand-writing every implementation . The edge is increasingly in the state machine around the model—repo-local SOPs, narrow scopes, and explicit review gates—not in babysitting one more chat window .

⚡ TRY THIS

  • Set up a ticket-native loop in an afternoon. Jason Zhou’s setup is explicit: clone Symphony; if you need custom tooling or language support, point another coding agent at spec.md; generate a repo-local workflow.md; create a Linear project and save a personal API key with linear api-token save; define To Do, In Progress, Human Review, and Merging; then run symphony --workflow path/to/workflow.md --daemon. The workflow.md frontmatter controls ticket filters, polling, hooks, parallelism, and agent settings, while the markdown body is the SOP the agent follows every turn . Add Playwright CLI, a boot skill, and indexed docs if you want autonomous verification instead of partially autonomous implementation .

  • Keep the stack narrow; talk to one orchestrator. Ross Mike says OpenClaw worked best when one main agent held the full context and delegated bounded jobs to sub-agents; he does not talk to the sub-agents directly . He pairs that with a narrow skill surface—few goals, few connectors, domain-specific skills—because broad “do everything” agents with 15-30 skills/connectors made “none of it work” . Good default: one main agent, specialized subs, and human review only at the last consequential step .

  • Convert good runs into skills; stop carrying junk context. Riley Brown and Ross Mike describe the same loop: get one output you actually like, then reverse-engineer that run into a reusable skill with exact structure, examples, and domain rules; that was the difference between garbage reports and one-shot usable output in their demos . Keep only necessary context too—don’t tell the model what it can already infer from the repo, and don’t expect slangy prompts to produce precise work .

"The value of good instructions has never been higher."

Tibo also calls /goal one of Codex’s most consequential releases so far, which fits the same pattern: instruction quality is compounding, not getting commoditized .

  • If you run local frontier-ish models, compact aggressively. Salvatore Sanfilippo’s ds4.c demo on a 128GB M3 Max shows the hidden tax isn’t just model size: 32K context adds ~1GB RAM, 250K adds ~7GB, and big tool outputs plus huge system prompts crush real-world latency; a 5K-token tool response after an 11K-token system prompt took 86 seconds to reprocess . His fix is straightforward: after long runs, compact the conversation into a short summary instead of dragging the full transcript forward .

📡 WHAT SHIPPED

  • OpenAI Symphony / “Symfony” — open-source ticket-driven coding-agent orchestrator. Core pieces: a background scheduler plus repo-local workflow.md that acts as config + SOP; default flow polls Linear every 30s, creates one workspace per ticket, and auto-PRs on Merging. Jason Zhou’s field report: Playwright CLI + boot skill + WORKFLOW.md + good harness = 5x better outcomes .

  • Cursor SDK — official SDK for building agents with the same runtime, harness, and models as Cursor. Use cases: CI/CD jobs, end-to-end automations, and embedding agents in products. Announcement. Composer 2 is 50% off in the SDK this weekend .

  • Petdex — RaillyHugo’s public gallery for discovering, sharing, and installing Codex pets “with one curl”; Greg Brockman amplified it and submissions are open. Link.

  • ds4.c / ds4 server — Salvatore Sanfilippo’s mostly GPT-5.5-built local inference engine for DeepSeek v4 Flash 270B in 2-bit GGUF (~81GB), with an OpenAI-compatible server for coding agents like OpenCode. Current state: working locally on a MacBook Pro M3 Max 128GB, not on GitHub yet .

  • Codex vs Anthropic: the practitioner split is getting sharper. Riley Brown and Ross Mike argue Codex is winning the “super app” lane because coding + knowledge work live behind one interface toggle, while Anthropic’s Cowork / Code / Dispatch / Remote stack still feels fragmented . Their parallel claim is useful: a great coding model is increasingly a great general-purpose knowledge-work model because everything reduces to files, tools, and GUI wrappers .

  • Claude Code for web, real build — Simon Willison shipped iNaturalist syndication from his phone and wired it into homepage, archives, and search. Good proof that browser-native coding agents are now viable for real feature work, not just edits. PR.

🎬 GO DEEPER

  • 1:45-3:38 — Symphony’s mental model. Jason Zhou explains the scheduler + workflow.md split: frontmatter handles polling, workspace hooks, and agent settings; the markdown body holds the SOP. This is the shortest clean explanation of why the ticket tracker becomes the state machine .
  • 23:53-25:04 — A proactive agent loop that actually saves time. Ross Mike walks through an OpenClaw workflow that vets sponsor emails, researches companies, sends the first pricing email, tracks negotiations in Notion, and hands off only the finals. Even if you don’t do sponsorships, the reusable pattern is heartbeat + database + human-at-the-end .

  • 15:09-20:58 — Local frontier model, real latency pain. Salvatore’s ds4/OpenCode demo is worth watching because it shows the practical bottleneck nobody advertises: tool output size and system-prompt bulk, not just raw tok/s. You’ll leave with a much better feel for when local agents are viable and when compaction is mandatory .

  • 25:26-33:24 — Steering vs queueing in Codex. Riley Brown’s beginner guide is one of the better current overviews of projects, plugins, custom skills, and automations; skip straight to the steering/queueing section if you already know the UI. Tutorial.

  • Study this PR, not just the screenshot. Simon Willison’s PR #668 is a clean example of shipping a real feature with Claude Code for web, while the Codebase Context Specification is still a useful artifact if you want a shared language for persistent context layout .

Editorial take: the durable edge is moving out of the model picker and into ticket queues, repo-native SOPs, narrow agent roles, and ruthless context hygiene .

Persistent Coding Agent Workflows and the Infra Forming Around Them
May 2
4 min read
136 docs
Alexander Embiricos
OpenAI Developers
Cognition
+11
The practical shift today is from one-shot coding help to persistent agent systems: daily loops, agent-native install prompts, portable context in Codex, and new infra for heavy agent workflows. This brief pulls out the concrete patterns and releases you can actually use.

🔥 TOP SIGNAL

Today’s real edge is persistent agent workflow design, not another model leaderboard . Alexander Embiricos says OpenAI growth teammate Sahil Punamia’s internal "Lord Bottleneck" started as separate Codex-assisted steps and became a daily loop that reviews past experiments, proposes new ones, generates code/config after the team picks, and runs again . Karpathy makes the matching interface point from the tooling side: specify the outcome, let the agent adapt to the local machine, and let it debug setup in the loop .

⚡ TRY THIS

  • Turn repeated work into a morning agent loop. Embiricos’ "Lord Bottleneck" pattern is straightforward: start by using Codex on each subtask separately—data analysis, experiment ideation, code generation, running the experiment, results analysis, deck writing—then stitch those steps into one reusable skill, then tell it to run every morning . The durable pattern is the important part: don’t start with full automation; chain together the steps that already work .

  • Write install docs as a prompt, not a shell script. Karpathy’s OpenClaw example: instead of shipping a giant cross-platform installer, publish a copy-paste prompt that tells the agent the desired outcome and available tools; the agent can inspect the environment, handle platform differences, and debug setup itself . For Here Now, the whole install flow was effectively:

    "I'd like you to set up here now the web hosting and cloud storage service for agents install as a skill if I have npm and if not, do this instead."

    If you build dev tools, Karpathy’s broader complaint is worth taking literally: docs should answer "what is the thing I should copy paste to my agent?"

  • Prototype in two stages: data pipeline first, UI prompt second. Simon Willison built an iNaturalist viewer entirely on his phone with Claude Code for web: first he created a small Python CLI to fetch and "clump" observations; then he ran that in a git-scraping repo to emit clumps.json; only then did he prompt for the frontend . His exact UI prompt was:

    Build inat-sightings.html - an app that does a fetch() against https://raw.githubusercontent.com/simonw/inaturalist-clumps/refs/heads/main/clumps.json and then displays all of the observations on one page using the https://static.inaturalist.org/photos/538073008/small.jpg small.jpg URLs for the thumbnails - with loading=lazy - but when a thumbnail is clicked showing the large.jpg in an HTML modal. Both small and large should include the common species names if available

  • If you script Claude Code, watch your recent commit messages. Theo highlighted Claude Code’s programmatic -p prompt mode and appended system prompts for automation, but warned that recent commit history mentioning tools like OpenClaw or Hermes MD could trigger refusals or extra billing, even in an empty repo, based on his demo . His practical warning was simple: be careful what you put in commit messages when using Claude Code .

📡 WHAT SHIPPED

  • Crabbox 0.1.0 — Peter Steinberger’s answer to "too many agents, too many test suites": remote Linux test boxes on AWS/Hetzner, dirty checkout sync, warm boxes with friendly slugs, and idle auto-free. Install with brew install openclaw/tap/crabbox. Site: crabbox.sh
  • Codex import flow — OpenAI added migration from other agents: import projects, settings, plugins, agents, and project configuration "in just a few clicks," with Romain Huet framing it as a seamless move to Codex
  • Codex pets — New /pet feature for persistent context. Tibo says it makes him more productive because the context follows him while multitasking, and Riley Brown says his agent could still write to a pet notebook created four days earlier
  • Devin inside the shell — Cognition added shell integration: run devin shell setup, hit Ctrl+G, and Devin can see the current terminal screen to help in place
  • llm-openai-via-codex 0.1a0 — Simon Willison released a plugin that reuses Codex CLI credentials for API calls via llm. Release: llm-openai-via-codex 0.1a0
  • GPT-5.5 migration guidance for coding agents — OpenAI says Codex and the main model line are unified, suggests treating GPT-5.5 as a fresh model family, and provides a Codex-side migration path with $openai-docs migrate this project to gpt-5.5; for multi-step tasks, the agent should send a short user-visible update before tool calls

🎬 GO DEEPER

  • 21:35-23:27 — Karpathy on "vibe coding" vs "agentic engineering." Best short framing in today’s source set: vibe coding raises the floor, agentic engineering raises the ceiling, and the job becomes coordinating fallible agents without dropping the quality bar .
  • 20:51-24:14 — Theo on why commit history can leak into Claude Code behavior. Worth watching if you wrap Claude Code in harnesses or scripts: his demo argues recent git history gets surfaced in a way that can affect behavior and billing .

Editorial take: the teams pulling ahead are not just picking better models; they’re turning agents into repeatable loops with durable context and infrastructure that can survive real work.

Ralph Loops Go Mainstream, Security Agents Ship, and Codex Speeds Up
May 1
5 min read
172 docs
Andrej Karpathy
Patrick Collison
Logan Kilpatrick
+20
Goal-persistent agent loops are turning into mainstream product features, while security agents, faster Codex workflows, and sharper harness tactics are giving developers more practical leverage. Today’s brief focuses on the patterns practitioners can actually copy: external state, model routing, long-running loops, and small configuration changes that materially improve results.

🔥 TOP SIGNAL

The strongest signal today: goal-persistent agent loops are moving from practitioner hack to product feature. OpenAI shipped /goal in Codex CLI 0.128.0—its take on the Ralph loop—while Addy Osmani’s long-running agents writeup lays out the durable recipe underneath: external state, explicit done-conditions, separate evaluator roles, and append-only logs for recovery . Cursor engineer Jediah Katz makes the same point from the harness side: orchestration, context, routing, transport, state, and execution all matter, and a weak layer can tank agent quality .

🛠️ TOOLS & MODELS

  • Codex CLI 0.128.0/goal keeps a goal alive across turns until completion or token-budget exhaustion. Embiricos says it has shipped to CLI and is coming to the app for all users; Simon Willison notes the behavior is largely driven by the goals/continuation.md and goals/budget_limit.md prompts .
  • Codex app update — Dynamic task-specific UI, browser/artifact/code annotation, and faster computer use are the main upgrades. OpenAI team posts cite 20% faster computer/browser use, a 42% faster Computer Use benchmark on one workflow, a new device toolbar for responsive testing, and additional browser speed plus Windows fixes .
  • OpenClaw v2026.4.29 — Better group chats, follow-up commitments from context, safer exec/pairing/owner controls, NVIDIA provider + model catalogs, faster startup, and plugin/channel fixes. Peter Steinberger says the new group chat finally feels agent-native .
  • Security agents are becoming default product surface:
    • Claude Security public beta — Built into Claude Code on the web; point it at a repo, get validated vulnerability findings, and fix them in the same place .
    • Cursor Security Review — Adds always-on Security Reviewer for PRs and Vulnerability Scanner for scheduled codebase scans, with configurable triggers, instructions, tooling, and output sharing .
  • Model/tool comparison from active use — Theo says GPT-5.5 is faster and more likely to unblock him, but can get stuck and choke on context; Opus 4.7 has better intent/taste but sometimes takes bizarre paths and ignores obvious answers. He also says Codex feels much faster than Claude Code on TTFT, TPS, token usage, and tool efficiency .
  • LangChain DeepAgents deploy — Simple cloud deployment for an agent harness via deepagents.toml, split into agent, sandbox, auth, and frontend sections .

💡 WORKFLOWS & TRICKS

  • The long-running loop recipe, stripped to essentials:
    1. Write a task file with explicit completion criteria (prd.json / feature list) before the run starts .
    2. For each cycle: pick the next task, build the prompt with relevant context and persistent notes, call the agent, run tests/checks, append to progress.txt, update task status, repeat .
    3. Keep state outside the model; use append-only logs for recovery/debugging, and split planner/worker/judge or generator/evaluator roles so the model is not grading its own homework .
    4. For overnight jobs, run in a worktree, surface lint/typecheck failures back to the agent, and commit progress at meaningful milestones .
  • Budget control is now harness design — Teams in production are seeing token spend rise fast, so the practical playbook is: use cheaper defaults for simple tasks, cap or pool spend for expensive models, and measure spend vs. outcomes monthly. One team cut cost 30% by changing default model routing; another is actively blocking/managing the most expensive Cursor models and moving to pooled spend . Counterpoint: at least one team refuses anything below Opus 4.7 for coding because cheaper errors in prod can cost more than the token bill .
  • OpenClaw tuning that sounds small but matters — If group chats felt messy before, retry with visible replies enabled and switch from GPT to the codex harness plugin. @steipete says that combo materially improved results .
  • If you build developer tools, design for the agent as the user — Patrick Collison says agents are even hungrier for good DX than developers, and Romain Huet puts it more bluntly: the primary developer on your API is an agent like Codex . Stripe’s concrete demo bar is high: Claude Code was pointed at https://github.com/stripe/link-cli and used secure single-use tokens to make a purchase on Gumroad .

👤 PEOPLE TO WATCH

  • Addy Osmani — Strongest practical synthesis in today’s notes on long-running coding agents: external files, explicit done-conditions, and append-only logs .
  • Jediah Katz — Cursor builder with a useful corrective: first-party lab harnesses do not automatically win, and a good agent stack has at least six layers to tune .
  • Theo — High-signal for current model ergonomics because he compares failure modes, not just wins .
  • Andrej Karpathy — Worth tracking for the framing shift from vibe coding to agentic engineering, plus his emphasis on LLM-legible systems and the skill set around them .
  • swyx — Useful operator signal that a tiny team can lean hard on agents in real operations: he says ai.engineer serves ~1m unique developers monthly, and his stack includes OpenClaw personally plus Devin and TownAI on the team side .

🎬 WATCH & LISTEN

  • 1:11-1:48 — Starter template to working app. Fast demo of a useful loop: click I'm feeling lucky, let the model plan the logic, then shape the result with multi-turn prompts .
  • 5:17-5:50 — Self-correcting loop + inline code feedback. Voice ideas become code, the system fixes its own runtime bugs, and the live API suggests more semantic HTML .
  • Full talk to queueAIE EU closing note: swyx on using agents to run ai.engineer as a tiny team serving ~1m monthly developers .

📊 PROJECTS & REPOS

  • snarktank/ralph — Still the clearest inspectable reference for the long-running loop: task list, prompt build, agent call, tests, progress log, repeat. The important signal today is that major products are scaling this pattern rather than replacing it .
  • snarktank/compound-product — Extends Ralph into chained analysis/planning/execution loops; a good repo to study if you want multiple agent roles without burying the orchestration .
  • Codex’s /goal prompt filesgoals/continuation.md and goals/budget_limit.md are worth reading because they show goal persistence implemented through inspectable prompt files .
  • OpenClaw v2026.4.29 — Fast-moving open-source agent surface with a meaningful release this week: better group chat, follow-up commitments, safer exec controls, NVIDIA provider support, and startup/plugin fixes .

Editorial take: the durable edge is moving above the model—persistent goals, external state, verification, routing, and recovery are what separate agents that demo well from agents that actually finish the job.