ZeroNoise Logo zeronoise
Post
Cursor’s “demos, not diffs” makes async agents mergeable
Feb 25
5 min read
185 docs
Cursor’s “demos, not diffs” push is turning async agents into mergeable teammates by having them self-test and return video proof—multiple practitioners report it breaks the review bottleneck. Also: Claude Code remote control, worktrees, Slack context plumbing, and two evergreen agent patterns from Simon Willison (run tests first + generate codebase walkthroughs).

🔥 TOP SIGNAL

Cursor’s big unlock this week is “demos, not diffs”: cloud agents can run the software they just built, test it end-to-end, and send you a video artifact as proof . Practitioners are saying this flips async agents from “fun but hard to trust” to “mergeable”—Jediah Katz reports that in the last two months >50% of his PRs were written by cloud agents once they could self-test and send videos .

🛠️ TOOLS & MODELS

  • Cursor Cloud Agents — “computer use” + video demo artifacts (shipping)

    • Agents can onboard to your repo, use a cloud computer/remote desktop, and return video demos of the finished change .
    • Cursor: “A third of the PRs we merge now come from agents running in cloud sandboxes.”
    • Cursor CEO Michael Truell: “Over a third of our PRs are now created autonomously with this feature.”
    • Internal example: Cursor agents modifying Cursor (e.g., adding secret redaction to model tool calls) and returning a multi-chapter demo video after E2E verification .
    • Try/read: http://cursor.com/onboard · http://cursor.com/blog/agent-computer-use
  • Claude Code — Remote Control (rolled out to all Max users)

    • /remote-control lets you start a local terminal session, then continue it from your phone.
    • Boris Cherny says he’s been using it daily .
  • Claude Code — Slack plugin (context + updates)

    • Install with /plugin install slack to connect Slack for search, messaging, doc creation, and pulling work context into Claude Code .
  • Claude Code — built-in git worktrees + tmux flags

    • New flags: -w, --worktree [name] and --tmux; each session runs in its own worktree to avoid branch-switching chaos .
  • Claude Code — notable performance datapoint

    • Reported: p99 memory usage dropped 40× in the last two weeks, and 6× since January, while shipping new features .
  • Devin (Cognition) — enterprise-first PMF story + self-serve UX catch-up

    • Scott (via @swyx): Devin didn’t have internal PMF at launch; first enterprise adoption took ~6 months; “async agents are the final boss of agent UX.
    • Claimed growth: usage doubled every 2 months in 2025 per enterprise after landing; accelerated to every 6 weeks so far this year; internal usage now 4× 2025 peak.
    • Devin 2.2: sprint to pay down self-serve UX debt; omnibox; tighter “close the loop” integration with Devin Review .

💡 WORKFLOWS & TRICKS

  • Close the agent loop with “proof artifacts,” not trust

    • Jediah Katz’s bottleneck framing: review/testing was the limiter (“you’re responsible… to deliver code you have proven to work”); video demos from agents shift what he can confidently merge without local checkout .
    • Kent C. Dodds calls this “closing the agent loop” and credits Cursor’s computer-equipped cloud agents as a major step change for shipping from his phone .
  • “First run the tests” as your session opener (Simon Willison)

    • Prompt: “First run the tests” to force test-suite discovery and put the agent into a testing mindset .
    • Willison’s claim: automated tests are no longer optional when working with coding agents; if code hasn’t been executed, it’s luck if it works in production .
    • If you use uv in Python, he prompts: Run "uv run pytest".
  • Generate a “linear walkthrough” doc for any repo (also Simon Willison)

    • Use an agent to read the source and produce a structured walkthrough—especially helpful if you “prompted the whole thing into existence” and now need to understand it .
    • Willison’s implementation detail: use Showboat so the agent includes code snippets by running commands (showboat exec + sed|grep|cat) instead of manual copy/paste (reduces hallucination risk) .
    • Example prompt (verbatim):

"Read the source and then plan a linear walkthrough of the code that explains how it all works in detail"

  • Peter Steinberger’s “conversational agent” habit: always ask for questions

    • He treats coding with agents as a conversation and repeatedly asks: “Do you have any questions?” to surface hidden assumptions (models otherwise default to assumptions) .
  • PR review as intent review (not code review)

    • Steinberger’s PR loop: first ask the model if it understands the intent of the PR and whether it’s the optimal solution; often the right fix is architectural/systemic .
  • Rubric separation to reduce “context rot” and bias (Doug O’Laughlin)

    • He keeps task and rubric prompts separate because combining them can commingle information and increase bias/susceptibility; he also calls out sycophancy as a practical failure mode .

👤 PEOPLE TO WATCH

  • Jediah Katz (Cursor) — concrete practitioner stat: >50% of PRs written by cloud agents once agents could self-test and send video proof .
  • Michael Truell (Cursor CEO) — production signal: >⅓ of Cursor PRs now created autonomously with demos .
  • Boris Cherny (Anthropic) — on-the-record: Claude Code does 100% of his coding; he “doesn’t write any of it anymore” .
  • Simon Willison — turning agent work into repeatable patterns: “First run the tests” + agent-generated linear walkthroughs.
  • Andrej Karpathy — pushing “build for agents”: CLI + Skills/MCP + exportable Markdown docs; argues CLIs are uniquely agent-friendly .

🎬 WATCH & LISTEN

1) Cursor: “A computer for every agent” (video artifacts as proof) (≈ 0:10–0:35)

Hook: Cursor shows agents testing their changes on a real desktop and returning a video artifact that demonstrates the feature works—not just a diff .

2) Cursor demo: “paste GitHub issue → agent works → browser proof” (≈ 0:47–1:05)

Hook: A concrete flow: paste an issue link; agent works ~40 minutes; returns an artifact showing it navigated to the locally running app and verified the result in-browser .

3) Claude Code (Boris Cherny): what changed at Opus 4.5 (≈ 8:02–8:52)

Hook: The shift from “agent does first pass, human fixes” to “agent runs tests, opens the browser, clicks around, and fixes UI issues”—so he no longer opens a text editor .

📊 PROJECTS & REPOS


Editorial take: The day’s theme is verification as a first-class artifact—agents that can run, test, and demo their own work are the ones that actually scale async development.

Cursor’s “demos, not diffs” makes async agents mergeable
Simon Willison's Weblog

Simon Willison shares a firsthand workflow for using coding agents to generate detailed linear walkthroughs of codebases, helping understand vibe-coded projects like his SwiftUI slide app .

Tools used: Claude Code (web) with frontier model Opus 4.6; Showboat (his tool for agents to build demo docs via showboat note for Markdown and showboat exec for shell outputs) .

Repo: https://github.com/simonw/present.

Workflow steps:

  • Point Claude Code at repo and prompt to read source, plan linear walkthrough, run uvx showboat --help, use Showboat to create walkthrough.md with showboat note for commentary and showboat exec plus sed/grep/cat for code snippets (avoids hallucinations) .

Exact prompt:

Read the source and then plan a linear walkthrough of the code that explains how it all works in detail
Then run “uvx showboat –help” to learn showboat - use showboat to create a walkthrough.md file in the repo and build the walkthrough in there, using showboat note for commentary and showboat exec plus sed or grep or cat or whatever you need to include snippets of code you are talking about .

Result: Detailed walkthrough.md covering all six .swift files, accelerating learning of SwiftUI structure and Swift language (~40 min project) .

Walkthrough: https://github.com/simonw/present/blob/main/walkthrough.md.

Timeless pattern: Agent-generated linear walkthroughs for codebase comprehension .

Contrarian take: LLMs enable faster skill acquisition via such patterns, countering concerns they slow learning .

Andrej Karpathy
x 2 docs

Andrej Karpathy (ex-Director of AI @ Tesla, OpenAI founding team, Stanford PhD) highlights CLIs as ideal for AI agents due to their legacy status, enabling native terminal use and combination.

Practical workflow: Prompt Claude or Codex agent to install Polymarket CLI (new Rust-based tool for querying markets, trading, pulling data) and build dashboards/interfaces/logic; pair with Github CLI for repo navigation/issues/PRs/code viewing.

Example: Claude generated terminal dashboard of highest-volume polymarkets' 24hr changes in ~3 minutes; extensible to web apps or pipeline modules.

Timeless pattern: Design products for agents via CLI/MCP, exportable markdown docs, or Skills.

"Build. For. Agents." (2026)

Announcement: https://x.com/SuhailKakar/status/2026305257257775524

Peter Steinberger
Profile 3 docs

Peter Steinberger (OpenClaw creator, ex-PSPDFKit founder, 90k GitHub contribs across 120 projects in 1 year via agentic coding) shares firsthand workflows:

Kickoff workflow: Dump codebase as 1.5MB Markdown into Claude 3.5 Sonnet for spec; drag spec to Cursor (claud code), run build in background; test iteratively with Playwright MCPs (e.g., login flows). Achieved "100% production ready" app in ~1hr despite rough early tools.

Daily Cursor workflow (highest trust tool; quantum leap post-GPT-4o/3.5):

  • Conversational prompting: State intent, always ask "Do you have any questions?" to surface assumptions (models default to solving w/ old-code biases; new sessions lack codebase context).
  • Simple Git: No worktrees/branches; checkout 1-10 for parallel tasks.
  • Review code stream/#changed files for anomalies; ship unread "boring" data transforms (focus architecture). Optimize repos for agents, not humans.

PRs as 'prompt requests': Ask Cursor "Do you understand intent? Optimal solution?"; voice-chat explores systemic fixes (e.g., message handling across WhatsApp/Signal); /land PR slash-command credits submitter post-refactor.

OpenClaw patterns (self-built via above; 400k+ LoC):

  • Soul.md: Prologue w/ values/personality (resists injections).
  • Hyper-aware agent: Knows harness/models; emits no-reply tokens; self-modifies via prompts.
  • Models: Opus (loop/personality), Sonnet 3.6+ (chat), avoid Haiku (injection-vulnerable).

Contrarian: "Vibe coding"=slur; agentic coding=learnable skill (like guitar); you'll be replaced by AI-users. Playful starts beat optimization traps.

Boris Cherny
Profile 1 doc

Claude Code, created by Boris Cherny (lead at Anthropic, uses it for 100% of his coding in production), enables fully agentic workflows where users no longer write or manually edit code.

Model progression & productivity (firsthand from Cherny):

  • Sonnet 3.5 (Feb 2025 launch): ~10% of Cherny's code.
  • Sonnet 4/Opus 4 (May): ~30%.
  • Opus 4.5 (Nov): sudden jump to 100%; agent self-tests code, runs tests, uses browser to verify UI/fix pixels—no manual intervention needed.

Concrete workflows (Cherny's production use at Anthropic):

  • Proactive bug fixing: Scans feedback Slack/GitHub, autonomously fixes issues (20-30% of Claude Code's own codebase).
  • Project mgmt: Analyzes team spreadsheet, pings Slack laggards (signs as Cowork bot via .claude.md persistent instructions file).

Timeless agent patterns (Cherny):

  • Core agent: LLM + tools.
  • Orchestration progression: Code writing → tool use (e.g., search codebase/Slack/history for context) → computer use (browser for non-API sites).
  • Highly customizable; self-configures via prompts (e.g., change theme).

Cowork (same Claude Code agent SDK, for non-devs): Ships VM/deletion protection; used for data analysis, busywork (e.g., organize screenshots, pay tickets/taxes—with self-check). Immediate hit vs. Claude Code's slow start.

Scale: ~4% global commits (study; higher w/ private code); used by Spotify/Shopify/Netflix/etc., incl. non-engineers (PMs/data sci).

swyx
x 2 docs

Devin (Cognition's coding agent) initially lacked internal PMF at 2024 launch; took 6 months for first enterprise adoption as models weren't ready and team experimented with agent patterns. However, async agent form factor was right—async agents are the final boss of agent UX.

Quantitative growth (per Scott from Cognition, shared by @swyx affiliated with @cognition): Usage doubled every 2 months across 2025 enterprises after landing; rate accelerated to every 6 weeks in 2026; internal usage now 4x 2025 peak.

Self-serve UX lagged due to repo setup neglect (irrelevant for enterprises using FDEs) .

Devin 2.2 updates: Hired first designer; all-hands sprint addressed self-serve UX debt, added omnibox, integrated Devin Review with main to close the loop . Team battle-tested background agents in largest enterprises before trends; now reworking UX for broader use. Swyx's designer reports senior engineer implementing Figma visions fluidly .

Upcoming: Devin 3.0 .

Source: Scott's comments (secondhand via @swyx).

Latent Space
youtube 1 doc

Doug Laughlin (SemiAnalysis founder, semiconductor expert using AI in production research workflows) shares firsthand Claude Code (Claude 3.5 Sonnet/Opus 4.5-4.6) usage starting Dec 2024, one-shotting MVPs, dashboards, financial models (e.g., input positions/notes; output portfolio, risk analysis, investment frameworks with rubrics scoring X/Y out of 10) .

Workflow tips:

  • Separate task/rubric prompts to reduce sycophancy/bias; fresh 1M context windows minimize rot .
  • Pre-load visualization skills (summarize 70 books into 90-token style guide for SemiAnalysis charts) .
  • Build iterative research agents (e.g., fine-tune time-series models on NAND/DRAM prices via APIs/search) .

Quantitative: Claude commits now 5% of GitHub (chart auto-generated via commit scraping) ; outperforms hiring case studies .

Comparisons: Better than Codex 5.2 (less seamless/agentic) ; Kimi 2.5 agent swarms boost perf (vs. Claude's experimental agent teams lacking RL) . Excel via Claude+Python > native Excel .

Patterns: Hygiene/review as expert oversight (AI=junior analyst sans meta-learning); subagents for clean task handoff . Surprising: All info work (not just code) automatable; IDEs/Excel obsolete .

Romain Huet
Profile 1 doc

Peter Steinberger, creator of OpenClaw (viral open-source personal AI agent; Clawcon drew 1000 attendees; WSJ coverage), attributes his 90k GitHub contributions across 120+ projects in past year to AI coding agents, especially post-Oct/Nov switch to Codex (highest trust among tools tried; quantum leap with o1-preview) .

Firsthand workflows (serious side projects/production-like scale):

  • Dump codebase as huge markdown file (~1.5MB), prompt Gemini 1.5 for spec, feed to Claude/Codex to build/test (e.g., Playwright for login) .
  • Conversational prompting: Describe goal, always ask "Do you have any questions?" to clarify assumptions; use voice for efficiency; reflect if slow .
  • Simple setup: No branches/worktrees initially; focus on problem .
  • Ship unread code: Mentally model output; optimize codebase for agents, not humans .
  • PR review as 'prompt requests': Ask Codex intent, rebuild optimally (often systemic/architectural) .

Contrarian takes: 'Vibe coding' is a slur—treat as skill like guitar; you'll be replaced by AI users . Advice: Playfully build desired projects .

OpenClaw highlights: Built iteratively with Codex (self-modifying); agentic problem-solving (e.g., auto-transcribed voice via FFMPEG/curl/OpenAI) .

swyx
x 6 docs

Claude Code (Anthropic coding tool) one-year anniversary:

  • New /remote-control feature enables continuing local sessions from phone; rolled out to all Max users [@_catwu, Anthropic].
  • Used by developers for weekend projects, production apps at world's largest companies, and Mars rover planning [@bcherny, Anthropic].
  • Writes 4% of all GitHub code, projected 25-50% by 2027.

Retrospective podcast by @swyx (@latentspacepod host, AI practitioner) with superuser @fabknowledge (@SemiAnalysis_): Claude Code for finance/semiconductor analysis workflows. https://latent.space/p/valuemule

Kevin Hou
Profile 1 doc

Antigravity is an agentic dev tool from Google that handles infrastructure, coding, research, and building via Agent Manager, a first-class agent UI separate from the editor .

Firsthand workflows from builders (Kevin Hou, Head of Product Engineering; Andy Zhang, engineer, both use it daily building Antigravity):

  • Iterative planning & verification: Prompt agent (e.g., Gemini 1.5 Flash) for Olympics medal app; review/update implementation plan & verification plan via comments on tables/artifacts; approve to build/test .
  • Browser agent orchestration: Main agent delegates browser tasks to sub-agent using computer use model for clicking/scrolling/screenshotting/verification; user approves URLs/actions; annotate screenshots for changes (e.g., "make header super blue") .
  • Parallel iteration: Run multiple agents/workspaces; comment/highlight on images/text for context; revert per-message diffs .
  • Personal hacks: Review plans before code changes (80% time on plans for large codebases); delegate repetitive tests (e.g., flows taking 2min x10 → prompt once); understand codebase at abstraction levels .

Timeless patterns: Human-in-loop via artifact comments/approvals; agentic loops (plan→implement→verify→iterate); multi-agent delegation; sandboxed file access/terminal approvals; default planning mode .

Supports OpenAI/Anthropic models; import VS Code workspaces; Git revert .

Simon Willison's Weblog

"First run the tests" Pattern

Start coding agent sessions with "First run the tests" to activate test suite discovery, instill testing mindset, and gauge project scale .

For Python projects using uv: "Run \"uv run pytest\"" (setup via pyproject.toml dependency groups).

Specific benefits:

  • Forces agent to run tests, ensuring future usage
  • Reveals codebase size/complexity via test count
  • Encourages agent-written tests

Tests are vital for validating AI-generated code, onboarding agents to codebases (e.g., Claude Code reads them), countering old excuses about test maintenance—agents write them quickly . Agents biased toward testing, amplified by existing suites .

Timeless pattern akin to "Use red-green TDD"; see guide.

Firsthand from Simon Willison using on his projects .

Latent.Space
latent 1 doc

Doug O'Laughlin (SemiAnalysis) uses Claude Code for firm work, treating it as a junior analyst that performs tedious information gathering: "This crap makes mistakes all the time... after doing this enough times, there’s a meta level thinking... I don’t think that meta level learning is there yet." It amplifies experts by offloading painful tasks but requires oversight to filter slop.

Doug calculated ~4% of GitHub is written by Claude Code: https://newsletter.semianalysis.com/p/claude-code-is-the-inflection-point

Episode covers Claude Code workflow setup (39:44).

Addy Osmani
x 3 docs

Addy Osmani (Director, Google Cloud AI) advises caution with /init and AGENTS.md files in coding agents: treat as a living list of codebase smells to fix rather than permanent config .

Auto-generated files hurt agent performance and inflate costs by duplicating discoverable info; human-written ones help only for non-discoverable information like tooling gotchas, non-obvious conventions, landmines .

Contrarian take: Single root AGENTS.md insufficient for complex codebases; need hierarchy of AGENTS.md files at directory/module levels, automatically maintained for precisely scoped context .

References @theo's study proving to delete CLAUDE.md/AGENTS.md: https://x.com/theo/status/2025900730847232409 (video) .

Long-form blog: https://addyosmani.com/blog/agents-md/.

Theo - t3․gg
youtube 1 doc

Theo (full-stack TypeScript dev, t3.gg, T3 Chat creator using models in production) shares firsthand agent orchestration insights:

  • Cursor legitimately distills Claude (Opus/Sonnet) API data—user prompts/outputs—to train cheaper composer models, subsidizing costs vs. retail API pricing .

Agent tool call loop example (update homepage pricing):

  • Initial analysis → tool call (grep pricing)
  • Review results → batch read files
  • Apply edits Each tool response ends exchange; full history resent for next, yielding 10s-100s exchanges/prompt (e.g., deep analysis) .

T3 Chat multi-search: 3-4 exchanges/request.

SWE-bench (2,294 Python tasks): ~50 tool calls/task115k exchanges/run.

Theo used Claude Code (custom endpoints) for JS coding; distillation from his inputs/outputs could yield JS-specialized model .

Anthropic accuses labs (Deepseek etc.) of targeting Claude's agentic reasoning/tool use/coding via 150k-13M exchanges; Theo counters as trivial (~T3 Chat daily) .

Kent C. Dodds ⚡
x 5 docs

Kent C. Dodds (@kentcdodds), dev educator and MVP, uses Cursor AI cloud agents from his phone to ship code .

Closing the agent loop is now essential; Cursor excels by giving cloud agents computers .

New Cursor feature: Agents run built software and send demo videos (not diffs) .

Firsthand usage at production scale for shipping; even better on desktop . Free workshop on web dev with Cursor .

geoff
x 1 doc

@GeoffreyHuntley built business cards with Cursor AI using specs from the https://latentpatterns.com design system .

Technique: Treat design system as specs for generation in Cursor AI (firsthand example from practitioner building latentpatterns.com) .

Kent C. Dodds ⚡
x 6 docs

Kent C. Dodds (@kentcdodds), dev educator who ships extensively from his phone, credits Cursor AI's cloud agents—with computers enabling them to use built software and send video demos instead of diffs—for allowing him to "build and ship so many things from my phone over the last month" . Combined with GPT 5.3, this is the "biggest step change" in his software building in the last year . He highlights closing the agent loop as key, praising Cursor's implementation . Demo workflow shown in video . Firsthand production use.

Kent C. Dodds ⚡
x 2 docs

Kent C. Dodds states closing the agent loop is "the name of the game now," praising @cursor_ai for giving their cloud agents computers. He highlights the ability to takeover and use the agent’s desktop. Dodds, a dev educator, will present on web dev with Cursor in a free workshop .

Kent C. Dodds ⚡
x 1 doc

Kent C. Dodds (@kentcdodds), a prominent dev educator, shared a firsthand experience using Cursor AI's demo feature to preview web page updates remotely without pulling code locally—discovering it had zoomed in on his nose in the /credits page: "lol, really glad I had @cursor_ai’s demo feature. Otherwise I would have had to pull this down locally to find out it updated my /credits page to zoom in on my nose 👃 😂" .

Kent C. Dodds ⚡
x 2 docs

Kent C. Dodds (@kentcdodds), a Dev Educator and MVP, states that closing the agent loop is now essential, praising @cursor_ai for equipping their cloud agents with computers.

He is presenting on this topic plus web development with Cursor in a free workshop: https://luma.com/pyi2sdlo.

geoff
x 1 doc

@GeoffreyHuntley discovered a new technique for self-optimising agents and self-optimising tool calls for agents, planning to package and release it.