ZeroNoise Logo zeronoise
Post
Cursor’s “demos, not diffs” makes async agents mergeable
Feb 25
5 min read
185 docs
Cursor’s “demos, not diffs” push is turning async agents into mergeable teammates by having them self-test and return video proof—multiple practitioners report it breaks the review bottleneck. Also: Claude Code remote control, worktrees, Slack context plumbing, and two evergreen agent patterns from Simon Willison (run tests first + generate codebase walkthroughs).

🔥 TOP SIGNAL

Cursor’s big unlock this week is “demos, not diffs”: cloud agents can run the software they just built, test it end-to-end, and send you a video artifact as proof . Practitioners are saying this flips async agents from “fun but hard to trust” to “mergeable”—Jediah Katz reports that in the last two months >50% of his PRs were written by cloud agents once they could self-test and send videos .

🛠️ TOOLS & MODELS

  • Cursor Cloud Agents — “computer use” + video demo artifacts (shipping)

    • Agents can onboard to your repo, use a cloud computer/remote desktop, and return video demos of the finished change .
    • Cursor: “A third of the PRs we merge now come from agents running in cloud sandboxes.”
    • Cursor CEO Michael Truell: “Over a third of our PRs are now created autonomously with this feature.”
    • Internal example: Cursor agents modifying Cursor (e.g., adding secret redaction to model tool calls) and returning a multi-chapter demo video after E2E verification .
    • Try/read: http://cursor.com/onboard · http://cursor.com/blog/agent-computer-use
  • Claude Code — Remote Control (rolled out to all Max users)

    • /remote-control lets you start a local terminal session, then continue it from your phone.
    • Boris Cherny says he’s been using it daily .
  • Claude Code — Slack plugin (context + updates)

    • Install with /plugin install slack to connect Slack for search, messaging, doc creation, and pulling work context into Claude Code .
  • Claude Code — built-in git worktrees + tmux flags

    • New flags: -w, --worktree [name] and --tmux; each session runs in its own worktree to avoid branch-switching chaos .
  • Claude Code — notable performance datapoint

    • Reported: p99 memory usage dropped 40× in the last two weeks, and 6× since January, while shipping new features .
  • Devin (Cognition) — enterprise-first PMF story + self-serve UX catch-up

    • Scott (via @swyx): Devin didn’t have internal PMF at launch; first enterprise adoption took ~6 months; “async agents are the final boss of agent UX.
    • Claimed growth: usage doubled every 2 months in 2025 per enterprise after landing; accelerated to every 6 weeks so far this year; internal usage now 4× 2025 peak.
    • Devin 2.2: sprint to pay down self-serve UX debt; omnibox; tighter “close the loop” integration with Devin Review .

💡 WORKFLOWS & TRICKS

  • Close the agent loop with “proof artifacts,” not trust

    • Jediah Katz’s bottleneck framing: review/testing was the limiter (“you’re responsible… to deliver code you have proven to work”); video demos from agents shift what he can confidently merge without local checkout .
    • Kent C. Dodds calls this “closing the agent loop” and credits Cursor’s computer-equipped cloud agents as a major step change for shipping from his phone .
  • “First run the tests” as your session opener (Simon Willison)

    • Prompt: “First run the tests” to force test-suite discovery and put the agent into a testing mindset .
    • Willison’s claim: automated tests are no longer optional when working with coding agents; if code hasn’t been executed, it’s luck if it works in production .
    • If you use uv in Python, he prompts: Run "uv run pytest".
  • Generate a “linear walkthrough” doc for any repo (also Simon Willison)

    • Use an agent to read the source and produce a structured walkthrough—especially helpful if you “prompted the whole thing into existence” and now need to understand it .
    • Willison’s implementation detail: use Showboat so the agent includes code snippets by running commands (showboat exec + sed|grep|cat) instead of manual copy/paste (reduces hallucination risk) .
    • Example prompt (verbatim):

"Read the source and then plan a linear walkthrough of the code that explains how it all works in detail"

  • Peter Steinberger’s “conversational agent” habit: always ask for questions

    • He treats coding with agents as a conversation and repeatedly asks: “Do you have any questions?” to surface hidden assumptions (models otherwise default to assumptions) .
  • PR review as intent review (not code review)

    • Steinberger’s PR loop: first ask the model if it understands the intent of the PR and whether it’s the optimal solution; often the right fix is architectural/systemic .
  • Rubric separation to reduce “context rot” and bias (Doug O’Laughlin)

    • He keeps task and rubric prompts separate because combining them can commingle information and increase bias/susceptibility; he also calls out sycophancy as a practical failure mode .

👤 PEOPLE TO WATCH

  • Jediah Katz (Cursor) — concrete practitioner stat: >50% of PRs written by cloud agents once agents could self-test and send video proof .
  • Michael Truell (Cursor CEO) — production signal: >⅓ of Cursor PRs now created autonomously with demos .
  • Boris Cherny (Anthropic) — on-the-record: Claude Code does 100% of his coding; he “doesn’t write any of it anymore” .
  • Simon Willison — turning agent work into repeatable patterns: “First run the tests” + agent-generated linear walkthroughs.
  • Andrej Karpathy — pushing “build for agents”: CLI + Skills/MCP + exportable Markdown docs; argues CLIs are uniquely agent-friendly .

🎬 WATCH & LISTEN

1) Cursor: “A computer for every agent” (video artifacts as proof) (≈ 0:10–0:35)

Hook: Cursor shows agents testing their changes on a real desktop and returning a video artifact that demonstrates the feature works—not just a diff .

2) Cursor demo: “paste GitHub issue → agent works → browser proof” (≈ 0:47–1:05)

Hook: A concrete flow: paste an issue link; agent works ~40 minutes; returns an artifact showing it navigated to the locally running app and verified the result in-browser .

3) Claude Code (Boris Cherny): what changed at Opus 4.5 (≈ 8:02–8:52)

Hook: The shift from “agent does first pass, human fixes” to “agent runs tests, opens the browser, clicks around, and fixes UI issues”—so he no longer opens a text editor .

📊 PROJECTS & REPOS


Editorial take: The day’s theme is verification as a first-class artifact—agents that can run, test, and demo their own work are the ones that actually scale async development.

Cursor’s “demos, not diffs” makes async agents mergeable
Back to details
Skipped contexts (57)
Simon Willison
x 2 docs
Riley Brown
Profile 1 doc
Andrej Karpathy
x 2 docs
Simon Willison's Weblog
Simon Willison's Weblog
swyx
x 1 doc
swyx
x 2 docs
swyx
x 2 docs
swyx
x 6 docs
Ben Tossell
x 1 doc
swyx
x 4 docs
swyx
x 4 docs
swyx
x 3 docs
swyx
x 2 docs
swyx
x 2 docs
ThePrimeagen
x 1 doc
ThePrimeagen
x 3 docs
Ben Tossell
x 1 doc
ThePrimeagen
x 2 docs
Kent C. Dodds ⚡
x 1 doc