ZeroNoise Logo zeronoise
Post
Cursor’s “demos, not diffs” makes async agents mergeable
Feb 25
5 min read
185 docs
Cursor’s “demos, not diffs” push is turning async agents into mergeable teammates by having them self-test and return video proof—multiple practitioners report it breaks the review bottleneck. Also: Claude Code remote control, worktrees, Slack context plumbing, and two evergreen agent patterns from Simon Willison (run tests first + generate codebase walkthroughs).

🔥 TOP SIGNAL

Cursor’s big unlock this week is “demos, not diffs”: cloud agents can run the software they just built, test it end-to-end, and send you a video artifact as proof . Practitioners are saying this flips async agents from “fun but hard to trust” to “mergeable”—Jediah Katz reports that in the last two months >50% of his PRs were written by cloud agents once they could self-test and send videos .

🛠️ TOOLS & MODELS

  • Cursor Cloud Agents — “computer use” + video demo artifacts (shipping)

    • Agents can onboard to your repo, use a cloud computer/remote desktop, and return video demos of the finished change .
    • Cursor: “A third of the PRs we merge now come from agents running in cloud sandboxes.”
    • Cursor CEO Michael Truell: “Over a third of our PRs are now created autonomously with this feature.”
    • Internal example: Cursor agents modifying Cursor (e.g., adding secret redaction to model tool calls) and returning a multi-chapter demo video after E2E verification .
    • Try/read: http://cursor.com/onboard · http://cursor.com/blog/agent-computer-use
  • Claude Code — Remote Control (rolled out to all Max users)

    • /remote-control lets you start a local terminal session, then continue it from your phone.
    • Boris Cherny says he’s been using it daily .
  • Claude Code — Slack plugin (context + updates)

    • Install with /plugin install slack to connect Slack for search, messaging, doc creation, and pulling work context into Claude Code .
  • Claude Code — built-in git worktrees + tmux flags

    • New flags: -w, --worktree [name] and --tmux; each session runs in its own worktree to avoid branch-switching chaos .
  • Claude Code — notable performance datapoint

    • Reported: p99 memory usage dropped 40× in the last two weeks, and 6× since January, while shipping new features .
  • Devin (Cognition) — enterprise-first PMF story + self-serve UX catch-up

    • Scott (via @swyx): Devin didn’t have internal PMF at launch; first enterprise adoption took ~6 months; “async agents are the final boss of agent UX.
    • Claimed growth: usage doubled every 2 months in 2025 per enterprise after landing; accelerated to every 6 weeks so far this year; internal usage now 4× 2025 peak.
    • Devin 2.2: sprint to pay down self-serve UX debt; omnibox; tighter “close the loop” integration with Devin Review .

💡 WORKFLOWS & TRICKS

  • Close the agent loop with “proof artifacts,” not trust

    • Jediah Katz’s bottleneck framing: review/testing was the limiter (“you’re responsible… to deliver code you have proven to work”); video demos from agents shift what he can confidently merge without local checkout .
    • Kent C. Dodds calls this “closing the agent loop” and credits Cursor’s computer-equipped cloud agents as a major step change for shipping from his phone .
  • “First run the tests” as your session opener (Simon Willison)

    • Prompt: “First run the tests” to force test-suite discovery and put the agent into a testing mindset .
    • Willison’s claim: automated tests are no longer optional when working with coding agents; if code hasn’t been executed, it’s luck if it works in production .
    • If you use uv in Python, he prompts: Run "uv run pytest".
  • Generate a “linear walkthrough” doc for any repo (also Simon Willison)

    • Use an agent to read the source and produce a structured walkthrough—especially helpful if you “prompted the whole thing into existence” and now need to understand it .
    • Willison’s implementation detail: use Showboat so the agent includes code snippets by running commands (showboat exec + sed|grep|cat) instead of manual copy/paste (reduces hallucination risk) .
    • Example prompt (verbatim):

"Read the source and then plan a linear walkthrough of the code that explains how it all works in detail"

  • Peter Steinberger’s “conversational agent” habit: always ask for questions

    • He treats coding with agents as a conversation and repeatedly asks: “Do you have any questions?” to surface hidden assumptions (models otherwise default to assumptions) .
  • PR review as intent review (not code review)

    • Steinberger’s PR loop: first ask the model if it understands the intent of the PR and whether it’s the optimal solution; often the right fix is architectural/systemic .
  • Rubric separation to reduce “context rot” and bias (Doug O’Laughlin)

    • He keeps task and rubric prompts separate because combining them can commingle information and increase bias/susceptibility; he also calls out sycophancy as a practical failure mode .

👤 PEOPLE TO WATCH

  • Jediah Katz (Cursor) — concrete practitioner stat: >50% of PRs written by cloud agents once agents could self-test and send video proof .
  • Michael Truell (Cursor CEO) — production signal: >⅓ of Cursor PRs now created autonomously with demos .
  • Boris Cherny (Anthropic) — on-the-record: Claude Code does 100% of his coding; he “doesn’t write any of it anymore” .
  • Simon Willison — turning agent work into repeatable patterns: “First run the tests” + agent-generated linear walkthroughs.
  • Andrej Karpathy — pushing “build for agents”: CLI + Skills/MCP + exportable Markdown docs; argues CLIs are uniquely agent-friendly .

🎬 WATCH & LISTEN

1) Cursor: “A computer for every agent” (video artifacts as proof) (≈ 0:10–0:35)

Hook: Cursor shows agents testing their changes on a real desktop and returning a video artifact that demonstrates the feature works—not just a diff .

2) Cursor demo: “paste GitHub issue → agent works → browser proof” (≈ 0:47–1:05)

Hook: A concrete flow: paste an issue link; agent works ~40 minutes; returns an artifact showing it navigated to the locally running app and verified the result in-browser .

3) Claude Code (Boris Cherny): what changed at Opus 4.5 (≈ 8:02–8:52)

Hook: The shift from “agent does first pass, human fixes” to “agent runs tests, opens the browser, clicks around, and fixes UI issues”—so he no longer opens a text editor .

📊 PROJECTS & REPOS


Editorial take: The day’s theme is verification as a first-class artifact—agents that can run, test, and demo their own work are the ones that actually scale async development.

Cursor’s “demos, not diffs” makes async agents mergeable