Cursor’s “demos, not diffs” makes async agents mergeable

Error!

Unable to generate download right now.

Post

Coding Agents Alpha Tracker

Feb 25 •

5 min read

• 185 docs

Cursor’s “demos, not diffs” push is turning async agents into mergeable teammates by having them self-test and return video proof—multiple practitioners report it breaks the review bottleneck. Also: Claude Code remote control, worktrees, Slack context plumbing, and two evergreen agent patterns from Simon Willison (run tests first + generate codebase walkthroughs).

🔥 TOP SIGNAL

Cursor’s big unlock this week is “demos, not diffs”: cloud agents can run the software they just built, test it end-to-end, and send you a video artifact as proof . Practitioners are saying this flips async agents from “fun but hard to trust” to “mergeable”—Jediah Katz reports that in the last two months >50% of his PRs were written by cloud agents once they could self-test and send videos .

🛠️ TOOLS & MODELS

Cursor Cloud Agents — “computer use” + video demo artifacts (shipping)
- Agents can onboard to your repo, use a cloud computer/remote desktop, and return video demos of the finished change .
- Cursor: “A third of the PRs we merge now come from agents running in cloud sandboxes.”
- Cursor CEO Michael Truell: “Over a third of our PRs are now created autonomously with this feature.”
- Internal example: Cursor agents modifying Cursor (e.g., adding secret redaction to model tool calls) and returning a multi-chapter demo video after E2E verification .
- Try/read: http://cursor.com/onboard · http://cursor.com/blog/agent-computer-use
Claude Code — Remote Control (rolled out to all Max users)
- /remote-control lets you start a local terminal session, then continue it from your phone.
- Boris Cherny says he’s been using it daily .
Claude Code — Slack plugin (context + updates)
- Install with /plugin install slack to connect Slack for search, messaging, doc creation, and pulling work context into Claude Code .
Claude Code — built-in git worktrees + tmux flags
- New flags: -w, --worktree [name] and --tmux; each session runs in its own worktree to avoid branch-switching chaos .
Claude Code — notable performance datapoint
- Reported: p99 memory usage dropped 40× in the last two weeks, and 6× since January, while shipping new features .
Devin (Cognition) — enterprise-first PMF story + self-serve UX catch-up
- Scott (via @swyx): Devin didn’t have internal PMF at launch; first enterprise adoption took ~6 months; “async agents are the final boss of agent UX” .
- Claimed growth: usage doubled every 2 months in 2025 per enterprise after landing; accelerated to every 6 weeks so far this year; internal usage now 4× 2025 peak.
- Devin 2.2: sprint to pay down self-serve UX debt; omnibox; tighter “close the loop” integration with Devin Review .

💡 WORKFLOWS & TRICKS

Close the agent loop with “proof artifacts,” not trust
- Jediah Katz’s bottleneck framing: review/testing was the limiter (“you’re responsible… to deliver code you have proven to work”); video demos from agents shift what he can confidently merge without local checkout .
- Kent C. Dodds calls this “closing the agent loop” and credits Cursor’s computer-equipped cloud agents as a major step change for shipping from his phone .
“First run the tests” as your session opener (Simon Willison)
- Prompt: “First run the tests” to force test-suite discovery and put the agent into a testing mindset .
- Willison’s claim: automated tests are no longer optional when working with coding agents; if code hasn’t been executed, it’s luck if it works in production .
- If you use uv in Python, he prompts: Run "uv run pytest".
Generate a “linear walkthrough” doc for any repo (also Simon Willison)
- Use an agent to read the source and produce a structured walkthrough—especially helpful if you “prompted the whole thing into existence” and now need to understand it .
- Willison’s implementation detail: use Showboat so the agent includes code snippets by running commands (showboat exec + sed|grep|cat) instead of manual copy/paste (reduces hallucination risk) .
- Example prompt (verbatim):

"Read the source and then plan a linear walkthrough of the code that explains how it all works in detail"

Peter Steinberger’s “conversational agent” habit: always ask for questions
- He treats coding with agents as a conversation and repeatedly asks: “Do you have any questions?” to surface hidden assumptions (models otherwise default to assumptions) .
PR review as intent review (not code review)
- Steinberger’s PR loop: first ask the model if it understands the intent of the PR and whether it’s the optimal solution; often the right fix is architectural/systemic .
Rubric separation to reduce “context rot” and bias (Doug O’Laughlin)
- He keeps task and rubric prompts separate because combining them can commingle information and increase bias/susceptibility; he also calls out sycophancy as a practical failure mode .

👤 PEOPLE TO WATCH

Jediah Katz (Cursor) — concrete practitioner stat: >50% of PRs written by cloud agents once agents could self-test and send video proof .
Michael Truell (Cursor CEO) — production signal: >⅓ of Cursor PRs now created autonomously with demos .
Boris Cherny (Anthropic) — on-the-record: Claude Code does 100% of his coding; he “doesn’t write any of it anymore” .
Simon Willison — turning agent work into repeatable patterns: “First run the tests” + agent-generated linear walkthroughs.
Andrej Karpathy — pushing “build for agents”: CLI + Skills/MCP + exportable Markdown docs; argues CLIs are uniquely agent-friendly .

🎬 WATCH & LISTEN

1) Cursor: “A computer for every agent” (video artifacts as proof) (≈ 0:10–0:35)

Hook: Cursor shows agents testing their changes on a real desktop and returning a video artifact that demonstrates the feature works—not just a diff .

2) Cursor demo: “paste GitHub issue → agent works → browser proof” (≈ 0:47–1:05)

Hook: A concrete flow: paste an issue link; agent works ~40 minutes; returns an artifact showing it navigated to the locally running app and verified the result in-browser .

3) Claude Code (Boris Cherny): what changed at Opus 4.5 (≈ 8:02–8:52)

Hook: The shift from “agent does first pass, human fixes” to “agent runs tests, opens the browser, clicks around, and fixes UI issues”—so he no longer opens a text editor .

📊 PROJECTS & REPOS

Showboat (Simon Willison) — a tool designed so agents can build trustworthy walkthrough documents using executed commands + captured output (instead of pasted snippets): https://github.com/simonw/showboat
“present” (Simon Willison’s SwiftUI app repo) + generated walkthrough
- Repo: https://github.com/simonw/present
- Walkthrough doc: https://github.com/simonw/present/blob/main/walkthrough.md
Polymarket CLI — positioned as a terminal interface agents can use to query markets/place trades/pull data .

Editorial take: The day’s theme is verification as a first-class artifact—agents that can run, test, and demo their own work are the ones that actually scale async development.