ZeroNoise Logo zeronoise
Post
4-day autonomous agents, Cursor MCP Apps, and the push from code review to evidence
Mar 4
4 min read
111 docs
Cursor claims a 4-day fully autonomous agent run produced a stronger-than-official solution to a math research challenge—suggesting coordination techniques may generalize beyond coding. Also: Cursor’s MCP Apps (interactive UIs in-chat), model/tool value debates (Codex vs others), and concrete execution patterns like Implementation Plans, 95%-confidence autopilot loops, and async checkpoints.

🔥 TOP SIGNAL

Cursor’s CEO says their agent harness ran fully autonomously for 4 days (no nudges/hints) and produced what they believe is a novel, stronger solution to Problem Six of the First Proof math research challenge—an early signal that “scaling agent coordination” may generalize beyond coding tasks . The claimed improvements include using the Marcus–Spielman–Srivastava interlacing polynomial method, improving a constant from c=0.03 → 0.13, and partitioning the entire vertex set into light components (vs a subset) .

🛠️ TOOLS & MODELS

  • Cursor — MCP Apps support (new): Cursor now supports MCP Apps, so agents can render interactive UIs inside conversations.

  • OpenAI Codex — “most agentic coding per dollar” (practitioner claim): Romain Huet says Codex is currently the best option by far for agentic coding value .

  • Antigravity (agentic coding platform) — “Implementation Plan” + screenshot-to-Flutter UI

    • Recommended flow: request an “Implementation Plan” artifact first, review/edit the markdown architecture/logic, then approve execution—explicitly warning “don’t let AI write code blindly” .
    • “Screenshot → functional Flutter UI” demo: drop a screenshot and ask to rebuild as Flutter UI; described as powered by Gemini 3 Flash and launching on-device .
  • Claude Opus 4.5 / 4.6 (Copilot workflow) — quality jump (firsthand): Burke Holland describes Opus as a practical inflection point for building tools quickly, contrasting it with Sonnet 3.5 output he calls “spaghetti code” and “willy nilly” changes .

💡 WORKFLOWS & TRICKS

  • Steal this: “Implementation Plan → approve → execute” as your default safety rail (Antigravity)

    1. Ask the agent for an Implementation Plan artifact first .
    2. Review and edit the architecture + markdown logic yourself .
    3. Only then approve execution (the explicit goal: control the outcome vs blind codegen) .
  • Plan mode isn’t about the plan—it’s about flushing missing constraints (Burke Holland)

    • Start in “plan mode” and do 4–6 loops where the agent proposes what you forgot to specify + multiple options, before you let it implement .
  • Autopilot / loop-until-confidence (Burke Holland)

    • Run the agent in a loop that feeds its output back into itself, but change the stop condition from “until it’s done” to “until you have 95% confidence it’s done” .
  • Task classification + model routing + sub-agent fanout (multi-model orchestration) (Burke Holland)

    • Use a “front” agent to classify tasks as easy/medium/hard and change the workflow accordingly (hard tasks: plan + sub-agents + farm-out work) .
    • In the described Copilot setup, different models can be used in one run (example routing: Gemini for design, other models for refactoring) and scaled up to many sub-agents—but the workflow must still output something verifiable.
  • Async agent + human checkpoints (Burke Holland)

    • Pattern: give the CLI a big job, walk away, and have it message you (example: Telegram) with progress + a “what next?” checkpoint so you can approve/deny and let it continue .
  • Reality check: “polish” is still synchronous (Kent C. Dodds)

    • Kent calls out that with cloud agents, polish requires real-time back-and-forth while you try outputs and iterate—hard to do asynchronously from phone/desktop today .

👤 PEOPLE TO WATCH

  • Michael Truell (Cursor) — concrete evidence of long-horizon autonomy: same harness previously used to “build a browser from scratch,” now used for a 4-day autonomous run on a math research problem .

  • Burke Holland (GitHub Copilot DevRel) — unusually replicable patterns for “agent experience”: plan-mode loops, 95% confidence autopilot loops, and multi-model orchestration with evidence requirements .

  • Simon Willison — frames the core bottleneck as security review at scale: treat coding agents like “teams of mixed ability engineers” shipping under deadline pressure; security issues are “directly harmful” vs survivable code quality issues .

  • swyx (+ Ankitxg) — continued push to remove review bottlenecks: calls “killing the Code Review” the “Final Boss of Agentic Engineering,” pointing to a layered playbook and “Dark Factory” anecdotes (no human code and no human review) .

🎬 WATCH & LISTEN

1) Changelog — “Plan mode” loops that prevent bad prompts (≈20:55–22:04)

Hook: plan mode as a structured way to surface what you forgot to ask for, plus multiple implementation options before execution .

2) Changelog — Autopilot: loop until 95% confidence (≈22:16–23:03)

Hook: changing the stopping condition (“until it’s done” → “until 95% confident”) to force deeper self-checking iterations .

📊 PROJECTS & REPOS


Editorial take: The frontier is shifting from “write code” to run loops + produce evidence—and the hardest unsolved piece is how you scale review (especially security) without slowing agents back down .

4-day autonomous agents, Cursor MCP Apps, and the push from code review to evidence
Summary
Coverage start
Mar 3 at 7:00 AM
Coverage end
Mar 4 at 7:00 AM
Frequency
Daily
Published
Mar 4 at 8:10 AM
Reading time
4 min
Research time
1 hr 41 min
Documents scanned
111
Documents used
19
Citations
30
Sources monitored
110 / 110
Insights
Skipped contexts
Source details
Source Docs Insights Status
Lukas Möller 0 0
Jediah Katz 2 1
Aman Karmani 2 0
Jacob Jackson 0 0
Cursor Blog | RSS Feed 0 0
Nicholas Moy 0 0
Mike Krieger 0 0
Sualeh Asif 0 0
Michael Truell 6 1
Google Antigravity 7 1
Aman Sanger 0 0
cat 0 0
Mark Chen 0 0
Greg Brockman 0 0
Tongzhou Wang 0 0
fouad 0 0
Calvin French-Owen 0 0
Hanson Wang 0 0
Ed Bayes 0 0
Alexander Embiricos 0 0
Tibo 4 0
Romain Huet 2 1
DHH 6 0
Jane Street Blog 0 0
Miguel Grinberg's Blog: AI 0 0
xxchan's Blog 0 0
<antirez> 0 0
Brendan Long 0 0
The Pragmatic Engineer 0 0
David Heinemeier Hansson 0 0
Armin Ronacher ⇌ 9 0
Mitchell Hashimoto 0 0
Armin Ronacher's Thoughts and Writings 0 0
Peter Steinberger 0 0
Theo - t3.gg 2 0
Sourcegraph 0 0
Anthropic 0 0
Cursor 0 0
LangChain 0 0
Anthropic 0 0
LangChain Blog 0 0
LangChain 4 0
Cursor 3 1
Riley Brown 0 0
Riley Brown 4 0
Jason Zhou 0 0
Boris Cherny 0 0
Mckay Wrigley 0 0
geoff 7 0
Peter Steinberger 🦞 6 0
AI Jason 0 0
Alex Albert 0 0
Latent.Space 0 0
Logan Kilpatrick 2 0
Fireship 0 0
Fireship 0 0
Kent C. Dodds ⚡ 7 1
Practical AI 0 0
Practical AI Clips 0 0
Stories by Steve Yegge on Medium 0 0
Kent C. Dodds Blog 0 0
ThePrimeTime 1 0
Theo - t3․gg 0 0
ThePrimeagen 15 0
Ben Tossell 0 0
swyx 11 1
AI For Developers 0 0
Geoffrey Huntley 0 0
Addy Osmani 3 0
Andrej Karpathy 0 0
Simon Willison 5 1
Matthew Berman 0 0
Changelog 1 1
Simon Willison’s Newsletter 0 0
Agentic Coding Newsletter 0 0
Latent Space 0 0
Simon Willison's Weblog 2 0
Elevate 0 0
Lukas Möller 0 0
Jediah Katz 0 0
Sualeh Asif 0 0
Mike Krieger 0 0
Michael Truell 0 0
Cat Wu 0 0
Kevin Hou 0 0
Aman Sanger 0 0
Nicholas Moy 0 0
Andrey Mishchenko 0 0
Jerry Tworek 0 0
Romain Huet 0 0
Thibault Sottiaux 0 0
Alexander Embiricos 0 0
xxchan 0 0
Salvatore Sanfilippo 0 0
Armin Ronacher 0 0
David Heinemeier Hansson (DHH) 0 0
Alex Albert 0 0
Logan Kilpatrick 0 0
Shawn "swyx" Wang 0 0
Jason Zhou 0 0
Riley Brown 0 0
McKay Wrigley 0 0
Boris Cherny 0 0
Ben Tossell 0 0
Geoffrey Huntley 0 0
Peter Steinberger 0 0
Addy Osmani 0 0
Simon Willison 0 0
Andrej Karpathy 0 0
Harrison Chase 0 0