ZeroNoise Logo zeronoise
Post
4-day autonomous agents, Cursor MCP Apps, and the push from code review to evidence
Mar 4
4 min read
111 docs
Cursor claims a 4-day fully autonomous agent run produced a stronger-than-official solution to a math research challenge—suggesting coordination techniques may generalize beyond coding. Also: Cursor’s MCP Apps (interactive UIs in-chat), model/tool value debates (Codex vs others), and concrete execution patterns like Implementation Plans, 95%-confidence autopilot loops, and async checkpoints.

🔥 TOP SIGNAL

Cursor’s CEO says their agent harness ran fully autonomously for 4 days (no nudges/hints) and produced what they believe is a novel, stronger solution to Problem Six of the First Proof math research challenge—an early signal that “scaling agent coordination” may generalize beyond coding tasks . The claimed improvements include using the Marcus–Spielman–Srivastava interlacing polynomial method, improving a constant from c=0.03 → 0.13, and partitioning the entire vertex set into light components (vs a subset) .

🛠️ TOOLS & MODELS

  • Cursor — MCP Apps support (new): Cursor now supports MCP Apps, so agents can render interactive UIs inside conversations.

  • OpenAI Codex — “most agentic coding per dollar” (practitioner claim): Romain Huet says Codex is currently the best option by far for agentic coding value .

  • Antigravity (agentic coding platform) — “Implementation Plan” + screenshot-to-Flutter UI

    • Recommended flow: request an “Implementation Plan” artifact first, review/edit the markdown architecture/logic, then approve execution—explicitly warning “don’t let AI write code blindly” .
    • “Screenshot → functional Flutter UI” demo: drop a screenshot and ask to rebuild as Flutter UI; described as powered by Gemini 3 Flash and launching on-device .
  • Claude Opus 4.5 / 4.6 (Copilot workflow) — quality jump (firsthand): Burke Holland describes Opus as a practical inflection point for building tools quickly, contrasting it with Sonnet 3.5 output he calls “spaghetti code” and “willy nilly” changes .

💡 WORKFLOWS & TRICKS

  • Steal this: “Implementation Plan → approve → execute” as your default safety rail (Antigravity)

    1. Ask the agent for an Implementation Plan artifact first .
    2. Review and edit the architecture + markdown logic yourself .
    3. Only then approve execution (the explicit goal: control the outcome vs blind codegen) .
  • Plan mode isn’t about the plan—it’s about flushing missing constraints (Burke Holland)

    • Start in “plan mode” and do 4–6 loops where the agent proposes what you forgot to specify + multiple options, before you let it implement .
  • Autopilot / loop-until-confidence (Burke Holland)

    • Run the agent in a loop that feeds its output back into itself, but change the stop condition from “until it’s done” to “until you have 95% confidence it’s done” .
  • Task classification + model routing + sub-agent fanout (multi-model orchestration) (Burke Holland)

    • Use a “front” agent to classify tasks as easy/medium/hard and change the workflow accordingly (hard tasks: plan + sub-agents + farm-out work) .
    • In the described Copilot setup, different models can be used in one run (example routing: Gemini for design, other models for refactoring) and scaled up to many sub-agents—but the workflow must still output something verifiable.
  • Async agent + human checkpoints (Burke Holland)

    • Pattern: give the CLI a big job, walk away, and have it message you (example: Telegram) with progress + a “what next?” checkpoint so you can approve/deny and let it continue .
  • Reality check: “polish” is still synchronous (Kent C. Dodds)

    • Kent calls out that with cloud agents, polish requires real-time back-and-forth while you try outputs and iterate—hard to do asynchronously from phone/desktop today .

👤 PEOPLE TO WATCH

  • Michael Truell (Cursor) — concrete evidence of long-horizon autonomy: same harness previously used to “build a browser from scratch,” now used for a 4-day autonomous run on a math research problem .

  • Burke Holland (GitHub Copilot DevRel) — unusually replicable patterns for “agent experience”: plan-mode loops, 95% confidence autopilot loops, and multi-model orchestration with evidence requirements .

  • Simon Willison — frames the core bottleneck as security review at scale: treat coding agents like “teams of mixed ability engineers” shipping under deadline pressure; security issues are “directly harmful” vs survivable code quality issues .

  • swyx (+ Ankitxg) — continued push to remove review bottlenecks: calls “killing the Code Review” the “Final Boss of Agentic Engineering,” pointing to a layered playbook and “Dark Factory” anecdotes (no human code and no human review) .

🎬 WATCH & LISTEN

1) Changelog — “Plan mode” loops that prevent bad prompts (≈20:55–22:04)

Hook: plan mode as a structured way to surface what you forgot to ask for, plus multiple implementation options before execution .

2) Changelog — Autopilot: loop until 95% confidence (≈22:16–23:03)

Hook: changing the stopping condition (“until it’s done” → “until 95% confident”) to force deeper self-checking iterations .

📊 PROJECTS & REPOS


Editorial take: The frontier is shifting from “write code” to run loops + produce evidence—and the hardest unsolved piece is how you scale review (especially security) without slowing agents back down .