4-day autonomous agents, Cursor MCP Apps, and the push from code review to evidence

Cursor claims a 4-day fully autonomous agent run produced a stronger-than-official solution to a math research challenge—suggesting coordination techniques may generalize beyond coding. Also: Cursor’s MCP Apps (interactive UIs in-chat), model/tool value debates (Codex vs others), and concrete execution patterns like Implementation Plans, 95%-confidence autopilot loops, and async checkpoints.

Romain Huet

Tristan Rhodes

Latent.Space

🔥 TOP SIGNAL

Cursor’s CEO says their agent harness ran fully autonomously for 4 days (no nudges/hints) and produced what they believe is a novel, stronger solution to Problem Six of the First Proof math research challenge—an early signal that “scaling agent coordination” may generalize beyond coding tasks . The claimed improvements include using the Marcus–Spielman–Srivastava interlacing polynomial method, improving a constant from c=0.03 → 0.13, and partitioning the entire vertex set into light components (vs a subset) .

🛠️ TOOLS & MODELS

Cursor — MCP Apps support (new): Cursor now supports MCP Apps, so agents can render interactive UIs inside conversations.
- Also added: private plugins you can create/share via team marketplaces.
- Changelog: https://cursor.com/changelog/2-6
OpenAI Codex — “most agentic coding per dollar” (practitioner claim): Romain Huet says Codex is currently the best option by far for agentic coding value .
Antigravity (agentic coding platform) — “Implementation Plan” + screenshot-to-Flutter UI
- Recommended flow: request an “Implementation Plan” artifact first, review/edit the markdown architecture/logic, then approve execution—explicitly warning “don’t let AI write code blindly” .
- “Screenshot → functional Flutter UI” demo: drop a screenshot and ask to rebuild as Flutter UI; described as powered by Gemini 3 Flash and launching on-device .
Claude Opus 4.5 / 4.6 (Copilot workflow) — quality jump (firsthand): Burke Holland describes Opus as a practical inflection point for building tools quickly, contrasting it with Sonnet 3.5 output he calls “spaghetti code” and “willy nilly” changes .

💡 WORKFLOWS & TRICKS

Steal this: “Implementation Plan → approve → execute” as your default safety rail (Antigravity)
1. Ask the agent for an Implementation Plan artifact first .
2. Review and edit the architecture + markdown logic yourself .
3. Only then approve execution (the explicit goal: control the outcome vs blind codegen) .
Plan mode isn’t about the plan—it’s about flushing missing constraints (Burke Holland)
- Start in “plan mode” and do 4–6 loops where the agent proposes what you forgot to specify + multiple options, before you let it implement .
Autopilot / loop-until-confidence (Burke Holland)
- Run the agent in a loop that feeds its output back into itself, but change the stop condition from “until it’s done” to “until you have 95% confidence it’s done” .
Task classification + model routing + sub-agent fanout (multi-model orchestration) (Burke Holland)
- Use a “front” agent to classify tasks as easy/medium/hard and change the workflow accordingly (hard tasks: plan + sub-agents + farm-out work) .
- In the described Copilot setup, different models can be used in one run (example routing: Gemini for design, other models for refactoring) and scaled up to many sub-agents—but the workflow must still output something verifiable.
Async agent + human checkpoints (Burke Holland)
- Pattern: give the CLI a big job, walk away, and have it message you (example: Telegram) with progress + a “what next?” checkpoint so you can approve/deny and let it continue .
Reality check: “polish” is still synchronous (Kent C. Dodds)
- Kent calls out that with cloud agents, polish requires real-time back-and-forth while you try outputs and iterate—hard to do asynchronously from phone/desktop today .

👤 PEOPLE TO WATCH

Michael Truell (Cursor) — concrete evidence of long-horizon autonomy: same harness previously used to “build a browser from scratch,” now used for a 4-day autonomous run on a math research problem .
Burke Holland (GitHub Copilot DevRel) — unusually replicable patterns for “agent experience”: plan-mode loops, 95% confidence autopilot loops, and multi-model orchestration with evidence requirements .
Simon Willison — frames the core bottleneck as security review at scale: treat coding agents like “teams of mixed ability engineers” shipping under deadline pressure; security issues are “directly harmful” vs survivable code quality issues .
swyx (+ Ankitxg) — continued push to remove review bottlenecks: calls “killing the Code Review” the “Final Boss of Agentic Engineering,” pointing to a layered playbook and “Dark Factory” anecdotes (no human code and no human review) .

🎬 WATCH & LISTEN

1) Changelog — “Plan mode” loops that prevent bad prompts (≈20:55–22:04)

Hook: plan mode as a structured way to surface what you forgot to ask for, plus multiple implementation options before execution .

2) Changelog — Autopilot: loop until 95% confidence (≈22:16–23:03)

Hook: changing the stopping condition (“until it’s done” → “until 95% confident”) to force deeper self-checking iterations .

📊 PROJECTS & REPOS

Cursor: “Scaling agents” harness write-up: http://cursor.com/blog/scaling-agents
First Proof challenge site: https://1stproof.org/
Cursor’s full Problem Six solution (doc link): https://drive.google.com/file/d/1wqNqUoRmuaBaP2Y0mxI_OfAkb1cTar5m/view?usp=sharing
Summit Scout (built in Antigravity, shared demo): https://summitscout-five.vercel.app/
“Reviews dead” post (as linked): https://latent.space/p/reviews-dead
Cursor v2.6 changelog: https://cursor.com/changelog/2-6

Editorial take: The frontier is shifting from “write code” to run loops + produce evidence—and the hardest unsolved piece is how you scale review (especially security) without slowing agents back down .