Brief details for 4-day autonomous agents, Cursor MCP Apps, and the push from code review to evidence

4-day autonomous agents, Cursor MCP Apps, and the push from code review to evidence

Cursor claims a 4-day fully autonomous agent run produced a stronger-than-official solution to a math research challenge—suggesting coordination techniques may generalize beyond coding. Also: Cursor’s MCP Apps (interactive UIs in-chat), model/tool value debates (Codex vs others), and concrete execution patterns like Implementation Plans, 95%-confidence autopilot loops, and async checkpoints.

Romain Huet

Tristan Rhodes

Latent.Space

🔥 TOP SIGNAL

Cursor’s CEO says their agent harness ran fully autonomously for 4 days (no nudges/hints) and produced what they believe is a novel, stronger solution to Problem Six of the First Proof math research challenge—an early signal that “scaling agent coordination” may generalize beyond coding tasks . The claimed improvements include using the Marcus–Spielman–Srivastava interlacing polynomial method, improving a constant from c=0.03 → 0.13, and partitioning the entire vertex set into light components (vs a subset) .

🛠️ TOOLS & MODELS

Cursor — MCP Apps support (new): Cursor now supports MCP Apps, so agents can render interactive UIs inside conversations.
- Also added: private plugins you can create/share via team marketplaces.
- Changelog: https://cursor.com/changelog/2-6
OpenAI Codex — “most agentic coding per dollar” (practitioner claim): Romain Huet says Codex is currently the best option by far for agentic coding value .
Antigravity (agentic coding platform) — “Implementation Plan” + screenshot-to-Flutter UI
- Recommended flow: request an “Implementation Plan” artifact first, review/edit the markdown architecture/logic, then approve execution—explicitly warning “don’t let AI write code blindly” .
- “Screenshot → functional Flutter UI” demo: drop a screenshot and ask to rebuild as Flutter UI; described as powered by Gemini 3 Flash and launching on-device .
Claude Opus 4.5 / 4.6 (Copilot workflow) — quality jump (firsthand): Burke Holland describes Opus as a practical inflection point for building tools quickly, contrasting it with Sonnet 3.5 output he calls “spaghetti code” and “willy nilly” changes .

💡 WORKFLOWS & TRICKS

Steal this: “Implementation Plan → approve → execute” as your default safety rail (Antigravity)
1. Ask the agent for an Implementation Plan artifact first .
2. Review and edit the architecture + markdown logic yourself .
3. Only then approve execution (the explicit goal: control the outcome vs blind codegen) .
Plan mode isn’t about the plan—it’s about flushing missing constraints (Burke Holland)
- Start in “plan mode” and do 4–6 loops where the agent proposes what you forgot to specify + multiple options, before you let it implement .
Autopilot / loop-until-confidence (Burke Holland)
- Run the agent in a loop that feeds its output back into itself, but change the stop condition from “until it’s done” to “until you have 95% confidence it’s done” .
Task classification + model routing + sub-agent fanout (multi-model orchestration) (Burke Holland)
- Use a “front” agent to classify tasks as easy/medium/hard and change the workflow accordingly (hard tasks: plan + sub-agents + farm-out work) .
- In the described Copilot setup, different models can be used in one run (example routing: Gemini for design, other models for refactoring) and scaled up to many sub-agents—but the workflow must still output something verifiable.
Async agent + human checkpoints (Burke Holland)
- Pattern: give the CLI a big job, walk away, and have it message you (example: Telegram) with progress + a “what next?” checkpoint so you can approve/deny and let it continue .
Reality check: “polish” is still synchronous (Kent C. Dodds)
- Kent calls out that with cloud agents, polish requires real-time back-and-forth while you try outputs and iterate—hard to do asynchronously from phone/desktop today .

👤 PEOPLE TO WATCH

Michael Truell (Cursor) — concrete evidence of long-horizon autonomy: same harness previously used to “build a browser from scratch,” now used for a 4-day autonomous run on a math research problem .
Burke Holland (GitHub Copilot DevRel) — unusually replicable patterns for “agent experience”: plan-mode loops, 95% confidence autopilot loops, and multi-model orchestration with evidence requirements .
Simon Willison — frames the core bottleneck as security review at scale: treat coding agents like “teams of mixed ability engineers” shipping under deadline pressure; security issues are “directly harmful” vs survivable code quality issues .
swyx (+ Ankitxg) — continued push to remove review bottlenecks: calls “killing the Code Review” the “Final Boss of Agentic Engineering,” pointing to a layered playbook and “Dark Factory” anecdotes (no human code and no human review) .

🎬 WATCH & LISTEN

1) Changelog — “Plan mode” loops that prevent bad prompts (≈20:55–22:04)

Hook: plan mode as a structured way to surface what you forgot to ask for, plus multiple implementation options before execution .

2) Changelog — Autopilot: loop until 95% confidence (≈22:16–23:03)

Hook: changing the stopping condition (“until it’s done” → “until 95% confident”) to force deeper self-checking iterations .

📊 PROJECTS & REPOS

Cursor: “Scaling agents” harness write-up: http://cursor.com/blog/scaling-agents
First Proof challenge site: https://1stproof.org/
Cursor’s full Problem Six solution (doc link): https://drive.google.com/file/d/1wqNqUoRmuaBaP2Y0mxI_OfAkb1cTar5m/view?usp=sharing
Summit Scout (built in Antigravity, shared demo): https://summitscout-five.vercel.app/
“Reviews dead” post (as linked): https://latent.space/p/reviews-dead
Cursor v2.6 changelog: https://cursor.com/changelog/2-6

Editorial take: The frontier is shifting from “write code” to run loops + produce evidence—and the hardest unsolved piece is how you scale review (especially security) without slowing agents back down .

4-day autonomous agents, Cursor MCP Apps, and the push from code review to evidence

Summary

Coverage start

Mar 3 at 7:00 AM

Coverage end

Mar 4 at 7:00 AM

Frequency

Daily

Published

Mar 4 at 8:10 AM

Reading time

4 min

Research time

1 hr 41 min

Documents scanned

111

Documents used

Citations

Sources monitored

110 / 110

Insights

View

Skipped contexts

View

Source details

Source	Docs	Insights
Lukas Möller	0	0
Jediah Katz	2	1
Aman Karmani	2	0
Jacob Jackson	0	0
Cursor Blog \| RSS Feed	0	0
Nicholas Moy	0	0
Mike Krieger	0	0
Sualeh Asif	0	0
Michael Truell	6	1
Google Antigravity	7	1
Aman Sanger	0	0
cat	0	0
Mark Chen	0	0
Greg Brockman	0	0
Tongzhou Wang	0	0
fouad	0	0
Calvin French-Owen	0	0
Hanson Wang	0	0
Ed Bayes	0	0
Alexander Embiricos	0	0
Tibo	4	0
Romain Huet	2	1
DHH	6	0
Jane Street Blog	0	0
Miguel Grinberg's Blog: AI	0	0
xxchan's Blog	0	0
<antirez>	0	0
Brendan Long	0	0
The Pragmatic Engineer	0	0
David Heinemeier Hansson	0	0
Armin Ronacher ⇌	9	0
Mitchell Hashimoto	0	0
Armin Ronacher's Thoughts and Writings	0	0
Peter Steinberger	0	0
Theo - t3.gg	2	0
Sourcegraph	0	0
Anthropic	0	0
Cursor	0	0
LangChain	0	0
Anthropic	0	0
LangChain Blog	0	0
LangChain	4	0
Cursor	3	1
Riley Brown	0	0
Riley Brown	4	0
Jason Zhou	0	0
Boris Cherny	0	0
Mckay Wrigley	0	0
geoff	7	0
Peter Steinberger 🦞	6	0
AI Jason	0	0
Alex Albert	0	0
Latent.Space	0	0
Logan Kilpatrick	2	0
Fireship	0	0
Fireship	0	0
Kent C. Dodds ⚡	7	1
Practical AI	0	0
Practical AI Clips	0	0
Stories by Steve Yegge on Medium	0	0
Kent C. Dodds Blog	0	0
ThePrimeTime	1	0
Theo - t3․gg	0	0
ThePrimeagen	15	0
Ben Tossell	0	0
swyx	11	1
AI For Developers	0	0
Geoffrey Huntley	0	0
Addy Osmani	3	0
Andrej Karpathy	0	0
Simon Willison	5	1
Matthew Berman	0	0
Changelog	1	1
Simon Willison’s Newsletter	0	0
Agentic Coding Newsletter	0	0
Latent Space	0	0
Simon Willison's Weblog	2	0
Elevate	0	0
Lukas Möller	0	0
Jediah Katz	0	0
Sualeh Asif	0	0
Mike Krieger	0	0
Michael Truell	0	0
Cat Wu	0	0
Kevin Hou	0	0
Aman Sanger	0	0
Nicholas Moy	0	0
Andrey Mishchenko	0	0
Jerry Tworek	0	0
Romain Huet	0	0
Thibault Sottiaux	0	0
Alexander Embiricos	0	0
xxchan	0	0
Salvatore Sanfilippo	0	0
Armin Ronacher	0	0
David Heinemeier Hansson (DHH)	0	0
Alex Albert	0	0
Logan Kilpatrick	0	0
Shawn "swyx" Wang	0	0
Jason Zhou	0	0
Riley Brown	0	0
McKay Wrigley	0	0
Boris Cherny	0	0
Ben Tossell	0	0
Geoffrey Huntley	0	0
Peter Steinberger	0	0
Addy Osmani	0	0
Simon Willison	0	0
Andrej Karpathy	0	0
Harrison Chase	0	0