Activity for 4-day autonomous agents, Cursor MCP Apps, and the push from code review to evidence

4-day autonomous agents, Cursor MCP Apps, and the push from code review to evidence

Cursor claims a 4-day fully autonomous agent run produced a stronger-than-official solution to a math research challenge—suggesting coordination techniques may generalize beyond coding. Also: Cursor’s MCP Apps (interactive UIs in-chat), model/tool value debates (Codex vs others), and concrete execution patterns like Implementation Plans, 95%-confidence autopilot loops, and async checkpoints.

Romain Huet

Tristan Rhodes

Latent.Space

🔥 TOP SIGNAL

Cursor’s CEO says their agent harness ran fully autonomously for 4 days (no nudges/hints) and produced what they believe is a novel, stronger solution to Problem Six of the First Proof math research challenge—an early signal that “scaling agent coordination” may generalize beyond coding tasks . The claimed improvements include using the Marcus–Spielman–Srivastava interlacing polynomial method, improving a constant from c=0.03 → 0.13, and partitioning the entire vertex set into light components (vs a subset) .

🛠️ TOOLS & MODELS

Cursor — MCP Apps support (new): Cursor now supports MCP Apps, so agents can render interactive UIs inside conversations.
- Also added: private plugins you can create/share via team marketplaces.
- Changelog: https://cursor.com/changelog/2-6
OpenAI Codex — “most agentic coding per dollar” (practitioner claim): Romain Huet says Codex is currently the best option by far for agentic coding value .
Antigravity (agentic coding platform) — “Implementation Plan” + screenshot-to-Flutter UI
- Recommended flow: request an “Implementation Plan” artifact first, review/edit the markdown architecture/logic, then approve execution—explicitly warning “don’t let AI write code blindly” .
- “Screenshot → functional Flutter UI” demo: drop a screenshot and ask to rebuild as Flutter UI; described as powered by Gemini 3 Flash and launching on-device .
Claude Opus 4.5 / 4.6 (Copilot workflow) — quality jump (firsthand): Burke Holland describes Opus as a practical inflection point for building tools quickly, contrasting it with Sonnet 3.5 output he calls “spaghetti code” and “willy nilly” changes .

💡 WORKFLOWS & TRICKS

Steal this: “Implementation Plan → approve → execute” as your default safety rail (Antigravity)
1. Ask the agent for an Implementation Plan artifact first .
2. Review and edit the architecture + markdown logic yourself .
3. Only then approve execution (the explicit goal: control the outcome vs blind codegen) .
Plan mode isn’t about the plan—it’s about flushing missing constraints (Burke Holland)
- Start in “plan mode” and do 4–6 loops where the agent proposes what you forgot to specify + multiple options, before you let it implement .
Autopilot / loop-until-confidence (Burke Holland)
- Run the agent in a loop that feeds its output back into itself, but change the stop condition from “until it’s done” to “until you have 95% confidence it’s done” .
Task classification + model routing + sub-agent fanout (multi-model orchestration) (Burke Holland)
- Use a “front” agent to classify tasks as easy/medium/hard and change the workflow accordingly (hard tasks: plan + sub-agents + farm-out work) .
- In the described Copilot setup, different models can be used in one run (example routing: Gemini for design, other models for refactoring) and scaled up to many sub-agents—but the workflow must still output something verifiable.
Async agent + human checkpoints (Burke Holland)
- Pattern: give the CLI a big job, walk away, and have it message you (example: Telegram) with progress + a “what next?” checkpoint so you can approve/deny and let it continue .
Reality check: “polish” is still synchronous (Kent C. Dodds)
- Kent calls out that with cloud agents, polish requires real-time back-and-forth while you try outputs and iterate—hard to do asynchronously from phone/desktop today .

👤 PEOPLE TO WATCH

Michael Truell (Cursor) — concrete evidence of long-horizon autonomy: same harness previously used to “build a browser from scratch,” now used for a 4-day autonomous run on a math research problem .
Burke Holland (GitHub Copilot DevRel) — unusually replicable patterns for “agent experience”: plan-mode loops, 95% confidence autopilot loops, and multi-model orchestration with evidence requirements .
Simon Willison — frames the core bottleneck as security review at scale: treat coding agents like “teams of mixed ability engineers” shipping under deadline pressure; security issues are “directly harmful” vs survivable code quality issues .
swyx (+ Ankitxg) — continued push to remove review bottlenecks: calls “killing the Code Review” the “Final Boss of Agentic Engineering,” pointing to a layered playbook and “Dark Factory” anecdotes (no human code and no human review) .

🎬 WATCH & LISTEN

1) Changelog — “Plan mode” loops that prevent bad prompts (≈20:55–22:04)

Hook: plan mode as a structured way to surface what you forgot to ask for, plus multiple implementation options before execution .

2) Changelog — Autopilot: loop until 95% confidence (≈22:16–23:03)

Hook: changing the stopping condition (“until it’s done” → “until 95% confident”) to force deeper self-checking iterations .

📊 PROJECTS & REPOS

Cursor: “Scaling agents” harness write-up: http://cursor.com/blog/scaling-agents
First Proof challenge site: https://1stproof.org/
Cursor’s full Problem Six solution (doc link): https://drive.google.com/file/d/1wqNqUoRmuaBaP2Y0mxI_OfAkb1cTar5m/view?usp=sharing
Summit Scout (built in Antigravity, shared demo): https://summitscout-five.vercel.app/
“Reviews dead” post (as linked): https://latent.space/p/reviews-dead
Cursor v2.6 changelog: https://cursor.com/changelog/2-6

Editorial take: The frontier is shifting from “write code” to run loops + produce evidence—and the hardest unsolved piece is how you scale review (especially security) without slowing agents back down .

4-day autonomous agents, Cursor MCP Apps, and the push from code review to evidence

swyx

x 2 docs

@ankitxg shares a 5-step layered playbook for eliminating code reviews in agentic workflows .

Full post: https://latent.space/p/reviews-dead.

Example: StrongDM’s “Dark Factory” generates and deploys code with no human code or review.

@swyx calls killing code review the “Final Boss of Agentic Engineering”, as agents near full productivity by removing the human review bottleneck. Multiple practitioners mapping fully agent-driven SDLC; swyx not there yet but expects it in 3-6 months .

Secondhand reporting on emerging practitioner strategies.

Changelog

youtube 1 doc

Burke Holland (GitHub Copilot devrel, tests models firsthand) reports Claude Opus 4.5 as inflection point: one-shots native Windows resize tool using WinUI , screen-to-GIF capture , ~Snagit-like video editor in hours—structured code vs prior Sonnet 3.5 spaghetti . Previously took year+ with AI; now afternoon iOS app for wife's business (Gemini image captions with few-shot prior accepts) replacing paid SaaS routing app—personal software era .

Workflow (greenfield MVP in Copilot CLI/VS Code):

Plan mode (Opus 4.6): 4-6 iteration loops to surface omissions/alternatives before prompt .
Autopilot (ReAct/RLPH loop): self-loops to 95% confidence .
Custom agent Anvil: classifies easy/medium/hard tasks; routes (e.g., Gemini design, Sonnet 5.3 refactor); sub-agents/multi-model orchestration; verifiable evidence (browser skills/unit tests inadequate) .

Async/team lead vision: CLI runs large jobs, Telegram checkpoints for human-in-loop .

Patterns: Agentic (tools/terminal/internet); concepts > syntax (e.g., UNIX sockets in Go); accelerated learning; production hard (arch/security/deploy); not all software equal (vibe/prototype vs shippable) .

Comparisons: Copilot best value (request-based, ~billion tokens/$200); Sonnet eager/sloppier . Contrarian: AI expands concepts/knowledge vs dumbs down; juniors need abstractions; high-perf devs now direct algorithms .

Simon Willison

x 5 docs

Simon Willison (@simonw, Django co-creator, Datasette creator) views security review as the critical lens for agent-generated code:

Treat coding agents like "teams of mixed ability engineers working under aggressive deadlines" shipping varying quality code—a common large-company challenge .
Security problems are "directly harmful" to organizations, unlike survivable issues like poor performance or technical debt .
Seeks security teams' strategies for securing systems amid constant shipping by inexperienced engineers, amid trend of eliminating human code review—the "Final Boss of Agentic Engineering" .
Requests essays/books/talks on robust security review at scale (e.g., DEF CON, Black Hat, CCC) .

Timeless pattern: Analogize agents to junior/mixed-skill teams, prioritize scaled security review over traditional code review.

Google Antigravity

x 7 docs

Antigravity agentic coding platform:

Firsthand user example: @vamsibatchuk built full ‘Summit Scout’ app for 63 national parks—scoring remoteness, scenery, elevation, biodiversity; includes logistics, directions, camping, ratings, community intel—in under 2 hours. Live demo: https://summitscout-five.vercel.app/.

Implementation Plan workflow (official demo): Request “Implementation Plan” artifact first; review/edit markdown architecture/logic; approve before execution to avoid blind code gen and control outcomes .

Screenshot-to-UI workflow: Drop screenshot, prompt rebuild as Flutter UI using Gemini 3 Flash for high-fidelity reproduction that launches on device .

Michael Truell

x 6 docs

Cursor agents autonomously solved First Proof Problem Six, a math research challenge approximating Stanford/MIT/Berkeley academic work, yielding stronger results than the human-written solution via Marcus-Spielman-Srivastava method (constant c=0.13 vs 0.03, full vertex set partitioning) .

Workflow: Same harness as browser-from-scratch build; ran fully autonomously for 4 days without nudges/hints .
Validation: Likely correct per spectral graph expert Yang Liu and Stanford mathematician Jan Vondrák .
Insight: Scaling agent coordination generalizes beyond coding .

Resources:

Firsthand from Michael Truell (@mntruell), Cursor CEO, production agent use.

Kent C. Dodds ⚡

x 1 doc

@kentcdodds, a dev educator, highlights a key limitation of cloud agents: polishing requires synchronous back-and-forth iteration from phone/desktop, as the user tries the agent's output and iterates . He anticipates models improving intuition for this .

Firsthand workflow insight: Emphasizes need for real-time human-in-the-loop during polish phase.

Cursor

x 3 docs

Cursor now supports MCP Apps, enabling agents to render interactive UIs in conversations .

New features include:

Create and share private plugins with team marketplaces

Full changelog: https://cursor.com/changelog/2-6

Official announcement from @cursor_ai (Cursor team).

Romain Huet

x 2 docs

OpenAI Codex provides the most agentic coding per dollar, currently the best option by far .

Endorsed by Romain Huet (Head of Developer Experience @OpenAI, working on Codex) .

Quoted post: https://x.com/tristanbob/status/2028888924131885144.

Jediah Katz

x 2 docs

Cursor now supports MCP Apps, enabling agents to render interactive UIs in conversations, as shown in demo video .

Shared by @jediahkatz (building the Cursor AI agent) .