Activity for Coding agents hit a post-December step-change; Codex 5.3 momentum vs Opus 4.6; remote-control + orchestration patterns

Coding agents hit a post-December step-change; Codex 5.3 momentum vs Opus 4.6; remote-control + orchestration patterns

Agents keep moving from “toy” to “teammate”: Karpathy reports a sharp post-December step-change and shares a hands-off, 30-minute end-to-end build example. Also: Codex 5.3 displacing Opus 4.6 for some power users, Claude Code Remote Control’s early reliability issues, and concrete workflow patterns for orchestration, review, and repo hygiene.

Sherwin Wu

Andrej Karpathy

Cognition

+12

🔥 TOP SIGNAL

Coding agents crossed a “works in practice” threshold since December, driven (per Andrej Karpathy) by improved model quality, long-term coherence, and tenacity—enough to be disruptive to the default programming workflow. His concrete example: he handed an agent a single English brief to set up vLLM + Qwen3-VL, build a video inference endpoint + web UI, debug issues, install systemd services, and return a markdown report—hands-off in ~30 minutes.

🛠️ TOOLS & MODELS

GPT-5.3-Codex / Codex 5.3 vs Opus 4.6 (practitioner preference)
- Mitchell Hashimoto says Codex 5.3 is “much more effective” than Opus 4.6, and that after going back and forth he hasn’t touched Opus for a week—“first model to get me off of Opus… ever” .
- OpenAI’s Romain Huet says the team is “continuing to iterate and improve Codex every week” .
- Tool reliability signal: Brian Lovin hit Claude Code 500s, tried Codex, and reported “Codex is good!” .
Reasoning settings (Codex)
- Sherwin Wu: they “basically only run [GPT-5.3-Codex] on xhigh nowadays for all coding tasks,” and notes speed improvements make it not feel slow even at xhigh.
- Greg Brockman’s advice: “always run with xhigh reasoning” .
Claude Code — Remote Control (new capability, rough edges in testing)
- Feature: run claude remote-control locally, then send prompts to that session from web/iOS/desktop; one session per machine and requires per-action approval.
- Simon Willison reports it’s “a little bit janky,” including repeated API 500 errors and confusing failure behavior after restarting the program .
Devin 2.2 (Cognition)
- Cognition markets Devin 2.2 as an autonomous agent that can test with computer use, self-verify, and auto-fix; also claims 3× faster startup, redesigned UI, and “computer use + virtual desktop” .
OpenClaw — new beta
- Peter Steinberger: beta includes security improvements, various fixes, DM “heartbeat” made configurable after feedback, better Slack threads, improved subagents, and a more reliable Telegram webhook.
- Releases: https://github.com/openclaw/openclaw/releases.
Sourcegraph 7.0 (positioning shift)
- Sourcegraph says 7.0 marks a new chapter: doubling down on being an “intelligence layer” that developers and AI agents rely on to navigate/understand/operate on large codebases .
- Details: https://sourcegraph.com/blog/a-new-era-for-sourcegraph-the-intelligence-layer-for-ai-coding-agents-and-developers.

💡 WORKFLOWS & TRICKS

“English → parallel agents → you review” (Karpathy’s decomposition rule)
- Karpathy’s pattern: agents aren’t perfect—they need high-level direction, judgment, taste, oversight, iteration, hints, and they work best when tasks are well-specified and verifiable/testable.
- His operational heuristic: build intuition for task decomposition—hand off the parts that work well to agents, then “help out around the edges” .
- Scaling idea: build long-running orchestrators (“Claws”) with tools/memory/instructions managing multiple parallel “Code” instances .
Cursor cloud agent: “clone it from a video” as a starting point, then iterate for fidelity
- @swyx dropped a tweet + video into Cursor cloud expecting it not to work; he says Cursor Agent oneshotted a functional clone of Rachel Chen’s site from the video alone over 43 minutes (including a working “RachelLLM” sidebar) .
- His follow-up prompt for fidelity is a reusable template:
  - step through the video,
  - discover assets (headless run / curl / network snooping),
  - build a checklist + sitemap,
  - spin up subagents/swarm for parallel work,
  - don’t stop until behavior/visuals match closely; trade off fidelity vs simplicity when ambiguous .
- He reports a second improved output after another 43 minutes.
Run many agents in parallel (Cursor) + let the agent do exploratory UX testing
- Kent C. Dodds: he can run “as many of these [Cursor agents]” as he wants; instead of filing issues for ideas, he fires off prompts and gets back what it built (with screenshots) .
- He also saw the agent “noticed one UX edge case during walkthrough” while doing manual testing .
Long-running agent refactors overnight (Cursor) + “computer use” for steering
- Kent kicked off a long-running Cursor agent overnight and iterated in the morning using “computer use” .
- He reports it dropped ~15k lines in a refactor .
Code review aid: ask for a linear walkthrough of the codebase (Simon Willison)
- Willison’s prompt pattern: ask agents for “a linear walkthrough of the code that explains how it all works in detail” to understand vibe-coded output .
Git hygiene for agentic work: small commits, then squash (Huntley)
- Geoffrey Huntley suggests an agent-friendly workflow: make incremental small commits, then squash to a single commit so “study git log” for a unit of work can be a single tool call .
Production caution: don’t trust “ranked” PR scores if they’re editable
- Steinberger says they use Greptile to rank PRs, but observed someone manually edited a PR review score from 2/5 to 5/5.
- Example PR: https://github.com/openclaw/openclaw/pull/13095.
OSS maintainer playbook shift: tests as “reimplementation fuel”
- Simon Willison notes that a comprehensive test suite can be enough to rebuild a library from scratch, and highlights tldraw moving tests to a private repo as a response pattern .

👤 PEOPLE TO WATCH

Andrej Karpathy — clearest firsthand articulation of what changed since December, plus a concrete “30 minutes, hands-off” agent-run build story and an orchestration north star (“Claws”) .
Simon Willison — consistently turns agent usage into repeatable patterns (e.g., “linear walkthroughs”), and also documents sharp edges like Claude Code Remote Control’s failure modes .
Mitchell Hashimoto — high-signal model/tool preference note: Codex 5.3 displaced Opus 4.6 for him after direct comparison .
Kent C. Dodds — pragmatic day-to-day agent usage: parallel agents, long-running refactors, and agents surfacing UX edge cases during walkthroughs .
ThePrimeagen — counterweight: after ~3 months of vibe-coding, he says he hates the generated code and the “subtle offness,” and plans to “tradcode” (useful reality check on taste/intent gaps) .

🎬 WATCH & LISTEN

No YouTube videos or podcast episodes were included in today’s source set, so there are no embeddable clips to share.

📊 PROJECTS & REPOS

Simon Willison — “Present” (SwiftUI macOS presentation app) repo + walkthrough
- Repo: https://github.com/simonw/present
- Walkthrough doc: https://github.com/simonw/present/blob/main/walkthrough.md
OpenClaw — releases + active PR example
- Releases: https://github.com/openclaw/openclaw/releases
- PR referenced in Greptile score-editing report: https://github.com/openclaw/openclaw/pull/13095
tldraw — tests moving closed-source (issue)
- Issue: https://github.com/tldraw/tldraw/issues/8082

Editorial take: The bottleneck is shifting from “can the agent write code?” to “can you reliably steer, verify, and govern what it did?”

Coding agents hit a post-December step-change; Codex 5.3 momentum vs Opus 4.6; remote-control + orchestration patterns

swyx

x 3 docs

@swyx (affiliated with @cognition) shares insider analysis on Devin coding agent:

Lacked internal PMF at 2024 launch; 6 months to first enterprise due to unready models and agent pattern experimentation. Async agents deemed optimal UX ("Final Boss").
Usage doubled every 2 months per 2025 enterprise; accelerated to every 6 weeks in 2026; internal usage now 4x 2025 peak.
Self-serve hindered by repo setup neglect (enterprises use FDEs).
Devin 2.2 addresses UX debt: first designer hired, all-hands sprint for self-serve polish, omnibox integration, review/main loop closure.
Battle-tested background agents in top enterprises; team iterates post-PMF.
Designer reports senior engineer directly implementing Figma vision.
Devin 3.0 incoming.

Example polished Slack integration: feature request to merge in ~8 hours (33 comments, 2-3 human reviews).

Quotes @ScottWu46: https://x.com/scottwu46/status/2026350958213787903

Simon Willison's Weblog

simonwillison 1 doc

AI coding agents threat to OSS: Comprehensive test suites enable rebuilding open source libraries from scratch, even in different languages .

Real-world example: Cloudflare's Vinext project ported Next.js to use Vite in one week using AI.

tldraw response (collaborative drawing library, custom license requires commercial for production ):

Moving tests to private repo: [tldraw issue #8082]
Joke issue to "defend IP" from agents by translating code to Traditional Chinese: [tldraw issue #8092]

Reported by Simon Willison (secondhand via @steveruizok).

swyx

x 5 docs

Devin 2.2 by Cognition: autonomous agent that can test with computer use, self-verify, and auto-fix its work. Try for free .

Key updates:

3x faster startup
Fully redesigned interface
Computer use + virtual desktop

Hundreds more UX and functionality improvements .

@swyx: Check usage numbers before conclusions; celebrates progress across AI engineering/coding tools (e.g., questions wandb FDE data); declines Devin vs. Codex/CC comparison .

@yishan (contrarian): Devin startup should cash out ASAP amid AI turbulence .

Simon Willison's Weblog

simonwillison 1 doc

Simon Willison, an experienced engineer, vibe-coded (iterative LLM prompting) his macOS app Present (Swift/SwiftUI, 355KB) in ~45 minutes for presentations as sequenced URLs, preventing browser crash losses .

Starting prompt: "Build a SwiftUI app for giving presentations where every slide is a URL. The app starts as a window with a webview on the right and a UI on the left for adding, removing and reordering the sequence of URLs. Then you click Play in a menu and the app goes full screen and the left and right keys switch between URLs" . Full implementation transcript: https://gisthost.github.io/?bfbc338977ceb71e298e4d4d5ac7d63c.

Remote control prompt: "Add a web server which listens on 0.0.0.0:9123—the web server serves a single mobile-friendly page with prominent left and right buttons..." . Claude Code implemented socket-based HTTP parser without libraries .

Repo: https://github.com/simonw/present. Linear walkthrough pattern for reviewing agent-generated code: https://simonwillison.net/guides/agentic-engineering-patterns/linear-walkthroughs/. Walkthrough: https://github.com/simonw/present/blob/main/walkthrough.md.

Firsthand production use case for personal tool.

Simon Willison

x 1 doc

Simon Willison notes that Claude Code Remote and Cowork scheduled tasks overlap with OpenClaw, but both require leaving your computer powered on .

See his brief notes: https://simonwillison.net/2026/Feb/25/claude-code-remote-control/.

ThePrimeagen

x 1 doc

Contrarian firsthand account from @ThePrimeagen (prominent dev streamer with 85k+ views on post): After 3 months of heavy 'vibe coding' (AI code generation on stream for projects he cares about), he hates the results—'everything I ask for and nothing I want', 'subtle offness'—and feels guilty for not matching Twitter hype on productivity gains . Decides to 'tradcode' instead . Used in stream environment (serious content creation) .

Simon Willison's Weblog

simonwillison 1 doc

Claude Code released Remote Control feature: run claude remote-control on your machine to enable prompt-based control from web, iOS app (shows as "Remote Control Session (Mac)"), or desktop app. Only one session per machine; requires per-action approval (ignores --dangerously-skip-permissions) .

Firsthand testing by Simon Willison (personal use):

Initial error "Remote Control is not enabled..." fixed by logout/login in terminal app
Frequent API 500 errors on prompts; restarts cause session failures without clear messaging
Example: Generated AppleScript to play song in Music app despite error

Compares to OpenClaw (stronger phone control); notes Claude Code lacks scheduling .

Docs: https://code.claude.com/docs/en/remote-control Announced: https://twitter.com/claudeai/status/2026418433911603668

swyx

x 4 docs

@karpathy (top AI practitioner) shares firsthand production-like workflow using coding agents, noting dramatic shift: "coding agents basically didn’t work before December and basically work since" due to improved model quality, coherence, tenacity .

Exact prompt for building local video analysis dashboard on DGX Spark: “Here is the local IP and username/password of my DGX Spark. Log in, set up ssh keys, set up vLLM, download and bench Qwen3-VL, set up a server endpoint to inference videos, a basic web ui dashboard, test everything, set it up with systemd, record memory notes for yourself and write up a markdown report for me” .

Agent execution (hands-off): Ran ~30 min, resolved issues via online research, wrote/tested/debugged code, set up services/systemd, delivered markdown report — vs. "easily ... a weekend project just 3 months ago" .

Tools: vLLM (inference server), Qwen3-VL (model), systemd .

Paradigm: Give English tasks to agents; manage/review parallel work. Prize in agentic engineering: "long-running orchestrator Claws" with tools/memory/instructions managing "multiple parallel Code instances" .

Best practices/caveats: High-level direction/judgement/oversight; decompose for well-specified, verifiable tasks .

@swyx (podcaster/engineer accumulating notes on same topic) confirms timely relevance .

swyx

x 4 docs

@swyx (@dxtipshq, @cognition, @temporalio, @aidotengineer, @latentspacepod) affirms 'IDE is dead', heralding post-IDE form factors for agentic engineering with direct line of sight as he serves early adopters .

Augment's Intent integrates top code agent management ideas without vendor lock-in .

Tool comparisons:

Cursor 2.0: toe dip
Claude: folded into chat app
Codex: formalized Conductor patterns
Amazon Kiro: emphasizes Spec Driven Dev

Later: Cursor officially agrees IDE is dead.

Resources:

Augment demo talk: https://x.com/Wattenberger/status/2021269188979949749
Cursor article: https://x.com/mntruell/status/2026736314272591924

Andrej Karpathy

x 1 doc

Andrej Karpathy (@karpathy, ex-Director of AI @ Tesla, founding team @ OpenAI, PhD @ Stanford) reports coding agents now work effectively after major improvements in December, enabling disruption to traditional workflows with higher model quality, coherence, and tenacity for large tasks .

Concrete workflow example (personal home video analysis dashboard): Provided agent with exact English prompt — “Here is the local IP and username/password of my DGX Spark. Log in, set up ssh keys, set up vLLM, download and bench Qwen3-VL, set up a server endpoint to inference videos, a basic web ui dashboard, test everything, set it up with systemd, record memory notes for yourself and write up a markdown report for me” — agent autonomously handled setup, debugging, testing, and delivery in ~30 minutes (previously a full weekend project) .

Paradigm shift: Programming now involves spinning up AI agents for English tasks, parallel management/review; key leverage in agentic engineering via long-running orchestrators ('Claws') with tools/memory/instructions managing multiple parallel 'Code' instances .

Timeless patterns: Decompose tasks for well-specified/verifiable parts; provide high-level direction, oversight, iteration, hints . Firsthand production-like use on side project.

geoff

x 1 doc

@GeoffreyHuntley, building @latentpatterns, demoed his hyper personalised embedded software factory powered by Cursor (used over last two weeks), revealing the [design] button and a product-that-is-itself-an-IDE world .

Implemented automatic e2e PBT website testing with state impersonation (no real emails/charges) via @owickstrom's tool, achieved in ~1 hour using careful latent space prompting—what was once 'too hard' .

Staging for hairy things, but ship-and-test-prod is core design for shipping pace .

Agents more reliable than humans, after git clean mishap lost marketing automation progress .

swyx

x 4 docs

Cursor AI Agent (cloud mode) autonomously reconstructed a product designer's portfolio website (Rachel Chen's) from a Twitter video demo alone.

Workflow (firsthand by @swyx):

Pasted tweet with video into Cursor cloud; agent worked 43 minutes to produce functional clone—including RachelLLM sidebar that demoed working—without further instructions .
Follow-up prompt for fidelity: Analyze video step-by-step, curl/discover assets headlessly, build checklist/sitemap, spin up subagents/swarm for parallel work, iterate to completeness, trade off design vs. simplicity .
Yielded improved clone after another 43 minutes.

@swyx (affiliated with @dxtipshq/@cognition/@temporalio/@aidotengineer/@latentspacepod): "3 months ago... hell no" this possible; designer's job safe but capability impressive .

Simon Willison

x 2 docs

Simon Willison (@simonw, creator of Datasette, Django co-creator) shared firsthand vibe-coding experience using Claude Code to build a SwiftUI macOS app: turning a list of URLs into full-screen slides remotely controllable from phone .

Actionable technique: Prompt coding agents for “a linear walkthrough of the code that explains how it all works in detail”—he's had good results, demonstrated on this codebase .

Resources:

Project writeup: https://simonwillison.net/2026/Feb/25/present/
Technique guide: https://simonwillison.net/guides/agentic-engineering-patterns/linear-walkthroughs/

Ben Tossell

x 2 docs

Andrej Karpathy (@karpathy) shares firsthand experience of dramatic shift in coding agents since December: models now exhibit higher quality, long-term coherence, and tenacity for large tasks, disrupting traditional workflows .

Concrete workflow example: Prompted agent to build local video analysis dashboard—"Here is the local IP and username/password of my DGX Spark. Log in, set up ssh keys, set up vLLM, download and bench Qwen3-VL, set up a server endpoint to inference videos, a basic web ui dashboard, test everything, set it up with systemd, record memory notes for yourself and write up a markdown report for me"—agent resolved issues autonomously over ~30 minutes (previously a weekend project) .

Tools used: DGX Spark, vLLM, Qwen3-VL.

Emerging paradigm: Spin up AI agents for English tasks, manage/review parallel work; highest leverage in agentic engineering—long-running orchestrators (e.g., "Claws") with tools/memory/instructions managing multiple parallel Code instances .

Caveats/best practices: Requires high-level direction/oversight; excels for well-specified, verifiable tasks; decompose tasks optimally .

Ben Tossell (@bentossell) amplifies as emergence of "new technical class" .

Kent C. Dodds ⚡

x 2 docs

Kent C. Dodds (@kentcdodds), dev educator and MVP, describes using Cursor agents for rapid idea implementation:

Fire off a prompt for an idea; agent figures it out, builds it, and provides a screenshot. Can run multiple agents in parallel, replacing manual issue creation.
During manual testing walkthrough, agent autonomously noticed a UX edge case.

Firsthand production-like workflow from experienced practitioner.

Peter Steinberger 🦞

x 1 doc

New @openclaw beta release announced by Peter Steinberger (@steipete), ClawFather and openclaw maintainer: security improvements, various fixes, heartbeat in DMs now a configurable setting (after user feedback), improved Slack threads, better subagents, and more reliable Telegram webhook .

Releases: https://github.com/openclaw/openclaw/releases.

Kent C. Dodds ⚡

x 1 doc

Kent C. Dodds (@kentcdodds), dev educator and MVP, kicked off a long-running Cursor AI agent overnight and iterated with it using the computer use feature this morning .

Achieved ~15k lines dropped in a refactor, calling it "pretty great" .

Plans a blog post on the experience regardless of outcome .

Peter Steinberger 🦞

x 2 docs

Peter Steinberger (@steipete) reports their team uses Greptile as a PR review tool that ranks PRs.

Firsthand production usage revealed a vulnerability: someone manually edited a PR review score from 2/5 to 5/5, as shown in this OpenClaw PR: https://github.com/openclaw/openclaw/pull/13095. They note it adds clutter and suggest moving to a separate comment section.

geoff

x 2 docs

Geoffrey Huntley (@GeoffreyHuntley), builder of @latentpatterns, uses Cursor cloud agents to autonomously develop sales automation for his project while at the airport: "chilling here at the airport and my roomba is cleaning house whilst digital roomba builds me @latentpatterns sales automation" .

Compares favorably to Ampcode: "everything i wanted @ampcode to be" .

Self-improvement workflow with Cursor: use product → “ugh this could be better” → use product to improve product → spins up Cursor in the background . Describes as perfect UX for “software is clay” via shipping refinements as incremental agents.

Firsthand side project usage. Resources: Cursor, Ampcode, Latent Patterns.

Theo - t3.gg

x 2 docs

OpenClaw coding tool in use by @babykeem (secondhand report by @theo, developer/CEO @t3dotchat) .

@babykeem (firsthand): Experiencing internal reasoning leaking issue; asks "how do u fix openclaw internal reasoning leaking" .

@theo promotes: "Baby keem is using openclaw and you’re still writing code by hand" .

Link: https://x.com/babykeem/status/2026836033757934056.