ZeroNoise Logo zeronoise
Post
Coding agents hit a post-December step-change; Codex 5.3 momentum vs Opus 4.6; remote-control + orchestration patterns
Feb 26
5 min read
141 docs
Agents keep moving from “toy” to “teammate”: Karpathy reports a sharp post-December step-change and shares a hands-off, 30-minute end-to-end build example. Also: Codex 5.3 displacing Opus 4.6 for some power users, Claude Code Remote Control’s early reliability issues, and concrete workflow patterns for orchestration, review, and repo hygiene.

🔥 TOP SIGNAL

Coding agents crossed a “works in practice” threshold since December, driven (per Andrej Karpathy) by improved model quality, long-term coherence, and tenacity—enough to be disruptive to the default programming workflow. His concrete example: he handed an agent a single English brief to set up vLLM + Qwen3-VL, build a video inference endpoint + web UI, debug issues, install systemd services, and return a markdown report—hands-off in ~30 minutes.

🛠️ TOOLS & MODELS

  • GPT-5.3-Codex / Codex 5.3 vs Opus 4.6 (practitioner preference)

    • Mitchell Hashimoto says Codex 5.3 is “much more effective” than Opus 4.6, and that after going back and forth he hasn’t touched Opus for a week—“first model to get me off of Opus… ever” .
    • OpenAI’s Romain Huet says the team is “continuing to iterate and improve Codex every week” .
    • Tool reliability signal: Brian Lovin hit Claude Code 500s, tried Codex, and reported “Codex is good!” .
  • Reasoning settings (Codex)

    • Sherwin Wu: they “basically only run [GPT-5.3-Codex] on xhigh nowadays for all coding tasks,” and notes speed improvements make it not feel slow even at xhigh.
    • Greg Brockman’s advice: “always run with xhigh reasoning.
  • Claude Code — Remote Control (new capability, rough edges in testing)

    • Feature: run claude remote-control locally, then send prompts to that session from web/iOS/desktop; one session per machine and requires per-action approval.
    • Simon Willison reports it’s “a little bit janky,” including repeated API 500 errors and confusing failure behavior after restarting the program .
  • Devin 2.2 (Cognition)

    • Cognition markets Devin 2.2 as an autonomous agent that can test with computer use, self-verify, and auto-fix; also claims 3× faster startup, redesigned UI, and “computer use + virtual desktop” .
  • OpenClaw — new beta

    • Peter Steinberger: beta includes security improvements, various fixes, DM “heartbeat” made configurable after feedback, better Slack threads, improved subagents, and a more reliable Telegram webhook.
    • Releases: https://github.com/openclaw/openclaw/releases.
  • Sourcegraph 7.0 (positioning shift)

💡 WORKFLOWS & TRICKS

  • “English → parallel agents → you review” (Karpathy’s decomposition rule)

    • Karpathy’s pattern: agents aren’t perfect—they need high-level direction, judgment, taste, oversight, iteration, hints, and they work best when tasks are well-specified and verifiable/testable.
    • His operational heuristic: build intuition for task decomposition—hand off the parts that work well to agents, then “help out around the edges” .
    • Scaling idea: build long-running orchestrators (“Claws”) with tools/memory/instructions managing multiple parallel “Code” instances .
  • Cursor cloud agent: “clone it from a video” as a starting point, then iterate for fidelity

    • @swyx dropped a tweet + video into Cursor cloud expecting it not to work; he says Cursor Agent oneshotted a functional clone of Rachel Chen’s site from the video alone over 43 minutes (including a working “RachelLLM” sidebar) .
    • His follow-up prompt for fidelity is a reusable template:
      • step through the video,
      • discover assets (headless run / curl / network snooping),
      • build a checklist + sitemap,
      • spin up subagents/swarm for parallel work,
      • don’t stop until behavior/visuals match closely; trade off fidelity vs simplicity when ambiguous .
    • He reports a second improved output after another 43 minutes.
  • Run many agents in parallel (Cursor) + let the agent do exploratory UX testing

    • Kent C. Dodds: he can run “as many of these [Cursor agents]” as he wants; instead of filing issues for ideas, he fires off prompts and gets back what it built (with screenshots) .
    • He also saw the agent “noticed one UX edge case during walkthrough” while doing manual testing .
  • Long-running agent refactors overnight (Cursor) + “computer use” for steering

    • Kent kicked off a long-running Cursor agent overnight and iterated in the morning using “computer use” .
    • He reports it dropped ~15k lines in a refactor .
  • Code review aid: ask for a linear walkthrough of the codebase (Simon Willison)

    • Willison’s prompt pattern: ask agents for “a linear walkthrough of the code that explains how it all works in detail” to understand vibe-coded output .
  • Git hygiene for agentic work: small commits, then squash (Huntley)

    • Geoffrey Huntley suggests an agent-friendly workflow: make incremental small commits, then squash to a single commit so “study git log” for a unit of work can be a single tool call .
  • Production caution: don’t trust “ranked” PR scores if they’re editable

  • OSS maintainer playbook shift: tests as “reimplementation fuel”

    • Simon Willison notes that a comprehensive test suite can be enough to rebuild a library from scratch, and highlights tldraw moving tests to a private repo as a response pattern .

👤 PEOPLE TO WATCH

  • Andrej Karpathy — clearest firsthand articulation of what changed since December, plus a concrete “30 minutes, hands-off” agent-run build story and an orchestration north star (“Claws”) .
  • Simon Willison — consistently turns agent usage into repeatable patterns (e.g., “linear walkthroughs”), and also documents sharp edges like Claude Code Remote Control’s failure modes .
  • Mitchell Hashimoto — high-signal model/tool preference note: Codex 5.3 displaced Opus 4.6 for him after direct comparison .
  • Kent C. Dodds — pragmatic day-to-day agent usage: parallel agents, long-running refactors, and agents surfacing UX edge cases during walkthroughs .
  • ThePrimeagen — counterweight: after ~3 months of vibe-coding, he says he hates the generated code and the “subtle offness,” and plans to “tradcode” (useful reality check on taste/intent gaps) .

🎬 WATCH & LISTEN

  • No YouTube videos or podcast episodes were included in today’s source set, so there are no embeddable clips to share.

📊 PROJECTS & REPOS


Editorial take: The bottleneck is shifting from “can the agent write code?” to “can you reliably steer, verify, and govern what it did?”

Coding agents hit a post-December step-change; Codex 5.3 momentum vs Opus 4.6; remote-control + orchestration patterns
swyx
x 3 docs

@swyx (affiliated with @cognition) shares insider analysis on Devin coding agent:

  • Lacked internal PMF at 2024 launch; 6 months to first enterprise due to unready models and agent pattern experimentation. Async agents deemed optimal UX ("Final Boss").
  • Usage doubled every 2 months per 2025 enterprise; accelerated to every 6 weeks in 2026; internal usage now 4x 2025 peak.
  • Self-serve hindered by repo setup neglect (enterprises use FDEs).
  • Devin 2.2 addresses UX debt: first designer hired, all-hands sprint for self-serve polish, omnibox integration, review/main loop closure.
  • Battle-tested background agents in top enterprises; team iterates post-PMF.
  • Designer reports senior engineer directly implementing Figma vision.
  • Devin 3.0 incoming.

Example polished Slack integration: feature request to merge in ~8 hours (33 comments, 2-3 human reviews).

Quotes @ScottWu46: https://x.com/scottwu46/status/2026350958213787903

Simon Willison's Weblog

AI coding agents threat to OSS: Comprehensive test suites enable rebuilding open source libraries from scratch, even in different languages .

Real-world example: Cloudflare's Vinext project ported Next.js to use Vite in one week using AI.

tldraw response (collaborative drawing library, custom license requires commercial for production ):

  • Moving tests to private repo: [tldraw issue #8082]
  • Joke issue to "defend IP" from agents by translating code to Traditional Chinese: [tldraw issue #8092]

Reported by Simon Willison (secondhand via @steveruizok).

swyx
x 5 docs

Devin 2.2 by Cognition: autonomous agent that can test with computer use, self-verify, and auto-fix its work. Try for free .

Key updates:

  • 3x faster startup
  • Fully redesigned interface
  • Computer use + virtual desktop

Hundreds more UX and functionality improvements .

@swyx: Check usage numbers before conclusions; celebrates progress across AI engineering/coding tools (e.g., questions wandb FDE data); declines Devin vs. Codex/CC comparison .

@yishan (contrarian): Devin startup should cash out ASAP amid AI turbulence .

Simon Willison's Weblog

Simon Willison, an experienced engineer, vibe-coded (iterative LLM prompting) his macOS app Present (Swift/SwiftUI, 355KB) in ~45 minutes for presentations as sequenced URLs, preventing browser crash losses .

Starting prompt: "Build a SwiftUI app for giving presentations where every slide is a URL. The app starts as a window with a webview on the right and a UI on the left for adding, removing and reordering the sequence of URLs. Then you click Play in a menu and the app goes full screen and the left and right keys switch between URLs" . Full implementation transcript: https://gisthost.github.io/?bfbc338977ceb71e298e4d4d5ac7d63c.

Remote control prompt: "Add a web server which listens on 0.0.0.0:9123—the web server serves a single mobile-friendly page with prominent left and right buttons..." . Claude Code implemented socket-based HTTP parser without libraries .

Repo: https://github.com/simonw/present. Linear walkthrough pattern for reviewing agent-generated code: https://simonwillison.net/guides/agentic-engineering-patterns/linear-walkthroughs/. Walkthrough: https://github.com/simonw/present/blob/main/walkthrough.md.

Firsthand production use case for personal tool.

Simon Willison
x 1 doc

Simon Willison notes that Claude Code Remote and Cowork scheduled tasks overlap with OpenClaw, but both require leaving your computer powered on .

See his brief notes: https://simonwillison.net/2026/Feb/25/claude-code-remote-control/.

ThePrimeagen
x 1 doc

Contrarian firsthand account from @ThePrimeagen (prominent dev streamer with 85k+ views on post): After 3 months of heavy 'vibe coding' (AI code generation on stream for projects he cares about), he hates the results—'everything I ask for and nothing I want', 'subtle offness'—and feels guilty for not matching Twitter hype on productivity gains . Decides to 'tradcode' instead . Used in stream environment (serious content creation) .

Simon Willison's Weblog

Claude Code released Remote Control feature: run claude remote-control on your machine to enable prompt-based control from web, iOS app (shows as "Remote Control Session (Mac)"), or desktop app. Only one session per machine; requires per-action approval (ignores --dangerously-skip-permissions) .

Firsthand testing by Simon Willison (personal use):

  • Initial error "Remote Control is not enabled..." fixed by logout/login in terminal app
  • Frequent API 500 errors on prompts; restarts cause session failures without clear messaging
  • Example: Generated AppleScript to play song in Music app despite error

Compares to OpenClaw (stronger phone control); notes Claude Code lacks scheduling .

Docs: https://code.claude.com/docs/en/remote-control Announced: https://twitter.com/claudeai/status/2026418433911603668

swyx
x 4 docs

@karpathy (top AI practitioner) shares firsthand production-like workflow using coding agents, noting dramatic shift: "coding agents basically didn’t work before December and basically work since" due to improved model quality, coherence, tenacity .

Exact prompt for building local video analysis dashboard on DGX Spark: “Here is the local IP and username/password of my DGX Spark. Log in, set up ssh keys, set up vLLM, download and bench Qwen3-VL, set up a server endpoint to inference videos, a basic web ui dashboard, test everything, set it up with systemd, record memory notes for yourself and write up a markdown report for me” .

Agent execution (hands-off): Ran ~30 min, resolved issues via online research, wrote/tested/debugged code, set up services/systemd, delivered markdown report — vs. "easily ... a weekend project just 3 months ago" .

Tools: vLLM (inference server), Qwen3-VL (model), systemd .

Paradigm: Give English tasks to agents; manage/review parallel work. Prize in agentic engineering: "long-running orchestrator Claws" with tools/memory/instructions managing "multiple parallel Code instances" .

Best practices/caveats: High-level direction/judgement/oversight; decompose for well-specified, verifiable tasks .

@swyx (podcaster/engineer accumulating notes on same topic) confirms timely relevance .

swyx
x 4 docs

@swyx (@dxtipshq, @cognition, @temporalio, @aidotengineer, @latentspacepod) affirms 'IDE is dead', heralding post-IDE form factors for agentic engineering with direct line of sight as he serves early adopters .

Augment's Intent integrates top code agent management ideas without vendor lock-in .

Tool comparisons:

  • Cursor 2.0: toe dip
  • Claude: folded into chat app
  • Codex: formalized Conductor patterns
  • Amazon Kiro: emphasizes Spec Driven Dev

Later: Cursor officially agrees IDE is dead.

Resources:

Andrej Karpathy
x 1 doc

Andrej Karpathy (@karpathy, ex-Director of AI @ Tesla, founding team @ OpenAI, PhD @ Stanford) reports coding agents now work effectively after major improvements in December, enabling disruption to traditional workflows with higher model quality, coherence, and tenacity for large tasks .

Concrete workflow example (personal home video analysis dashboard): Provided agent with exact English prompt — “Here is the local IP and username/password of my DGX Spark. Log in, set up ssh keys, set up vLLM, download and bench Qwen3-VL, set up a server endpoint to inference videos, a basic web ui dashboard, test everything, set it up with systemd, record memory notes for yourself and write up a markdown report for me” — agent autonomously handled setup, debugging, testing, and delivery in ~30 minutes (previously a full weekend project) .

Paradigm shift: Programming now involves spinning up AI agents for English tasks, parallel management/review; key leverage in agentic engineering via long-running orchestrators ('Claws') with tools/memory/instructions managing multiple parallel 'Code' instances .

Timeless patterns: Decompose tasks for well-specified/verifiable parts; provide high-level direction, oversight, iteration, hints . Firsthand production-like use on side project.

geoff
x 1 doc

@GeoffreyHuntley, building @latentpatterns, demoed his hyper personalised embedded software factory powered by Cursor (used over last two weeks), revealing the [design] button and a product-that-is-itself-an-IDE world .

Implemented automatic e2e PBT website testing with state impersonation (no real emails/charges) via @owickstrom's tool, achieved in ~1 hour using careful latent space prompting—what was once 'too hard' .

Staging for hairy things, but ship-and-test-prod is core design for shipping pace .

Agents more reliable than humans, after git clean mishap lost marketing automation progress .

swyx
x 4 docs

Cursor AI Agent (cloud mode) autonomously reconstructed a product designer's portfolio website (Rachel Chen's) from a Twitter video demo alone.

Workflow (firsthand by @swyx):

  • Pasted tweet with video into Cursor cloud; agent worked 43 minutes to produce functional clone—including RachelLLM sidebar that demoed working—without further instructions .
  • Follow-up prompt for fidelity: Analyze video step-by-step, curl/discover assets headlessly, build checklist/sitemap, spin up subagents/swarm for parallel work, iterate to completeness, trade off design vs. simplicity .
  • Yielded improved clone after another 43 minutes.

@swyx (affiliated with @dxtipshq/@cognition/@temporalio/@aidotengineer/@latentspacepod): "3 months ago... hell no" this possible; designer's job safe but capability impressive .

Simon Willison
x 2 docs

Simon Willison (@simonw, creator of Datasette, Django co-creator) shared firsthand vibe-coding experience using Claude Code to build a SwiftUI macOS app: turning a list of URLs into full-screen slides remotely controllable from phone .

Actionable technique: Prompt coding agents for “a linear walkthrough of the code that explains how it all works in detail”—he's had good results, demonstrated on this codebase .

Resources:

Ben Tossell
x 2 docs

Andrej Karpathy (@karpathy) shares firsthand experience of dramatic shift in coding agents since December: models now exhibit higher quality, long-term coherence, and tenacity for large tasks, disrupting traditional workflows .

Concrete workflow example: Prompted agent to build local video analysis dashboard—"Here is the local IP and username/password of my DGX Spark. Log in, set up ssh keys, set up vLLM, download and bench Qwen3-VL, set up a server endpoint to inference videos, a basic web ui dashboard, test everything, set it up with systemd, record memory notes for yourself and write up a markdown report for me"—agent resolved issues autonomously over ~30 minutes (previously a weekend project) .

Tools used: DGX Spark, vLLM, Qwen3-VL.

Emerging paradigm: Spin up AI agents for English tasks, manage/review parallel work; highest leverage in agentic engineering—long-running orchestrators (e.g., "Claws") with tools/memory/instructions managing multiple parallel Code instances .

Caveats/best practices: Requires high-level direction/oversight; excels for well-specified, verifiable tasks; decompose tasks optimally .

Ben Tossell (@bentossell) amplifies as emergence of "new technical class" .

Kent C. Dodds ⚡
x 2 docs

Kent C. Dodds (@kentcdodds), dev educator and MVP, describes using Cursor agents for rapid idea implementation:

  • Fire off a prompt for an idea; agent figures it out, builds it, and provides a screenshot. Can run multiple agents in parallel, replacing manual issue creation.
  • During manual testing walkthrough, agent autonomously noticed a UX edge case.

Firsthand production-like workflow from experienced practitioner.

Peter Steinberger 🦞
x 1 doc

New @openclaw beta release announced by Peter Steinberger (@steipete), ClawFather and openclaw maintainer: security improvements, various fixes, heartbeat in DMs now a configurable setting (after user feedback), improved Slack threads, better subagents, and more reliable Telegram webhook .

Releases: https://github.com/openclaw/openclaw/releases.

Kent C. Dodds ⚡
x 1 doc

Kent C. Dodds (@kentcdodds), dev educator and MVP, kicked off a long-running Cursor AI agent overnight and iterated with it using the computer use feature this morning .

Achieved ~15k lines dropped in a refactor, calling it "pretty great" .

Plans a blog post on the experience regardless of outcome .

Peter Steinberger 🦞
x 2 docs

Peter Steinberger (@steipete) reports their team uses Greptile as a PR review tool that ranks PRs.

Firsthand production usage revealed a vulnerability: someone manually edited a PR review score from 2/5 to 5/5, as shown in this OpenClaw PR: https://github.com/openclaw/openclaw/pull/13095. They note it adds clutter and suggest moving to a separate comment section.

geoff
x 2 docs

Geoffrey Huntley (@GeoffreyHuntley), builder of @latentpatterns, uses Cursor cloud agents to autonomously develop sales automation for his project while at the airport: "chilling here at the airport and my roomba is cleaning house whilst digital roomba builds me @latentpatterns sales automation" .

Compares favorably to Ampcode: "everything i wanted @ampcode to be" .

Self-improvement workflow with Cursor: use product → “ugh this could be better” → use product to improve product → spins up Cursor in the background . Describes as perfect UX for “software is clay” via shipping refinements as incremental agents.

Firsthand side project usage. Resources: Cursor, Ampcode, Latent Patterns.

Theo - t3.gg
x 2 docs

OpenClaw coding tool in use by @babykeem (secondhand report by @theo, developer/CEO @t3dotchat) .

@babykeem (firsthand): Experiencing internal reasoning leaking issue; asks "how do u fix openclaw internal reasoning leaking" .

@theo promotes: "Baby keem is using openclaw and you’re still writing code by hand" .

Link: https://x.com/babykeem/status/2026836033757934056.