ZeroNoise Logo zeronoise
Post
Coding agents hit a post-December step-change; Codex 5.3 momentum vs Opus 4.6; remote-control + orchestration patterns
Feb 26
5 min read
141 docs
Agents keep moving from “toy” to “teammate”: Karpathy reports a sharp post-December step-change and shares a hands-off, 30-minute end-to-end build example. Also: Codex 5.3 displacing Opus 4.6 for some power users, Claude Code Remote Control’s early reliability issues, and concrete workflow patterns for orchestration, review, and repo hygiene.

🔥 TOP SIGNAL

Coding agents crossed a “works in practice” threshold since December, driven (per Andrej Karpathy) by improved model quality, long-term coherence, and tenacity—enough to be disruptive to the default programming workflow. His concrete example: he handed an agent a single English brief to set up vLLM + Qwen3-VL, build a video inference endpoint + web UI, debug issues, install systemd services, and return a markdown report—hands-off in ~30 minutes.

🛠️ TOOLS & MODELS

  • GPT-5.3-Codex / Codex 5.3 vs Opus 4.6 (practitioner preference)

    • Mitchell Hashimoto says Codex 5.3 is “much more effective” than Opus 4.6, and that after going back and forth he hasn’t touched Opus for a week—“first model to get me off of Opus… ever” .
    • OpenAI’s Romain Huet says the team is “continuing to iterate and improve Codex every week” .
    • Tool reliability signal: Brian Lovin hit Claude Code 500s, tried Codex, and reported “Codex is good!” .
  • Reasoning settings (Codex)

    • Sherwin Wu: they “basically only run [GPT-5.3-Codex] on xhigh nowadays for all coding tasks,” and notes speed improvements make it not feel slow even at xhigh.
    • Greg Brockman’s advice: “always run with xhigh reasoning.
  • Claude Code — Remote Control (new capability, rough edges in testing)

    • Feature: run claude remote-control locally, then send prompts to that session from web/iOS/desktop; one session per machine and requires per-action approval.
    • Simon Willison reports it’s “a little bit janky,” including repeated API 500 errors and confusing failure behavior after restarting the program .
  • Devin 2.2 (Cognition)

    • Cognition markets Devin 2.2 as an autonomous agent that can test with computer use, self-verify, and auto-fix; also claims 3× faster startup, redesigned UI, and “computer use + virtual desktop” .
  • OpenClaw — new beta

    • Peter Steinberger: beta includes security improvements, various fixes, DM “heartbeat” made configurable after feedback, better Slack threads, improved subagents, and a more reliable Telegram webhook.
    • Releases: https://github.com/openclaw/openclaw/releases.
  • Sourcegraph 7.0 (positioning shift)

💡 WORKFLOWS & TRICKS

  • “English → parallel agents → you review” (Karpathy’s decomposition rule)

    • Karpathy’s pattern: agents aren’t perfect—they need high-level direction, judgment, taste, oversight, iteration, hints, and they work best when tasks are well-specified and verifiable/testable.
    • His operational heuristic: build intuition for task decomposition—hand off the parts that work well to agents, then “help out around the edges” .
    • Scaling idea: build long-running orchestrators (“Claws”) with tools/memory/instructions managing multiple parallel “Code” instances .
  • Cursor cloud agent: “clone it from a video” as a starting point, then iterate for fidelity

    • @swyx dropped a tweet + video into Cursor cloud expecting it not to work; he says Cursor Agent oneshotted a functional clone of Rachel Chen’s site from the video alone over 43 minutes (including a working “RachelLLM” sidebar) .
    • His follow-up prompt for fidelity is a reusable template:
      • step through the video,
      • discover assets (headless run / curl / network snooping),
      • build a checklist + sitemap,
      • spin up subagents/swarm for parallel work,
      • don’t stop until behavior/visuals match closely; trade off fidelity vs simplicity when ambiguous .
    • He reports a second improved output after another 43 minutes.
  • Run many agents in parallel (Cursor) + let the agent do exploratory UX testing

    • Kent C. Dodds: he can run “as many of these [Cursor agents]” as he wants; instead of filing issues for ideas, he fires off prompts and gets back what it built (with screenshots) .
    • He also saw the agent “noticed one UX edge case during walkthrough” while doing manual testing .
  • Long-running agent refactors overnight (Cursor) + “computer use” for steering

    • Kent kicked off a long-running Cursor agent overnight and iterated in the morning using “computer use” .
    • He reports it dropped ~15k lines in a refactor .
  • Code review aid: ask for a linear walkthrough of the codebase (Simon Willison)

    • Willison’s prompt pattern: ask agents for “a linear walkthrough of the code that explains how it all works in detail” to understand vibe-coded output .
  • Git hygiene for agentic work: small commits, then squash (Huntley)

    • Geoffrey Huntley suggests an agent-friendly workflow: make incremental small commits, then squash to a single commit so “study git log” for a unit of work can be a single tool call .
  • Production caution: don’t trust “ranked” PR scores if they’re editable

  • OSS maintainer playbook shift: tests as “reimplementation fuel”

    • Simon Willison notes that a comprehensive test suite can be enough to rebuild a library from scratch, and highlights tldraw moving tests to a private repo as a response pattern .

👤 PEOPLE TO WATCH

  • Andrej Karpathy — clearest firsthand articulation of what changed since December, plus a concrete “30 minutes, hands-off” agent-run build story and an orchestration north star (“Claws”) .
  • Simon Willison — consistently turns agent usage into repeatable patterns (e.g., “linear walkthroughs”), and also documents sharp edges like Claude Code Remote Control’s failure modes .
  • Mitchell Hashimoto — high-signal model/tool preference note: Codex 5.3 displaced Opus 4.6 for him after direct comparison .
  • Kent C. Dodds — pragmatic day-to-day agent usage: parallel agents, long-running refactors, and agents surfacing UX edge cases during walkthroughs .
  • ThePrimeagen — counterweight: after ~3 months of vibe-coding, he says he hates the generated code and the “subtle offness,” and plans to “tradcode” (useful reality check on taste/intent gaps) .

🎬 WATCH & LISTEN

  • No YouTube videos or podcast episodes were included in today’s source set, so there are no embeddable clips to share.

📊 PROJECTS & REPOS


Editorial take: The bottleneck is shifting from “can the agent write code?” to “can you reliably steer, verify, and govern what it did?”