Coding agents hit a post-December step-change; Codex 5.3 momentum vs Opus 4.6; remote-control + orchestration patterns

Error!

Unable to generate download right now.

Post

Coding Agents Alpha Tracker

Feb 26 •

5 min read

• 141 docs

Agents keep moving from “toy” to “teammate”: Karpathy reports a sharp post-December step-change and shares a hands-off, 30-minute end-to-end build example. Also: Codex 5.3 displacing Opus 4.6 for some power users, Claude Code Remote Control’s early reliability issues, and concrete workflow patterns for orchestration, review, and repo hygiene.

🔥 TOP SIGNAL

Coding agents crossed a “works in practice” threshold since December, driven (per Andrej Karpathy) by improved model quality, long-term coherence, and tenacity—enough to be disruptive to the default programming workflow. His concrete example: he handed an agent a single English brief to set up vLLM + Qwen3-VL, build a video inference endpoint + web UI, debug issues, install systemd services, and return a markdown report—hands-off in ~30 minutes.

🛠️ TOOLS & MODELS

GPT-5.3-Codex / Codex 5.3 vs Opus 4.6 (practitioner preference)
- Mitchell Hashimoto says Codex 5.3 is “much more effective” than Opus 4.6, and that after going back and forth he hasn’t touched Opus for a week—“first model to get me off of Opus… ever” .
- OpenAI’s Romain Huet says the team is “continuing to iterate and improve Codex every week” .
- Tool reliability signal: Brian Lovin hit Claude Code 500s, tried Codex, and reported “Codex is good!” .
Reasoning settings (Codex)
- Sherwin Wu: they “basically only run [GPT-5.3-Codex] on xhigh nowadays for all coding tasks,” and notes speed improvements make it not feel slow even at xhigh.
- Greg Brockman’s advice: “always run with xhigh reasoning” .
Claude Code — Remote Control (new capability, rough edges in testing)
- Feature: run claude remote-control locally, then send prompts to that session from web/iOS/desktop; one session per machine and requires per-action approval.
- Simon Willison reports it’s “a little bit janky,” including repeated API 500 errors and confusing failure behavior after restarting the program .
Devin 2.2 (Cognition)
- Cognition markets Devin 2.2 as an autonomous agent that can test with computer use, self-verify, and auto-fix; also claims 3× faster startup, redesigned UI, and “computer use + virtual desktop” .
OpenClaw — new beta
- Peter Steinberger: beta includes security improvements, various fixes, DM “heartbeat” made configurable after feedback, better Slack threads, improved subagents, and a more reliable Telegram webhook.
- Releases: https://github.com/openclaw/openclaw/releases.
Sourcegraph 7.0 (positioning shift)
- Sourcegraph says 7.0 marks a new chapter: doubling down on being an “intelligence layer” that developers and AI agents rely on to navigate/understand/operate on large codebases .
- Details: https://sourcegraph.com/blog/a-new-era-for-sourcegraph-the-intelligence-layer-for-ai-coding-agents-and-developers.

💡 WORKFLOWS & TRICKS

“English → parallel agents → you review” (Karpathy’s decomposition rule)
- Karpathy’s pattern: agents aren’t perfect—they need high-level direction, judgment, taste, oversight, iteration, hints, and they work best when tasks are well-specified and verifiable/testable.
- His operational heuristic: build intuition for task decomposition—hand off the parts that work well to agents, then “help out around the edges” .
- Scaling idea: build long-running orchestrators (“Claws”) with tools/memory/instructions managing multiple parallel “Code” instances .
Cursor cloud agent: “clone it from a video” as a starting point, then iterate for fidelity
- @swyx dropped a tweet + video into Cursor cloud expecting it not to work; he says Cursor Agent oneshotted a functional clone of Rachel Chen’s site from the video alone over 43 minutes (including a working “RachelLLM” sidebar) .
- His follow-up prompt for fidelity is a reusable template:
  - step through the video,
  - discover assets (headless run / curl / network snooping),
  - build a checklist + sitemap,
  - spin up subagents/swarm for parallel work,
  - don’t stop until behavior/visuals match closely; trade off fidelity vs simplicity when ambiguous .
- He reports a second improved output after another 43 minutes.
Run many agents in parallel (Cursor) + let the agent do exploratory UX testing
- Kent C. Dodds: he can run “as many of these [Cursor agents]” as he wants; instead of filing issues for ideas, he fires off prompts and gets back what it built (with screenshots) .
- He also saw the agent “noticed one UX edge case during walkthrough” while doing manual testing .
Long-running agent refactors overnight (Cursor) + “computer use” for steering
- Kent kicked off a long-running Cursor agent overnight and iterated in the morning using “computer use” .
- He reports it dropped ~15k lines in a refactor .
Code review aid: ask for a linear walkthrough of the codebase (Simon Willison)
- Willison’s prompt pattern: ask agents for “a linear walkthrough of the code that explains how it all works in detail” to understand vibe-coded output .
Git hygiene for agentic work: small commits, then squash (Huntley)
- Geoffrey Huntley suggests an agent-friendly workflow: make incremental small commits, then squash to a single commit so “study git log” for a unit of work can be a single tool call .
Production caution: don’t trust “ranked” PR scores if they’re editable
- Steinberger says they use Greptile to rank PRs, but observed someone manually edited a PR review score from 2/5 to 5/5.
- Example PR: https://github.com/openclaw/openclaw/pull/13095.
OSS maintainer playbook shift: tests as “reimplementation fuel”
- Simon Willison notes that a comprehensive test suite can be enough to rebuild a library from scratch, and highlights tldraw moving tests to a private repo as a response pattern .

👤 PEOPLE TO WATCH

Andrej Karpathy — clearest firsthand articulation of what changed since December, plus a concrete “30 minutes, hands-off” agent-run build story and an orchestration north star (“Claws”) .
Simon Willison — consistently turns agent usage into repeatable patterns (e.g., “linear walkthroughs”), and also documents sharp edges like Claude Code Remote Control’s failure modes .
Mitchell Hashimoto — high-signal model/tool preference note: Codex 5.3 displaced Opus 4.6 for him after direct comparison .
Kent C. Dodds — pragmatic day-to-day agent usage: parallel agents, long-running refactors, and agents surfacing UX edge cases during walkthroughs .
ThePrimeagen — counterweight: after ~3 months of vibe-coding, he says he hates the generated code and the “subtle offness,” and plans to “tradcode” (useful reality check on taste/intent gaps) .

🎬 WATCH & LISTEN

No YouTube videos or podcast episodes were included in today’s source set, so there are no embeddable clips to share.

📊 PROJECTS & REPOS

Simon Willison — “Present” (SwiftUI macOS presentation app) repo + walkthrough
- Repo: https://github.com/simonw/present
- Walkthrough doc: https://github.com/simonw/present/blob/main/walkthrough.md
OpenClaw — releases + active PR example
- Releases: https://github.com/openclaw/openclaw/releases
- PR referenced in Greptile score-editing report: https://github.com/openclaw/openclaw/pull/13095
tldraw — tests moving closed-source (issue)
- Issue: https://github.com/tldraw/tldraw/issues/8082

Editorial take: The bottleneck is shifting from “can the agent write code?” to “can you reliably steer, verify, and govern what it did?”