Coding agents hit a post-December step-change; Codex 5.3 momentum vs Opus 4.6; remote-control + orchestration patterns

Agents keep moving from “toy” to “teammate”: Karpathy reports a sharp post-December step-change and shares a hands-off, 30-minute end-to-end build example. Also: Codex 5.3 displacing Opus 4.6 for some power users, Claude Code Remote Control’s early reliability issues, and concrete workflow patterns for orchestration, review, and repo hygiene.

Sherwin Wu

Andrej Karpathy

Cognition

+12

🔥 TOP SIGNAL

Coding agents crossed a “works in practice” threshold since December, driven (per Andrej Karpathy) by improved model quality, long-term coherence, and tenacity—enough to be disruptive to the default programming workflow. His concrete example: he handed an agent a single English brief to set up vLLM + Qwen3-VL, build a video inference endpoint + web UI, debug issues, install systemd services, and return a markdown report—hands-off in ~30 minutes.

🛠️ TOOLS & MODELS

GPT-5.3-Codex / Codex 5.3 vs Opus 4.6 (practitioner preference)
- Mitchell Hashimoto says Codex 5.3 is “much more effective” than Opus 4.6, and that after going back and forth he hasn’t touched Opus for a week—“first model to get me off of Opus… ever” .
- OpenAI’s Romain Huet says the team is “continuing to iterate and improve Codex every week” .
- Tool reliability signal: Brian Lovin hit Claude Code 500s, tried Codex, and reported “Codex is good!” .
Reasoning settings (Codex)
- Sherwin Wu: they “basically only run [GPT-5.3-Codex] on xhigh nowadays for all coding tasks,” and notes speed improvements make it not feel slow even at xhigh.
- Greg Brockman’s advice: “always run with xhigh reasoning” .
Claude Code — Remote Control (new capability, rough edges in testing)
- Feature: run claude remote-control locally, then send prompts to that session from web/iOS/desktop; one session per machine and requires per-action approval.
- Simon Willison reports it’s “a little bit janky,” including repeated API 500 errors and confusing failure behavior after restarting the program .
Devin 2.2 (Cognition)
- Cognition markets Devin 2.2 as an autonomous agent that can test with computer use, self-verify, and auto-fix; also claims 3× faster startup, redesigned UI, and “computer use + virtual desktop” .
OpenClaw — new beta
- Peter Steinberger: beta includes security improvements, various fixes, DM “heartbeat” made configurable after feedback, better Slack threads, improved subagents, and a more reliable Telegram webhook.
- Releases: https://github.com/openclaw/openclaw/releases.
Sourcegraph 7.0 (positioning shift)
- Sourcegraph says 7.0 marks a new chapter: doubling down on being an “intelligence layer” that developers and AI agents rely on to navigate/understand/operate on large codebases .
- Details: https://sourcegraph.com/blog/a-new-era-for-sourcegraph-the-intelligence-layer-for-ai-coding-agents-and-developers.

💡 WORKFLOWS & TRICKS

“English → parallel agents → you review” (Karpathy’s decomposition rule)
- Karpathy’s pattern: agents aren’t perfect—they need high-level direction, judgment, taste, oversight, iteration, hints, and they work best when tasks are well-specified and verifiable/testable.
- His operational heuristic: build intuition for task decomposition—hand off the parts that work well to agents, then “help out around the edges” .
- Scaling idea: build long-running orchestrators (“Claws”) with tools/memory/instructions managing multiple parallel “Code” instances .
Cursor cloud agent: “clone it from a video” as a starting point, then iterate for fidelity
- @swyx dropped a tweet + video into Cursor cloud expecting it not to work; he says Cursor Agent oneshotted a functional clone of Rachel Chen’s site from the video alone over 43 minutes (including a working “RachelLLM” sidebar) .
- His follow-up prompt for fidelity is a reusable template:
  - step through the video,
  - discover assets (headless run / curl / network snooping),
  - build a checklist + sitemap,
  - spin up subagents/swarm for parallel work,
  - don’t stop until behavior/visuals match closely; trade off fidelity vs simplicity when ambiguous .
- He reports a second improved output after another 43 minutes.
Run many agents in parallel (Cursor) + let the agent do exploratory UX testing
- Kent C. Dodds: he can run “as many of these [Cursor agents]” as he wants; instead of filing issues for ideas, he fires off prompts and gets back what it built (with screenshots) .
- He also saw the agent “noticed one UX edge case during walkthrough” while doing manual testing .
Long-running agent refactors overnight (Cursor) + “computer use” for steering
- Kent kicked off a long-running Cursor agent overnight and iterated in the morning using “computer use” .
- He reports it dropped ~15k lines in a refactor .
Code review aid: ask for a linear walkthrough of the codebase (Simon Willison)
- Willison’s prompt pattern: ask agents for “a linear walkthrough of the code that explains how it all works in detail” to understand vibe-coded output .
Git hygiene for agentic work: small commits, then squash (Huntley)
- Geoffrey Huntley suggests an agent-friendly workflow: make incremental small commits, then squash to a single commit so “study git log” for a unit of work can be a single tool call .
Production caution: don’t trust “ranked” PR scores if they’re editable
- Steinberger says they use Greptile to rank PRs, but observed someone manually edited a PR review score from 2/5 to 5/5.
- Example PR: https://github.com/openclaw/openclaw/pull/13095.
OSS maintainer playbook shift: tests as “reimplementation fuel”
- Simon Willison notes that a comprehensive test suite can be enough to rebuild a library from scratch, and highlights tldraw moving tests to a private repo as a response pattern .

👤 PEOPLE TO WATCH

Andrej Karpathy — clearest firsthand articulation of what changed since December, plus a concrete “30 minutes, hands-off” agent-run build story and an orchestration north star (“Claws”) .
Simon Willison — consistently turns agent usage into repeatable patterns (e.g., “linear walkthroughs”), and also documents sharp edges like Claude Code Remote Control’s failure modes .
Mitchell Hashimoto — high-signal model/tool preference note: Codex 5.3 displaced Opus 4.6 for him after direct comparison .
Kent C. Dodds — pragmatic day-to-day agent usage: parallel agents, long-running refactors, and agents surfacing UX edge cases during walkthroughs .
ThePrimeagen — counterweight: after ~3 months of vibe-coding, he says he hates the generated code and the “subtle offness,” and plans to “tradcode” (useful reality check on taste/intent gaps) .

🎬 WATCH & LISTEN

No YouTube videos or podcast episodes were included in today’s source set, so there are no embeddable clips to share.

📊 PROJECTS & REPOS

Simon Willison — “Present” (SwiftUI macOS presentation app) repo + walkthrough
- Repo: https://github.com/simonw/present
- Walkthrough doc: https://github.com/simonw/present/blob/main/walkthrough.md
OpenClaw — releases + active PR example
- Releases: https://github.com/openclaw/openclaw/releases
- PR referenced in Greptile score-editing report: https://github.com/openclaw/openclaw/pull/13095
tldraw — tests moving closed-source (issue)
- Issue: https://github.com/tldraw/tldraw/issues/8082

Editorial take: The bottleneck is shifting from “can the agent write code?” to “can you reliably steer, verify, and govern what it did?”