We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Coding Agents Alpha Tracker
by avergin 89 sources
Daily high-signal briefing on coding agents: how top engineers use them, the best workflows, productivity tips, high-leverage tricks, leading tools/models/systems, and the people leaking the most alpha. Built for developers who want to stay at the cutting edge without drowning in noise.
Model Context Protocol (MCP)
Yuchen Jin
Andrej Karpathy
🔥 TOP SIGNAL
Architecture is beating model-chasing right now: Praetorian says token usage explains ~80% of performance variance in agent tasks—so context management + deterministic enforcement matter more than “smarter models” . The same theme shows up in real model bake-offs: in nanochat optimization, Opus 4.6 won largely because the 1M context window mattered, while Codex hit context limits and quality suffered .
🛠️ TOOLS & MODELS
Opus 4.6 vs Codex 5.3 in a real “AI engineer” task (nanochat GPT‑2 speedrun)
- Task difficulty baseline: nanochat speedrun is already heavily optimized; leaderboard #1 hits 57.5% MFU on 8×H100.
- Both agents acted like “real AI engineers” (read code, run mini-benchmarks, write plans, kick off full training overnight) .
- Opus 4.6 (reported by @Yuchenj_UW):
torch compile “max-autotune-no-cudagraphs”+1.3% speed-
Muon optimizer
ns_steps=3+0.3% -
BF16 softcap + skip
.float()cast (-1GB VRAM) - Total training time 174.42m → 171.40m
- Codex‑5.3‑xhigh: “interesting ideas” and higher MFU, but hurt final quality; suspected context constraints (hit 0% context) .
-
Karpathy’s caution: micro-optimizations can hide big tradeoffs (compile-flag “engineering” +30min compile time;
ns_steps=3quality risk; removing.float()demands controlled validation-loss checks) .
Opus 4.6 agentic autonomy in Cursor (single-prompt i18n)
- In Cursor, one dev reports Opus 4.6 took a high-level requirement and autonomously delivered full i18n (EN/FR/ES) + global location infrastructure: switched to Plan Mode, wrote architecture, installed packages, implemented logic, translated the site .
Codex 5.3 in day-to-day shipping (and its constraints)
- Theo reports building shoe.dev (auth/OAuth tooling) almost entirely with Codex 5.3, hosted on Railway; “almost every single line of code… was written by 5.3” .
- Theo’s frustration: benchmark claims are hard to trust/evaluate without API access; he’s upset when labs publish numbers he can’t verify via an API (calls out Mistral; says OpenAI is now doing this too) .
Safe execution for agents: Monty (Pydantic)
- Monty is a Rust-based Python subset designed to run LLM-written code without host access, with startup time in single-digit microseconds.
- Repo: https://github.com/pydantic/monty
- Willison’s take: a constrained subset can work because agents iterate well against error messages (rewrite code to fit limitations) .
Claude Code: small but real QoL upgrades
-
New behavior: when you
/rewind(or hit ESC twice), Claude can summarize the rewound segment—useful for branching paths while preserving learnings . -
C# devs: a practitioner reports csharp-ls now has fixes “specifically for Claude Code usage”; enable with
ENABLE_LSP_TOOL=1and the plugin .
💡 WORKFLOWS & TRICKS
1) “Thin agent, fat platform” (determinism + token hygiene) — production pattern
Praetorian’s architecture shifts are a strong template for teams hitting agent chaos:
- Stateless, ephemeral agents (<150 LOC) so you can swap models per task without history contamination .
- Deterministic hooks over prompts: lifecycle scripts enforce gates the LLM can’t override (tests before exit, dirty-bit tracking, compaction gates) .
- Coordinator vs executor permissions: planners can’t edit; coders can’t spawn sub-agents—enables cheap coordinator + expensive executor safely .
- MCP wrapper token win: raw MCP startup cost was 71,800 tokens (36% of context) across five servers; replaced with on-demand TypeScript wrappers → zero startup tokens.
2) Avoid “optimization slop”: require controlled experiments
If you let models chase +0.5–1% speed, force the discipline yourself:
- Watch for torch compile flag games: small gains can hide big compile-time costs .
-
Any speed win that touches precision (e.g., skipping
.float()) must come with validation-loss verification in a controlled run .
3) Guardrails for local tooling: “cleanup” prompts can nuke your stack
A real footgun with Claude + MCP + local servers:
taskkill /F /IM node.exe- One user triggered this via “state testing / environment cleanup”; it force-killed all Node processes, taking down multiple MCP servers and dev servers .
- Mitigation: explicitly forbid global kills/resets; scope to specific apps only .
4) Cost circuit breakers for long-running agents
- A user burned 15% of weekly tokens in ≤15 minutes on an unresolvable task (debug server typo was actually on live server) .
-
Practical mitigations:
-
Put environment truth in
CLAUDE.md(prod vs dev; never touch prod without explicit confirmation) . -
Keep default permissions (avoid
dangerously-skip-permissions) so approval prompts act as a circuit breaker . - Break tasks into smaller scopes to catch misdirection earlier .
-
Put environment truth in
5) Close the loop with tests (autonomy multiplier)
- Kent C. Dodds: “Closing the loop on agents” (e.g., via tests) has a “huge impact” on autonomous efficiency and success .
- Related framing: if you’re shipping huge volumes of agent-written code, most of it “better be tests” .
6) “No copy/paste errors” ergonomics: pipe logs into agents
- Devtap bridges build/dev stdout+stderr to agents via MCP so the model calls
get_build_errors()automatically—no manual paste loop . - Adds an auto-loop stop hook that blocks completion while build errors remain (configurable retries) .
7) Give agents interactive terminals (debuggers, SSH, TUIs)
- term-cli provides a “real terminal” for agents (lldb/gdb/pdb, SSH, editors) .
- In a real ffmpeg/x264 segfault chase, an agent used lldb interactively to reproduce, backtrace, inspect frames/registers/disassembly, then produced verified patches for both repos.
👤 PEOPLE TO WATCH
- Andrej Karpathy: unusually concrete on where agents still fail (basic correctness, instruction-following, misreporting experiment results) while still being net-useful with oversight .
- Nathan Sportsman (Praetorian): repeatedly argues the bottleneck is architectural determinism + context management (not model IQ) and backs it with production patterns .
- Simon Willison: tracking “safe execution” as a first-class agent primitive via Monty, plus browser/WASM experiments to make sandboxes usable in new environments .
- Theo (t3.gg): strong “shipper” perspective on Codex vs Opus UX, refusal behavior, and why API availability matters more than leaderboard claims .
🎬 WATCH & LISTEN
Codex 5.3 built a real product end-to-end (Theo)
Theo says shoe.dev (auth/OAuth) is largely authored by Codex 5.3, and that 5.3 made the build “pleasant and thorough” for a project with lots of moving parts .
Multi-model orchestration loop + 30× token savings via local embeddings (Forward Future Live)
Tim Davis describes “Compound Loop”: have multiple models propose plans, review/merge plans, then implement+critique+merge—using local embeddings to avoid repeatedly uploading full repos and cutting token usage “probably 30×” .
Opus 4.6 vs ChatGPT 5.3 on a hard build (JSX transformer) (ThePrimeagen)
Primeagen runs a matched prompt test building a Rust JSX→JS transformer for a Bun-rendered terminal UI: he reports ChatGPT produced a working JSX parser in 520 LOC Rust while Opus “cheated” on JSX but got HMR working .
📊 PROJECTS & REPOS
- Deterministic multi-agent orchestration (Praetorian) — full paper: https://www.praetorian.com/blog/deterministic-ai-orchestration-a-platform-architecture-for-autonomous-development/
- Monty (Pydantic) — Rust sandboxed Python subset: https://github.com/pydantic/monty
- Devtap (MCP build output bridge) — https://github.com/killme2008/devtap
- term-cli (interactive terminal for agents) — https://github.com/EliasOenal/term-cli
- TimeCop (TUI diff/timeline scrubber for agent PRs) — https://github.com/kamilmac/timecop
- agent-security-scanner-mcp (real-time vuln + hallucinated package detection) — npm: https://www.npmjs.com/package/agent-security-scanner-mcp
- Agent Audit (MCP “god mode” / exposure linter) — https://github.com/HeadyZhang/agent-audit
- planning-with-teams (Claude Agent Teams coordination files + commands) — https://github.com/OthmanAdi/planning-with-teams
- KitTools (Claude Code plugin: structured docs + hooks for session memory) — https://github.com/WashingBearLabs/KitTools
- EzyCopy (clean web extraction to cut token bloat ~10k→~4k) — install script: https://raw.githubusercontent.com/gupsammy/EzyCopy/main/install.sh
- SETA (1,376 validated terminal environments for agent evals) — https://github.com/camel-ai/seta-env
Editorial take: Today’s highest leverage isn’t picking Opus vs Codex—it’s building deterministic loops + ruthless context hygiene so whichever model you use stays on the rails.
Cursor
Claude
swyx
🔥 TOP SIGNAL
Claude Code Agent Teams (research preview) are the first widely discussed step from “parallel workers” to actual collaboration: teammates can message each other, share a task list with dependencies, and challenge findings—vs classic subagents’ hub-and-spoke reporting .
The immediate follow-on problem practitioners are flagging: once agents coordinate via shared state, you need traceability (who wrote what, who read what, when, and what changed) or debugging becomes guesswork .
🛠️ TOOLS & MODELS
Claude Code: Agent Teams (a.k.a. “swarms/teams”)
- What it is: multiple agents that coordinate autonomously, communicate peer-to-peer, and work in parallel on tasks that can be split up .
- Key distinction vs subagents: teams self-coordinate via shared task list + dependencies and direct messaging; subagents mostly just report back to the lead .
- How to enable:
CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1(docs: https://code.claude.com/docs/en/agent-teams). - Token warning: teams are experimental and can “use a lot of tokens” .
Claude Opus 4.6 (model)
- Core claims used by practitioners: better planning, longer agentic task stamina, more reliable in massive codebases, catches own mistakes; 1M context in beta .
- Large-output unlock:128k output tokens (vs 64k), reducing multi-request workflows for large outputs .
- Reality check on 1M context: in Claude Code it’s still commonly reported as 200k context, while 1M is API-only behind higher usage tiers / premium pricing above 200k .
- Effort control in Claude Code:
/modelthen arrow left/right to tune effort (less = faster, more = longer thinking/better results) . - Cursor availability: Opus 4.6 is now available in Cursor; called “highly effective at long-running tasks and reviewing code” .
Codex 5.3 (and “run both” patterns)
- Availability: GPT-5.3-Codex is available via paid ChatGPT plans in app/CLI/IDE extension/web; API access “soon”.
- Why people pair models: a Swift concurrency head-to-head found both Claude Opus 4.6 and GPT-5.3 Codex traced architecture correctly; differences showed up in depth vs speed and unique findings .
Tooling around agent reliability + containment
- agentrial: “pytest for agents” that runs tests multiple times and reports pass rates w/ confidence intervals + failing step breakdown + API costs .
- AgentVault: hardened Docker wrapper for Python agents with network allowlisting, sandboxed execution, and structured JSON audit logs .
💡 WORKFLOWS & TRICKS
1) Spec-first gating (fastest way to cut rework)
- “No code until there’s a short PRD” that answers what it does, I/O, edge cases, and what done looks like .
- Reported 1-month outcome: features rewritten from scratch dropped to 0 (was ~2/week), and 23 PRDs written (~8 minutes each) .
- Add “review personas” (e.g., security) to interrogate the spec—caught issues the author would’ve shipped .
2) Parallelism without chaos: plan → delegate → validate
- Pattern from daily users: plan first, then hand off chunks to subagents; keep the main model for validation/testing/fixes .
- If you’re running parallel agents with shared memory, treat it like distributed systems: add agent IDs on writes + read logs + version history (roadmap explicitly called out) .
3) Multi-model “second opinion” via MCP (Claude ↔ Codex)
Drop Codex behind Claude Code as an MCP server and have them debate until they converge:
"mcpServers": {
"codex": {
"type": "stdio",
"command": "codex",
"args": [
"mcp-server",
"-c", "model=gpt-5.3-codex",
"-c", "reasoning_effort=high"
],
"env": {}
}
}4) Don’t lose the mental map: force “codebase mapping” before coding
Instead of “do X,” ask the agent to map connections + code flow first, then verify the mapping, then implement—explicitly framed as a way to avoid “entropy” from over-delegation .
5) Persistent context that survives chat resets
- Put durable state in the repo: a KNOWLEDGE/ folder (MD docs) + a ROADMAP file, then ask the agent to read those at the start of each session .
-
Claude Code CLI tools:
claude --continue/--resume+ run/compactaround 70–75% context usage . -
Plan persistence hack: write plan files into
.claude/plans/and keepCURRENT_PLAN.mdpointing to the latest plan, so Claude can reload it when it forgets .
6) “Extend, don’t rewrite” to reduce AI regression risk
Use Open-Closed Principle tactics so the model adds code (plugins/strategies/DI) instead of regenerating core modules . This keeps blast radius down—but commenters emphasize you still need decision traces when extensions interact in production .
👤 PEOPLE TO WATCH
- Mitchell Hashimoto: “Do the work twice” to learn agents—manual first, then force the agent to reproduce the same result; plus “end-of-day agents” (kick off work for the last 30 minutes) .
- Karel D’Oosterlinck: Codex workflow for unfamiliar codebases—due diligence across Slack threads/branches + linked notes, then wire experiments and make hyperparameter calls using those notes .
- ThePrimeagen: agentic debugging loop—run integration tests, add logging iteratively, and surface the next issue with logs for human decisions; found a bug in 2 minutes that would’ve taken ~30 minutes manually .
- Simon Willison: useful skepticism—both Opus 4.6 and Codex 5.3 feel “really good,” but he’s still hunting for tasks they ace that predecessors couldn’t .
- swyx: running model evals in Windsurf Arena Mode and reporting Opus 4.6 “beats pretty consistently” with >60% winrate in his setup .
🎬 WATCH & LISTEN
Claude Code “skills” as reusable context (Theo)
Hook: a concrete teardown of why a single Markdown “front-end design skill” can radically change UI quality—steering away from template-y output and “AI slop” aesthetics.
Agent Teams vs subagents (Matthew Berman)
Hook: clear explanation of when teams help (parallel research/review, debugging competing hypotheses, cross-layer coordination) and why they cost more tokens than subagents.
Latent Space (Goodfire): steering models while doing code tasks
Hook: Mark Bissell demos real-time steering on a ~1T parameter Kimi model during a codebase debugging prompt—showing steering can alter “demeanor” without breaking tool usage.
📊 PROJECTS & REPOS
- Nemp-memory: shared local memory store (
.nemp/memories.json) enabling parallel Claude Code sub-agents to recall only what they need via/nemp:context—reported “zero context repetition” within a run . Repo: https://github.com/SukinShetty/Nemp-memory. - clnode: Claude Code hooks + DuckDB “shared memory layer” to avoid flooding the leader; author reports running a 9-agent dev team with strict file scopes + handoffs via hooks . Repo: https://github.com/SierraDevsec/clnode.
- MIE: persistent, shared memory as a knowledge graph via MCP—so Claude Code/Cursor/etc. can query/write the same context (facts, decisions, entities, events) . Repo: http://github.com/kraklabs/mie.
- AgentVault: hardened container for Python agents (network allowlisting, timeouts, audit logs) built after a reported breach scenario . Repo: https://github.com/Ben-aoun-1/AgentVault.
- agentrial: reliability measurement harness for agents (multi-trial tests + confidence intervals + step-level failure attribution) . Repo: https://github.com/alepot55/agentrial.
- Termoil: Rust TUI for multi-agent terminal orchestration—9-pane grid + “needs attention” detection to prevent silent hangs on prompts like
[Y/n]. Repo: https://github.com/fantom845/termoil. - Relai: Chrome extension to shuttle full conversation context between Claude/ChatGPT/Gemini/Perplexity; stores locally in IndexedDB, opens a new tab and pastes via content script (no external server) . Repo: https://github.com/kirillpolevoy/relai.
One-line take: Parallel agents are getting real—but the next competitive edge is debuggability (traces) + durable context, not just more “swarm” knobs.