# Harness Tuning, Hybrid Routing, and Safer Sandboxes Move Coding Agents Forward

*By Coding Agents Alpha Tracker • April 14, 2026*

Harness quality emerged as today’s real edge: Theo unpacked the agent loop, Cursor confirmed live harness A/B testing, and Cloudflare shipped new primitives for safer, stateful agent sandboxes. Also inside: Cursor 3.1 upgrades, a practical local-vs-cloud routing playbook, and reproducible repo experiments from Simon Willison.

## 🔥 TOP SIGNAL

Today’s clearest signal: **harness engineering is becoming a first-class performance lever**, not a footnote to model choice. Theo’s breakdown defines the harness as the tool/runtime loop around the model, cites an independent benchmark where Opus went from **77% in Claude Code to 93% in Cursor**, and Cursor CEO Michael Truell separately says Cursor A/B tests the harness itself on live traffic [^1][^2].

Practical takeaway: stop evaluating models in isolation—the **tool descriptions, permissions, context bootstrap, and retry loop** are part of the product, and Theo shows even small description changes can materially alter tool behavior [^1].

## 🛠️ TOOLS & MODELS

- **Cursor 3.1:** split agents for multitasking, pick the branch for a cloud agent, better voice input with `Ctrl-M` hold-to-talk, jump from a diff to the exact file line, workspace search include/exclude filters, and an **87% reduction in dropped frames** for large file edits. Full changelog: [http://cursor.com/changelog/3-1](http://cursor.com/changelog/3-1) [^3][^4][^5][^6][^7][^8]
- **Cursor’s team is tuning more than the model:** Truell says Cursor A/B tests **model checkpoints, UX, and the agent harness**, including sending **<1% of traffic** to compare how Claude behaves under the Claude Code harness versus Cursor’s default harness [^2]
- **Cloudflare Durable Object Facets:** sandboxed Dynamic Workers can now access **SQLite** through standard Durable Object implementations with fast synchronous reads/writes; a supervisor Durable Object can create attached databases and pass specific ones into workers. Kent C. Dodds says he is integrating this into Kody immediately and expects a significant capability boost. Blog: [https://blog.cloudflare.com/durable-object-facets-dynamic-workers/](https://blog.cloudflare.com/durable-object-facets-dynamic-workers/) [^9][^10][^11][^9]
- **Cloudflare outbound Workers for Sandboxes:** credential injection, egress logging, and zero-trust policies at the **network layer** for agent sandboxes. Dodds notes Kody previously had to solve the same basic secret-injection problem earlier at the **template layer** because this feature did not exist yet. Announcement: [https://cfl.re/4tfSt1G](https://cfl.re/4tfSt1G) [^12][^13][^14][^12]
- **Practical model routing from OpenClaw:** Berman keeps **Opus 4.6 / GPT 5.4** for coding, planning, and orchestration, then offloads embeddings, transcription, voice, PDF extraction, classification, and some chat to local models like **Qwen 3.5**, **Nemotron**, and **Gemma** via LM Studio. His hardware heuristic: ~**30B** models are the sweet spot for many consumer GPUs [^15]

## 💡 WORKFLOWS & TRICKS

- **DIY harness in a weekend:** Theo’s minimal version is small enough to build yourself. Core loop: define a few tools like `read_file`, `list_files`, and `edit_file` (or just `bash`), list them in the system prompt, let the model emit `tool: name {json}`, execute the tool, append the output to history, repeat [^1]
- **Tune tool descriptions per model, not once:** Theo demos that changing only a tool description can change which tool the model reaches for. His broader point: models only see the descriptions/context you give them, and different models react differently to the same wording [^1]
- **Keep upfront context short; let tools do the exploration:** use `.claude.md` or `.agent.md` for the highest-value bootstrap context, then let the model search/read its way to the rest. Theo’s case against repo stuffing is blunt: large contexts make models worse, tool-based exploration beat Repomix-style packing, and staying in one thread preserves useful history [^1]
- **Three-stage local-model rollout:** Berman’s pattern is clean: **(1)** experiment with frontier models only, **(2)** productionize and identify sub-tasks already working on weaker models, **(3)** move repeated, lower-complexity work local after edge-case testing. His examples: notification classification, company-news relevance, CRM context extraction, and knowledge-base summarization [^15]
- **Concrete way to wire a local model into an agent stack:** run LM Studio on the target GPU machine, load a model like **Qwen 3.5 35B**, ask Cursor to SSH in and add it to OpenClaw’s routing config, then smoke-test it in Telegram with `/status` and a quick prompt. Berman reports about **65 tok/sec** on DGX Spark and faster simple chat round trips than Sonnet in his setup [^15]
- **Rule-first prompting is emerging as a sane default:** ThePrimeagen says he is codifying his own programming rules, applying them through several stages, and keeping the scope to small changes while staying accountable for every line instead of letting agents dump code over the wall [^16]

## 👤 PEOPLE TO WATCH

- **Theo** — Best demystifier today. He turns harnesses from buzzword into a concrete loop, then shows why tool descriptions, prompts, and context loading materially change outcomes [^17][^1]
- **Michael Truell** — Rare firsthand confirmation that Cursor is testing the harness itself on real traffic, not just swapping models behind the scenes [^2]
- **Addy Osmani** — Strong firsthand signal from inside Google: **40K+ SWEs** use agentic coding weekly, with internal custom CLIs, MCPs, orchestrators, agent loops, and virtual SWE teams in daily use [^18]
- **Matthew Berman** — Shared the clearest frontier-to-local routing playbook of the day: use the best cloud models for code and planning, then offload repeatable sub-tasks locally once you’ve validated the workflow [^15]
- **Simon Willison** — Still the best source for bounded, reproducible agent experiments: this time he had Claude Code explore the new `servo` crate, build a working screenshot CLI, and publish both the repo and the task PR [^19]

## 🎬 WATCH & LISTEN

- **Theo — 15:30-19:17:** Best short explainer on why stuffing an entire repo into context is the wrong instinct. He walks through why tool-driven context building beats Repomix-style packing, and why bigger context can make models worse [^1]

[![How does Claude Code *actually* work?](https://img.youtube.com/vi/I82j7AzMU80/hqdefault.jpg)](https://youtube.com/watch?v=I82j7AzMU80&t=930)
*How does Claude Code *actually* work? (15:30)*


- **Theo — 20:37-23:05:** The minimal harness primer. Three tools, a system prompt, and a loop. Watch this before you over-engineer your own agent runtime [^1]

[![How does Claude Code *actually* work?](https://img.youtube.com/vi/I82j7AzMU80/hqdefault.jpg)](https://youtube.com/watch?v=I82j7AzMU80&t=1237)
*How does Claude Code *actually* work? (20:37)*


- **Latent Space — 42:54-46:30:** Sharp management clip on the new failure mode: engineers juggling many agents all day get fatigued, then still have to review critical PRs. The takeaway is simple—AI increases the need for serious human review, not less [^20]

[![⚡️ The best engineers don't write the most code. They delete the most code. — Stay Sassy](https://img.youtube.com/vi/5KnCKadxSPY/hqdefault.jpg)](https://youtube.com/watch?v=5KnCKadxSPY&t=2574)
*⚡️ The best engineers don't write the most code. They delete the most code. — Stay Sassy (42:54)*


## 📊 PROJECTS & REPOS

- **`servo-shot` / Simon’s `servo` exploration repo:** Claude Code explored `servo` v0.1.0, identified the API surface, and built a headless CLI that renders URLs or HTML to PNG. The replication steps, repo, and task PR are all public: [https://github.com/simonw/research/tree/main/servo-crate-exploration#readme](https://github.com/simonw/research/tree/main/servo-crate-exploration#readme) · [https://github.com/simonw/research/pull/108](https://github.com/simonw/research/pull/108) [^19]
- **`html5ever` / `markup5ever_rcdom` WASM playground:** compiling Servo itself to WebAssembly was not feasible, but Claude still produced a narrower useful artifact: a browser playground for turning HTML fragments into a parse tree [^19]
- **`pi-tutorial`:** clever repo pattern from Armin Ronacher—package onboarding as an interactive agent experience instead of docs. Repo: [http://github.com/earendil-works/pi-tutorial](http://github.com/earendil-works/pi-tutorial) [^21]

*Editorial take: the real edge right now is not one magic model—it’s better harnesses, tighter context, and safer orchestration around the model* [^1][^2][^1][^9]

---

### Sources

[^1]: [How does Claude Code *actually* work?](https://www.youtube.com/watch?v=I82j7AzMU80)
[^2]: [𝕏 post by @mntruell](https://x.com/mntruell/status/2043574966168555717)
[^3]: [𝕏 post by @cursor_ai](https://x.com/cursor_ai/status/2043798784367546707)
[^4]: [𝕏 post by @cursor_ai](https://x.com/cursor_ai/status/2043798789493055574)
[^5]: [𝕏 post by @cursor_ai](https://x.com/cursor_ai/status/2043798785881702441)
[^6]: [𝕏 post by @cursor_ai](https://x.com/cursor_ai/status/2043798787605573891)
[^7]: [𝕏 post by @cursor_ai](https://x.com/cursor_ai/status/2043798791116206453)
[^8]: [𝕏 post by @cursor_ai](https://x.com/cursor_ai/status/2043798792672293357)
[^9]: [𝕏 post by @KentonVarda](https://x.com/KentonVarda/status/2043684025454170438)
[^10]: [𝕏 post by @kentcdodds](https://x.com/kentcdodds/status/2043696172703879621)
[^11]: [𝕏 post by @kentcdodds](https://x.com/kentcdodds/status/2043744805344424256)
[^12]: [𝕏 post by @Cloudflare](https://x.com/Cloudflare/status/2043692614445133933)
[^13]: [𝕏 post by @kentcdodds](https://x.com/kentcdodds/status/2043808783290569007)
[^14]: [𝕏 post by @kentcdodds](https://x.com/kentcdodds/status/2043904902100070821)
[^15]: ["But OpenClaw is expensive..."](https://www.youtube.com/watch?v=nt7dWOEFUB4)
[^16]: [𝕏 post by @ThePrimeagen](https://x.com/ThePrimeagen/status/2043861800819761382)
[^17]: [𝕏 post by @theo](https://x.com/theo/status/2043611205856837680)
[^18]: [𝕏 post by @addyosmani](https://x.com/addyosmani/status/2043812343508021460)
[^19]: [Exploring the new `servo` crate](https://simonwillison.net/2026/Apr/13/servo-crate-exploration)
[^20]: [⚡️ The best engineers don't write the most code. They delete the most code. — Stay Sassy](https://www.youtube.com/watch?v=5KnCKadxSPY)
[^21]: [𝕏 post by @mitsuhiko](https://x.com/mitsuhiko/status/2043820824168116345)