ZeroNoise Logo zeronoise
Post
Tests + harnesses become the interface for deep agents (plus narrow-agent teams and one-prompt infra)
Mar 2
6 min read
97 docs
Today’s theme: tests and harnesses are becoming the real control plane for deep agents—enabling rapid rewrites, repeatable behavior, and realistic evals. Plus: a one-prompt Windows VM inside a Cursor cloud agent, narrow-agent team patterns, and practical remote-control alternatives to ecosystem lock-in.

🔥 TOP SIGNAL

Test suites + harnesses are becoming the real “interface” for deep agents: Cloudflare used an AI agent (Paramigen) plus Next.js tests to recreate a Next.js-compatible framework in a week , while LangChain says deep agents require multi-layer evals (single-step → full-turn → multi-turn) in clean, reproducible environments . Geoffrey Huntley’s framing clicks: the model is the “reasoning engine,” but the agent harness is the control plane that makes behavior safe and repeatable in production .

🛠️ TOOLS & MODELS

  • “Too-hard-for-LLM” coding tasks are drying up (gpt-5.3-codex vs opus 4.6): Theo is offering $500/problem for locally verifiable repos that current top models can’t solve, but says almost every verifiable problem sent so far gets solved by 5.3 Codex first try . If you’re sharing benchmarks, he wants a git clone-from-scratch command sequence to reproduce the exact state .

  • Codex vs Claude (practitioner comparison): Peter Steinberger’s take: Codex “will read much more of your code and usually find a better solution,” while Claude is “more pleasant” but may claim it’s “100% production ready” and then bug out .

  • Augment’s codebase indexing → better retrieval inside Codex: Theo demos using Augment’s CLI to index a codebase, then switching to Codex where the model immediately uses Augment’s retrieval tool; he claims it finds “exactly what you need and almost nothing else” and returns results in <20 seconds vs 5–10 minutes previously .

  • Remote control, minus lock-in: Claude Code’s new “Remote Control” is described as Max/Pro-only and Claude-Code-only in Jason Zhou’s thread; his alternative is Tailscale + SSH for a private network workflow: “Phone terminal → SSH → dev machine,” with “no public IP / no port forwarding / no exposed services” .

  • Cursor cloud agent: desktop Windows VM from a single prompt: Jediah Katz says he got a Cursor cloud agent to run a Windows VM with full desktop display support “with just one prompt,” taking ~1.5 hours on a long-running harness, and then he can snapshot it as a reusable Windows base .

💡 WORKFLOWS & TRICKS

  • Deep-agent evaluation stack (LangChain’s production learnings)

    • Write bespoke success criteria per datapoint (not one generic rubric) .
    • Run single-step evals to catch regressions at specific decision points .
    • Add full-turn evals to validate end-to-end behavior .
    • Add multi-turn evals that simulate realistic user interactions .
    • Keep clean, reproducible test environments so runs are comparable .
  • Treat your harness as the product (Huntley’s definition you can operationalize)

    • The “agent harness” is the orchestration layer that manages prompts, tool execution, policy checks, guardrails, and loop control (continue/stop) .
    • Handy reference: https://latentpatterns.com/glossary/agent-harness.
  • Narrow-agent teams beat “mega agents” (Riley Brown’s OpenClaw pattern)

    • He reports that as he added more skills, agent dependability dropped: skill use timing got worse, context got “clouded,” and integrations/personalities got “jumbled” .
    • His proposed sweet spot: teams of narrow agents with ~7–10 skills each (vs ~30+) .
    • Concrete coordination pattern: a journal agent in Telegram pings him ~every 30 minutes (sometimes skipping if nothing’s needed), logs useful context into Notion, and other agents read that shared journal (e.g., newsletter agent drafting for a 300,000-person email list with its own conversion goals) .
    • Autonomy lever: narrow agents can run predictable cron-job loops (“three tasks every day”) because they’re optimizing for a small set of goals .
  • Agentic engineering tactics you can copy (Peter Steinberger)

    • Treat it like a discussion, not a one-liner; guide it explicitly (“look here, look there”) and assume it starts with zero project context .
    • Ask it to propose refactors/pain points when a change touches many parts of a codebase .
    • After shipping a feature, ask: “Now that you built it, what would you do different?” to surface what it learned during implementation .
    • Speed tricks: provide images as context when that’s faster than writing; use voice input for throughput .
  • Codebase hygiene under agent acceleration (Theo’s rules of thumb)

    • “Tolerate nothing”: if a bad pattern makes it in, it multiplies—so delete it aggressively .
    • Spend more time in “plan mode”: go back-and-forth until you have a markdown plan you can review, then tell the model to build .
    • When an agent goes wrong, interrogate the path: ask what it’s doing and why, then eliminate bad examples if they’re coming from your own codebase/docs .
  • One-prompt infra bootstrap: Windows VM inside an agent box (Cursor cloud agent)

    1. Give a prompt that explicitly asks for a full Windows VM with desktop display support (not just CLI) .
    2. Let it run under a long-running harness (reported ~1.5 hours) .
    3. Snapshot the resulting VM as a reusable “Windows base” .

👤 PEOPLE TO WATCH

  • Jediah Katz — practical proof of long-horizon agent setup work: a full desktop Windows VM inside a cloud agent, plus snapshotting for reuse .
  • Theo (t3.gg) — running public, verifiable “too hard for LLM” challenges and documenting what actually breaks modern coding models (increasingly little) .
  • Peter Steinberger (OpenClaw) — high-signal “agentic engineering” habits + grounded tool comparison based on daily use .
  • Riley Brown (vibecodeapp) — concrete multi-agent “team” design: narrow agents, shared memory via Notion, cron-based loops .
  • LangChain team — pragmatic eval guidance from building/testing 4 production agents.

🎬 WATCH & LISTEN

1) Riley Brown — why “too many skills” makes agents worse (≈02:33–05:42)

He explains the failure mode (context clouding + jumbled integrations) and the practical alternative: 7–10 skills per agent, then build a team .

2) Peter Steinberger — OpenClaw project update + why he resists one-click installs (≈08:03–12:04)

He describes working to add maintainers and set up a foundation for donations/hiring, and argues that making installs too easy can hide real risks (he calls out prompt injection as unsolved) .

3) Theo — codebase inertia + “slop” compounding under agent acceleration (≈18:17–20:02)

A concrete mental model: codebase quality peaks early, bad patterns spread faster than good ones, and “the models accelerate this” .

📊 PROJECTS & REPOS

  • LangChain: “Evaluating deep agents — our learnings” (built + tested 4 production agents) — https://www.blog.langchain.com/evaluating-deep-agents-our-learnings/
  • Agent harness glossary (Latent Patterns) — crisp definition of the orchestration layer that constructs context, executes tool calls, enforces guardrails, and controls loop continuation — https://latentpatterns.com/glossary/agent-harness
  • Cloudflare’s ViNext (Next.js recreation via Paramigen + tests): reported one-week build, 1700 Vitest + 380 Playwright E2E tests, and partial test coverage breakdown (13% dev / 20% E2E / 10% production out of 13,708 cases) .

Editorial take: The leverage is shifting from “better prompts” to better harnesses + better tests—they’re what make agents reliable, repeatable, and (increasingly) portable across codebases .