ZeroNoise Logo zeronoise
Post
Tests + harnesses become the interface for deep agents (plus narrow-agent teams and one-prompt infra)
Mar 2
6 min read
97 docs
Today’s theme: tests and harnesses are becoming the real control plane for deep agents—enabling rapid rewrites, repeatable behavior, and realistic evals. Plus: a one-prompt Windows VM inside a Cursor cloud agent, narrow-agent team patterns, and practical remote-control alternatives to ecosystem lock-in.

🔥 TOP SIGNAL

Test suites + harnesses are becoming the real “interface” for deep agents: Cloudflare used an AI agent (Paramigen) plus Next.js tests to recreate a Next.js-compatible framework in a week , while LangChain says deep agents require multi-layer evals (single-step → full-turn → multi-turn) in clean, reproducible environments . Geoffrey Huntley’s framing clicks: the model is the “reasoning engine,” but the agent harness is the control plane that makes behavior safe and repeatable in production .

🛠️ TOOLS & MODELS

  • “Too-hard-for-LLM” coding tasks are drying up (gpt-5.3-codex vs opus 4.6): Theo is offering $500/problem for locally verifiable repos that current top models can’t solve, but says almost every verifiable problem sent so far gets solved by 5.3 Codex first try . If you’re sharing benchmarks, he wants a git clone-from-scratch command sequence to reproduce the exact state .

  • Codex vs Claude (practitioner comparison): Peter Steinberger’s take: Codex “will read much more of your code and usually find a better solution,” while Claude is “more pleasant” but may claim it’s “100% production ready” and then bug out .

  • Augment’s codebase indexing → better retrieval inside Codex: Theo demos using Augment’s CLI to index a codebase, then switching to Codex where the model immediately uses Augment’s retrieval tool; he claims it finds “exactly what you need and almost nothing else” and returns results in <20 seconds vs 5–10 minutes previously .

  • Remote control, minus lock-in: Claude Code’s new “Remote Control” is described as Max/Pro-only and Claude-Code-only in Jason Zhou’s thread; his alternative is Tailscale + SSH for a private network workflow: “Phone terminal → SSH → dev machine,” with “no public IP / no port forwarding / no exposed services” .

  • Cursor cloud agent: desktop Windows VM from a single prompt: Jediah Katz says he got a Cursor cloud agent to run a Windows VM with full desktop display support “with just one prompt,” taking ~1.5 hours on a long-running harness, and then he can snapshot it as a reusable Windows base .

💡 WORKFLOWS & TRICKS

  • Deep-agent evaluation stack (LangChain’s production learnings)

    • Write bespoke success criteria per datapoint (not one generic rubric) .
    • Run single-step evals to catch regressions at specific decision points .
    • Add full-turn evals to validate end-to-end behavior .
    • Add multi-turn evals that simulate realistic user interactions .
    • Keep clean, reproducible test environments so runs are comparable .
  • Treat your harness as the product (Huntley’s definition you can operationalize)

    • The “agent harness” is the orchestration layer that manages prompts, tool execution, policy checks, guardrails, and loop control (continue/stop) .
    • Handy reference: https://latentpatterns.com/glossary/agent-harness.
  • Narrow-agent teams beat “mega agents” (Riley Brown’s OpenClaw pattern)

    • He reports that as he added more skills, agent dependability dropped: skill use timing got worse, context got “clouded,” and integrations/personalities got “jumbled” .
    • His proposed sweet spot: teams of narrow agents with ~7–10 skills each (vs ~30+) .
    • Concrete coordination pattern: a journal agent in Telegram pings him ~every 30 minutes (sometimes skipping if nothing’s needed), logs useful context into Notion, and other agents read that shared journal (e.g., newsletter agent drafting for a 300,000-person email list with its own conversion goals) .
    • Autonomy lever: narrow agents can run predictable cron-job loops (“three tasks every day”) because they’re optimizing for a small set of goals .
  • Agentic engineering tactics you can copy (Peter Steinberger)

    • Treat it like a discussion, not a one-liner; guide it explicitly (“look here, look there”) and assume it starts with zero project context .
    • Ask it to propose refactors/pain points when a change touches many parts of a codebase .
    • After shipping a feature, ask: “Now that you built it, what would you do different?” to surface what it learned during implementation .
    • Speed tricks: provide images as context when that’s faster than writing; use voice input for throughput .
  • Codebase hygiene under agent acceleration (Theo’s rules of thumb)

    • “Tolerate nothing”: if a bad pattern makes it in, it multiplies—so delete it aggressively .
    • Spend more time in “plan mode”: go back-and-forth until you have a markdown plan you can review, then tell the model to build .
    • When an agent goes wrong, interrogate the path: ask what it’s doing and why, then eliminate bad examples if they’re coming from your own codebase/docs .
  • One-prompt infra bootstrap: Windows VM inside an agent box (Cursor cloud agent)

    1. Give a prompt that explicitly asks for a full Windows VM with desktop display support (not just CLI) .
    2. Let it run under a long-running harness (reported ~1.5 hours) .
    3. Snapshot the resulting VM as a reusable “Windows base” .

👤 PEOPLE TO WATCH

  • Jediah Katz — practical proof of long-horizon agent setup work: a full desktop Windows VM inside a cloud agent, plus snapshotting for reuse .
  • Theo (t3.gg) — running public, verifiable “too hard for LLM” challenges and documenting what actually breaks modern coding models (increasingly little) .
  • Peter Steinberger (OpenClaw) — high-signal “agentic engineering” habits + grounded tool comparison based on daily use .
  • Riley Brown (vibecodeapp) — concrete multi-agent “team” design: narrow agents, shared memory via Notion, cron-based loops .
  • LangChain team — pragmatic eval guidance from building/testing 4 production agents.

🎬 WATCH & LISTEN

1) Riley Brown — why “too many skills” makes agents worse (≈02:33–05:42)

He explains the failure mode (context clouding + jumbled integrations) and the practical alternative: 7–10 skills per agent, then build a team .

2) Peter Steinberger — OpenClaw project update + why he resists one-click installs (≈08:03–12:04)

He describes working to add maintainers and set up a foundation for donations/hiring, and argues that making installs too easy can hide real risks (he calls out prompt injection as unsolved) .

3) Theo — codebase inertia + “slop” compounding under agent acceleration (≈18:17–20:02)

A concrete mental model: codebase quality peaks early, bad patterns spread faster than good ones, and “the models accelerate this” .

📊 PROJECTS & REPOS

  • LangChain: “Evaluating deep agents — our learnings” (built + tested 4 production agents) — https://www.blog.langchain.com/evaluating-deep-agents-our-learnings/
  • Agent harness glossary (Latent Patterns) — crisp definition of the orchestration layer that constructs context, executes tool calls, enforces guardrails, and controls loop continuation — https://latentpatterns.com/glossary/agent-harness
  • Cloudflare’s ViNext (Next.js recreation via Paramigen + tests): reported one-week build, 1700 Vitest + 380 Playwright E2E tests, and partial test coverage breakdown (13% dev / 20% E2E / 10% production out of 13,708 cases) .

Editorial take: The leverage is shifting from “better prompts” to better harnesses + better tests—they’re what make agents reliable, repeatable, and (increasingly) portable across codebases .

Tests + harnesses become the interface for deep agents (plus narrow-agent teams and one-prompt infra)
Riley Brown
Profile 1 doc

Riley Brown (cofounder of @vibecodeapp, running agents for vibecode.dev growth division) shares firsthand experience building hundreds of OpenClaw agent workflows after 2 weeks of testing OpenClaw, Manus, Claude code, and Perplexity Computer .

Contrarian take: Narrow, specialized agents (7-10 skills each) outperform general-purpose agents or per-task cloud computers (Manus/Perplexity), as adding >10 skills reduces dependability, clouds context, and jumbles integrations/personalities . Plans 15 narrow OpenClaw agents in a superteam for growth .

Why narrow agents:

  • Tie skills directly to specific goals/KPIs for easy validation (e.g., YouTube agent optimizes subs/views/conversions with SERP/Supadata APIs for research , NanoBanana thumbnails , Notion scripts ) .
  • Easy to duplicate/remix/share (e.g., journal agent duplicated to cofounder in 5 min) .
  • Simple to understand (few markdown skill files) .
  • Reviewable via pass/fail on narrow KPIs .
  • Enables autonomous loops via cron jobs .

Orchestration pattern: Narrow agents share memory/context via central Notion journal (journal agent analyzes activities every ~30 min, writes entries; newsletter agent reads for content ideas, optimizes open/CTR/conversions) . Future: inter-agent communication, cloud scaling, team sharing .

OpenClaw advantages: persistent computer (Mac Mini), structured markdown skills, multi-channel gateway (Telegram/Slack/Discord) .

Riley Brown
youtube 1 doc

Riley Brown, founder of vibecode.dev, shares firsthand experience building hundreds of AI agent workflows over two weeks for his company's growth division, primarily using OpenClaw.

Tools tested and compared:

  • OpenClaw: Persistent agent on one computer (e.g., Mac Mini) with structured skills (markdown files), long-term memory, and chat gateway (Telegram, Discord, Slack) . Preferred for narrow focus.
  • Manus: Spins up cloud computer per task; acts as command center for general agents . Less ideal due to generality.
  • Claude Code: Tested, limited details .
  • Perplexity Computer (new release): ChatGPT-like with cloud sandbox for file creation/editing per task, similar to Manus .

Key insight: Narrow agents (7-10 skills each) outperform general ones—more dependable, clearer context, easier intent assignment . Plans 15 narrow OpenClaw agents in a team, sharing context via Notion journal .

Narrow agent benefits:

  • Skills tied to specific goals/KPIs (e.g., pass/fail reviewable) .
  • Easy duplication/sharing (e.g., journal agent remixed for co-founder in 5 min) .
  • Simple autonomous loops via cron jobs .
  • Inter-agent coordination (shared Notion; future memory sharing) .

Example workflows:

  • YouTube agent: Optimizes subs/views/conversions. Skills: YouTube research (SERP API, Supadata API for transcripts), thumbnails (Nano Banana + photos context), Notion scripts .
  • Journal agent: Analyzes daily activities/meetings/videos, writes Notion entries every 30 min if needed; informs other agents .
  • Newsletter agent: Reads journal, drafts emails optimizing opens/CTR/conversions .

Contrarian take: Shift from prompts to agent intents/purpose; general agents fail at proactivity/surprise/useful suggestions . Future: Scale narrow OpenClaw agents to cloud (e.g., 200/team) .

Jason Zhou
x 2 docs

Tailscale + SSH enables remote control for any coding agent (Claude Code, Codex, etc.) via private network, avoiding ecosystem lock-in .

Claude Code's new Remote Control is limited: Max/Pro users only, Claude Code only—if running multiple agents, hard limit .

Workflow: Phone terminal → SSH → dev machine. No public IP, port forwarding, or exposed services .

Demo video in @hqmank's thread (quoted by @jasonzhou1993 ).

Peter Steinberger
Profile 1 doc

Peter Steinberger (creator of OpenClaw, steipete blog), a practitioner using coding agents daily, shares agentic engineering tips from firsthand experience:.

Practical workflows and techniques:

  • Treat agents as discussion partners: Guide iteratively ("look here, look there"), avoid one-liner prompts; imagine agent's perspective with no prior project knowledge .
  • Instruct to Google best practices for ideas .
  • Provide images for quick context .
  • Use voice input (e.g., Whisper Flow) for highest throughput .
  • Post-feature build: Ask agent "what would you do different?" for insights .
  • Leverage for architecture: Identify pain points/refactors in large codebases .

Tool comparisons: Codex (OpenAI) is best coding agent—reads more code, finds better solutions despite drier personality; Claude more pleasant but overclaims production readiness and bugs out .

OpenClaw update (his open-source project): Not dead; adding maintainers, foundation for donations/hiring; emphasizes hackable installs over one-click due to risks like prompt injection . Views as infinite playground like Factorio; agent loop as "hello world" of AI age .

Quantitative insight: ~4% GitHub comments now by agents, rapidly rising .

Contrarian take: Agents amplify skilled humans like a rocket; learned fastest ever last year; don't fear laziness if used thoughtfully .

Riley Brown
x 2 docs

Riley Brown (@rileybrown), cofounder of @vibecodeapp, spent 200 hours testing OpenClaw and recommends keeping agents focused (narrow) rather than overloading with skills, and building teams of them .

Video covers:

  • Perplexity Computer and Manus
  • OpenClaw
  • Issues with too many skills in first agent
  • Preference for agents with intent like employee
  • Narrow agent example: YouTube Agent
  • Team of narrow agents rationale

Plans to build agent team for @vibecodeapp, with a video per agent added .

Firsthand experience from vibecodeapp development .

Theo - t3․gg
youtube 1 doc

Theo (full-stack TypeScript dev, T3.gg creator, Cursor investor, ex-Twitch engineer, using agents in production like T3 chat) critiques coding agents/IDEs (Cursor, Claude Code, Codex) for poor UX, inconsistency, and performance due to 'vibe coding' with early models like Sonnet 3.5/3.7, leading to slop codebases .

Codebase inertia pattern: Quality peaks at 3-6 months then degrades as bad patterns exponentially multiply via copying; agents accelerate this by replicating slop .

Actionable practices:

  • Tolerate no slop: Immediately delete bad patterns; use 'sledgehammer development'—agents make rewriting cheap (e.g., replace 5k LOC in hours) .
  • Extended planning: Back-and-forth with model to spec in Markdown before building; review plan .
  • Use latest models (e.g., Opus 4, Codex 5) over outdated ones .
  • Question agent decisions: Ask 'why this path?' to trace/eliminate bad examples from codebase/docs .
  • Proliferate codebases: Spin up new repos/services trivially; avoid bloating main repo with one-offs .
  • Dual codebases: Slop version (vibe-coded) for prototyping/iteration, port refined features to clean production version (e.g., his T3 chat PRs, Vampire Survivors Phaser→C) .

Tool tip: Augment CLI indexes large codebases for fast retrieval (<20s vs. 5-10min) in Codex; outperforms grep by finding related info .

Upcoming: T3 code for stable agent interaction .

ThePrimeTime
youtube 1 doc

Cloudflare used the Paramigen AI agent to rebuild Next.js (called ViNext) in one week by passing 1700 Vitest tests and 380 Playwright end-to-end tests, achieving 94% of Next.js 16 API surface but only 13% dev, 20% E2E, and 10% production test coverage out of 13,708 total cases .

Quantitative gains: 4x faster builds (using Rollup vs. Turbopack) and 57% smaller client bundles . Now running in production, including on CIO.gov beta sites by National Design Studio team .

New feature: Experimental Traffic Aware Pre-rendering uses Cloudflare's reverse proxy traffic data to pre-render only high-traffic pages (e.g., 184 pages cover 90% traffic from 12,000 unique paths), avoiding linear build scaling .

Test-driven agent pattern: Open test suites enable rapid recreation of complex frameworks, raising open-source implications (e.g., competitors forking via AI) . Speaker (experienced engineer who questioned Cloudflare staff) skeptical of bundle size claim's longevity as features added .

Secondhand report on Cloudflare's firsthand production usage.

Riley Brown
x 2 docs

Riley Brown (@rileybrown), cofounder of @vibecodeapp, shares key insight after 200 hours testing OpenClaw: "Keep your agents focused, and build a team" .

Video overview covers:

  • Perplexity Computer and Manus
  • OpenClaw setup
  • Problems with agents having too many skills
  • Preference for agents with intent (narrow focus)
  • Testing narrow AI agents
  • YouTube Agent as narrow example
  • Team of narrow agents benefits

Related YouTube video.

Timeless pattern: Narrow, specialized agents orchestrated in teams outperform broad multi-skill agents . Firsthand practitioner experience.

geoff
x 1 doc

@GeoffreyHuntley defines Agent Harness as the orchestration layer around a language model/agent that manages prompts, tool execution, policy checks, and loop control for autonomous behavior .

Key functions:

  • Constructs context
  • Executes tool calls
  • Enforces guardrails
  • Decides loop iterations (continue/stop)

Analogy: Model as “reasoning engine”; harness as operating system and control plane for useful, safe, repeatable production use .

Glossary: https://latentpatterns.com/glossary/agent-harness

LangChain
x 2 docs

LangChain team shares evaluation patterns for deep agents (complex multi-step agents) after building and testing 4 production agents—distinct from simple LLM tasks .

Key practices:

  • Bespoke test logic per datapoint with custom success criteria
  • Single-step evals to catch regressions at decision points
  • Full turn evals for end-to-end behavior
  • Multi-turn evals simulating realistic user interactions
  • Clean, reproducible test environments

Full blog: https://www.blog.langchain.com/evaluating-deep-agents-our-learnings/

Firsthand production experience from LangChain engineers.

Theo - t3.gg
x 4 docs

Theo (@theo, CEO t3.gg, YouTuber, developer) shares firsthand testing of coding agents on verifiable code problems.

  • gpt-5.3-codex and opus 4.6 are current top models; he's running out of tasks too hard for LLMs and willing to pay $500 per locally testable problem they can't solve .
  • Almost every verifiable problem sent so far solved by 5.3 Codex first try; submitters must verify this before DMing .
  • Ideal submission: commands from git clone to set up exact repo state for the agent .

Surprising take: Production-ready coding agents now handle most verifiable tasks, shifting focus to edge cases.

Theo - t3.gg
x 2 docs

Theo Browne (@theo, CEO @t3dotchat, YouTuber/developer) seeks programmatically verifiable code problems unsolvable by gpt-5.3-codex or opus 4.6, with repo 'before' state and solving example — offers up to $500 per valid problem .

Firsthand insight: Most submitted verifiable problems solved by 5.3 Codex; submitters must verify failure first .

Indicates advanced coding LLMs handle nearly all crowd-sourced verifiable tasks Theo receives .

Jediah Katz
x 2 docs

@jediahkatz (building @cursor_ai agent) used Cursor cloud agent to set up a full Windows VM with desktop display support inside its environment via one detailed prompt.

Prompt:

“I want you to set up a full windows VM inside your box. It should not just be a cmd interface, it should actually have full display support so I can see the windows desktop. Would be awesome to run windows XP so we can have the classic hill background, but if thats hard a more modern one is totally fine. This will be a significant challenge with many roadblocks and might require you to change your approach at various points. Please don’t stop until you’ve successfully got the windows desktop set up.”

Workflow details: Took 1.5 hours with long-running harness; enables snapshotting for persistent Windows base .

Firsthand account from Cursor AI builder demonstrating agent handling complex, iterative infrastructure tasks autonomously.