Activity for Tests + harnesses become the interface for deep agents (plus narrow-agent teams and one-prompt infra)

Tests + harnesses become the interface for deep agents (plus narrow-agent teams and one-prompt infra)

Today’s theme: tests and harnesses are becoming the real control plane for deep agents—enabling rapid rewrites, repeatable behavior, and realistic evals. Plus: a one-prompt Windows VM inside a Cursor cloud agent, narrow-agent team patterns, and practical remote-control alternatives to ecosystem lock-in.

Jediah Katz

Riley Brown

Peter Steinberger

🔥 TOP SIGNAL

Test suites + harnesses are becoming the real “interface” for deep agents: Cloudflare used an AI agent (Paramigen) plus Next.js tests to recreate a Next.js-compatible framework in a week , while LangChain says deep agents require multi-layer evals (single-step → full-turn → multi-turn) in clean, reproducible environments . Geoffrey Huntley’s framing clicks: the model is the “reasoning engine,” but the agent harness is the control plane that makes behavior safe and repeatable in production .

🛠️ TOOLS & MODELS

“Too-hard-for-LLM” coding tasks are drying up (gpt-5.3-codex vs opus 4.6): Theo is offering $500/problem for locally verifiable repos that current top models can’t solve, but says almost every verifiable problem sent so far gets solved by 5.3 Codex first try . If you’re sharing benchmarks, he wants a git clone-from-scratch command sequence to reproduce the exact state .
Codex vs Claude (practitioner comparison): Peter Steinberger’s take: Codex “will read much more of your code and usually find a better solution,” while Claude is “more pleasant” but may claim it’s “100% production ready” and then bug out .
Augment’s codebase indexing → better retrieval inside Codex: Theo demos using Augment’s CLI to index a codebase, then switching to Codex where the model immediately uses Augment’s retrieval tool; he claims it finds “exactly what you need and almost nothing else” and returns results in <20 seconds vs 5–10 minutes previously .
Remote control, minus lock-in: Claude Code’s new “Remote Control” is described as Max/Pro-only and Claude-Code-only in Jason Zhou’s thread; his alternative is Tailscale + SSH for a private network workflow: “Phone terminal → SSH → dev machine,” with “no public IP / no port forwarding / no exposed services” .
Cursor cloud agent: desktop Windows VM from a single prompt: Jediah Katz says he got a Cursor cloud agent to run a Windows VM with full desktop display support “with just one prompt,” taking ~1.5 hours on a long-running harness, and then he can snapshot it as a reusable Windows base .

💡 WORKFLOWS & TRICKS

Deep-agent evaluation stack (LangChain’s production learnings)
- Write bespoke success criteria per datapoint (not one generic rubric) .
- Run single-step evals to catch regressions at specific decision points .
- Add full-turn evals to validate end-to-end behavior .
- Add multi-turn evals that simulate realistic user interactions .
- Keep clean, reproducible test environments so runs are comparable .
Treat your harness as the product (Huntley’s definition you can operationalize)
- The “agent harness” is the orchestration layer that manages prompts, tool execution, policy checks, guardrails, and loop control (continue/stop) .
- Handy reference: https://latentpatterns.com/glossary/agent-harness.
Narrow-agent teams beat “mega agents” (Riley Brown’s OpenClaw pattern)
- He reports that as he added more skills, agent dependability dropped: skill use timing got worse, context got “clouded,” and integrations/personalities got “jumbled” .
- His proposed sweet spot: teams of narrow agents with ~7–10 skills each (vs ~30+) .
- Concrete coordination pattern: a journal agent in Telegram pings him ~every 30 minutes (sometimes skipping if nothing’s needed), logs useful context into Notion, and other agents read that shared journal (e.g., newsletter agent drafting for a 300,000-person email list with its own conversion goals) .
- Autonomy lever: narrow agents can run predictable cron-job loops (“three tasks every day”) because they’re optimizing for a small set of goals .
Agentic engineering tactics you can copy (Peter Steinberger)
- Treat it like a discussion, not a one-liner; guide it explicitly (“look here, look there”) and assume it starts with zero project context .
- Ask it to propose refactors/pain points when a change touches many parts of a codebase .
- After shipping a feature, ask: “Now that you built it, what would you do different?” to surface what it learned during implementation .
- Speed tricks: provide images as context when that’s faster than writing; use voice input for throughput .
Codebase hygiene under agent acceleration (Theo’s rules of thumb)
- “Tolerate nothing”: if a bad pattern makes it in, it multiplies—so delete it aggressively .
- Spend more time in “plan mode”: go back-and-forth until you have a markdown plan you can review, then tell the model to build .
- When an agent goes wrong, interrogate the path: ask what it’s doing and why, then eliminate bad examples if they’re coming from your own codebase/docs .
One-prompt infra bootstrap: Windows VM inside an agent box (Cursor cloud agent)
1. Give a prompt that explicitly asks for a full Windows VM with desktop display support (not just CLI) .
2. Let it run under a long-running harness (reported ~1.5 hours) .
3. Snapshot the resulting VM as a reusable “Windows base” .

👤 PEOPLE TO WATCH

Jediah Katz — practical proof of long-horizon agent setup work: a full desktop Windows VM inside a cloud agent, plus snapshotting for reuse .
Theo (t3.gg) — running public, verifiable “too hard for LLM” challenges and documenting what actually breaks modern coding models (increasingly little) .
Peter Steinberger (OpenClaw) — high-signal “agentic engineering” habits + grounded tool comparison based on daily use .
Riley Brown (vibecodeapp) — concrete multi-agent “team” design: narrow agents, shared memory via Notion, cron-based loops .
LangChain team — pragmatic eval guidance from building/testing 4 production agents.

🎬 WATCH & LISTEN

1) Riley Brown — why “too many skills” makes agents worse (≈02:33–05:42)

He explains the failure mode (context clouding + jumbled integrations) and the practical alternative: 7–10 skills per agent, then build a team .

2) Peter Steinberger — OpenClaw project update + why he resists one-click installs (≈08:03–12:04)

He describes working to add maintainers and set up a foundation for donations/hiring, and argues that making installs too easy can hide real risks (he calls out prompt injection as unsolved) .

3) Theo — codebase inertia + “slop” compounding under agent acceleration (≈18:17–20:02)

A concrete mental model: codebase quality peaks early, bad patterns spread faster than good ones, and “the models accelerate this” .

📊 PROJECTS & REPOS

LangChain: “Evaluating deep agents — our learnings” (built + tested 4 production agents) — https://www.blog.langchain.com/evaluating-deep-agents-our-learnings/
Agent harness glossary (Latent Patterns) — crisp definition of the orchestration layer that constructs context, executes tool calls, enforces guardrails, and controls loop continuation — https://latentpatterns.com/glossary/agent-harness
Cloudflare’s ViNext (Next.js recreation via Paramigen + tests): reported one-week build, 1700 Vitest + 380 Playwright E2E tests, and partial test coverage breakdown (13% dev / 20% E2E / 10% production out of 13,708 cases) .

Editorial take: The leverage is shifting from “better prompts” to better harnesses + better tests—they’re what make agents reliable, repeatable, and (increasingly) portable across codebases .

Tests + harnesses become the interface for deep agents (plus narrow-agent teams and one-prompt infra)

Riley Brown

Profile 1 doc

Riley Brown (cofounder of @vibecodeapp, running agents for vibecode.dev growth division) shares firsthand experience building hundreds of OpenClaw agent workflows after 2 weeks of testing OpenClaw, Manus, Claude code, and Perplexity Computer .

Contrarian take: Narrow, specialized agents (7-10 skills each) outperform general-purpose agents or per-task cloud computers (Manus/Perplexity), as adding >10 skills reduces dependability, clouds context, and jumbles integrations/personalities . Plans 15 narrow OpenClaw agents in a superteam for growth .

Why narrow agents:

Tie skills directly to specific goals/KPIs for easy validation (e.g., YouTube agent optimizes subs/views/conversions with SERP/Supadata APIs for research , NanoBanana thumbnails , Notion scripts ) .
Easy to duplicate/remix/share (e.g., journal agent duplicated to cofounder in 5 min) .
Simple to understand (few markdown skill files) .
Reviewable via pass/fail on narrow KPIs .
Enables autonomous loops via cron jobs .

Orchestration pattern: Narrow agents share memory/context via central Notion journal (journal agent analyzes activities every ~30 min, writes entries; newsletter agent reads for content ideas, optimizes open/CTR/conversions) . Future: inter-agent communication, cloud scaling, team sharing .

OpenClaw advantages: persistent computer (Mac Mini), structured markdown skills, multi-channel gateway (Telegram/Slack/Discord) .

Riley Brown

youtube 1 doc

Riley Brown, founder of vibecode.dev, shares firsthand experience building hundreds of AI agent workflows over two weeks for his company's growth division, primarily using OpenClaw.

Tools tested and compared:

OpenClaw: Persistent agent on one computer (e.g., Mac Mini) with structured skills (markdown files), long-term memory, and chat gateway (Telegram, Discord, Slack) . Preferred for narrow focus.
Manus: Spins up cloud computer per task; acts as command center for general agents . Less ideal due to generality.
Claude Code: Tested, limited details .
Perplexity Computer (new release): ChatGPT-like with cloud sandbox for file creation/editing per task, similar to Manus .

Key insight: Narrow agents (7-10 skills each) outperform general ones—more dependable, clearer context, easier intent assignment . Plans 15 narrow OpenClaw agents in a team, sharing context via Notion journal .

Narrow agent benefits:

Skills tied to specific goals/KPIs (e.g., pass/fail reviewable) .
Easy duplication/sharing (e.g., journal agent remixed for co-founder in 5 min) .
Simple autonomous loops via cron jobs .
Inter-agent coordination (shared Notion; future memory sharing) .

Example workflows:

YouTube agent: Optimizes subs/views/conversions. Skills: YouTube research (SERP API, Supadata API for transcripts), thumbnails (Nano Banana + photos context), Notion scripts .
Journal agent: Analyzes daily activities/meetings/videos, writes Notion entries every 30 min if needed; informs other agents .
Newsletter agent: Reads journal, drafts emails optimizing opens/CTR/conversions .

Contrarian take: Shift from prompts to agent intents/purpose; general agents fail at proactivity/surprise/useful suggestions . Future: Scale narrow OpenClaw agents to cloud (e.g., 200/team) .

Jason Zhou

x 2 docs

Tailscale + SSH enables remote control for any coding agent (Claude Code, Codex, etc.) via private network, avoiding ecosystem lock-in .

Claude Code's new Remote Control is limited: Max/Pro users only, Claude Code only—if running multiple agents, hard limit .

Workflow: Phone terminal → SSH → dev machine. No public IP, port forwarding, or exposed services .

Demo video in @hqmank's thread (quoted by @jasonzhou1993 ).

Peter Steinberger

Profile 1 doc

Peter Steinberger (creator of OpenClaw, steipete blog), a practitioner using coding agents daily, shares agentic engineering tips from firsthand experience:.

Practical workflows and techniques:

Treat agents as discussion partners: Guide iteratively ("look here, look there"), avoid one-liner prompts; imagine agent's perspective with no prior project knowledge .
Instruct to Google best practices for ideas .
Provide images for quick context .
Use voice input (e.g., Whisper Flow) for highest throughput .
Post-feature build: Ask agent "what would you do different?" for insights .
Leverage for architecture: Identify pain points/refactors in large codebases .

Tool comparisons: Codex (OpenAI) is best coding agent—reads more code, finds better solutions despite drier personality; Claude more pleasant but overclaims production readiness and bugs out .

OpenClaw update (his open-source project): Not dead; adding maintainers, foundation for donations/hiring; emphasizes hackable installs over one-click due to risks like prompt injection . Views as infinite playground like Factorio; agent loop as "hello world" of AI age .

Quantitative insight: ~4% GitHub comments now by agents, rapidly rising .

Contrarian take: Agents amplify skilled humans like a rocket; learned fastest ever last year; don't fear laziness if used thoughtfully .

Riley Brown

x 2 docs

Riley Brown (@rileybrown), cofounder of @vibecodeapp, spent 200 hours testing OpenClaw and recommends keeping agents focused (narrow) rather than overloading with skills, and building teams of them .

Video covers:

Perplexity Computer and Manus
OpenClaw
Issues with too many skills in first agent
Preference for agents with intent like employee
Narrow agent example: YouTube Agent
Team of narrow agents rationale

Plans to build agent team for @vibecodeapp, with a video per agent added .

Firsthand experience from vibecodeapp development .

Theo - t3․gg

youtube 1 doc

Theo (full-stack TypeScript dev, T3.gg creator, Cursor investor, ex-Twitch engineer, using agents in production like T3 chat) critiques coding agents/IDEs (Cursor, Claude Code, Codex) for poor UX, inconsistency, and performance due to 'vibe coding' with early models like Sonnet 3.5/3.7, leading to slop codebases .

Codebase inertia pattern: Quality peaks at 3-6 months then degrades as bad patterns exponentially multiply via copying; agents accelerate this by replicating slop .

Actionable practices:

Tolerate no slop: Immediately delete bad patterns; use 'sledgehammer development'—agents make rewriting cheap (e.g., replace 5k LOC in hours) .
Extended planning: Back-and-forth with model to spec in Markdown before building; review plan .
Use latest models (e.g., Opus 4, Codex 5) over outdated ones .
Question agent decisions: Ask 'why this path?' to trace/eliminate bad examples from codebase/docs .
Proliferate codebases: Spin up new repos/services trivially; avoid bloating main repo with one-offs .
Dual codebases: Slop version (vibe-coded) for prototyping/iteration, port refined features to clean production version (e.g., his T3 chat PRs, Vampire Survivors Phaser→C) .

Tool tip: Augment CLI indexes large codebases for fast retrieval (<20s vs. 5-10min) in Codex; outperforms grep by finding related info .

Upcoming: T3 code for stable agent interaction .

ThePrimeTime

youtube 1 doc

Cloudflare used the Paramigen AI agent to rebuild Next.js (called ViNext) in one week by passing 1700 Vitest tests and 380 Playwright end-to-end tests, achieving 94% of Next.js 16 API surface but only 13% dev, 20% E2E, and 10% production test coverage out of 13,708 total cases .

Quantitative gains: 4x faster builds (using Rollup vs. Turbopack) and 57% smaller client bundles . Now running in production, including on CIO.gov beta sites by National Design Studio team .

New feature: Experimental Traffic Aware Pre-rendering uses Cloudflare's reverse proxy traffic data to pre-render only high-traffic pages (e.g., 184 pages cover 90% traffic from 12,000 unique paths), avoiding linear build scaling .

Test-driven agent pattern: Open test suites enable rapid recreation of complex frameworks, raising open-source implications (e.g., competitors forking via AI) . Speaker (experienced engineer who questioned Cloudflare staff) skeptical of bundle size claim's longevity as features added .

Secondhand report on Cloudflare's firsthand production usage.

Riley Brown

x 2 docs

Riley Brown (@rileybrown), cofounder of @vibecodeapp, shares key insight after 200 hours testing OpenClaw: "Keep your agents focused, and build a team" .

Video overview covers:

Perplexity Computer and Manus
OpenClaw setup
Problems with agents having too many skills
Preference for agents with intent (narrow focus)
Testing narrow AI agents
YouTube Agent as narrow example
Team of narrow agents benefits