Hours of research in one daily brief–on your terms.

Tell us what you need to stay on top of. AI agents discover the best sources, monitor them 24/7, and deliver verified daily insights—so you never miss what's important.

Setup your daily brief agent
Discovering relevant sources...
Syncing sources 0/180...
Extracting information
Generating brief

Recent briefs

Harnesses Become the Real Lever as Codex Lands in Claude Code
Mar 31
6 min read
103 docs
Claude
LangChain
Jason Zhou
+17
The best signal today is that coding-agent performance is increasingly a harness problem, not just a base-model problem. Also inside: the open-source Codex bridge into Claude Code, practical workflows for local models and secure orchestration, and the clips worth watching.

🔥 TOP SIGNAL

The strongest signal today: the harness is now a first-order performance variable. Georgi Gerganov says most local coding-agent failures come from the harness, chat-template, prompt-construction, and inference chain—not just the model , Matt Maher’s 100-feature PRD benchmark found Cursor improved frontier-model results by 11% on average, with Opus scoring 20% higher there than in Claude Code , and the open-source Meta Harness paper summary says changing the harness around a fixed model can create a 6x gap .

For builders, benchmarking the base model alone is increasingly the wrong abstraction; routing, review, retrieval, debugging visibility, and context handling are where a lot of the practical edge is moving .

🛠️ TOOLS & MODELS

  • Codex plugin for Claude Code. OpenAI shipped openai/codex-plugin-cc so Claude Code users can delegate tasks to Codex or have Codex review changes with a ChatGPT subscription . Huet says the pattern they already saw in the wild was Codex for review and GPT-5.4 for more complex tasks. Commands: /codex:review, /codex:adversarial-review, /codex:rescue.
  • Open Codex substrate. Huet says Codex CLI and Codex app server are open source so the same ChatGPT subscription can be used in the app, terminal, JetBrains, Xcode, OpenCode, Pi, and Claude Code . The new Claude Code plugin is built on that same open-source app server + harness, including the same models, parallel tasking, and review flow .
  • Codex got a context upgrade. Mark Chen says Codex now has auto compaction, and an early user report says it remembers tiny details across multiple rounds of compaction .
  • Harness comparison that matters. In Matt Maher’s benchmark of frontier models implementing a 100-feature PRD, Cursor improved results from Gemini 52→57, GPT-5.4 82→88, and Opus 77→93; Theo highlighted Opus being 20% higher there than in Claude Code .
  • Local-model family to test now: Qwen3.5. Georgi Gerganov calls it a step change across device sizes . His tested local coding/chat/MCP set included gpt-oss-120b, Qwen3-Coder-30B, GLM-4.7-Flash, MiniMax-M2.5, and Qwen3.5-35B-A3B, mostly in Q8_0 variants . He says tool-calling quality still depends on both model intelligence and chat-template parsing in llama.cpp .
  • Claude Code widened enterprise support. GitHub Enterprise Server now works across Claude Code on the web, iOS, Android, and Code Review, so self-hosted repos no longer need to move to github.com for async workflows . Docs: code.claude.com/docs/en/github-enterprise-server.
  • Claude Code added computer use, but early cost reports are rough. Anthropic says Claude can open apps, click through UI, and test what it built from the CLI in research preview on Pro/Max plans . Theo’s firsthand reaction: it used up his rate limits in 2 minutes despite a $200/month plan .
  • LangSmith Experiments got a more useful failure view. LangChain rebuilt the detail view to cut clutter and show better traces, clearer evaluator reasoning, and easier comparisons when debugging agent failures . Try it at smith.langchain.com.

💡 WORKFLOWS & TRICKS

  • Cross-model review loop from Claude Code
    1. Install with /plugin marketplace add openai/codex-plugin-cc or from the repo .
    2. Use /codex:review for a standard read-only pass, /codex:adversarial-review when you want a challenge pass, or /codex:rescue to hand a task off .
    3. Keep Claude Code as your front-end if you like, but route review to Codex and heavier tasks to GPT-5.4—the pattern Huet says users were already doing manually .
  • Local-model bring-up: don’t benchmark a broken stack
    1. Start with the highest-quality model that fits your hardware .
    2. Use your own harness—or llama-server’s webui with MCP—so you know what the stack is actually doing .
    3. Only then optimize with quantization or community parameter tuning .
    4. If results still look bad, inspect the whole chain: harness, chat template, prompt construction, and inference bugs .
  • Claude Code can build real artifacts end-to-end. Simon Willison’s flow: clone nanochat, pull model weights, use the Space demo source to fill in the inference script, then have Claude Code read the LLM plugin tutorial and finish the plugin . The output repo is public, and Simon says it was his first full model-plugin build this way and it worked really well .
  • Keep secrets out of context; let the agent do the plumbing. Kent C. Dodds says Claude Desktop cancelled a scheduled Cursor cloud agent, asked for a Cloudflare API token securely so it never entered context, generated an EC P-256 keypair, deployed a Worker, and updated Cloudflare routing to finish a Tesla integration . Reusable pattern: human-mediated auth, agent-executed infra steps, MCP as the handoff surface .
  • If Claude Code usage suddenly spikes, test the CLI path. Theo, relaying a reverse-engineered report he says he did not independently confirm, points to a standalone Bun-binary cache bug with a workaround of npx @anthropic-ai/claude-code, plus a separate --resume issue that may still break cache . He says uncached tokens can be 10x-20x more expensive .
  • Push human effort into plan mode. Jason Zhou says his strongest engineers now spend their time giving context and making technical decisions while the agent executes across multiple sessions .

👤 PEOPLE TO WATCH

  • Georgi Gerganov — probably the clearest current explainer of why local coding agents disappoint: harness, chat templates, prompt construction, inference bugs, and a practical bring-up order . Simon Willison says this matches his own experiments .
  • Romain Huet — high signal because he’s shipping actual workflow glue, not just demos: the open-source Codex plugin for Claude Code, concrete commands, and the open-source Codex app server/CLI underneath .
  • Simon Willison — published a full transcript of Claude Code building a real model plugin end-to-end; good benchmark for what a successful serious use case looks like .
  • Kent C. Dodds — worth following for real MCP + infra orchestration patterns, especially his clear secret-handling boundary where tokens stay out of model context .
  • Jason Zhou — useful on where coding agents meet product/design: he trained non-technical designers on Cursor + GitHub, and now ships a /superdesign skill that scans the codebase before designing in context .

🎬 WATCH & LISTEN

  • 14:15-16:39 — Meta Harness’s self-improvement loop. Fastest clean explanation of the pattern: store source, scores, and traces on disk; let a coding-agent proposer inspect prior failures; iterate the harness instead of stuffing everything into one prompt .
  • 37:25-38:28 — Jason Zhou’s /superdesign flow. Nice crossover example: the agent first scans the codebase and component system, then opens the browser and designs with actual product context instead of guessing from scratch .

📊 PROJECTS & REPOS

  • openai/codex-plugin-cc — open-source bridge letting Claude Code call Codex for task delegation or code review via ChatGPT subscription . Signal that it solves a real behavior: Embiricos says enough developers were already using Codex to review Claude outputs that OpenAI decided to lean into it .
  • Meta Harness — new open-source project from Stanford, MIT, and Krafton for end-to-end optimization of model harnesses; paper and code are already out . The core design is a coding-agent proposer with filesystem access that iterates on prior harnesses instead of relying on a fixed scaffold .
  • simonw/llm-mrchatterbox — a useful reference repo for the Claude Code-built-plugin pattern; Simon says it was his first full model-plugin build this way and he expects to use the method again .
  • Karpathy’s autoresearch — Matthew Berman cites it as a close cousin to Meta Harness, with 61k stars and a self-improving loop that runs experiments and learns from prior results .

Editorial take: the fastest-moving edge in coding agents is no longer just model choice; it’s the harness around the model — memory, routing, review, and debugging visibility .

Extreme Ownership Leads a Founder Reading List on Accountability, Policy, and AI Safety
Mar 31
3 min read
143 docs
The Verge
Boaz Barak
Marc Andreessen
+3
Marc Andreessen's recommendation of Jocko Willink's *Extreme Ownership* stands out because he ties it to a clear operating framework rather than offering generic praise. Other authentic picks include the *Draghi Report*, Tony Fadell's Apple II history article, and Sam Altman's nod to a Boaz Barak AI safety post.

Most compelling recommendation: Extreme Ownership

Marc Andreessen's endorsement of Extreme Ownership is the clearest signal in today's set because he explains exactly what he took from the book and how he uses it. He says the idea of assuming fault first helps him focus on self-improvement, relieve stress, drain resentment, and lean on intrinsic rather than extrinsic motivation; he also says he fell in love with the book when it came out and wanted to share it with founders

  • Content type: Book
  • Author/creator: Jocko Willink
  • Link/URL: Not provided in the source material
  • Who recommended it: Marc Andreessen
  • Key takeaway: Treat "extreme ownership" as an operating rule: if something goes wrong, start by assuming it is your fault, then improve what you can control
  • Why it matters: This stands out because the recommendation comes with a concrete framework, not just praise. Andreessen ties it to a repeatable psychology he says is useful precisely when external rewards are not enough

"Life just gets a lot simpler if you just assume everything is your own fault."

Other clearly organic picks

A pattern in today's set: the highest-signal recommendations came with explicit operating rules, while the lighter picks still help readers track what prominent tech leaders found worth sharing.

Draghi Report

  • Content type: Report
  • Author/creator: Mario Draghi
  • Link/URL: Not provided in the source material
  • Who recommended it: Marc Andreessen
  • Key takeaway: Andreessen says readers should "just read that report" because "everything's in there," and frames it as a set of prescriptions for building stronger tech ecosystems in Europe
  • Why it matters: He presents it as an execution playbook rather than a diagnosis alone; the gap, in his telling, is implementation

Apple II Forever!

  • Content type: Article
  • Author/creator: The Verge
  • Link/URL:https://www.theverge.com/tech/900677/apple-ii-personal-computer
  • Who recommended it: Tony Fadell
  • Key takeaway: Fadell says the Apple II was his first computer; he saved up by caddying while his grandfather matched his earnings, then read MacWorld and other computer magazines while dreaming of joining the Macintosh team
  • Why it matters: The recommendation is anchored in a formative personal story about how early computing shaped a future hardware builder's ambitions

the state of AI safety in four fake graphs

  • Content type: Blog post
  • Author/creator: Boaz Barak
  • Link/URL: Announcement post — https://x.com/boazbaraktcs/status/2038606572046172443
  • Who recommended it: Sam Altman
  • Key takeaway: Altman called it "a very good post"
  • Why it matters: The source material does not include a fuller summary of the post's thesis, but the endorsement itself is a strong signal for readers tracking what prominent tech leaders are flagging in AI safety discussions
Claude Code Expands, Qwen3.5-Omni Ships, and Harness Engineering Takes Center Stage
Mar 31
9 min read
643 docs
Stephanie Palazzolo
elvis
Jason Weston
+50
The biggest developments were a more capable Claude Code, Alibaba's Qwen3.5-Omni release, and a growing body of evidence that harness design is becoming a core performance lever. This brief also covers measurable enterprise ROI, faster local AI stacks, new research papers, funding and strategy moves, and governance-related updates.

Top Stories

Why it matters: This cycle's biggest signals were about agent execution: models are getting better at acting on software, multimodal systems are widening the interface, and performance is increasingly coming from the harness around the model as much as the model itself.

Claude Code moved closer to a full software-testing loop

Anthropic added Computer use to Claude Code, letting Claude open apps, click through interfaces, and test what it built directly from the CLI; the feature is in research preview on Pro and Max plans . At the same time, Claude Code and Code Review added GitHub Enterprise Server support for async workflows on self-hosted repos . Anthropic staff also said they open sourced a plugin so Claude Code users can call Codex from a ChatGPT subscription for reviews, adversarial reviews, and rescue flows .

Impact: this is a step from code generation toward a tighter write-build-run-verify loop, and it makes Claude Code easier to use inside enterprise GitHub setups .

Qwen3.5-Omni pushed multimodal interaction further into the product layer

Alibaba released Qwen3.5-Omni, a model for text, image, audio, and video understanding with real-time interaction features including semantic interruption, built-in web search, and complex function calling . Alibaba highlighted script-level captioning, support for up to 10 hours of audio or 400 seconds of 720p video, 113 speech-recognition languages, and 36 output languages, plus an "Audio-Visual Vibe Coding" workflow that turns camera-described ideas into a website or game . The company also said the model is open access via Hugging Face, with the caveat that "omni" here refers to interpreting image and voice, not generating them .

Impact: Alibaba is packaging multimodal reasoning, voice interaction, and tool use into a surface that looks closer to a general-purpose AI application platform.

Harness engineering is turning into a primary performance lever

Several results this cycle pointed in the same direction: the system around the model matters more than many teams assumed. Meta-Harness said prompt/tool/retry/context choices alone can create a 6x performance gap on the same model, and that harness deltas are now wider than frontier-model deltas . In Matt Maher's 100-feature PRD benchmark, a post said Cursor improved model performance by 11% on average, including Opus from 77% to 93%. CMU's CAID paper reported +26.7 points on PaperBench and +14.3 points on Commit0 over single-agent baselines by coordinating isolated git worktrees and explicit integration via git .

"The delta between harness implementations on the same model is not. That's where the leverage is."

Impact: performance gains are increasingly coming from coordination, evaluation loops, and tool design, not only from bigger base models.

Enterprise deployments are producing measurable ROI

Two deployment examples stood out for hard numbers. Novo Nordisk is using AI agents built on Anthropic and OpenAI models to detect trial risks, automate site selection, and flag process redundancies, shaving weeks to months off clinical trials and potentially accelerating time-to-market by hundreds of millions of dollars. Separately, a Shopify case study said the company cut annual AI deployment costs from $5.5M to $73K by decomposing business logic, modeling intent with DSPy, and optimizing a smaller model while maintaining performance; the cited scale-up estimate cut 150,000-shop coverage from $41M to $73K.

"The juice is clearly worth the squeeze."

Impact: the strongest enterprise signal in the notes was not hype but faster trials, lower operating cost, and maintained performance.

Local AI stacks got faster and more usable

Ollama said it now runs fastest on Apple silicon through MLX, Apple's machine-learning framework . Its preview release also added NVFP4 support, cache reuse across conversations, intelligent checkpoints, and smarter eviction, with a Mac-oriented acceleration path for Qwen3.5-35B-A3B on systems with more than 32GB of unified memory . In parallel, llama.cpp reached 100k GitHub stars, and its creator said local agentic workflows are now practical because tool calling and local models have improved enough to support tasks like search, email, summarization, and home automation .

Impact: the local AI stack is getting closer to real everyday agent use on consumer hardware, especially on Macs.

Research & Innovation

Why it matters: Research this cycle focused less on raw scale and more on leverage: better long-context handling, stronger multimodal designs, cheaper training, and harder benchmarks.

  • Massive-context agents without giant context windows: one paper places very large text corpora into directory structures and lets off-the-shelf coding agents navigate them with shell commands and Python instead of stuffing everything into the context window. The reported results were 88.5% on BrowseComp-Plus versus 80% best published, 33.7% on Oolong-Real versus 24.1%, and operation up to 3 trillion tokens. Paper: https://arxiv.org/abs/2603.20432.

  • LongCat-Next: a new multimodal model was presented as "lexicalizing modalities as discrete tokens," with claims that it matches or beats SOTA across multimodal benchmarks, delivers SOTA audio on both recognition and TTS accuracy, and adds vision/audio without hurting core language performance . Resources: paper, GitHub, Hugging Face.

  • daVinci-LLM: this pretraining paper was summarized as matching larger-model performance with half the size, adding 23 points on MATH, and arguing that data quality can matter more than dataset scale . Resources: paper, repo.

  • Reasoning and optimization:ParaGator trains candidate generation and aggregation end-to-end for parallel reasoning, using pass@k for generation and pass@1 for aggregation, with the stated goal of avoiding mode collapse and improving math/scientific reasoning . On the systems side, Gram Newton-Schulz was introduced as a drop-in replacement for Newton-Schulz in Muon, with up to 2x faster performance while preserving validation perplexity within 0.01.

  • Benchmarks remain hard:PRBench introduced 30 expert-curated paper-reproduction tasks across 11 physics subfields, and the cited result was stark: all agents showed zero end-to-end callback success. Tau Bench added a banking domain with 698 documents across 21 product categories; best models were cited at 25% task success and under 10% on pass@4 .

Products & Launches

Why it matters: Product work moved toward usable systems: better voice models, more local tooling, and clearer paths from research models to daily workflows.

  • Voice products improved at both ends of the stack. OpenAI said gpt-realtime-1.5 improves instruction following, tool calling, and multilingual accuracy in the Realtime API, while a new OpenAI developer post summarized Perplexity's lessons from running voice agents in production around context, audio pipelines, and turn-taking . Separately, Cohere Transcribe launched as a 2B-parameter open-weights speech-to-text model with 4.7% AA-WER, roughly 60x real-time transcription, training from scratch on 14 languages, and availability both through Cohere's API and on Hugging Face under Apache 2.0 .

  • Local agent tooling kept expanding.ARC (Agent Remote Control) introduced a browser-based remote monitor for local agents, with real-time tool-call visibility, approvals, messaging, native Hermes Agent integration, open source distribution, and end-to-end encryption . AutoClaw launched as a way to run OpenClaw locally with no API key, support for any model or GLM-5-Turbo, and fully local data handling . litesearch packaged a fully local document-ingestion and retrieval stack for agents like Claude Code, using LiteParse, local embeddings, local Qdrant storage, and CLI-native search .

  • Security-conscious agent wrappers are becoming their own category.PokeeClaw positioned itself as an enterprise-secure alternative to OpenClaw, with a secure sandbox architecture, isolated environments, approval workflows, role-based access control, audit trails, and lower token usage .

  • Composable agent skills are spreading.Base44 added 130+ built-in "Superagent Skills" across marketing, operations, data analysis, design, content, coding, and research, with custom skills created from natural-language descriptions and reusable across workflows .

Industry Moves

Why it matters: Corporate signals this cycle were about who owns the agent operating layer, who controls deployment, and where new capital is going.

  • SycamoreLabs launched as a "trusted agent OS for the enterprise" with a $65M seed led by Coatue and Lightspeed, alongside AbstractVC, Dell Technologies Capital, 8VC, Fellows Fund, e14 Fund, and angel investors .

  • Figure AI described its breakup with OpenAI in unusually direct terms. CEO Brett Adcock said Figure got "no value" from the relationship beyond early fundraising, said Figure's internal team outperformed OpenAI's daily, and said the real break came when OpenAI planned to restart robotics, which would have turned Figure's work into competitor training . Figure has since built its own vision-language-action model, Helix, and the cited post said the company is valued at $39B.

  • Anthropic's growth is creating infrastructure strain. A cited report described the company's success as sparking a server crunch.

  • Hugging Face is explicitly pushing a builder strategy. Clement Delangue said the goal is to help "millions" build AI themselves rather than remain API users, and pointed to hf-autoresearch as an example of agent collaboration around checkpoints, datasets, papers, and Hub workflows .

  • Internal agent deployments are becoming business functions. A post about LangChain said its internal GTM agent drove 250% more lead conversions, using Deep Agents for orchestration, multiple data sources for context, and Slack for approvals . A separate build log said a similar agent was rebuilt on DeeplineCLI + Deep Agents in under an hour with roughly 200 lines of config .

Policy & Regulation

Why it matters: The notes were light on formal government action, but governance questions around data consent, auditing, and safety evaluation were prominent.

  • GitHub Copilot training consent: a widely shared warning said GitHub had opted users into training its models on their code by default, including paying customers, and pointed users to Settings > Privacy to disable it .

  • Governance proposals are getting more concrete: Will MacAskill and Fin Moorhouse proposed eight projects aimed at improving the transition to superintelligence, including independent evaluation of AI character traits, benchmarking strategic and philosophical reasoning, auditing models for sabotage and backdoors, and building monitoring and verification tools for collective coordination .

  • Safety debate stayed active: Boaz Barak published a new post titled the state of AI safety in four fake graphs, which Sam Altman publicly endorsed as "a very good post" .

Quick Takes

Why it matters: These smaller items help fill in the operating picture around models, agent frameworks, and supporting infrastructure.

  • Qwen 3.6 Plus Preview went live on OpenRouter for a limited free period; Alibaba asked for feedback and noted prompts/completions may be collected during the preview .
  • Codex auto compaction was reported to improve long-session coherence, with one user saying Codex remembers tiny details across multiple rounds of compaction .
  • Hermes Agent added Multi Agent Profiles, giving independent bots separate memory, gateway connections, skills, and chat histories .
  • A new BOOT.md hook in Hermes lets agents save state before restarts and resume with what one post described as zero context loss .
  • OpenAI's Codex App Server is fully open source, includes sign in with ChatGPT, and powers Codex integrations in products like the Codex app and external tools such as JetBrains and T3 Code .
  • PixVerse V6 launched on fal.ai with text-to-video, image-to-video, transition, and extend endpoints, while PixVerse separately promoted V6 as offering more control, better performance, and 15-second 1080p audiovisual generation .
  • LisanBench launched a live benchmark site with leaderboard visualizations, and its creator said a meta leaderboard is next .
  • Triton-Ascend is now public, giving Huawei Ascend hardware a Triton kernel programming model that commenters said could help frameworks like sglang and vLLM run on Ascend without learning AscendC .
  • Gemini Live is now powered by Gemini 3.1 Flash Live.
Copilot Goes Multi-Model as Open Voice and Local AI Accelerate
Mar 31
4 min read
155 docs
Import AI
clem 🤗
Ben Thompson
+6
Microsoft rolled out multi-model research features in M365 Copilot, while Mistral and other open-model builders pushed audio, speech, and multilingual releases forward. Local AI also crossed a symbolic milestone with llama.cpp at 100k stars, as enterprise competition around OpenAI and Anthropic sharpened.

A few shifts stood out today

Microsoft is turning model plurality into a product, open releases are getting stronger across audio and speech, and local AI keeps looking more deployable. The market feels a bit less centered on one flagship model and more on orchestration, efficiency, and where systems actually run.

Microsoft brings multi-model workflows into Copilot

Microsoft introduced Critique in M365 Copilot, a multi-model deep research system that uses multiple models together to generate responses and reports; Satya Nadella said Microsoft's benchmarks show "best-in-class deep research." It also launched Council, which lets users run multiple models on the same prompt at once to compare alignment, divergence, and unique contributions. Both are available now in Frontier.

Why it matters: This is a notable product signal from a major platform vendor: instead of hiding model plurality behind one answer, Microsoft is exposing model collaboration and disagreement as a feature.

Open models broaden beyond text

Mistral’s Voxtral TTS is a notable open-audio release

Mistral launched Voxtral TTS, an open-weight multilingual text-to-speech model that supports nine languages and targets real-time streaming for voice agents . Latent Space said the model posted a 68.4% win rate against ElevenLabs Flash v2.5, while Mistral speakers described it as state-of-the-art quality at a fraction of proprietary costs .

Its architecture mixes autoregressive semantic speech tokens with flow matching for acoustic tokens, backed by an in-house neural audio codec at 12.5 Hz; the team also said the setup can extend to long generations via larger context windows .

Why it matters: Open voice models are getting closer to the quality, latency, and cost targets that matter for real-time products.

The broader open-model pipeline was unusually diverse

Interconnects highlighted an unusually broad set of open releases: NVIDIA's Nemotron-3-Super-120B-A12B-NVFP4 with a 1M context window, multilingual support, NVFP4 pre-training, and open pre-/post-training datasets; Cohere's cohere-transcribe-03-2026 speech model with 14 languages under Apache 2.0; Sarvam's 105B and 30B models with strong Indic-language positioning; and Mistral-Small-4-119B-2603 as a hybrid reasoning model with coding abilities . Interconnects argued this kind of domain-specific, cheaper model development is becoming an important complement to the strongest closed agents .

Why it matters: The open ecosystem is spreading across speech, multilingual, regional, and reasoning workloads instead of clustering around one general chatbot race.

Local AI looks more like infrastructure

llama.cpp reached 100k stars, and the stack around it keeps firming up

llama.cpp crossed 100k GitHub stars, has 1,500+ contributors, and Hugging Face said it is bringing Georgi Gerganov and ggml into the team behind what it called the most widely used open-source runtime for local AI . Gerganov said useful local agentic workflows became feasible as models improved tool calling on everyday devices, and Clement Delangue argued that many disappointments with smaller local models are really failures of scaffolding, chat templates, prompt construction, or fine-tuning rather than raw model capability .

Gerganov also described Qwen3.5 as a "step change" across device sizes, while Delangue urged open-source agent tools to rely primarily on open models rather than closed APIs that send data to the cloud .

"The technology is too important to be vendor-locked. It has to be developed in the open, by the community, together with the independent hardware vendors."

Why it matters: This is starting to look less like enthusiast momentum and more like a real deployment path for private, on-device, and cross-platform AI.

Strategy watch

Ben Thompson sees OpenAI's enterprise focus as a competitive necessity

Ben Thompson argued that reports of OpenAI cutting side projects should not be overread as an exit from consumer; instead, he sees a rational shift of resources toward enterprise, where customers pay for productivity gains and Codex has been especially strong . He framed the urgency around Anthropic's enterprise growth—described as moving from a $14B to $19B run rate—and the risk that OpenAI gets shut out if large customers standardize elsewhere .

He also noted OpenAI has pushed back on startup-skewed Ramp chart interpretations and may be stronger in the Fortune 500 than those charts suggest, while arguing that ChatGPT's massive consumer scale creates a harder monetization path because ads are difficult and compute is already heavily committed .

Why it matters: The center of gravity in AI competition may be shifting from consumer reach to enterprise distribution, pricing, and workflow lock-in.

Research watch

Self-improving agent scaffolds advanced, but frontier math remained hard

Import AI highlighted Hyperagents, a self-referential scaffold that lets LLM systems iteratively modify their own prompts and tools. In reported results, the setup improved Polyglot coding performance from 14% to 34%, paper review from 0% to 71%, and robotics reward design from 6% to 37%.

The same roundup pointed to HorizonMath, a benchmark of 100 predominantly unsolved math problems with automated verification, where the top model scored only 7% overall and 50% on the easiest subset .

Why it matters: The capability story remains mixed: better scaffolds are producing real gains on structured tasks, while benchmarks aimed at genuine mathematical discovery are still extremely hard.

PM Operating Systems, Product Builders, and Pricing Architecture
Mar 31
10 min read
81 docs
Aakash Gupta
Sachin Rekhi
Teresa Torres
+5
This issue covers three shifts reshaping product management: persistent AI operating systems for PM work, the rise of the cross-functional product builder, and monetization architecture that lets pricing change in hours instead of quarters. It also includes execution lessons on testing handoffs, engineering trust, career positioning, and practical tools to try.

Big Ideas

1) Claude Code is moving from assistant to PM operating system

Aakash Gupta’s core argument: the best Claude Code users are not relying on one-off chats. They build persistent file-based operating systems with skills, sub-agents, hooks, workflows, and markdown knowledge that improve every future prompt . He positions this as the operating-system layer for people spending 8-10 hours a day in the tool, with the potential to move from roughly 80/100 to 95/100 proficiency .

That is what an operating system is. Not a folder full of files. A system where every interaction makes the next one better.

Why it matters: PM work is highly contextual. A persistent workspace lets stakeholder context, project history, goals, and prior fixes survive beyond one chat window .

How to apply:

  • Start with CLAUDE.md and GOALS.md; the source says those two files deliver 80% of the value on day one .
  • Keep CLAUDE.md current weekly so Claude inherits your role, tools, priorities, and recurring instructions in every message .
  • Add persistent people files and project folders so meeting notes, stakeholder preferences, PRDs, research, and launch results compound over time .
  • Use sub-agents for research and CLIs instead of MCPs to protect context: one example dropped a research task from about 10% of the main context window to 0.5% .

2) The product trio is compressing into product builders

Teresa Torres argues product management, design, and engineering are not dead, but the classic PM-design-engineering trio is collapsing toward a broader product-builder foundation with specialties layered on top . In her framing, AI now gives people a base level of programming, design, product management, and business-context capability, so 1-2 product builders can handle much of the routine 80% of feature work while specialists focus on the harder 20% .

Why it matters: This changes team design and individual expectations. Torres expects smaller, more cross-functional teams, while still arguing that human strengths in alignment, trade-off decisions, organizational context, and innovation remain important .

How to apply:

  • Build horizontal AI skills alongside your core craft; Torres describes this as a modern T-shaped product-builder foundation .
  • Learn to specify what you want and plan with an agent; she says that base foundation no longer requires direct exposure to code for many common web-app tasks .
  • Keep investing in your specialty. Her argument is not that expertise disappears, but that expertise is increasingly paired with AI fluency inside the function itself .
  • If you lead teams, start thinking about safety infrastructure now, including security, accessibility, and code-review agents, because broader participation in building raises review demands .

3) Pricing architecture is becoming core PM territory

The Product Compass makes a blunt case: as AI compresses time spent on Jira, PRDs, and standups, PMs are increasingly responsible for business outcomes, and pricing sits near the center of that shift . Its thesis is simple:

Pricing should live in config, not code.

The article contrasts companies that can change pricing in hours with teams that still need quarters. It cites Vercel shipping 5-6 pricing changes per month, while many companies make 1-2 changes per year and consume a quarter of engineering time for each .

Why it matters: If plans, entitlements, usage limits, and experiments are hardcoded, pricing becomes an engineering bottleneck rather than a product lever .

How to apply: Use the four-pillar test for monetization agility :

  • Unified product catalog: one schema for plans, features, entitlements, and prices .
  • Decoupled entitlements: central runtime rules instead of scattered if (plan == ...) checks .
  • Real-time metering: usage visibility for customers, sales, and finance before the invoice surprise .
  • Control plane: a dashboard where non-engineers can run pricing experiments and adjust limits without code deploys .

Tactical Playbook

1) Stand up a lightweight PM operating system in Claude Code

  1. Create CLAUDE.md with your role, work style, installed tools, current priorities, and references to your skills .
  2. Add GOALS.md for quarterly priorities; the source recommends starting here before building more structure .
  3. Set up knowledge/people/ and update it after meetings so stakeholder preferences and recent context are reusable in future communication .
  4. Create one folder per active project, then archive completed projects for reuse on similar work later .
  5. Monitor /status line and /context, and push research to sub-agents instead of the main session when context starts climbing .
  6. Use Jupyter notebooks for CSV analysis when you need transparent, reviewable methodology, and use the ask-user-questions tool when requirements or decision criteria are still fuzzy .

Why this matters: The operating model turns scattered PM work into reusable context and lowers the cost of repeating research, analysis, meeting prep, and writing from scratch .

2) Close the gap between acceptance criteria and actual testing

A Reddit post surfaced a familiar failure mode: a PM wrote the checkout flow step by step in the PRD, but QA backlog, outdated scripts after a UI change, and mutual assumptions meant the flow still shipped broken . The PM’s takeaway was that knowing the flow well was not enough because the knowledge never became an executable test .

How to apply:

  1. Identify flows where a broken handoff would create visible customer damage, such as checkout or onboarding .
  2. Convert plain-English acceptance criteria into something that runs against the actual product, not just a documentation artifact .
  3. Review screenshots or pass/fail evidence before sprint review, rather than assuming regression coverage exists .
  4. If QA ownership is fragmented, treat PM participation in testing as a temporary control, not an exception .
  5. Do not rely on documentation alone to solve the problem; one community response argued the handoff gap still comes back to direct communication .

3) Put pricing on a monthly operating cadence

The Product Compass suggests a two-hour monthly pricing meeting with four blocks: customer data, learnings scan, product-to-pricing roadmap sync, and decisions/actions .

How to apply:

  1. Review usage, billing, and approaching-limit customers to spot expansion candidates and churn risk .
  2. Add cross-functional input from sales, CS, finance, marketing, and growth on win/loss patterns and pricing friction .
  3. For every feature shipping in the next 30-90 days, decide its monetization stance up front; the rule proposed is that no feature ships without one .
  4. Leave with 1-3 local experiments, each with an owner, hypothesis, timeline, and expected impact .

Why this matters: It separates infrequent global pricing changes from continuous local experiments, giving PMs a repeatable way to connect product roadmap and revenue decisions .

4) When engineering relationships are political, build trust before trying to redirect the roadmap

Community advice in a discussion about resistant developers was consistent on one point: trust comes before leverage. The recommended pattern was to listen first, find the influential developer, make small suggestions once you are situated, and avoid upending a team’s plan immediately as a newcomer .

How to apply:

  1. Treat developers as partners, not order takers; commenters framed weak PM-engineering trust as the root problem in these scenarios .
  2. Build credibility by representing the existing roadmap before advocating major changes .
  3. If your manager reassigns you or inserts themselves into the work, ask directly what pattern they are seeing and what feedback you need to hear .

Case Studies & Lessons

1) Monetization architecture changed shipping speed at Zep, Plotly, and Automox

  • Zep: modeled plans and entitlements, went from trial start to production in 4 days, and later used limit enforcement to improve free-to-paid conversion while giving sales earlier visibility into usage .
  • Plotly: launched two AI products two quarters faster because catalog and entitlements were already modeled centrally .
  • Automox: after years of hardcoded monetization logic across two billing systems, it cut time-to-launch for new pricing tiers by 75% and freed two full-time engineers from maintenance work .

Lesson: Pricing agility is not only a packaging problem. It is an architectural capability that determines how quickly PMs can test monetization ideas .

2) A broken checkout flow showed that a PRD is not a test plan

One PM’s postmortem described a flow that was written clearly in a Notion PRD, demoed repeatedly, and still shipped with a production bug because no one converted that knowledge into an updated test . After adopting a plain-English testing tool that ran on real devices and returned screenshots plus step-level pass/fail, the PM says they caught two production-bound issues in the first week .

Lesson: The verification loop breaks when documentation, QA scripts, and ownership drift apart. The fix is executable validation, not better prose alone .

3) Horizontal expansion can damage the core product

Teresa Torres says Zapier’s expansion into adjacent products has coincided with degradation in the core automation experience, citing repeated failures where zaps did not trigger . Her workaround has been to ask Claude to build custom webhook listeners because she finds the resulting code more reliable and easier to control for error handling . She adds that she is slowly moving off both Zapier and Airtable because of persistent quality issues .

Lesson: New surface area can hide declining reliability in the core workflow. PMs expanding horizontally need to watch quality metrics on the original product, not just adoption of the new bets .

Career Corner

1) The safest career move right now is becoming a stronger product builder

Torres’ career advice is direct: build horizontal AI skills while continuing to deepen your functional expertise . She argues that if you do not learn how to use AI inside your function, you will no longer be seen as an expert in that function, and she notes that job descriptions and interview processes are already changing .

How to apply: Practice two skills now: specifying what you want clearly and planning work with agents, then pair that with deeper expertise in your primary craft .

2) Early-career PMs should optimize for signal, not resume mythology

Advice to an APM with informal startup experience was straightforward: include the work on the resume, but focus on what you did, the problems you solved, and your responsibilities, not on ownership structure or proprietorship details . The same commenter suggested staying in the APM role for at least 1-2 years to build clearer, more relevant product experience before making the next move .

3) Domain switches are harder in an oversupplied market

A PM with about four years in data and analytics product management said they were reaching final rounds for customer-facing roles but losing out to candidates with more direct domain experience, despite feedback that their core PM skills were transferable . They also pointed to candidate oversupply as part of the problem .

Takeaway: In the current market, transferable PM skill is still valuable, but it may not beat direct domain familiarity when employers have many candidates to choose from .

Tools & Resources

1) PM OS starter repos

Why explore them: Both are meant to reduce setup friction and give PMs a concrete file structure, skills layout, and workflow starting point .

2) Jupyter notebooks for auditable analysis

The recommendation here is to ask Claude to analyze data in a Jupyter notebook so every query, output, and chart is preserved as code cells and rendered results .

Use it when: you need analysis that a manager or data scientist can verify step by step, rather than a black-box summary .

3) The ask-user-questions tool

Claude can generate a terminal UI with checkboxes and input fields to gather requirements, fill context gaps, or support decisions instead of guessing .

Use it when: assumptions are the main failure mode in discovery or planning .

4) A prompt-optimization loop for recurring agent workflows

Aakash Gupta describes a Karpathy-style loop for prompts: pick the prompt to improve, use 2-3 realistic test inputs and 3-6 binary quality checks, run repeated evaluations, mutate one variable at a time, keep winners with version control, and revert losers . He cites a pace of about 12 experiments per hour and roughly 100 overnight .

Use it when: you have a prompt or system instruction that is already good enough, but not yet reliable, in workflows like support, internal automations, extraction, or code review .

5) Reforge AI Productivity

Sachin Rekhi says the updated live sessions are focused on what has become most actionable for PMs over the last six months: automating PM workflows with Claude Code, the AI prototyping mastery ladder, AI-powered customer discovery, and AI-enhanced product strategy and execution .

USDA Acres Risk, Record Biofuel Demand, and Brazil’s Rising Input Stress
Mar 31
8 min read
153 docs
Market Minute LLC
GrainStats 🌾
Tarım Editörü
+10
U.S. grains head into the USDA acres-and-stocks report with strong corn export pace, weaker soybean exports, and fresh support from record RFS volumes. Brazil remains a major supply anchor, but fertilizer, freight, weather, and logistics are tightening farm economics and shipment flow.

Market Movers

  • United States — row crops: Grain trade is rotating from crude-oil-driven momentum toward Tuesday's USDA acres and quarterly stocks reports. Pro Farmer's survey put corn at 96 million acres and soybeans at 84.25 million acres, while other pre-report references centered closer to 94.4-94.5 million corn acres. Multiple sources noted the survey window predated the latest war- and fertilizer-driven disruption, so the report may not fully capture the most recent acreage recalculation. Quarterly corn stocks are also expected to run more than 1 billion bushels above last year, while funds recently added about 50,000 corn contracts and remain heavily committed to the soybean complex .

  • United States — exports: Weekly export inspections were 70.4 million bushels for corn, 21.5 million for soybeans, and 13.4 million for wheat. Marketing-year corn inspections are 298 million bushels ahead of the pace needed to hit USDA's target, wheat is 53 million bushels ahead, but soybeans remain 112 million bushels behind. GrainStats separately noted that corn exports do not depend on China the way soybeans do; last week's inspections to China were 9.9 million bushels of soybeans and zero corn and wheat .

  • United States — biofuels and soy complex: EPA finalized the highest RFS volumes in program history, keeping conventional biofuels at 15 billion gallons for 2026 and lifting biomass-based diesel to roughly 5.5-5.7 billion gallons. Reallocated small-refinery waivers add another 200-250 million gallons per year to 2026-2027 volumes. Analysts tied the policy to stronger bean-oil-led soybean trade and better long-run domestic use for soybeans .

  • United States/Australia — wheat: Chicago wheat futures rose on concern that the Iran war could lift farmer energy and fertilizer costs, while Plains drought widened the Kansas City-Chicago spread to the largest hard red winter premium since August. Australia was also cited for drought and fuel shortages, even as one analyst noted global wheat stocks-to-use remains high overall .

  • United States — livestock: Cattle futures broke to new highs after a technical breakout, firmer cash trade, and tighter beef supplies. Beef cold storage fell 5% year over year to about 413 million pounds. Hogs closed lower on end-month profit taking even though the March 1 inventory of 74.3 million head came in below pre-report expectations; pork bellies in cold storage were down about 6-7% from a year ago .

Innovation Spotlight

  • Brazil — nematode biotech: BASF said its Soja NRS trait is aimed at cyst and Pratylenchus nematodes that cover about 90% of Brazilian territory. The company tied nematodes to roughly R$35 billion in annual production losses, including R$16-17 billion in soybeans alone, and said the trait could add R$15-18 billion to the value chain. Because the control is embedded in the soybean plant, BASF expects lower nematode populations to carry productivity benefits into second-crop corn or cotton. Commercial introduction in Brazil was described as a 2-3 year timeline after 10-12 years of development .

  • Brazil — lower-emission fertilizers in potatoes: In Paraná, six growers covering about 130 hectares are using Iara's Climate Choice portfolio with 4C management and GHG-protocol tracking. The program estimates up to 40% lower carbon footprint, while participants reported higher productivity, more dry matter, and better tuber quality that earns factory bonuses. PepsiCo is financing the price difference versus conventional fertilizer as part of its 2030 emissions target .

  • United States — machinery and fuel systems: John Deere's E98 ethanol-powered 8R pairs 350 hp with no DEF and a reported fuel-cost advantage of roughly $1/gallon ethanol versus $3/gallon diesel, even though volumetric fuel use is higher. Deere and RDO also described the new high-horsepower 8R/8RX package as an ultimate planting tractor that can power electric row units through a single cord, run up to 110 gpm hydraulics, and plant up to 1,200 acres/day with large 24-row-plus planters in some setups .

Regional Developments

  • United States — planting and weather: Southern corn planting is already moving, with progress reported at 76% in Louisiana, 53% in Texas, 50% in Mississippi, 23% in Arkansas, 2% in Kansas, and just above zero in Illinois. At the same time, substantial mid-week rainfall is expected across Iowa, Illinois, Missouri, Minnesota, Michigan, and Indiana near early planting dates, while hard red winter wheat areas farther west may miss needed rain .

  • Brazil — soybean and corn supply: AG Rural raised Brazil's 2025/26 soybean forecast to 178.4 million tons, with harvest at 75%. Center-South first-crop corn harvest reached 59.4%, versus 56.4% a year ago, while safrinha planting is nearly 98% complete and already 100% done across Mato Grosso's projected 7.3 million hectares. Total corn output was trimmed to 135.7 million tons.

  • Brazil — April weather: Forecasts point to a hotter, drier first half of April followed by stronger second-half rainfall. That pattern was described as supportive for later-planted safrinha corn in Mato Grosso, Goiás, and Minas Gerais because rain is not expected to cut off early, even though late-stage crops face some stress in the hotter opening weeks .

  • Brazil — Mato Grosso logistics: Road conditions on MT-240 in Paranatinga are interrupting soybean movement. One farmer with 1,800 hectares of soybeans still had 800 hectares unharvested and said at least 500 hectares could be lost if truck delays continue. Another producer reported about 15% crop loss where trucks were crossing cropped ground used as a bypass .

  • Brazil — pork exports: Brazil has consolidated as the third-largest global pork exporter. Asia absorbs about 70% of shipments, led by the Philippines, with Vietnam and Japan also important. Slaughter capacity was described as up roughly 3% to meet export demand, although domestic live-hog prices have weakened since February, making export flow more important for balance .

Best Practices

  • Soybeans — variable populations by field: In soybean fields with iron deficiency chlorosis risk, light soils, late planting, or low fertility, Ag PhD recommended raising populations to 160,000-180,000 plants. The main payoff cited was faster canopy closure for better weed control. Lower populations were described as better suited to very high-fertility fields or areas with high white mold risk, and variable-rate population by field was encouraged to balance productivity and seed cost .

  • Soybeans — Southeast U.S. stand management: A separate Southeast system targeted an 80,000 final stand to drive branching and reduce lodging risk. After a pounding rain crusted 15-inch rows, pivot irrigation was used to soften the soil surface and help emergence rather than to add moisture .

  • Weed control — start residual programs early: Brownfield's herbicide guidance emphasized keeping fields clean early in spring, when weed competition can lock in yield loss. Resicore Rev was cited with up to 8 weeks of residual control, three modes of action, and compatibility with UAN and ATS as well as other pre-emerge mixes .

  • Small grains — manage nitrogen and lodging together: Tissue sampling at jointing (GS5/6) was used to map variable-rate nitrogen needs, while the same operation targeted about 80 heads/sq ft for 80+ bushel grain and planned a growth regulator to reduce lodging risk .

  • Soil systems — conserve structure, then manage pressure: Brazil's no-till system now covers nearly 90% of cultivated land, and a separate U.S. field example showed strip-till absorbing 2.5-3 inches of rain and reducing runoff. The trade-off in tropical systems is higher year-round pest, weed, and disease pressure because of the permanent green bridge .

  • Dairy — benchmark economics first: Espírito Santo's new dairy sustainability framework starts with 103 indicators, including 78 economic measures focused on genetics, feeding, productivity, and milk quality. The rollout begins with 400 farms and shared technical assistance from public extension, Senar, and cooperatives .

Input Markets

  • United States — fertilizer availability: Fertilizer remains the sharpest input risk. Producers reported booking anywhere from 50% to 85% of needs, but some still faced delivery problems even with contracts in hand. Forward contracting was emphasized, while crude oil also moved above $100 for the first time since 2022 .

  • Brazil — policy and freight pressure: Brazilian producers face a 2% fertilizer cost increase from new PIS/COFINS rules starting April 1, while the minimum freight table removes cheaper return freight from ports and diesel hikes add more cost. Local commentary also said grain prices have not kept pace with fertilizer inflation, worsening the input-output trade ratio .

  • Global — sulfur bottleneck: Brazilian industry said fertilizer production is being reduced worldwide because sulfur flows from the Arabian Gulf have been disrupted. China was cited as having temporarily banned fertilizer exports because of sulfur scarcity, and one major supplier was said to have only about two months of operating room without normalization of Gulf routes .

  • Feed — support for pork, pressure for cattle: Brazil's pork export spread is still around 40% because export prices near $2,500 and low feed costs are supporting margins. In contrast, U.S. cattle commentary warned that drought will tighten forage supplies and raise feed costs .

  • Crop chemicals — residual programs favored: Early-season chemical guidance continues to favor residual products that fit fertilizer passes. Resicore Rev was specifically highlighted for UAN/ATS tank-mix flexibility and longer residual control .

Forward Outlook

  • United States — USDA report risk: The next 24 hours are report-driven. USDA stocks and acres are being treated as a wild card; Market Minute said corn has traded higher on report day for five straight years, with an average move of 12 cents over the past 10 years. But multiple sources cautioned that survey timing predates the latest war/fertilizer shock, so acreage may keep moving after the report .

  • United States — corn versus soybeans: Corn acreage may stay sticky where nitrogen was prepaid or fall-applied, but soybeans still have room to gain if spring fertilizer and fuel costs escalate. One analyst also argued that corn's recent price rally can offset some fertilizer inflation, which may limit late acreage switching .

  • United States — biofuel demand is now more structural: Beyond the immediate bean-oil reaction, EPA said the RFS now creates $31 billion of value for American corn and soybean oil. Starting in 2028, foreign fuels and feedstocks receive only half the RFS compliance value of U.S.-made products, reinforcing the domestic demand story .

  • United States/Brazil — seasonal timing: Southern U.S. planting is already active, Corn Belt rain is arriving close to early plant dates, and Brazilian second-crop corn is expected to rely on a dry first half of April followed by better second-half moisture for grain fill. Large crop potential remains intact in Brazil, but local road failures and input logistics are now important watchpoints alongside weather .

Your time, back.

An AI curator that monitors the web nonstop, lets you control every source and setting, and delivers one verified daily brief.

Save hours

AI monitors connected sources 24/7—YouTube, X, Substack, Reddit, RSS, people's appearances and more—condensing everything into one daily brief.

Full control over the agent

Add/remove sources. Set your agent's focus and style. Auto-embed clips from full episodes and videos. Control exactly how briefs are built.

Verify every claim

Citations link to the original source and the exact span.

Discover sources on autopilot

Your agent discovers relevant channels and profiles based on your goals. You get to decide what to keep.

Multi-media sources

Track YouTube channels, Podcasts, X accounts, Substack, Reddit, and Blogs. Plus, follow people across platforms to catch their appearances.

Private or Public

Create private agents for yourself, publish public ones, and subscribe to agents from others.

Get your briefs in 3 steps

1

Describe your goal

Tell your AI agent what you want to track using natural language. Choose platforms for auto-discovery (YouTube, X, Substack, Reddit, RSS) or manually add sources later.

Stay updated on space exploration and electric vehicle innovations
Daily newsletter on AI news and research
Track startup funding trends and venture capital insights
Latest research on longevity, health optimization, and wellness breakthroughs
Auto-discover sources

2

Confirm your sources and launch

Your agent finds relevant channels and profiles based on your instructions. Review suggestions, keep what fits, remove what doesn't, add your own. Launch when ready—you can always adjust sources anytime.

Discovering relevant sources...
Sam Altman Profile

Sam Altman

Profile
3Blue1Brown Avatar

3Blue1Brown

Channel
Paul Graham Avatar

Paul Graham

Account
Example Substack Avatar

The Pragmatic Engineer

Newsletter · Gergely Orosz
Reddit Machine Learning

r/MachineLearning

Community
Naval Ravikant Profile

Naval Ravikant

Profile ·
Example X List

AI High Signal

List
Example RSS Feed

Stratechery

RSS · Ben Thompson
Sam Altman Profile

Sam Altman

Profile
3Blue1Brown Avatar

3Blue1Brown

Channel
Paul Graham Avatar

Paul Graham

Account
Example Substack Avatar

The Pragmatic Engineer

Newsletter · Gergely Orosz
Reddit Machine Learning

r/MachineLearning

Community
Naval Ravikant Profile

Naval Ravikant

Profile ·
Example X List

AI High Signal

List
Example RSS Feed

Stratechery

RSS · Ben Thompson

3

Receive verified daily briefs

Get concise, daily updates with precise citations directly in your inbox. You control the focus, style, and length.

Harnesses Become the Real Lever as Codex Lands in Claude Code
Mar 31
6 min read
103 docs
Claude
LangChain
Jason Zhou
+17
The best signal today is that coding-agent performance is increasingly a harness problem, not just a base-model problem. Also inside: the open-source Codex bridge into Claude Code, practical workflows for local models and secure orchestration, and the clips worth watching.

🔥 TOP SIGNAL

The strongest signal today: the harness is now a first-order performance variable. Georgi Gerganov says most local coding-agent failures come from the harness, chat-template, prompt-construction, and inference chain—not just the model , Matt Maher’s 100-feature PRD benchmark found Cursor improved frontier-model results by 11% on average, with Opus scoring 20% higher there than in Claude Code , and the open-source Meta Harness paper summary says changing the harness around a fixed model can create a 6x gap .

For builders, benchmarking the base model alone is increasingly the wrong abstraction; routing, review, retrieval, debugging visibility, and context handling are where a lot of the practical edge is moving .

🛠️ TOOLS & MODELS

  • Codex plugin for Claude Code. OpenAI shipped openai/codex-plugin-cc so Claude Code users can delegate tasks to Codex or have Codex review changes with a ChatGPT subscription . Huet says the pattern they already saw in the wild was Codex for review and GPT-5.4 for more complex tasks. Commands: /codex:review, /codex:adversarial-review, /codex:rescue.
  • Open Codex substrate. Huet says Codex CLI and Codex app server are open source so the same ChatGPT subscription can be used in the app, terminal, JetBrains, Xcode, OpenCode, Pi, and Claude Code . The new Claude Code plugin is built on that same open-source app server + harness, including the same models, parallel tasking, and review flow .
  • Codex got a context upgrade. Mark Chen says Codex now has auto compaction, and an early user report says it remembers tiny details across multiple rounds of compaction .
  • Harness comparison that matters. In Matt Maher’s benchmark of frontier models implementing a 100-feature PRD, Cursor improved results from Gemini 52→57, GPT-5.4 82→88, and Opus 77→93; Theo highlighted Opus being 20% higher there than in Claude Code .
  • Local-model family to test now: Qwen3.5. Georgi Gerganov calls it a step change across device sizes . His tested local coding/chat/MCP set included gpt-oss-120b, Qwen3-Coder-30B, GLM-4.7-Flash, MiniMax-M2.5, and Qwen3.5-35B-A3B, mostly in Q8_0 variants . He says tool-calling quality still depends on both model intelligence and chat-template parsing in llama.cpp .
  • Claude Code widened enterprise support. GitHub Enterprise Server now works across Claude Code on the web, iOS, Android, and Code Review, so self-hosted repos no longer need to move to github.com for async workflows . Docs: code.claude.com/docs/en/github-enterprise-server.
  • Claude Code added computer use, but early cost reports are rough. Anthropic says Claude can open apps, click through UI, and test what it built from the CLI in research preview on Pro/Max plans . Theo’s firsthand reaction: it used up his rate limits in 2 minutes despite a $200/month plan .
  • LangSmith Experiments got a more useful failure view. LangChain rebuilt the detail view to cut clutter and show better traces, clearer evaluator reasoning, and easier comparisons when debugging agent failures . Try it at smith.langchain.com.

💡 WORKFLOWS & TRICKS

  • Cross-model review loop from Claude Code
    1. Install with /plugin marketplace add openai/codex-plugin-cc or from the repo .
    2. Use /codex:review for a standard read-only pass, /codex:adversarial-review when you want a challenge pass, or /codex:rescue to hand a task off .
    3. Keep Claude Code as your front-end if you like, but route review to Codex and heavier tasks to GPT-5.4—the pattern Huet says users were already doing manually .
  • Local-model bring-up: don’t benchmark a broken stack
    1. Start with the highest-quality model that fits your hardware .
    2. Use your own harness—or llama-server’s webui with MCP—so you know what the stack is actually doing .
    3. Only then optimize with quantization or community parameter tuning .
    4. If results still look bad, inspect the whole chain: harness, chat template, prompt construction, and inference bugs .
  • Claude Code can build real artifacts end-to-end. Simon Willison’s flow: clone nanochat, pull model weights, use the Space demo source to fill in the inference script, then have Claude Code read the LLM plugin tutorial and finish the plugin . The output repo is public, and Simon says it was his first full model-plugin build this way and it worked really well .
  • Keep secrets out of context; let the agent do the plumbing. Kent C. Dodds says Claude Desktop cancelled a scheduled Cursor cloud agent, asked for a Cloudflare API token securely so it never entered context, generated an EC P-256 keypair, deployed a Worker, and updated Cloudflare routing to finish a Tesla integration . Reusable pattern: human-mediated auth, agent-executed infra steps, MCP as the handoff surface .
  • If Claude Code usage suddenly spikes, test the CLI path. Theo, relaying a reverse-engineered report he says he did not independently confirm, points to a standalone Bun-binary cache bug with a workaround of npx @anthropic-ai/claude-code, plus a separate --resume issue that may still break cache . He says uncached tokens can be 10x-20x more expensive .
  • Push human effort into plan mode. Jason Zhou says his strongest engineers now spend their time giving context and making technical decisions while the agent executes across multiple sessions .

👤 PEOPLE TO WATCH

  • Georgi Gerganov — probably the clearest current explainer of why local coding agents disappoint: harness, chat templates, prompt construction, inference bugs, and a practical bring-up order . Simon Willison says this matches his own experiments .
  • Romain Huet — high signal because he’s shipping actual workflow glue, not just demos: the open-source Codex plugin for Claude Code, concrete commands, and the open-source Codex app server/CLI underneath .
  • Simon Willison — published a full transcript of Claude Code building a real model plugin end-to-end; good benchmark for what a successful serious use case looks like .
  • Kent C. Dodds — worth following for real MCP + infra orchestration patterns, especially his clear secret-handling boundary where tokens stay out of model context .
  • Jason Zhou — useful on where coding agents meet product/design: he trained non-technical designers on Cursor + GitHub, and now ships a /superdesign skill that scans the codebase before designing in context .

🎬 WATCH & LISTEN

  • 14:15-16:39 — Meta Harness’s self-improvement loop. Fastest clean explanation of the pattern: store source, scores, and traces on disk; let a coding-agent proposer inspect prior failures; iterate the harness instead of stuffing everything into one prompt .
  • 37:25-38:28 — Jason Zhou’s /superdesign flow. Nice crossover example: the agent first scans the codebase and component system, then opens the browser and designs with actual product context instead of guessing from scratch .

📊 PROJECTS & REPOS

  • openai/codex-plugin-cc — open-source bridge letting Claude Code call Codex for task delegation or code review via ChatGPT subscription . Signal that it solves a real behavior: Embiricos says enough developers were already using Codex to review Claude outputs that OpenAI decided to lean into it .
  • Meta Harness — new open-source project from Stanford, MIT, and Krafton for end-to-end optimization of model harnesses; paper and code are already out . The core design is a coding-agent proposer with filesystem access that iterates on prior harnesses instead of relying on a fixed scaffold .
  • simonw/llm-mrchatterbox — a useful reference repo for the Claude Code-built-plugin pattern; Simon says it was his first full model-plugin build this way and he expects to use the method again .
  • Karpathy’s autoresearch — Matthew Berman cites it as a close cousin to Meta Harness, with 61k stars and a self-improving loop that runs experiments and learns from prior results .

Editorial take: the fastest-moving edge in coding agents is no longer just model choice; it’s the harness around the model — memory, routing, review, and debugging visibility .

Extreme Ownership Leads a Founder Reading List on Accountability, Policy, and AI Safety
Mar 31
3 min read
143 docs
The Verge
Boaz Barak
Marc Andreessen
+3
Marc Andreessen's recommendation of Jocko Willink's *Extreme Ownership* stands out because he ties it to a clear operating framework rather than offering generic praise. Other authentic picks include the *Draghi Report*, Tony Fadell's Apple II history article, and Sam Altman's nod to a Boaz Barak AI safety post.

Most compelling recommendation: Extreme Ownership

Marc Andreessen's endorsement of Extreme Ownership is the clearest signal in today's set because he explains exactly what he took from the book and how he uses it. He says the idea of assuming fault first helps him focus on self-improvement, relieve stress, drain resentment, and lean on intrinsic rather than extrinsic motivation; he also says he fell in love with the book when it came out and wanted to share it with founders

  • Content type: Book
  • Author/creator: Jocko Willink
  • Link/URL: Not provided in the source material
  • Who recommended it: Marc Andreessen
  • Key takeaway: Treat "extreme ownership" as an operating rule: if something goes wrong, start by assuming it is your fault, then improve what you can control
  • Why it matters: This stands out because the recommendation comes with a concrete framework, not just praise. Andreessen ties it to a repeatable psychology he says is useful precisely when external rewards are not enough

"Life just gets a lot simpler if you just assume everything is your own fault."

Other clearly organic picks

A pattern in today's set: the highest-signal recommendations came with explicit operating rules, while the lighter picks still help readers track what prominent tech leaders found worth sharing.

Draghi Report

  • Content type: Report
  • Author/creator: Mario Draghi
  • Link/URL: Not provided in the source material
  • Who recommended it: Marc Andreessen
  • Key takeaway: Andreessen says readers should "just read that report" because "everything's in there," and frames it as a set of prescriptions for building stronger tech ecosystems in Europe
  • Why it matters: He presents it as an execution playbook rather than a diagnosis alone; the gap, in his telling, is implementation

Apple II Forever!

  • Content type: Article
  • Author/creator: The Verge
  • Link/URL:https://www.theverge.com/tech/900677/apple-ii-personal-computer
  • Who recommended it: Tony Fadell
  • Key takeaway: Fadell says the Apple II was his first computer; he saved up by caddying while his grandfather matched his earnings, then read MacWorld and other computer magazines while dreaming of joining the Macintosh team
  • Why it matters: The recommendation is anchored in a formative personal story about how early computing shaped a future hardware builder's ambitions

the state of AI safety in four fake graphs

  • Content type: Blog post
  • Author/creator: Boaz Barak
  • Link/URL: Announcement post — https://x.com/boazbaraktcs/status/2038606572046172443
  • Who recommended it: Sam Altman
  • Key takeaway: Altman called it "a very good post"
  • Why it matters: The source material does not include a fuller summary of the post's thesis, but the endorsement itself is a strong signal for readers tracking what prominent tech leaders are flagging in AI safety discussions
Claude Code Expands, Qwen3.5-Omni Ships, and Harness Engineering Takes Center Stage
Mar 31
9 min read
643 docs
Stephanie Palazzolo
elvis
Jason Weston
+50
The biggest developments were a more capable Claude Code, Alibaba's Qwen3.5-Omni release, and a growing body of evidence that harness design is becoming a core performance lever. This brief also covers measurable enterprise ROI, faster local AI stacks, new research papers, funding and strategy moves, and governance-related updates.

Top Stories

Why it matters: This cycle's biggest signals were about agent execution: models are getting better at acting on software, multimodal systems are widening the interface, and performance is increasingly coming from the harness around the model as much as the model itself.

Claude Code moved closer to a full software-testing loop

Anthropic added Computer use to Claude Code, letting Claude open apps, click through interfaces, and test what it built directly from the CLI; the feature is in research preview on Pro and Max plans . At the same time, Claude Code and Code Review added GitHub Enterprise Server support for async workflows on self-hosted repos . Anthropic staff also said they open sourced a plugin so Claude Code users can call Codex from a ChatGPT subscription for reviews, adversarial reviews, and rescue flows .

Impact: this is a step from code generation toward a tighter write-build-run-verify loop, and it makes Claude Code easier to use inside enterprise GitHub setups .

Qwen3.5-Omni pushed multimodal interaction further into the product layer

Alibaba released Qwen3.5-Omni, a model for text, image, audio, and video understanding with real-time interaction features including semantic interruption, built-in web search, and complex function calling . Alibaba highlighted script-level captioning, support for up to 10 hours of audio or 400 seconds of 720p video, 113 speech-recognition languages, and 36 output languages, plus an "Audio-Visual Vibe Coding" workflow that turns camera-described ideas into a website or game . The company also said the model is open access via Hugging Face, with the caveat that "omni" here refers to interpreting image and voice, not generating them .

Impact: Alibaba is packaging multimodal reasoning, voice interaction, and tool use into a surface that looks closer to a general-purpose AI application platform.

Harness engineering is turning into a primary performance lever

Several results this cycle pointed in the same direction: the system around the model matters more than many teams assumed. Meta-Harness said prompt/tool/retry/context choices alone can create a 6x performance gap on the same model, and that harness deltas are now wider than frontier-model deltas . In Matt Maher's 100-feature PRD benchmark, a post said Cursor improved model performance by 11% on average, including Opus from 77% to 93%. CMU's CAID paper reported +26.7 points on PaperBench and +14.3 points on Commit0 over single-agent baselines by coordinating isolated git worktrees and explicit integration via git .

"The delta between harness implementations on the same model is not. That's where the leverage is."

Impact: performance gains are increasingly coming from coordination, evaluation loops, and tool design, not only from bigger base models.

Enterprise deployments are producing measurable ROI

Two deployment examples stood out for hard numbers. Novo Nordisk is using AI agents built on Anthropic and OpenAI models to detect trial risks, automate site selection, and flag process redundancies, shaving weeks to months off clinical trials and potentially accelerating time-to-market by hundreds of millions of dollars. Separately, a Shopify case study said the company cut annual AI deployment costs from $5.5M to $73K by decomposing business logic, modeling intent with DSPy, and optimizing a smaller model while maintaining performance; the cited scale-up estimate cut 150,000-shop coverage from $41M to $73K.

"The juice is clearly worth the squeeze."

Impact: the strongest enterprise signal in the notes was not hype but faster trials, lower operating cost, and maintained performance.

Local AI stacks got faster and more usable

Ollama said it now runs fastest on Apple silicon through MLX, Apple's machine-learning framework . Its preview release also added NVFP4 support, cache reuse across conversations, intelligent checkpoints, and smarter eviction, with a Mac-oriented acceleration path for Qwen3.5-35B-A3B on systems with more than 32GB of unified memory . In parallel, llama.cpp reached 100k GitHub stars, and its creator said local agentic workflows are now practical because tool calling and local models have improved enough to support tasks like search, email, summarization, and home automation .

Impact: the local AI stack is getting closer to real everyday agent use on consumer hardware, especially on Macs.

Research & Innovation

Why it matters: Research this cycle focused less on raw scale and more on leverage: better long-context handling, stronger multimodal designs, cheaper training, and harder benchmarks.

  • Massive-context agents without giant context windows: one paper places very large text corpora into directory structures and lets off-the-shelf coding agents navigate them with shell commands and Python instead of stuffing everything into the context window. The reported results were 88.5% on BrowseComp-Plus versus 80% best published, 33.7% on Oolong-Real versus 24.1%, and operation up to 3 trillion tokens. Paper: https://arxiv.org/abs/2603.20432.

  • LongCat-Next: a new multimodal model was presented as "lexicalizing modalities as discrete tokens," with claims that it matches or beats SOTA across multimodal benchmarks, delivers SOTA audio on both recognition and TTS accuracy, and adds vision/audio without hurting core language performance . Resources: paper, GitHub, Hugging Face.

  • daVinci-LLM: this pretraining paper was summarized as matching larger-model performance with half the size, adding 23 points on MATH, and arguing that data quality can matter more than dataset scale . Resources: paper, repo.

  • Reasoning and optimization:ParaGator trains candidate generation and aggregation end-to-end for parallel reasoning, using pass@k for generation and pass@1 for aggregation, with the stated goal of avoiding mode collapse and improving math/scientific reasoning . On the systems side, Gram Newton-Schulz was introduced as a drop-in replacement for Newton-Schulz in Muon, with up to 2x faster performance while preserving validation perplexity within 0.01.

  • Benchmarks remain hard:PRBench introduced 30 expert-curated paper-reproduction tasks across 11 physics subfields, and the cited result was stark: all agents showed zero end-to-end callback success. Tau Bench added a banking domain with 698 documents across 21 product categories; best models were cited at 25% task success and under 10% on pass@4 .

Products & Launches

Why it matters: Product work moved toward usable systems: better voice models, more local tooling, and clearer paths from research models to daily workflows.

  • Voice products improved at both ends of the stack. OpenAI said gpt-realtime-1.5 improves instruction following, tool calling, and multilingual accuracy in the Realtime API, while a new OpenAI developer post summarized Perplexity's lessons from running voice agents in production around context, audio pipelines, and turn-taking . Separately, Cohere Transcribe launched as a 2B-parameter open-weights speech-to-text model with 4.7% AA-WER, roughly 60x real-time transcription, training from scratch on 14 languages, and availability both through Cohere's API and on Hugging Face under Apache 2.0 .

  • Local agent tooling kept expanding.ARC (Agent Remote Control) introduced a browser-based remote monitor for local agents, with real-time tool-call visibility, approvals, messaging, native Hermes Agent integration, open source distribution, and end-to-end encryption . AutoClaw launched as a way to run OpenClaw locally with no API key, support for any model or GLM-5-Turbo, and fully local data handling . litesearch packaged a fully local document-ingestion and retrieval stack for agents like Claude Code, using LiteParse, local embeddings, local Qdrant storage, and CLI-native search .

  • Security-conscious agent wrappers are becoming their own category.PokeeClaw positioned itself as an enterprise-secure alternative to OpenClaw, with a secure sandbox architecture, isolated environments, approval workflows, role-based access control, audit trails, and lower token usage .

  • Composable agent skills are spreading.Base44 added 130+ built-in "Superagent Skills" across marketing, operations, data analysis, design, content, coding, and research, with custom skills created from natural-language descriptions and reusable across workflows .

Industry Moves

Why it matters: Corporate signals this cycle were about who owns the agent operating layer, who controls deployment, and where new capital is going.

  • SycamoreLabs launched as a "trusted agent OS for the enterprise" with a $65M seed led by Coatue and Lightspeed, alongside AbstractVC, Dell Technologies Capital, 8VC, Fellows Fund, e14 Fund, and angel investors .

  • Figure AI described its breakup with OpenAI in unusually direct terms. CEO Brett Adcock said Figure got "no value" from the relationship beyond early fundraising, said Figure's internal team outperformed OpenAI's daily, and said the real break came when OpenAI planned to restart robotics, which would have turned Figure's work into competitor training . Figure has since built its own vision-language-action model, Helix, and the cited post said the company is valued at $39B.

  • Anthropic's growth is creating infrastructure strain. A cited report described the company's success as sparking a server crunch.

  • Hugging Face is explicitly pushing a builder strategy. Clement Delangue said the goal is to help "millions" build AI themselves rather than remain API users, and pointed to hf-autoresearch as an example of agent collaboration around checkpoints, datasets, papers, and Hub workflows .

  • Internal agent deployments are becoming business functions. A post about LangChain said its internal GTM agent drove 250% more lead conversions, using Deep Agents for orchestration, multiple data sources for context, and Slack for approvals . A separate build log said a similar agent was rebuilt on DeeplineCLI + Deep Agents in under an hour with roughly 200 lines of config .

Policy & Regulation

Why it matters: The notes were light on formal government action, but governance questions around data consent, auditing, and safety evaluation were prominent.

  • GitHub Copilot training consent: a widely shared warning said GitHub had opted users into training its models on their code by default, including paying customers, and pointed users to Settings > Privacy to disable it .

  • Governance proposals are getting more concrete: Will MacAskill and Fin Moorhouse proposed eight projects aimed at improving the transition to superintelligence, including independent evaluation of AI character traits, benchmarking strategic and philosophical reasoning, auditing models for sabotage and backdoors, and building monitoring and verification tools for collective coordination .

  • Safety debate stayed active: Boaz Barak published a new post titled the state of AI safety in four fake graphs, which Sam Altman publicly endorsed as "a very good post" .

Quick Takes

Why it matters: These smaller items help fill in the operating picture around models, agent frameworks, and supporting infrastructure.

  • Qwen 3.6 Plus Preview went live on OpenRouter for a limited free period; Alibaba asked for feedback and noted prompts/completions may be collected during the preview .
  • Codex auto compaction was reported to improve long-session coherence, with one user saying Codex remembers tiny details across multiple rounds of compaction .
  • Hermes Agent added Multi Agent Profiles, giving independent bots separate memory, gateway connections, skills, and chat histories .
  • A new BOOT.md hook in Hermes lets agents save state before restarts and resume with what one post described as zero context loss .
  • OpenAI's Codex App Server is fully open source, includes sign in with ChatGPT, and powers Codex integrations in products like the Codex app and external tools such as JetBrains and T3 Code .
  • PixVerse V6 launched on fal.ai with text-to-video, image-to-video, transition, and extend endpoints, while PixVerse separately promoted V6 as offering more control, better performance, and 15-second 1080p audiovisual generation .
  • LisanBench launched a live benchmark site with leaderboard visualizations, and its creator said a meta leaderboard is next .
  • Triton-Ascend is now public, giving Huawei Ascend hardware a Triton kernel programming model that commenters said could help frameworks like sglang and vLLM run on Ascend without learning AscendC .
  • Gemini Live is now powered by Gemini 3.1 Flash Live.
Copilot Goes Multi-Model as Open Voice and Local AI Accelerate
Mar 31
4 min read
155 docs
Import AI
clem 🤗
Ben Thompson
+6
Microsoft rolled out multi-model research features in M365 Copilot, while Mistral and other open-model builders pushed audio, speech, and multilingual releases forward. Local AI also crossed a symbolic milestone with llama.cpp at 100k stars, as enterprise competition around OpenAI and Anthropic sharpened.

A few shifts stood out today

Microsoft is turning model plurality into a product, open releases are getting stronger across audio and speech, and local AI keeps looking more deployable. The market feels a bit less centered on one flagship model and more on orchestration, efficiency, and where systems actually run.

Microsoft brings multi-model workflows into Copilot

Microsoft introduced Critique in M365 Copilot, a multi-model deep research system that uses multiple models together to generate responses and reports; Satya Nadella said Microsoft's benchmarks show "best-in-class deep research." It also launched Council, which lets users run multiple models on the same prompt at once to compare alignment, divergence, and unique contributions. Both are available now in Frontier.

Why it matters: This is a notable product signal from a major platform vendor: instead of hiding model plurality behind one answer, Microsoft is exposing model collaboration and disagreement as a feature.

Open models broaden beyond text

Mistral’s Voxtral TTS is a notable open-audio release

Mistral launched Voxtral TTS, an open-weight multilingual text-to-speech model that supports nine languages and targets real-time streaming for voice agents . Latent Space said the model posted a 68.4% win rate against ElevenLabs Flash v2.5, while Mistral speakers described it as state-of-the-art quality at a fraction of proprietary costs .

Its architecture mixes autoregressive semantic speech tokens with flow matching for acoustic tokens, backed by an in-house neural audio codec at 12.5 Hz; the team also said the setup can extend to long generations via larger context windows .

Why it matters: Open voice models are getting closer to the quality, latency, and cost targets that matter for real-time products.

The broader open-model pipeline was unusually diverse

Interconnects highlighted an unusually broad set of open releases: NVIDIA's Nemotron-3-Super-120B-A12B-NVFP4 with a 1M context window, multilingual support, NVFP4 pre-training, and open pre-/post-training datasets; Cohere's cohere-transcribe-03-2026 speech model with 14 languages under Apache 2.0; Sarvam's 105B and 30B models with strong Indic-language positioning; and Mistral-Small-4-119B-2603 as a hybrid reasoning model with coding abilities . Interconnects argued this kind of domain-specific, cheaper model development is becoming an important complement to the strongest closed agents .

Why it matters: The open ecosystem is spreading across speech, multilingual, regional, and reasoning workloads instead of clustering around one general chatbot race.

Local AI looks more like infrastructure

llama.cpp reached 100k stars, and the stack around it keeps firming up

llama.cpp crossed 100k GitHub stars, has 1,500+ contributors, and Hugging Face said it is bringing Georgi Gerganov and ggml into the team behind what it called the most widely used open-source runtime for local AI . Gerganov said useful local agentic workflows became feasible as models improved tool calling on everyday devices, and Clement Delangue argued that many disappointments with smaller local models are really failures of scaffolding, chat templates, prompt construction, or fine-tuning rather than raw model capability .

Gerganov also described Qwen3.5 as a "step change" across device sizes, while Delangue urged open-source agent tools to rely primarily on open models rather than closed APIs that send data to the cloud .

"The technology is too important to be vendor-locked. It has to be developed in the open, by the community, together with the independent hardware vendors."

Why it matters: This is starting to look less like enthusiast momentum and more like a real deployment path for private, on-device, and cross-platform AI.

Strategy watch

Ben Thompson sees OpenAI's enterprise focus as a competitive necessity

Ben Thompson argued that reports of OpenAI cutting side projects should not be overread as an exit from consumer; instead, he sees a rational shift of resources toward enterprise, where customers pay for productivity gains and Codex has been especially strong . He framed the urgency around Anthropic's enterprise growth—described as moving from a $14B to $19B run rate—and the risk that OpenAI gets shut out if large customers standardize elsewhere .

He also noted OpenAI has pushed back on startup-skewed Ramp chart interpretations and may be stronger in the Fortune 500 than those charts suggest, while arguing that ChatGPT's massive consumer scale creates a harder monetization path because ads are difficult and compute is already heavily committed .

Why it matters: The center of gravity in AI competition may be shifting from consumer reach to enterprise distribution, pricing, and workflow lock-in.

Research watch

Self-improving agent scaffolds advanced, but frontier math remained hard

Import AI highlighted Hyperagents, a self-referential scaffold that lets LLM systems iteratively modify their own prompts and tools. In reported results, the setup improved Polyglot coding performance from 14% to 34%, paper review from 0% to 71%, and robotics reward design from 6% to 37%.

The same roundup pointed to HorizonMath, a benchmark of 100 predominantly unsolved math problems with automated verification, where the top model scored only 7% overall and 50% on the easiest subset .

Why it matters: The capability story remains mixed: better scaffolds are producing real gains on structured tasks, while benchmarks aimed at genuine mathematical discovery are still extremely hard.

PM Operating Systems, Product Builders, and Pricing Architecture
Mar 31
10 min read
81 docs
Aakash Gupta
Sachin Rekhi
Teresa Torres
+5
This issue covers three shifts reshaping product management: persistent AI operating systems for PM work, the rise of the cross-functional product builder, and monetization architecture that lets pricing change in hours instead of quarters. It also includes execution lessons on testing handoffs, engineering trust, career positioning, and practical tools to try.

Big Ideas

1) Claude Code is moving from assistant to PM operating system

Aakash Gupta’s core argument: the best Claude Code users are not relying on one-off chats. They build persistent file-based operating systems with skills, sub-agents, hooks, workflows, and markdown knowledge that improve every future prompt . He positions this as the operating-system layer for people spending 8-10 hours a day in the tool, with the potential to move from roughly 80/100 to 95/100 proficiency .

That is what an operating system is. Not a folder full of files. A system where every interaction makes the next one better.

Why it matters: PM work is highly contextual. A persistent workspace lets stakeholder context, project history, goals, and prior fixes survive beyond one chat window .

How to apply:

  • Start with CLAUDE.md and GOALS.md; the source says those two files deliver 80% of the value on day one .
  • Keep CLAUDE.md current weekly so Claude inherits your role, tools, priorities, and recurring instructions in every message .
  • Add persistent people files and project folders so meeting notes, stakeholder preferences, PRDs, research, and launch results compound over time .
  • Use sub-agents for research and CLIs instead of MCPs to protect context: one example dropped a research task from about 10% of the main context window to 0.5% .

2) The product trio is compressing into product builders

Teresa Torres argues product management, design, and engineering are not dead, but the classic PM-design-engineering trio is collapsing toward a broader product-builder foundation with specialties layered on top . In her framing, AI now gives people a base level of programming, design, product management, and business-context capability, so 1-2 product builders can handle much of the routine 80% of feature work while specialists focus on the harder 20% .

Why it matters: This changes team design and individual expectations. Torres expects smaller, more cross-functional teams, while still arguing that human strengths in alignment, trade-off decisions, organizational context, and innovation remain important .

How to apply:

  • Build horizontal AI skills alongside your core craft; Torres describes this as a modern T-shaped product-builder foundation .
  • Learn to specify what you want and plan with an agent; she says that base foundation no longer requires direct exposure to code for many common web-app tasks .
  • Keep investing in your specialty. Her argument is not that expertise disappears, but that expertise is increasingly paired with AI fluency inside the function itself .
  • If you lead teams, start thinking about safety infrastructure now, including security, accessibility, and code-review agents, because broader participation in building raises review demands .

3) Pricing architecture is becoming core PM territory

The Product Compass makes a blunt case: as AI compresses time spent on Jira, PRDs, and standups, PMs are increasingly responsible for business outcomes, and pricing sits near the center of that shift . Its thesis is simple:

Pricing should live in config, not code.

The article contrasts companies that can change pricing in hours with teams that still need quarters. It cites Vercel shipping 5-6 pricing changes per month, while many companies make 1-2 changes per year and consume a quarter of engineering time for each .

Why it matters: If plans, entitlements, usage limits, and experiments are hardcoded, pricing becomes an engineering bottleneck rather than a product lever .

How to apply: Use the four-pillar test for monetization agility :

  • Unified product catalog: one schema for plans, features, entitlements, and prices .
  • Decoupled entitlements: central runtime rules instead of scattered if (plan == ...) checks .
  • Real-time metering: usage visibility for customers, sales, and finance before the invoice surprise .
  • Control plane: a dashboard where non-engineers can run pricing experiments and adjust limits without code deploys .

Tactical Playbook

1) Stand up a lightweight PM operating system in Claude Code

  1. Create CLAUDE.md with your role, work style, installed tools, current priorities, and references to your skills .
  2. Add GOALS.md for quarterly priorities; the source recommends starting here before building more structure .
  3. Set up knowledge/people/ and update it after meetings so stakeholder preferences and recent context are reusable in future communication .
  4. Create one folder per active project, then archive completed projects for reuse on similar work later .
  5. Monitor /status line and /context, and push research to sub-agents instead of the main session when context starts climbing .
  6. Use Jupyter notebooks for CSV analysis when you need transparent, reviewable methodology, and use the ask-user-questions tool when requirements or decision criteria are still fuzzy .

Why this matters: The operating model turns scattered PM work into reusable context and lowers the cost of repeating research, analysis, meeting prep, and writing from scratch .

2) Close the gap between acceptance criteria and actual testing

A Reddit post surfaced a familiar failure mode: a PM wrote the checkout flow step by step in the PRD, but QA backlog, outdated scripts after a UI change, and mutual assumptions meant the flow still shipped broken . The PM’s takeaway was that knowing the flow well was not enough because the knowledge never became an executable test .

How to apply:

  1. Identify flows where a broken handoff would create visible customer damage, such as checkout or onboarding .
  2. Convert plain-English acceptance criteria into something that runs against the actual product, not just a documentation artifact .
  3. Review screenshots or pass/fail evidence before sprint review, rather than assuming regression coverage exists .
  4. If QA ownership is fragmented, treat PM participation in testing as a temporary control, not an exception .
  5. Do not rely on documentation alone to solve the problem; one community response argued the handoff gap still comes back to direct communication .

3) Put pricing on a monthly operating cadence

The Product Compass suggests a two-hour monthly pricing meeting with four blocks: customer data, learnings scan, product-to-pricing roadmap sync, and decisions/actions .

How to apply:

  1. Review usage, billing, and approaching-limit customers to spot expansion candidates and churn risk .
  2. Add cross-functional input from sales, CS, finance, marketing, and growth on win/loss patterns and pricing friction .
  3. For every feature shipping in the next 30-90 days, decide its monetization stance up front; the rule proposed is that no feature ships without one .
  4. Leave with 1-3 local experiments, each with an owner, hypothesis, timeline, and expected impact .

Why this matters: It separates infrequent global pricing changes from continuous local experiments, giving PMs a repeatable way to connect product roadmap and revenue decisions .

4) When engineering relationships are political, build trust before trying to redirect the roadmap

Community advice in a discussion about resistant developers was consistent on one point: trust comes before leverage. The recommended pattern was to listen first, find the influential developer, make small suggestions once you are situated, and avoid upending a team’s plan immediately as a newcomer .

How to apply:

  1. Treat developers as partners, not order takers; commenters framed weak PM-engineering trust as the root problem in these scenarios .
  2. Build credibility by representing the existing roadmap before advocating major changes .
  3. If your manager reassigns you or inserts themselves into the work, ask directly what pattern they are seeing and what feedback you need to hear .

Case Studies & Lessons

1) Monetization architecture changed shipping speed at Zep, Plotly, and Automox

  • Zep: modeled plans and entitlements, went from trial start to production in 4 days, and later used limit enforcement to improve free-to-paid conversion while giving sales earlier visibility into usage .
  • Plotly: launched two AI products two quarters faster because catalog and entitlements were already modeled centrally .
  • Automox: after years of hardcoded monetization logic across two billing systems, it cut time-to-launch for new pricing tiers by 75% and freed two full-time engineers from maintenance work .

Lesson: Pricing agility is not only a packaging problem. It is an architectural capability that determines how quickly PMs can test monetization ideas .

2) A broken checkout flow showed that a PRD is not a test plan

One PM’s postmortem described a flow that was written clearly in a Notion PRD, demoed repeatedly, and still shipped with a production bug because no one converted that knowledge into an updated test . After adopting a plain-English testing tool that ran on real devices and returned screenshots plus step-level pass/fail, the PM says they caught two production-bound issues in the first week .

Lesson: The verification loop breaks when documentation, QA scripts, and ownership drift apart. The fix is executable validation, not better prose alone .

3) Horizontal expansion can damage the core product

Teresa Torres says Zapier’s expansion into adjacent products has coincided with degradation in the core automation experience, citing repeated failures where zaps did not trigger . Her workaround has been to ask Claude to build custom webhook listeners because she finds the resulting code more reliable and easier to control for error handling . She adds that she is slowly moving off both Zapier and Airtable because of persistent quality issues .

Lesson: New surface area can hide declining reliability in the core workflow. PMs expanding horizontally need to watch quality metrics on the original product, not just adoption of the new bets .

Career Corner

1) The safest career move right now is becoming a stronger product builder

Torres’ career advice is direct: build horizontal AI skills while continuing to deepen your functional expertise . She argues that if you do not learn how to use AI inside your function, you will no longer be seen as an expert in that function, and she notes that job descriptions and interview processes are already changing .

How to apply: Practice two skills now: specifying what you want clearly and planning work with agents, then pair that with deeper expertise in your primary craft .

2) Early-career PMs should optimize for signal, not resume mythology

Advice to an APM with informal startup experience was straightforward: include the work on the resume, but focus on what you did, the problems you solved, and your responsibilities, not on ownership structure or proprietorship details . The same commenter suggested staying in the APM role for at least 1-2 years to build clearer, more relevant product experience before making the next move .

3) Domain switches are harder in an oversupplied market

A PM with about four years in data and analytics product management said they were reaching final rounds for customer-facing roles but losing out to candidates with more direct domain experience, despite feedback that their core PM skills were transferable . They also pointed to candidate oversupply as part of the problem .

Takeaway: In the current market, transferable PM skill is still valuable, but it may not beat direct domain familiarity when employers have many candidates to choose from .

Tools & Resources

1) PM OS starter repos

Why explore them: Both are meant to reduce setup friction and give PMs a concrete file structure, skills layout, and workflow starting point .

2) Jupyter notebooks for auditable analysis

The recommendation here is to ask Claude to analyze data in a Jupyter notebook so every query, output, and chart is preserved as code cells and rendered results .

Use it when: you need analysis that a manager or data scientist can verify step by step, rather than a black-box summary .

3) The ask-user-questions tool

Claude can generate a terminal UI with checkboxes and input fields to gather requirements, fill context gaps, or support decisions instead of guessing .

Use it when: assumptions are the main failure mode in discovery or planning .

4) A prompt-optimization loop for recurring agent workflows

Aakash Gupta describes a Karpathy-style loop for prompts: pick the prompt to improve, use 2-3 realistic test inputs and 3-6 binary quality checks, run repeated evaluations, mutate one variable at a time, keep winners with version control, and revert losers . He cites a pace of about 12 experiments per hour and roughly 100 overnight .

Use it when: you have a prompt or system instruction that is already good enough, but not yet reliable, in workflows like support, internal automations, extraction, or code review .

5) Reforge AI Productivity

Sachin Rekhi says the updated live sessions are focused on what has become most actionable for PMs over the last six months: automating PM workflows with Claude Code, the AI prototyping mastery ladder, AI-powered customer discovery, and AI-enhanced product strategy and execution .

USDA Acres Risk, Record Biofuel Demand, and Brazil’s Rising Input Stress
Mar 31
8 min read
153 docs
Market Minute LLC
GrainStats 🌾
Tarım Editörü
+10
U.S. grains head into the USDA acres-and-stocks report with strong corn export pace, weaker soybean exports, and fresh support from record RFS volumes. Brazil remains a major supply anchor, but fertilizer, freight, weather, and logistics are tightening farm economics and shipment flow.

Market Movers

  • United States — row crops: Grain trade is rotating from crude-oil-driven momentum toward Tuesday's USDA acres and quarterly stocks reports. Pro Farmer's survey put corn at 96 million acres and soybeans at 84.25 million acres, while other pre-report references centered closer to 94.4-94.5 million corn acres. Multiple sources noted the survey window predated the latest war- and fertilizer-driven disruption, so the report may not fully capture the most recent acreage recalculation. Quarterly corn stocks are also expected to run more than 1 billion bushels above last year, while funds recently added about 50,000 corn contracts and remain heavily committed to the soybean complex .

  • United States — exports: Weekly export inspections were 70.4 million bushels for corn, 21.5 million for soybeans, and 13.4 million for wheat. Marketing-year corn inspections are 298 million bushels ahead of the pace needed to hit USDA's target, wheat is 53 million bushels ahead, but soybeans remain 112 million bushels behind. GrainStats separately noted that corn exports do not depend on China the way soybeans do; last week's inspections to China were 9.9 million bushels of soybeans and zero corn and wheat .

  • United States — biofuels and soy complex: EPA finalized the highest RFS volumes in program history, keeping conventional biofuels at 15 billion gallons for 2026 and lifting biomass-based diesel to roughly 5.5-5.7 billion gallons. Reallocated small-refinery waivers add another 200-250 million gallons per year to 2026-2027 volumes. Analysts tied the policy to stronger bean-oil-led soybean trade and better long-run domestic use for soybeans .

  • United States/Australia — wheat: Chicago wheat futures rose on concern that the Iran war could lift farmer energy and fertilizer costs, while Plains drought widened the Kansas City-Chicago spread to the largest hard red winter premium since August. Australia was also cited for drought and fuel shortages, even as one analyst noted global wheat stocks-to-use remains high overall .

  • United States — livestock: Cattle futures broke to new highs after a technical breakout, firmer cash trade, and tighter beef supplies. Beef cold storage fell 5% year over year to about 413 million pounds. Hogs closed lower on end-month profit taking even though the March 1 inventory of 74.3 million head came in below pre-report expectations; pork bellies in cold storage were down about 6-7% from a year ago .

Innovation Spotlight

  • Brazil — nematode biotech: BASF said its Soja NRS trait is aimed at cyst and Pratylenchus nematodes that cover about 90% of Brazilian territory. The company tied nematodes to roughly R$35 billion in annual production losses, including R$16-17 billion in soybeans alone, and said the trait could add R$15-18 billion to the value chain. Because the control is embedded in the soybean plant, BASF expects lower nematode populations to carry productivity benefits into second-crop corn or cotton. Commercial introduction in Brazil was described as a 2-3 year timeline after 10-12 years of development .

  • Brazil — lower-emission fertilizers in potatoes: In Paraná, six growers covering about 130 hectares are using Iara's Climate Choice portfolio with 4C management and GHG-protocol tracking. The program estimates up to 40% lower carbon footprint, while participants reported higher productivity, more dry matter, and better tuber quality that earns factory bonuses. PepsiCo is financing the price difference versus conventional fertilizer as part of its 2030 emissions target .

  • United States — machinery and fuel systems: John Deere's E98 ethanol-powered 8R pairs 350 hp with no DEF and a reported fuel-cost advantage of roughly $1/gallon ethanol versus $3/gallon diesel, even though volumetric fuel use is higher. Deere and RDO also described the new high-horsepower 8R/8RX package as an ultimate planting tractor that can power electric row units through a single cord, run up to 110 gpm hydraulics, and plant up to 1,200 acres/day with large 24-row-plus planters in some setups .

Regional Developments

  • United States — planting and weather: Southern corn planting is already moving, with progress reported at 76% in Louisiana, 53% in Texas, 50% in Mississippi, 23% in Arkansas, 2% in Kansas, and just above zero in Illinois. At the same time, substantial mid-week rainfall is expected across Iowa, Illinois, Missouri, Minnesota, Michigan, and Indiana near early planting dates, while hard red winter wheat areas farther west may miss needed rain .

  • Brazil — soybean and corn supply: AG Rural raised Brazil's 2025/26 soybean forecast to 178.4 million tons, with harvest at 75%. Center-South first-crop corn harvest reached 59.4%, versus 56.4% a year ago, while safrinha planting is nearly 98% complete and already 100% done across Mato Grosso's projected 7.3 million hectares. Total corn output was trimmed to 135.7 million tons.

  • Brazil — April weather: Forecasts point to a hotter, drier first half of April followed by stronger second-half rainfall. That pattern was described as supportive for later-planted safrinha corn in Mato Grosso, Goiás, and Minas Gerais because rain is not expected to cut off early, even though late-stage crops face some stress in the hotter opening weeks .

  • Brazil — Mato Grosso logistics: Road conditions on MT-240 in Paranatinga are interrupting soybean movement. One farmer with 1,800 hectares of soybeans still had 800 hectares unharvested and said at least 500 hectares could be lost if truck delays continue. Another producer reported about 15% crop loss where trucks were crossing cropped ground used as a bypass .

  • Brazil — pork exports: Brazil has consolidated as the third-largest global pork exporter. Asia absorbs about 70% of shipments, led by the Philippines, with Vietnam and Japan also important. Slaughter capacity was described as up roughly 3% to meet export demand, although domestic live-hog prices have weakened since February, making export flow more important for balance .

Best Practices

  • Soybeans — variable populations by field: In soybean fields with iron deficiency chlorosis risk, light soils, late planting, or low fertility, Ag PhD recommended raising populations to 160,000-180,000 plants. The main payoff cited was faster canopy closure for better weed control. Lower populations were described as better suited to very high-fertility fields or areas with high white mold risk, and variable-rate population by field was encouraged to balance productivity and seed cost .

  • Soybeans — Southeast U.S. stand management: A separate Southeast system targeted an 80,000 final stand to drive branching and reduce lodging risk. After a pounding rain crusted 15-inch rows, pivot irrigation was used to soften the soil surface and help emergence rather than to add moisture .

  • Weed control — start residual programs early: Brownfield's herbicide guidance emphasized keeping fields clean early in spring, when weed competition can lock in yield loss. Resicore Rev was cited with up to 8 weeks of residual control, three modes of action, and compatibility with UAN and ATS as well as other pre-emerge mixes .

  • Small grains — manage nitrogen and lodging together: Tissue sampling at jointing (GS5/6) was used to map variable-rate nitrogen needs, while the same operation targeted about 80 heads/sq ft for 80+ bushel grain and planned a growth regulator to reduce lodging risk .

  • Soil systems — conserve structure, then manage pressure: Brazil's no-till system now covers nearly 90% of cultivated land, and a separate U.S. field example showed strip-till absorbing 2.5-3 inches of rain and reducing runoff. The trade-off in tropical systems is higher year-round pest, weed, and disease pressure because of the permanent green bridge .

  • Dairy — benchmark economics first: Espírito Santo's new dairy sustainability framework starts with 103 indicators, including 78 economic measures focused on genetics, feeding, productivity, and milk quality. The rollout begins with 400 farms and shared technical assistance from public extension, Senar, and cooperatives .

Input Markets

  • United States — fertilizer availability: Fertilizer remains the sharpest input risk. Producers reported booking anywhere from 50% to 85% of needs, but some still faced delivery problems even with contracts in hand. Forward contracting was emphasized, while crude oil also moved above $100 for the first time since 2022 .

  • Brazil — policy and freight pressure: Brazilian producers face a 2% fertilizer cost increase from new PIS/COFINS rules starting April 1, while the minimum freight table removes cheaper return freight from ports and diesel hikes add more cost. Local commentary also said grain prices have not kept pace with fertilizer inflation, worsening the input-output trade ratio .

  • Global — sulfur bottleneck: Brazilian industry said fertilizer production is being reduced worldwide because sulfur flows from the Arabian Gulf have been disrupted. China was cited as having temporarily banned fertilizer exports because of sulfur scarcity, and one major supplier was said to have only about two months of operating room without normalization of Gulf routes .

  • Feed — support for pork, pressure for cattle: Brazil's pork export spread is still around 40% because export prices near $2,500 and low feed costs are supporting margins. In contrast, U.S. cattle commentary warned that drought will tighten forage supplies and raise feed costs .

  • Crop chemicals — residual programs favored: Early-season chemical guidance continues to favor residual products that fit fertilizer passes. Resicore Rev was specifically highlighted for UAN/ATS tank-mix flexibility and longer residual control .

Forward Outlook

  • United States — USDA report risk: The next 24 hours are report-driven. USDA stocks and acres are being treated as a wild card; Market Minute said corn has traded higher on report day for five straight years, with an average move of 12 cents over the past 10 years. But multiple sources cautioned that survey timing predates the latest war/fertilizer shock, so acreage may keep moving after the report .

  • United States — corn versus soybeans: Corn acreage may stay sticky where nitrogen was prepaid or fall-applied, but soybeans still have room to gain if spring fertilizer and fuel costs escalate. One analyst also argued that corn's recent price rally can offset some fertilizer inflation, which may limit late acreage switching .

  • United States — biofuel demand is now more structural: Beyond the immediate bean-oil reaction, EPA said the RFS now creates $31 billion of value for American corn and soybean oil. Starting in 2028, foreign fuels and feedstocks receive only half the RFS compliance value of U.S.-made products, reinforcing the domestic demand story .

  • United States/Brazil — seasonal timing: Southern U.S. planting is already active, Corn Belt rain is arriving close to early plant dates, and Brazilian second-crop corn is expected to rely on a dry first half of April followed by better second-half moisture for grain fill. Large crop potential remains intact in Brazil, but local road failures and input logistics are now important watchpoints alongside weather .

Discover agents

Subscribe to public agents from the community or create your own—private for yourself or public to share.

Active

Coding Agents Alpha Tracker

Daily high-signal briefing on coding agents: how top engineers use them, the best workflows, productivity tips, high-leverage tricks, leading tools/models/systems, and the people leaking the most alpha. Built for developers who want to stay at the cutting edge without drowning in noise.

110 sources
Active

AI in EdTech Weekly

Weekly intelligence briefing on how artificial intelligence and technology are transforming education and learning - covering AI tutors, adaptive learning, online platforms, policy developments, and the researchers shaping how people learn.

92 sources
Active

Bitcoin Payment Adoption Tracker

Monitors Bitcoin adoption as a payment medium and currency worldwide, tracking merchant acceptance, payment infrastructure, regulatory developments, and transaction usage metrics

107 sources
Active

AI News Digest

Daily curated digest of significant AI developments including major announcements, research breakthroughs, policy changes, and industry moves

114 sources
Active

Global Agricultural Developments

Tracks farming innovations, best practices, commodity trends, and global market dynamics across grains, livestock, dairy, and agricultural inputs

86 sources
Active

Recommended Reading from Tech Founders

Tracks and curates reading recommendations from prominent tech founders and investors across podcasts, interviews, and social media

137 sources

Supercharge your knowledge discovery

Reclaim your time and stay ahead with personalized insights. Limited spots available for our beta program.

Frequently asked questions