Hours of research in one daily brief–on your terms.

Tell us what you need to stay on top of. AI agents discover the best sources, monitor them 24/7, and deliver verified daily insights—so you never miss what's important.

Setup your daily brief agent
Discovering relevant sources...
Syncing sources 0/180...
Extracting information
Generating brief

Recent briefs

Cursor Cloud Agents go video-first + test-first, while GPT-5.4 upgrades Codex and always-on automations spread
Mar 6
7 min read
147 docs
OpenAI
swyx
Salvatore Sanfilippo
+16
Cursor’s Cloud Agents show what “agentic IDE design” looks like in practice: dedicated VMs, end-to-end testing, demo videos, and Slack-first collaboration. Plus: GPT‑5.4’s Codex upgrades (/fast mode, Playwright skill, 1M context status), always-on Cursor Automations, and hard lessons on evaluation, manual testing, and CI prompt-injection security.

🔥 TOP SIGNAL

Cursor’s latest Cloud Agents push is a concrete “agentic IDE” redesign: agents run in dedicated VMs, test changes end-to-end, and return a demo video + a tested PR, with remote desktop/terminal access for quick human iteration . Cursor says this flow exists because reviewing code becomes the bottleneck once agents can generate large diffs—video is an easier first review surface (but not a code-review replacement) .

🛠️ TOOLS & MODELS

  • OpenAI — GPT-5.4 rollout (Thinking + Pro), unified frontier model

    • Rolling out in ChatGPT, and also available in the API and Codex.
    • OpenAI describes it as bringing advances in reasoning, coding, and agentic workflows into one model .
    • Practitioner note: Hanson Wang says Codex and Thinking models are now unified.
  • Codex — /fast mode (GPT-5.4)

    • Claimed 1.5x faster with “the same intelligence and reasoning” .
    • Tradeoff called out by the Codex team: 1.5x speed for 2x cost.
  • Codex — Playwright skill + frontend improvements (GPT-5.4 era)

    • Romain Huet says complex frontend work looks “noticeably better,” and calls out a new Playwright skill that lets Codex visually debug and test apps while it builds.
  • Cursor — GPT-5.4 support + 1M context status

    • Cursor says GPT-5.4 is now available and is “more natural and assertive,” leading on their internal benchmarks .
    • Cursor’s Jediah Katz reported an issue with 1M context in GPT-5.4 and said they were fixing it ASAP .
    • Follow-up: Katz says 1M context is now available for GPT-5.4 if you toggle Max Mode on (enterprise legacy pricing: coming behind a separate gpt-5.4-1m slug ).
  • Cursor — Automations (always-on agents)

    • Cursor announced Automations: “continuously monitor and improve your codebase,” running on triggers and instructions you define.
    • Cursor CEO Michael Truell says Automations already run thousands of times per day internally, powering self-healing CI, auto-approving PR flows, compute-intensive security review, and a team-wide memory system.
    • Jediah Katz highlights they can trigger on any event/webhook, run in the cloud (not dependent on one laptop), and are team-owned.
  • Local agents (privacy-driven) — Qwen 3.5 as “good enough” for some tasks

    • Salvatore Sanfilippo says Qwen 3.5 is the first time he feels local agents can work for simpler programming tasks on your own machine (not state of the art, but effective) .
    • He compares the 27B dense model (more stable, good for GPU) and 35B MoE (3B active) (faster iteration, maybe better in practice) .
  • Augment — “Intent” UI for large workloads

    • Theo describes Intent as a shift from chat/autocomplete toward a UI for planning and managing large agentic coding workloads.
    • He also highlights pulling context from Linear, Sentry, GitHub issues, or PRs to keep workstreams compatible .

💡 WORKFLOWS & TRICKS

1) Cursor’s “Cloud Agent” loop (test-first + video-first + HITL)

A replicable loop Cursor describes for cloud-agent work:

  • Kick off an agent in cursor.com/agents; it works longer because it tests end-to-end (starts dev servers, iterates) and aims to return a tested PR.
  • First review pass: watch the demo video (a faster entry point than reviewing a huge diff) .
  • If needed: use remote desktop (VNC-style) + terminal access to interactively verify behavior and iterate .
  • Testing controls:
    • Default behavior is calibrated testing: don’t test “very simple copy changes,” but test complex ones; configurable via agents.md.
    • Use /notest to force skipping tests .

2) Bugfixes that ship faster: /repro before/after videos

Cursor’s **/repro** pattern:

  • Agent reproduces the bug and records a video, then fixes and records an “after” video .
  • Cursor says this moves some bug classes from “hard to repro locally” to “merge in ~90 seconds” .

3) Parallelism you can actually review: Best-of-N via 20s videos

  • Cursor says demo videos made them use best-of-N more often because reviewing four 20-second videos is manageable vs reviewing 4× giant diffs.

4) Slack as the “new IDE” surface (team workflows)

  • Cursor engineers describe Slack threads as a dev surface: you can @cursor in issue/product channels to kick off a cloud agent; teammates can “follow up” in-thread with more context .
  • They say the human discussion shifts to the high-order decisions (“do we ship this?”, “is this the right UX?”) while the agent handles implementation .

5) Subagents for context + compute management

  • Cursor highlights subagents as a way to delegate across prompts/goals/models and keep context manageable .
  • Example: an explore subagent can be routed to a faster model to read lots of code quickly, then summarize back to the parent agent .

6) Long-running agent mode (“grind mode”)

  • Cursor describes a long-running mode (“grind mode”) that aligns on a plan first, then grinds until criteria are met—potentially for days .

7) “Meta-setup” is becoming its own benchmark (Karpathy)

  • Andrej Karpathy says he has agents iterating on nanochat automatically: agents work on feature branches, try ideas, merge improvements, and iterate .
  • In one snapshot he reports 110 changes in ~12 hours reducing validation loss from 0.862415 → 0.858039 (d12 model) with no wall-clock penalty .
  • He calls the real benchmark: “what is the research org agent code that produces improvements on nanochat the fastest?.

8) Let the model improve the model (Hanson Wang’s GPT-5.4 workflow)

  • Hanson Wang says he asked GPT-5.4-xhigh in Codex to autonomously iterate on Codex’s own system prompt; it ran >17 hours, executed 200+ evals, wrote scripts to monitor eval progress, and pruned unpromising branches .

9) Skills need evals (not vibes): LangChain’s skills benchmarking loop

  • LangChain’s Robert Xu outlines an evaluation pipeline: define tasks + define skills, run with/without skills, compare, iterate .
  • Reported outcome (their tests): Claude Code completed tasks 82% of the time with skills vs 9%without skills .
  • Practical detail: they stress consistent clean environments (they used a lightweight Docker scaffold) for reproducible agent tests .

10) Manual testing is still non-negotiable (and agents can help)

  • Simon Willison: “Just because code passes tests doesn’t mean it works as intended… Automated tests are no replacement for manual testing.
  • He recommends having agents execute what they wrote (e.g., Playwright for UI testing) instead of assuming correctness .
  • For evidence, Willison’s Showboat pattern records commands + outputs to discourage agents from writing what they hoped happened .

11) Security footgun: prompt-injected CI agents + cache poisoning (Cline)

  • Cline ran an issue-triage workflow using anthropics/claude-code-action@v1 on every newly opened GitHub issue with --allowedTools "Bash,Read,Write,...".
  • Because the workflow prompt included the untrusted issue title, an attacker could prompt-inject tool execution and use GitHub Actions cache behavior to poison shared caches and steal release secrets, leading to a compromised cline@2.3.0 release (later retracted) .

👤 PEOPLE TO WATCH

  • Jonas Nelle + Samantha Whitmore (Cursor) — unusually specific harness design details: test-first PRs, video review entrypoint, Slack-as-IDE, subagents, and long-running “grind mode” .
  • Michael Truell (Cursor) — adoption signal: Automations running thousands/day internally, including “compute-intensive security review” and team memory .
  • Hanson Wang (OpenAI/Codex) — concrete “agent improves agent” workflow (17h autonomous system-prompt iteration with 200+ evals) .
  • Andrej Karpathy — framing shift: optimize the agent org (meta-setup) and measure “time-to-improvement” loops .
  • Simon Willison — high-signal practical guidance across (1) agentic manual testing and (2) real-world agent CI security failures.
  • swyx — pushes for better rigor + tooling around agent reliability, including an open-sourced Claude compaction viewer for diagnosing bad compactions and a reminder that statistically meaningful SWE-bench comparisons can require 30–60x more compute than cheap samples .

🎬 WATCH & LISTEN

1) Cursor Cloud Agents: test + video + remote desktop as the new review loop (≈02:23–05:33)

Hook: why video is the “entry point” for reviewing agent output, and how remote desktop/terminal access closes the loop on real verification.

2) Slack as the collaboration surface for agents (≈20:57–23:26)

Hook: how agent threads + team follow-ups shift human work from “where does this if-statement go?” to product/UX decisions.

📊 PROJECTS & REPOS


Editorial take: Today’s theme is throughput via autonomous + parallel agents—and the tax you can’t dodge is verification (tests + manual evidence) and security boundaries around what those agents are allowed to touch.

GPT‑5.4 rolls out with native computer use; KARL and FlashAttention‑4 reshape the agent stack
Mar 6
9 min read
1027 docs
More Perfect Union
Lisan al Gaib
Tibo
+43
OpenAI’s GPT‑5.4 rollout dominates the cycle, bringing native computer use, tool-search efficiency, and 1M-token context (with real long-context caveats). Also: Databricks’ RL-trained KARL knowledge agent, FlashAttention‑4’s push into mainstream frameworks, a major Anthropic–Pentagon escalation, and a developer-agent supply-chain security incident.

Top Stories

1) OpenAI rolls out GPT‑5.4 (Thinking + Pro) with native computer use and 1M context

Why it matters: This is a consolidated “frontier model” push that pairs agentic coding + tool use + computer control with very long context, which changes what’s practical in production workflows (especially multi-step, tool-heavy tasks).

Key details (as announced across OpenAI + OpenAI DevRel):

  • Availability / SKUs: GPT‑5.4 is available now in the API and Codex, with GPT‑5.4 Thinking and GPT‑5.4 Pro rolling out in ChatGPT. In the API, it’s available as gpt-5.4 and gpt-5.4-pro.
  • Core capability bundle: Native computer-use capabilities; up to 1M tokens of context (Codex + API); “best-in-class agentic coding for complex tasks”; scalable tool search; more efficient reasoning for long, tool-heavy workflows .
  • Computer use specifics: OpenAI Devs says GPT‑5.4 can write Playwright code, read screenshots, and issue keyboard/mouse actions to operate computers, with steerable behavior and configurable confirmation policies .
  • Benchmarks shared by OpenAI Devs: 83.0% on GDPval, 75.0% on OSWorld‑Verified, 57.7% on SWE‑Bench Pro (Public), 54.6% on Toolathlon .
  • Efficiency + speed knobs in Codex: /fast mode delivers up to 1.5× faster performance across supported models (including GPT‑5.4) . Separately, a user report notes 1.5× speed at 2× credit consumption.
  • Steering mid-response: In ChatGPT, OpenAI says you can now interrupt GPT‑5.4 Thinking mid-response to add instructions or adjust direction, with steering rolling out on Android and web (iOS “coming soon”) .

Practical caveat on long context:

  • Even with a 1M context window, retrieval degrades at very large contexts. One reported MRCR v2 “needle-in-a-haystack” curve shows 97% at 16–32K tokens, 57% at 256–512K, and 36% at 512K–1M—prompting recommendations to compact regularly.

Relevant links:


2) Databricks releases KARL, an RL-trained “knowledge agent” aimed at grounded enterprise reasoning

Why it matters: KARL is a concrete example of applying RL to non-verifiable enterprise knowledge tasks (messy docs, long tool chains), and Databricks frames it as an “assembly line” for producing agents—important for teams trying to move beyond “RAG as a demo.”

What was announced:

  • What it is: KARL (Knowledge Agents from Reinforcement Learning) is an RL-trained agent for document-centric grounded reasoning over complex questions, “millions of documents,” “hundreds of tool calls,” and repeated context compression .
  • Performance framing: Databricks describes “frontier-level performance on complex knowledge workloads at a fraction of the cost and latency of leading proprietary models” .
  • Why RL here: Databricks emphasizes these enterprise tasks “are not strictly verifiable” like unit-test-style RL wins .
  • Mechanics (high level): Off-policy RL with synthetic data (OAPL), multi-task RL that generalizes, and “parallel thinking” test-time compute to manage latency .
  • RAG++++ detail: A VentureBeat summary highlights KARL matching frontier quality on messy enterprise data by running up to 200 vector searches per query.

Links:


3) FlashAttention‑4 goes GA; PyTorch adds a FlashAttention‑4 backend for FlexAttention

Why it matters: Attention kernels are a performance ceiling for both training and inference. FA4 is positioned as a Blackwell-era redesign that shifts bottlenecks away from softmax/SMEM limits, while PyTorch is trying to make these gains accessible for custom attention variants (not only a single “blessed” kernel).

What’s new:

  • FA4 GA: “FlashAttention‑4 is GA” .
  • Core performance claim: FA4 reaches ~1600 TFLOPs attention on Blackwell GPUs and is described as “pretty much at matmul speed,” by changing the algorithm/pipeline so softmax and shared memory bandwidth no longer dictate speed .
  • PyTorch integration: PyTorch added a FlashAttention‑4 backend to FlexAttention on Hopper and Blackwell GPUs; PyTorch now auto-generates CuTeDSL score/mask modifications and JIT-instantiates FA4 for custom attention variants . PyTorch reports 1.2× to 3.2× speedups over Triton on compute-bound workloads .
  • Transformers integration (in progress): A PR for FA4 integration into Hugging Face Transformers was shared (PR #42435) .

4) Anthropic–Pentagon escalation: “supply chain risk” designation + Amodei statement

Why it matters: This is a high-stakes governance signal: AI labs are increasingly treated as critical suppliers (and potential risks) in national-security procurement, with direct implications for enterprise adoption, contracts, and oversight.

Reported developments:

  • Designation: A post claims the Pentagon formally notified Anthropic it’s been deemed a “supply chain risk”.
  • Amodei response (as summarized): A memo-style summary says Amodei apologized for the tone of a leaked memo, said it was outdated/not his considered view, emphasized keeping warfighters equipped, and offered Claude to the military at nominal cost with forward-deployed engineer support .
  • Anthropic’s statement link: Anthropic shared a statement from Amodei: https://www.anthropic.com/news/where-stand-department-war.

“Anthropic has much more in common with the Department of War than we have differences.”


5) Security incident report: “Clinejection” installs a separate agent (OpenClaw) without consent

Why it matters: Agentic dev tools run with broad local permissions; supply-chain style incidents can turn “developer convenience” into fleet-wide risk.

  • A write-up alleges “every developer who installed or updated Cline got OpenClaw … installed globally on their machine without consent,” describing it as “malicious agent injection” and noting OpenClaw has “full system access” .

Details: https://grith.ai/blog/clinejection-when-your-ai-tool-installs-another

Research & Innovation

Why it matters: This week’s research is converging on a few themes: RL methods for messy tasks, hybrid architectures for scaling efficiency, and benchmarks that better approximate real agent constraints (implicit rules, over/underthinking, interaction).

Open models + hybrid architectures

  • OLMo Hybrid (AI2): Allen AI released OLMo Hybrid, mixing transformer attention with linear RNN layers; the team claims hybrid models are “strictly more expressive” than either alone and that this translates to better scaling (49% fewer tokens to match OLMo 3 MMLU accuracy) .
  • Training “fully in the open”: Lambda says OLMo Hybrid 7B was trained in the open with training logs/recovery metrics/weights, using 3T tokens, 512 NVIDIA Blackwell GPUs, over 7 days, with 97% active training time and median recovery under 4 minutes.

RL + evaluation research (Meta FAIR ICLR set)

  • Meta FAIR says its team co-authored 7 papers accepted to ICLR, covering topics including joint safety agents (“Alignment Waltz”), judge RL (“J1”), experience synthesis for agent learning, and benchmarks for over/underthinking (“OptimalThinkingBench”) .

Data efficiency for language models

  • Semantic Tube Prediction (STP): STP (co-authored by Yann LeCun) is described as forcing hidden states into locally linear “semantic tubes,” matching baseline accuracy with 16× less training data. Paper: https://arxiv.org/abs/2602.22617.

Benchmarks for agent “implicit constraints”

  • Implicit Intelligence: Labelbox Applied ML Research introduced a benchmark testing whether agents respect unstated constraints across implicit reasoning, catastrophic risk, privacy/security, and accessibility . Paper: https://arxiv.org/abs/2602.20424.

Long-running agents: context compression as a core problem

  • Baseten KV-cache compression: Baseten reports one-shot compaction preserves detailed information with 65–80% accuracy at 2–5× compression (outperforming text summarization) and explores what happens when you compress repeatedly for persistent agents .

Products & Launches

Why it matters: The biggest product shifts are around agent scaffolding: better computer-use interfaces, orchestration/automation, and cross-tool connectivity (so agents can actually act, not just chat).

GPT‑5.4 distribution and integrations

  • GitHub Copilot: GitHub says GPT‑5.4 is now generally available and rolling out in Copilot; early testing highlights “enhanced logical reasoning and task execution” . Changelog: https://github.blog/changelog/2026-03-05-gpt-5-4-is-generally-available-in-github-copilot/.
  • Cursor: Cursor says “GPT 5.4 is now available in Cursor,” and they found it “more natural and assertive than previous models” .
  • Perplexity: Perplexity announced GPT‑5.4 and GPT‑5.4 Thinking availability for Pro/Max subscribers .
  • Arena: Arena reports GPT‑5.4 variants in Text/Vision/Code arenas and publishes ranking highlights (e.g., GPT‑5.4‑high tied with Gemini‑3‑Pro in Text Arena) .

Codex tooling updates

  • Codex app on Windows: OpenAI Devs announced Codex is now on Windows with a “native agent sandbox” and PowerShell support . Landing page: https://developers.openai.com/wendows.

Always-on agent operations

  • Cursor Automations: Cursor introduced Automations for always-on agents that run based on triggers and instructions you define . Blog: http://cursor.com/blog/automations.

Office / finance workflow tooling

  • ChatGPT for Excel: OpenAI launched “ChatGPT for Excel,” positioning it as bringing ChatGPT into spreadsheet workflows (“where decisions get made”) . Link: https://openai.com/index/chatgpt-for-excel/.

Video generation continues to split into “engines” vs “story tools”

  • Bing Video Creator: Microsoft rolled out “Sora 2 generative video” in Bing Video Creator, adding audio integration and watermark + C2PA credentials .
  • PAI (Utopai Studios): Utopai says PAI is rolling out as a long-form cinematic model with minutes-long continuous generation, character/scene consistency, and natural-language editing .
  • LTX‑2.3 on fal: fal says LTX‑2.3 is live with Pro (audio-to-video, retake, extend) and Fast modes plus sharper detail/cleaner audio/stronger motion .

Industry Moves

Why it matters: Distribution and enterprise positioning are starting to matter as much as raw model quality—especially for agents (where tool ecosystems + integrations decide what gets adopted).

  • Together AI fundraising (reported): Together AI is reportedly raising $1B at a $7.5B pre-money valuation, generating ~$1B ARR, with growth tied to moving from leasing GPUs to buying their own GPUs to rent out .
  • Codex user growth: Codex surpassed 2M+ active users, up 25% week-over-week (noted as before Windows + GPT‑5.4 launch) .
  • Claude adoption: One post claims “more than a million people are now signing up for Claude every day” .
  • Sakana AI × MUFG: Sakana AI and Mitsubishi UFJ Bank advanced their “AI Lending Expert” system from ~6-month PoC to real-case verification phase. Link: https://sakana.ai/mufg-ai-lending.

Policy & Regulation

Why it matters: Export controls and professional-liability rules can become hard constraints on where AI can be deployed—and what assistants can legally do.

  • US AI chip export restrictions (reported): A post says the Trump Administration is preparing a rule to restrict AI chip shipments globally without US approval, requiring permission for “virtually all exports of AI chips,” with Nvidia and AMD heavily impacted .
  • New York bill targeting “substantive responses”: A New York bill would ban AI from answering questions related to licensed professions (medicine, law, dentistry, nursing, psychology, social work, engineering, and more), and companies would be liable if chatbots give “substantive responses” in these areas .

Quick Takes

Why it matters: Smaller releases often become “quiet defaults” inside stacks—especially around evaluation, routing, and on-device constraints.

  • OpenAI: Chain-of-Thought controllability: OpenAI published a new evaluation suite/paper and says GPT‑5.4 Thinking shows “low ability to obscure its reasoning,” suggesting CoT monitoring remains a useful safety tool .
  • Gemini 3.1 Flash‑Lite preview (pricing): Google launched Gemini 3.1 Flash‑Lite in preview at $0.25 / 1M input tokens for high-volume developer workloads .
  • Perplexity “Model Council”: Perplexity launched a mode that runs GPT‑5.4, Claude Opus 4.6, and Gemini 3.1 Pro simultaneously and selects the best answer in one workflow .
  • OLMo Hybrid (distribution): AI2 released a family of OLMo Hybrid models (base/SFT/DPO) on Hugging Face .
  • FlashAttention‑4 resources: FA4 paper and code links shared (paper PDF + GitHub repo) .
  • LiquidAI on-device agent: A 24B-parameter model (2.3B active per token) is reported to fit in 14.5GB and run tool selection with 385ms average latency (67 tools, 13 MCP servers) with “zero network calls” .
  • OpenHands Critic v1.0: OpenHands released a “critic” model that scores coding agent traces to address the verification bottleneck, with real-time thumbs-up/down monitoring and support in SDK/CLI/Hugging Face .
  • LangChain skills evaluation: LangChain released an evaluation benchmark for LangSmith/LangChain “skills,” emphasizing variance across tasks for coding agents . Repo: https://github.com/langchain-ai/skills-benchmarks.
  • GitHub AGENTS.md guidance: GitHub’s analysis of 2,500+ repos suggests effective AGENTS.md files stay brief and include persona, exact commands, boundaries, and good output examples .
GPT-5.4 rolls out broadly as coding agents, hybrid open models, and interpretability funding accelerate
Mar 6
6 min read
201 docs
OpenAI
Ai2
swyx
+15
OpenAI rolled out GPT-5.4 (Thinking + Pro) across ChatGPT, the API, and Codex—highlighting steering mid-response, 1M-token context, and native computer use—alongside new safety research on chain-of-thought controllability. The digest also covers Cursor’s cloud agents workflow, Perplexity’s multi-model “Model Council,” AllenAI’s open Olmo Hybrid architecture release, Goodfire’s $150M fundraise, and fresh signals of agents moving into enterprise operations.

OpenAI launches GPT-5.4 (Thinking + Pro) across ChatGPT, API, and Codex

GPT-5.4 roll-out + headline capabilities

OpenAI announced GPT-5.4 is available now in the API and Codex, with a gradual rollout in ChatGPT starting today . OpenAI frames GPT-5.4 as combining advances in reasoning, coding, and agentic workflows into one frontier model .

Notable feature claims include:

  • Steering mid-response (interrupt the model and adjust direction)
  • 1M tokens of context
  • Better performance on knowledge work and web search, plus native computer use capabilities
  • “Most factual and efficient” (OpenAI claims fewer tokens and faster speed)

Steering availability: OpenAI says steering is available this week on Android and web, with iOS “coming soon”.

Details: http://openai.com/index/introducing-gpt-5-4/

Codex updates: /fast mode + GPT-5.4 improvements for tool use and frontend work

Sam Altman also highlighted that GPT-5.4 is introduced in Codex with improved token efficiency and better tool calling, computer use, and frontend development. OpenAI is also introducing /fast for a faster version of Codex .

Demo: computer use + frontend workflows (with self-testing efficiency claims)

In OpenAI’s GPT-5.4 Thinking demo, the model uses CUA (computer use) to test its own work (e.g., interacting with a 3D chess Electron app) . The presenter claims token use has dropped by 2/3 in some self-testing cases .

Windows: Codex app + “Windows-native agent sandbox”

OpenAI says the Codex app is now live on Windows. The app can run natively or in WSL with integrated terminals (PowerShell, Command Prompt, Git Bash, or WSL) .

OpenAI also described a Windows-native agent sandbox with OS-level controls that block filesystem writes outside a working folder and prevent outbound network access unless explicitly approved .

Safety research: Chain-of-Thought (CoT) controllability eval

OpenAI published a new evaluation suite and research paper on Chain-of-Thought (CoT) Controllability. The company reports that GPT-5.4 Thinking shows low ability to obscure its reasoning, suggesting CoT monitoring remains a useful safety tool .

Paper: https://openai.com/index/reasoning-models-chain-of-thought-controllability/

Early tester feedback (including weaknesses flagged)

One tester wrote that after a week of testing, GPT-5.4 felt like “the best model in the world” and reduced their reliance on Pro modes . The same thread praised coding reliability in Codex and speed improvements from using fewer reasoning tokens .

That tester also listed weaknesses: “frontend taste” lagging competitors, missing obvious real-world context in planning, and stopping short before finishing tasks in OpenClaw . Sam Altman replied: “We will be able to fix these three things!” .

Coding agents: Cursor’s cloud agents push toward test-and-video workflows

Cursor’s “cloud agents” are described as having surpassed tab-autocomplete usage internally, reinforcing the claim that “the IDE is Dead” . In this model, agents do more end-to-end work and return artifacts that are easier to review than raw diffs.

Key product mechanics highlighted:

  • Automatic testing of changes before PR submission (with calibrated prompting and a /no test override)
  • Demo videos as an entry point for review, plus Storybook-style galleries
  • Remote VM access (VNC) for live interaction and iteration
  • A /repro workflow for bug reproduction + fix verification with before/after videos

The same discussion frames a near-term “big unlock” as widening throughput via parallel agents and subagents for context management and long-running threads .

Multi-model orchestration: Perplexity adds “Model Council” to Perplexity Computer

Perplexity launched Model Council inside Perplexity Computer, allowing users to run GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro simultaneously and select an orchestrator model . Perplexity’s positioning: “Three frontier models. One workflow. Best answer wins.”

Open models and new architectures: AllenAI releases Olmo Hybrid (7B)

Allen AI released Olmo Hybrid, a fully open 7B model combining transformer and linear RNN (gated delta net / GDN) layers in a 3:1 ratio with full attention . AllenAI and commentary in Interconnects describe it as a strong artifact for studying hybrid architectures, with theory and scaling experiments accompanying the release .

Interconnects reports:

  • Pretraining gains: about a 2× gain on training efficiency vs. Olmo 3 dense
  • Post-training results: mixed (knowledge wins, reasoning losses vs. dense), but still a strong open model overall
  • Practical challenge: OSS tooling and long-context inference issues can negate efficiency gains in practice right now

Resources:

Research workflow shift: Karpathy’s nanochat gets faster—and agents iterate on it autonomously

Andrej Karpathy reported nanochat can now train a GPT-2-capability model in 2 hours on a single 8×H100 node (down from ~3 hours a month ago), largely due to switching from FineWeb-edu to NVIDIA ClimbMix.

He also described AI agents automatically iterating on nanochat, making 110 changes over ~12 hours and improving validation loss from 0.862415 → 0.858039 for a d12 model without increasing wall-clock time (feature branch experimentation + merge when ideas work) . Karpathy later framed the “new meta” benchmark as: “what is the research org agent code that produces improvements on nanochat the fastest?”.

Interpretability funding + “Intentional Design”: Goodfire raises $150M Series B

Mechanistic interpretability startup Goodfire announced a $150M Series B at a $1.25B valuation, less than 2 years after founding . Alongside the raise, the company introduced Intentional Design: complementing reverse engineering with an approach focused on shaping the loss landscape to influence what models learn and how they generalize .

One proof-of-concept described is hallucination reduction using a probe trained to detect hallucinations for both runtime steering and RL reward signals, with a key training trick: run the probe on a frozen copy of the model to reduce incentives/ability to evade the detector during training .

Enterprise adoption notes: MUFG + Sakana AI lending agent moves to real-case testing; Microsoft updates Dragon Copilot

Sakana AI and Mitsubishi UFJ Bank (MUFG) advanced their “AI Lending Expert” agent system from a ~6-month PoC to a real-case verification phase, following their 2025 comprehensive partnership announcement .

Microsoft announced “big updates” to Dragon Copilot at HIMSS, introducing Work IQ to bring the right work context alongside patient data, aiming to reduce admin busywork and let clinicians focus more on patients .

Two cautionary notes circulating: benchmarks and moral-reasoning behavior

  • Benchmark noise: swyx cautioned against a viral claim that Claude Opus 4.6 had its “worst benchmark day,” pointing out that the SWE-bench author does not endorse “cheap sample” benchmarks and arguing 30–60× more compute is needed for statistically meaningful results .

  • Moral-reasoning oddities: Gary Marcus amplified a study thread reporting that GPT answered “yes” to torturing a woman to prevent a nuclear apocalypse but “absolutely not” to harassing a woman in the same scenario—described as a reversal that appeared only when the target was a woman . The thread argues this may reflect mechanical overgeneralization from RLHF rather than reasoning about underlying harms .

Self-help sobriety, database fundamentals, and AI-era signal vs. noise
Mar 6
4 min read
159 docs
Tim Ferriss
martin_casado
Jamie Turner
+3
Today’s highest-signal picks include Shaan Puri’s standout endorsement of Tim Ferriss on the “self-help trap,” Martin Casado’s pointer to a practical database fundamentals video, and Packy McCormick’s curated reading on AI’s impact on hiring and software production—plus two history/strategy recommendations about control, institutions, and high-stakes decision-making.

Most compelling recommendation: a check on “improvement” becoming its own addiction

The Self-Help Trap: What 20+ Years of “Optimizing” Has Taught Me — Tim Ferriss (blog post)

  • Type: Blog post
  • Author/creator: Tim Ferriss
  • Link/URL:https://x.com/tferriss/status/2029283224866770944
  • Recommended by: Shaan Puri (@ShaanVP)
  • Key takeaway (as shared): Shaan connects Ferriss’ point to seeing people at a Tony Robbins event who, after attending 3+ times, seemed to get “addicted to the medicine” .
  • Why it matters: It’s a rare, high-conviction endorsement (“my favorite thing Tim has written in 10+ years”) aimed directly at “fellow self improvers” —useful if you’re trying to ensure learning and self-improvement translate into changed behavior (not just repeated consumption).

Engineering fundamentals worth (re)loading into your brain

Video on database consistency + concurrency tradeoffs — @jamwt (video)

  • Type: Video (posted on X)
  • Author/creator: @jamwt
  • Link/URL:https://x.com/jamwt/status/2029353984792961278
  • Recommended by: Martin Casado (@martin_casado)
  • Key takeaway (as shared): A “fantastic overview” demystifying database consistency, isolation levels, record contention, and pessimistic vs. optimistic concurrency control tradeoffs.
  • Why it matters: Casado frames these as “super important concepts” if you’re “building production systems,” and suggests that as AI reduces the need to “memorize random framework nonsense,” this is the kind of broadly useful material to replace it with .

AI is changing workflows (and the signal-to-noise ratio)

The Tinder-ization of the Job Market — Matt Darling (essay)

  • Type: Article/essay
  • Author/creator: Matt Darling (The Argument)
  • Link/URL:https://www.theargumentmag.com/p/the-tinder-ization-of-the-job-market
  • Recommended by: Packy McCormick (Not Boring)
  • Key takeaway (as highlighted): The argument presented is that the job market is “stuck” (e.g., hiring rate averaged 3.3% in H2 2025) even while unemployment was 4.3% in January and prime-age (25–54) employment was 80.9%. One proposed mechanism: LLMs make it easier to apply to many jobs, increasing volume while weakening traditional signals (e.g., recruiting workload rose 26% in Q3 2024; 38% of job seekers reported “mass applying”; an applications-to-recruiter ratio “about 500–1”) .
  • Why it matters: If you hire, this is a concrete pointer to why screening may be breaking down under application flooding and AI-generated materials—and why process adjustments may be required .

Claude Code Is The Inflection Point — SemiAnalysis (newsletter post)

  • Type: Newsletter post
  • Author/creator: SemiAnalysis
  • Link/URL:https://newsletter.semianalysis.com/p/claude-code-is-the-inflection-point
  • Recommended by: Packy McCormick (Not Boring)
  • Key takeaway (as quoted): “4% of GitHub public commits are being authored by Claude Code right now,” with a projection that it could reach “20%+ of all daily commits by the end of 2026” .
  • Why it matters: It’s a specific metric + trajectory claim that can recalibrate how quickly you expect AI-assisted coding to show up in day-to-day software production .

Tool Shaped Objects — Minutes (essay)

  • Type: Article/essay
  • Author/creator: Minutes (publication)
  • Link/URL:https://minutes.substack.com/p/tool-shaped-objects
  • Recommended by: Packy McCormick (Not Boring)
  • Key takeaway (quote highlighted in Not Boring):

“The market for feeling productive is orders of magnitude larger than the market for being productive.”

  • Why it matters: A compact framing for evaluating tools, dashboards, and workflows that optimize for the appearance of progress rather than actual outcomes .

Strategy + history: control, institutions, and who makes the call

The Control Revolution — James R. Beniger (book)

  • Type: Book
  • Author/creator: James R. Beniger
  • Link/URL:https://www.amazon.com/Control-Revolution-Technological-Economic-Information/dp/0674169867
  • Recommended by: Packy McCormick (Not Boring)
  • Key takeaway (as shared): Packy says he’s only started it, but flags Beniger’s idea that modern information technology emerged as a response to industrial scale and complexity—an industrial-era “crisis of control” where information/communication innovations lagged behind energy and manufacturing advances .
  • Why it matters: It’s a lens for thinking about why information systems and coordination mechanisms proliferate when production and complexity accelerate .

The Making of the Atomic Bomb — Richard Rhodes (book)

  • Type: Book
  • Author/creator: Richard Rhodes
  • Link/URL: Not provided in-source
  • Recommended by: Rory (20VC panel)
  • Key takeaway (as shared): Rory describes a lesson from reading it: military leadership “didn’t give a rat’s ass about the scientists” and that expecting “the luxury of getting to be part of the decision” is “unrealistic” .
  • Why it matters: A grounded reminder about institutional power and decision rights—especially relevant when technical teams assume they’ll control downstream use of what they build .

Owen (Intercom) on building Fin on top of an existing business (article/blog post)

  • Type: Article/blog post (not linked in-source)
  • Author/creator: Owen (Intercom)
  • Link/URL: Not provided in-source
  • Recommended by: Rory (20VC panel)
  • Key takeaway (as shared): Described as a “really great piece” about taking an existing business, “gut[ting] it,” and building Fin on top—emphasizing not just technical work but the fortitude to bet on the new thing, including “probably a year of feeling ridiculous” amid customer pressure .
  • Why it matters: A candid reference point for leaders attempting major AI-driven product transitions inside a live, legacy business .
Taste at speed, “GitHub for PM” context layers, and prototype-first workflows
Mar 6
9 min read
67 docs
Sachin Rekhi
Teresa Torres
Casey Winters
+9
This digest focuses on what changes when prototyping and code generation become cheap: PM leverage shifts toward fast judgment, context management, and quality measurement. It also includes practical playbooks for vibe prototyping, building persistent AI assistants (Gems/Projects), and a case-study-driven look at alignment, hardware constraints, and career tactics in AI-era product roles.

Big Ideas

1) When building gets cheap, the bottleneck becomes judgment ("taste at speed")

Teams are increasingly using AI to prototype so fast that the core constraint shifts from "can we build it?" to "should we ship it?". Aakash Gupta highlights Anthropic’s Claude Code team as an extreme example: they build hundreds of working prototypes before shipping a single feature, with Boris Cherny reportedly shipping 20–30 PRs/day across parallel Claude instances, and building "Cowork" in about 10 days.

This shows up in the broader PM community too: one Reddit post describes moving from a weeks-long spec → align → build → measure loop to putting rough versions in front of clients the same day, shrinking feedback loops from weeks to hours.

Why it matters: As prototyping cost collapses, PM leverage moves to rapid evaluation, ruthless focus, and decision quality—especially when stakeholders can react to a demo instead of a doc .

How to apply (this week):

  • Run a prototype-first cycle: build a rough demo, test it, then document decisions after validation (not before) .
  • Treat the PRD as a source of truth after learning, not an authorization artifact .

2) Alignment is becoming an AI problem: “GitHub for product management”

Teresa Torres spotlights Momental’s vision of a “GitHub for product management”: ingest org documents/transcripts/recordings and use AI agents to map them into a structured context layer, then surface “merge conflicts” in strategy (e.g., one team prioritizing retention while another prioritizes conversion) for humans to resolve .

Momental frames an internal “product chain” (signals → learnings → decisions → principles) and models org context as three trees (product tree, wisdom tree, people/time tree) . They emphasize metadata (who said it, when, and in what context) as critical for preventing hallucinations .

Why it matters: Even if engineers ship faster, PMs still spend large amounts of time coordinating alignment; Momental cites the reality that you “don’t know what you don’t know” when conflicts are implicit or distributed .

How to apply:

  • Treat misalignment like a first-class defect: explicitly track decisions with reasoning (not just outcomes), and make conflicts visible for resolution .
  • When adopting AI for org context, prioritize provenance/metadata over “just summarization” to reduce ambiguity .

3) Accurate AI agents require domain knowledge + proprietary data, plus a hybrid architecture

In high-stakes domains (fintech, legal, healthcare), accuracy is “the product,” and out-of-the-box LLMs aren’t naturally reliable enough . Two advantages help close the gap:

  • Domain knowledge: map workflows, stakeholders, and where a “90% answer” is acceptable vs. a failure .
  • Proprietary data: transaction-level data, interaction history, domain corpora for personalization and insights a general model can’t produce .

On architecture, Lisa Huang recommends a hybrid system: LLMs (including multi-agent workflows) where they fit, but deterministic code where you need reliability and control .

Why it matters: Without domain constraints, data advantage, and deterministic guardrails, teams can build fast but ship unreliable behavior in the places users care most .

How to apply:

  • Before building: map tasks/subtasks and define explicit accuracy thresholds by step .
  • Build hybrid: identify components that must be deterministic and keep them in code .

4) “Personalized AI” beats one-off chats: build persistent context assistants (Gems/Projects)

Lisa Huang argues the core issue with typical LLM usage is starting from scratch each chat—role, strategy, writing style, product history all reset . Gemini Gems (and analogous Claude Projects / custom GPTs) aim to retain context across work so you don’t re-brief every time .

Why it matters: Persistent context makes AI useful as a daily collaborator for writing, strategy, and synthesis—not just a “glorified search engine” .

How to apply: Start with three “foundation” assistants:

  • Writing clone: upload PRDs/emails/Slack messages for drafts in your voice .
  • Product strategy advisor: feed strategy docs, positioning, competitor analysis; use as a thought partner (not a replacement for judgment) .
  • User research synthesizer: upload transcripts/surveys/support tickets to extract themes you can’t manually read at scale .

Tactical Playbook

1) Prototype fast without derailing the org (vibe prototyping change management)

A practical framing: call prototypes what they are—not deployable code, but a substitute for a “clickable Figma,” and pilot with one team first .

Step-by-step:

  1. Prototype for yourself first: expect to revise requirements 5–15 times after seeing the first version and noticing what you forgot to specify .
  2. Bring it to the team: use it to get engineering/design feedback without claiming it replaces their work .
  3. Use it for stakeholders: prototypes create shared understanding; senior stakeholders often won’t read PRDs, but they will react to a working flow .
  4. Then validate with users: test with customers to learn quickly .

Prompting discipline (avoid “degrees of freedom”):

  • Provide enough context so the tool doesn’t guess across thousands of possibilities .
  • Include an object model (high-level entities/relationships) so the prototype isn’t built on wrong assumptions .
  • Use a “Goldilocks” amount of context—too little causes wrong guesses; too much can overwhelm context windows .

Build advice for speed: stay front-end as long as possible; delay auth/DB, and “fake it” with sample data (CSV/local storage) until needed . If you realize you’re on the wrong path, restart (“nuke from orbit”) because regenerating is cheap .


2) A quick-start map for the “vibe coding” tool landscape

Dan Olsen’s “Vibe Coding Spectrum” organizes tools from less technical (browser/UI-first) to more technical (IDE/CLI/code-first), with suggested entry points by role :

  • Designer-friendly: Figma Make, Magic Patterns
  • PM-friendly: Lovable, Bolt, Base44
  • More technical: Replit, V0
  • Developer tools: Cursor, GitHub Copilot (and others)

How to apply: start on the left where you can iterate quickly, then migrate right only when you hit constraints .


3) Build a Gem (persistent copilot) with a PM-friendly workflow

Step-by-step:

  1. Write detailed instructions: a full page of context (role, audience, format preferences). Avoid vague prompts like “help me write better” .
  2. Upload your knowledge files: PRDs, emails, competitor teardowns, roadmaps; Gemini Gems rely strictly on instructions + files, so update the files as context changes .
  3. Iterate like a mini product: refine instructions/knowledge over time .

Lisa Huang’s suggested scale: a PM may end up with ~20 Gems/Projects across workflows .


4) Measuring AI agents: a three-layer scoreboard (in order)

  1. Quality: ask “is the AI doing what it’s supposed to do?” via evals, human annotators, and LLM judges (each scales differently) .
  2. Product metrics: adoption, usage, retention, CSAT; also track qualitative signals (social, customer conversations, support tickets) .
  3. Business impact: revenue attribution, retention influence, ARR contribution—tracked consistently on the business scorecard .

The sequence matters: jumping to business impact without a quality foundation is unstable measurement .

Case Studies & Lessons

1) Building AI into hardware changes the design space (Meta Ray-Ban)

Lisa Huang describes constraints that “pure software” teams often don’t face: weight, battery life, privacy, bystander concerns, and even partner pace differences (e.g., Luxottica vs. a Silicon Valley engineering org) . She flags an important trade-off: cloud processing is the default today, but on-device is positioned as the future—especially because “privacy wins over performance” for a device worn on your face all day .

Takeaway: Don’t “fall in love with the technology.” The best AI products sit at the intersection of what users need and what the tech can reliably do today; build fast, observe behavior, and update assumptions .


2) Standing out in AI PM interviews: do the work before you’re asked

Aakash Gupta relays a hiring story from Lisa Huang: a candidate with zero AI experience stood out by watching three hours of TikTok videos from coaches working with small businesses, then bringing synthesized user needs into the interview. No other candidate did comparable pre-work .

Takeaway: The differentiator wasn’t AI credentials—it was initiative and user-centric research depth .


3) Even agents need aligned context (Momental’s pivot)

Momental’s founders described building a “product team of agents” (developer agent, PM agent doing slides/sprint planning), but discovered the agents asked endless sensible questions—mirroring the same alignment problems real teams have. The insight: they hadn’t solved alignment; they needed a context foundation first .

Takeaway: Multi-agent systems can amplify the demand for clear shared context—speed doesn’t remove coordination problems .


4) Cheap code can lead to “shipping slop” unless strategy and focus stay sharp

Casey Winters argues code is now “incredibly cheap,” which can become an excuse for a lack of strategic thinking about what’s worth building . He describes incumbents “DDoS’ing” customers with too many features and notes that running multiple agents doesn’t guarantee value—often it produces “slop” without product sense and business strategy .

Takeaway: Higher build throughput increases the penalty for weak focus: customers get overwhelmed and teams lose clear signal on what’s working .

Career Corner

1) The PM role is shifting toward hybrid builders (judgment stays core)

Aakash Gupta’s framing: AI won’t replace PMs, but it will automate or accelerate execution work (PRDs, mocks, roadmaps, data pulls). Product judgment—deciding in ambiguity what’s worth doing—remains core . Structural changes follow: PM-to-engineer ratios compress and PM expectations shift toward prototyping/design/coding enough to communicate intent .

How to act on it: choose one build-adjacent skill (rapid prototyping, lightweight coding, or system prompt + eval design) and ship artifacts regularly .


2) Breaking into AI PM: remove the “I don’t work on AI” excuse

Gupta’s roadmap includes:

  • Get direct AI experience in-role if possible; otherwise build on the side .
  • Invest in network and referrals (he emphasizes referrals still matter) .
  • Treat interview prep as a skill: practice out loud, get mocks, drill the format (product sense, execution, behavioral, case questions) .

He also argues you don’t need permission, budget, or a team to build AI products—consumer tools provide access to the same models many companies build on, and many companies aren’t fine-tuning at all .


3) Product-manage your career (and keep empathy as the strategy anchor)

Deb Liu recommends treating your career with the same intentionality PMs apply to product roadmaps . She also anchors product strategy in empathy—“vision without customer pain is theater” .


4) Job market signal (EU): Technical Project Manager (AI & Web Infrastructure), Frankfurt

A Frankfurt-based technology startup is hiring a Technical Project Manager to coordinate product strategy and execution for the European market and translate technical capabilities into market-ready products . Responsibilities include market/competitor research, structuring product priorities, coordinating development cycles, supporting validation/evaluation, and exploring AI-based workflow tools . The post lists requirements like CS/technical background, web/cloud fundamentals, structured thinking, and interest in emerging AI tools . Apply via careers@novada.com.

Tools & Resources

  • Claude Code webinar recording (Sachin Rekhi): Rekhi hosted a live session with 1,500 PMs, covering why he views Claude Code as highly productive for PMs, showing 13 automation skills, and walking through setup (editors/terminals/voice tools) . Video link: https://www.youtube.com/watch?v=zsAAaY8a63Q.

  • Gemini Gems masterclass (Lisa Huang): Podcast episode URL: https://www.news.aakashg.com/p/lisa-huang-podcast. (Key build steps: detailed instructions, upload knowledge, iterate) .

  • Vibe Brief template + tool starting point: Dan Olsen recommends starting with Lovable by default and sharing a lightweight “vibe coding brief” at “bitly slash vibebrief” .

  • AI PM feedback loop (community writeup): Reddit thread link includes: https://www.clawrapid.com/en/blog/ai-pm-feedback-loop. One described workflow: rough requirements doc for Claude → prototyping → experimenting → PRD (source of truth) → ship .

  • A caution on “auto-invoked” AI skills: Rekhi noted that installing an auto-invoked “frontend-design” skill made his monthly NPS trend visualizations harder to read, and he prefers skills he can invoke manually .

Fertilizer disruption risk collides with corn acreage debates as Brazil’s Iran-linked corn trade faces uncertainty
Mar 6
8 min read
111 docs
This Week In Regenerative Agriculture
Regenerative Agriculture
Market Minute LLC
+5
Grain and livestock markets reacted to Middle East-driven input and energy uncertainty, while U.S. corn acreage expectations split between supply-disruption risks and claims that nitrogen is already prepaid. This digest also highlights practical agronomy tools, emerging sustainability/traceability workflows, and Brazil’s export exposure to Iran amid weather-driven production and logistics constraints.

1) Market Movers

War premium + inflation narrative lifts grains (U.S.)

Market commentary tied grain strength to inflationary buying and a perceived war premium as energy markets firmed during the Iran conflict .

Futures snapshot (Mar 5, early):

  • May corn $4.46 (+2.25¢)
  • May soybeans $11.74 (+4.5¢)
  • May Chicago wheat $5.75½ (+7.25¢)
  • May KC wheat $5.80½ (+8¢)
  • May spring wheat $6.14 (+4.75¢)

Corn: acreage debate intensifies; demand signals remain mixed (U.S.)

  • Acreage risk narrative: One Farm Journal segment framed U.S. corn acres as “hanging in the balance” as shippers try to move fertilizer out of the Middle East . It noted USDA had been expecting a 5M acre cut to 2026 corn acres before the latest fertilizer spike/disruption .
  • Counterpoint: Another analyst said feedback from subscribers was “almost unanimous” that nitrogen was prepaid and acreage plans aren’t changing—maintaining a 96.5M acre estimate and expecting the Iran situation to do “very, very little” to reduce corn acres .
  • Trade + demand notes:
    • A flash sale cited 5M bushels of corn sold to unknown destinations for delivery this marketing year .
    • U.S. ethanol production fell to 1.1M barrels/day (-1.6% WoW, +1.3% YoY), while stocks rose to 26.34M barrels (+2.7% WoW) .

Soybeans: holding near highs despite Brazil harvest (U.S. + Brazil)

  • Soybean futures were described as holding within 12–13 cents of multi-month highs despite an ongoing Brazilian harvest; the same commentary pointed to low U.S. farmer ownership of old-crop beans as limiting “natural selling” .
  • Separate analysis expected continued buying interest from China in U.S. beans and cited a tighter soybean balance sheet forecast (e.g., 265M bushels 2025/26 ending stocks, crush >2.6B bushels) as part of its pre-WASDE expectations .

Wheat: weather-driven strength in HRW; broader rally context

  • Forecasts showed dry/warm conditions for U.S. HRW wheat areas (western KS/eastern CO/southern NE/TX/OK), raising emergence/prospect concerns and supporting HRW relative strength .
  • KC wheat was described as maintaining an uptrend that began in December .

Livestock: boxed beef firm; cattle and hogs trend higher (U.S.)

  • Live cattle were reported higher (362–477) and feeders higher (672–765), with boxed beef up (Choice $388.57, +$0.52; Select $380.35, +$1.77) .
  • In another market segment, cattle were characterized as supported by tight supplies and robust demand, with expectations for steady-to-higher cash trade . Hogs were described as maintaining an uptrend with improving boxed beef/pork into grilling season .

2) Innovation Spotlight

Corn rootworm: Syngenta’s DuraStack (U.S.)

Syngenta highlighted DuraStack trait technology (available for the 2027 season) featuring three modes of action and a triple-Bt protein stack aimed at corn rootworm control .

High-horsepower tractor redesign: John Deere 8R / 8RX (U.S.)

John Deere described six redesigned models, including a flagship 540 HP (wheeled and 4-track), plus 440 and 490 variants (wheeled and 8RX 4-track) . The segment emphasized:

  • Central tire inflation system (CTIS) for transport vs. field traction
  • Transport capability at 60kph / 37mph
  • Engine braking and updates to suspension/steering and cab visibility/space

Residue-to-nutrient strategies in tight fertilizer markets (U.S.)

A no-till segment described Meristem’s Excavator residue breakdown product, applied in fall or spring to “eat the pith” and make residue easier to manage . It cited studies describing nutrient release equivalent to 100 lbs of a 10-30-30 fertilizer application, and suggested potential savings of $40–$50/acre.

It also highlighted:

  • UpShift C starter system (claimed to replace 30–50% of typical starter fertilizer cost) and a new version adding 1 pint zinc/acre and phenolic acids described as stress mitigators .
  • A hopper-applied zinc pail delivering 1.2 quarts (9% equivalent) zinc with talc/graphite at $3/acre (vs. $5–$7 for a quart of zinc) .

Regenerative systems: biochar + agrivoltaics (Germany / Mexico)

  • A German organic farm study reported that combining minimum tillage with deep-placed biochar (30 cm) increased native soil organic carbon by 2.24 Mg C ha⁻¹, with decreases in bulk density and higher microbial biomass carbon in the top 10 cm .
  • Researchers at Pitzer College proposed a “Regenerative Agrivoltaics” framework combining soil restoration and solar energy, suggesting agricultural productivity could increase by up to 70% while also improving photovoltaic performance (via ambient cooling effects) .

3) Regional Developments

U.S.: fertilizer logistics + acreage risk (and uncertainty)

A Farm Journal report noted analysts estimate fertilizer takes ~30 days to reach the U.S. from the Persian Gulf, plus another 3–4 weeks to reach farmers . It also warned delayed fall application or purchase could force later planting and less corn, while another comment referenced analysts suggesting losses of up to 1M corn acres per week under prolonged disruption .

Brazil: corn export exposure to Iran + corn/ethanol “verticalization” push

  • Mato Grosso (Brazil’s top corn producer) estimated 51.7M tons for 2025/26 and reported early-year exports of 2.53M tons to 28 countries .
  • Iran was described as a major destination: 9M tons shipped in 2024 (about 20% of Brazil’s corn exports) and roughly 80% of Iran’s corn imports sourced from Brazil .
  • In response to geopolitical risk and input costs, Canal Rural commentary argued for verticalizing corn by processing into ethanol and DDG (noting DDG’s 30% protein potential) . It also cited corn as 20% of Brazil’s ethanol mix and emphasized export outlets for ethanol (e.g., Korea and Vietnam, and interest from India) .

Brazil: excess rain and logistics disruptions in northern Mato Grosso (Marcelândia)

In Marcelândia (MT), rainfall totals were reported >2,200 mm with projections to 3,000 mm (vs. an average 1,800–2,000 mm), saturating soils and preventing machinery access . Reported impacts included:

  • Soy harvested at 28–30% moisture (described as about double ideal), driving quality/weight loss and discounts .
  • Estimated losses ranging 15–32% on properties, with a cited minimum 10% productivity loss .
  • Corn planting delayed beyond the ideal window due to rain .
  • Logistics pinch points: limited storage and truck backups, with concerns about road access and the MT-320 section “ceding” .

Brazil: second-crop corn planting delays + weather windows

  • Conab-linked reporting cited second-crop corn planting running about 4–5% behind last year due to excess moisture . Mato Grosso was cited at 85% planted and slightly ahead year-on-year, while São Paulo hadn’t started (awaiting rain), and Goiás/Paraná were flagged as delayed .
  • Bahia storms left 16 municipalities in emergency, with Jacobina reporting >150 mm in 12 hours and river overflow; forecasts suggested improvement then a return of heavier rain mid-month .

Policy / trade: Mercosur–EU agreement (Brazil)

Brazil’s Senate unanimously approved the Mercosur–EU free trade agreement, describing tariff reductions/elimination across >90% of trade; ratification is still required by other Mercosur countries and the EU .

4) Best Practices

Corn: crown rot risk management (field-level)

Ag PhD highlighted crown rot as a multi-pathogen problem (e.g., fusarium, anthracnose, charcoal rot, gibberella, pythium) and noted risk increases under plant stress (drought, insufficient fertility, insects/nematodes, wind/hail damage, high populations/weeds) . Practices cited to “keep it at bay” included:

  • Prioritize great drainage and excellent fertility (including high K, P, and micronutrients)
  • Improve seed treatment and consider an in-furrow fungicide such as Xyway

Grain marketing (producer panel tactics)

A producer panel described using working orders (including “odd numbers” to improve fill odds) and using put options to protect downside while staying open to upside around volatile headline-driven moves .

Sustainability + market access (Brazil): documentation as a practical workflow

Canal Rural coverage framed sustainability as increasingly tied to export market access through traceability and proof of practices, pushing producers toward:

  • Digital traceability for inputs and supplier origin documentation
  • Precision agriculture
  • Bioinputs / biofertilizers
  • Energy solutions like solar and biodigesters (biogas/biofertilizer from waste)

5) Input Markets

Fertilizer: supply chain timing, affordability pressure, and antitrust scrutiny (U.S.)

  • A Farm Journal segment said the corn-to-urea price ratio was already one of the second or third worst in history and was “quickly getting worse” .
  • DOJ is investigating multiple fertilizer companies (including Nutrien, Mosaic, CF Industries, Koch, and Yara) for alleged price collusion; the investigation was described as early-stage and examining potential civil and criminal antitrust violations .
  • Separate commentary cited market concentration figures from an industry watchdog (e.g., Nutrien/Mosaic controlling 90% of potash and phosphate capacity; Nutrien/CF/Koch/Yara controlling ~82% of nitrogen-based fertilizers) .

Freight + diesel cost pressures (Brazil)

Brazilian commentary anticipated higher global freight and diesel costs as oil rises during the conflict .

Crop protection drift risk (France/EU)

A discussion on prosulfocarb (a widely used herbicide in France) described it as highly volatile with dispersion over kilometers, with sales rising from ~1,000 tonnes (2012) to 7,400 tonnes (2022). The same post cited findings that two-thirds of fruit/vegetable samples tested contained residues, and 40% exceeded maximum permitted limits, alongside reports of organic crop rejections and financial losses .

6) Forward Outlook

Key near-term dates and decision points

  • Multiple segments pointed to the March planting intentions report as a key checkpoint for how farmers ultimately respond to fertilizer pricing/availability .

Corn: seasonality + fund positioning as a watch item (U.S.)

MarketMinute noted that new-crop corn has topped before April only three times historically (2013, 2024, 2025) . It also argued current fund positioning is “pretty much flat,” unlike last year when funds were “super long” (over +300k contracts in February) before liquidation into July—an argument presented against another early non-seasonal top this year .

Weather: what markets may (and may not) price right now

One market/weather segment emphasized that spring dryness typically matters less to markets than rainfall during late June to mid-July, while excessive rain would need to be extreme enough to create delayed planting to become a major issue .

Your time, back.

An AI curator that monitors the web nonstop, lets you control every source and setting, and delivers one verified daily brief.

Save hours

AI monitors connected sources 24/7—YouTube, X, Substack, Reddit, RSS, people's appearances and more—condensing everything into one daily brief.

Full control over the agent

Add/remove sources. Set your agent's focus and style. Auto-embed clips from full episodes and videos. Control exactly how briefs are built.

Verify every claim

Citations link to the original source and the exact span.

Discover sources on autopilot

Your agent discovers relevant channels and profiles based on your goals. You get to decide what to keep.

Multi-media sources

Track YouTube channels, Podcasts, X accounts, Substack, Reddit, and Blogs. Plus, follow people across platforms to catch their appearances.

Private or Public

Create private agents for yourself, publish public ones, and subscribe to agents from others.

Get your briefs in 3 steps

1

Describe your goal

Tell your AI agent what you want to track using natural language. Choose platforms for auto-discovery (YouTube, X, Substack, Reddit, RSS) or manually add sources later.

Stay updated on space exploration and electric vehicle innovations
Daily newsletter on AI news and research
Track startup funding trends and venture capital insights
Latest research on longevity, health optimization, and wellness breakthroughs
Auto-discover sources

2

Confirm your sources and launch

Your agent finds relevant channels and profiles based on your instructions. Review suggestions, keep what fits, remove what doesn't, add your own. Launch when ready—you can always adjust sources anytime.

Discovering relevant sources...
Sam Altman Profile

Sam Altman

Profile
3Blue1Brown Avatar

3Blue1Brown

Channel
Paul Graham Avatar

Paul Graham

Account
Example Substack Avatar

The Pragmatic Engineer

Newsletter · Gergely Orosz
Reddit Machine Learning

r/MachineLearning

Community
Naval Ravikant Profile

Naval Ravikant

Profile ·
Example X List

AI High Signal

List
Example RSS Feed

Stratechery

RSS · Ben Thompson
Sam Altman Profile

Sam Altman

Profile
3Blue1Brown Avatar

3Blue1Brown

Channel
Paul Graham Avatar

Paul Graham

Account
Example Substack Avatar

The Pragmatic Engineer

Newsletter · Gergely Orosz
Reddit Machine Learning

r/MachineLearning

Community
Naval Ravikant Profile

Naval Ravikant

Profile ·
Example X List

AI High Signal

List
Example RSS Feed

Stratechery

RSS · Ben Thompson

3

Receive verified daily briefs

Get concise, daily updates with precise citations directly in your inbox. You control the focus, style, and length.

Cursor Cloud Agents go video-first + test-first, while GPT-5.4 upgrades Codex and always-on automations spread
Mar 6
7 min read
147 docs
OpenAI
swyx
Salvatore Sanfilippo
+16
Cursor’s Cloud Agents show what “agentic IDE design” looks like in practice: dedicated VMs, end-to-end testing, demo videos, and Slack-first collaboration. Plus: GPT‑5.4’s Codex upgrades (/fast mode, Playwright skill, 1M context status), always-on Cursor Automations, and hard lessons on evaluation, manual testing, and CI prompt-injection security.

🔥 TOP SIGNAL

Cursor’s latest Cloud Agents push is a concrete “agentic IDE” redesign: agents run in dedicated VMs, test changes end-to-end, and return a demo video + a tested PR, with remote desktop/terminal access for quick human iteration . Cursor says this flow exists because reviewing code becomes the bottleneck once agents can generate large diffs—video is an easier first review surface (but not a code-review replacement) .

🛠️ TOOLS & MODELS

  • OpenAI — GPT-5.4 rollout (Thinking + Pro), unified frontier model

    • Rolling out in ChatGPT, and also available in the API and Codex.
    • OpenAI describes it as bringing advances in reasoning, coding, and agentic workflows into one model .
    • Practitioner note: Hanson Wang says Codex and Thinking models are now unified.
  • Codex — /fast mode (GPT-5.4)

    • Claimed 1.5x faster with “the same intelligence and reasoning” .
    • Tradeoff called out by the Codex team: 1.5x speed for 2x cost.
  • Codex — Playwright skill + frontend improvements (GPT-5.4 era)

    • Romain Huet says complex frontend work looks “noticeably better,” and calls out a new Playwright skill that lets Codex visually debug and test apps while it builds.
  • Cursor — GPT-5.4 support + 1M context status

    • Cursor says GPT-5.4 is now available and is “more natural and assertive,” leading on their internal benchmarks .
    • Cursor’s Jediah Katz reported an issue with 1M context in GPT-5.4 and said they were fixing it ASAP .
    • Follow-up: Katz says 1M context is now available for GPT-5.4 if you toggle Max Mode on (enterprise legacy pricing: coming behind a separate gpt-5.4-1m slug ).
  • Cursor — Automations (always-on agents)

    • Cursor announced Automations: “continuously monitor and improve your codebase,” running on triggers and instructions you define.
    • Cursor CEO Michael Truell says Automations already run thousands of times per day internally, powering self-healing CI, auto-approving PR flows, compute-intensive security review, and a team-wide memory system.
    • Jediah Katz highlights they can trigger on any event/webhook, run in the cloud (not dependent on one laptop), and are team-owned.
  • Local agents (privacy-driven) — Qwen 3.5 as “good enough” for some tasks

    • Salvatore Sanfilippo says Qwen 3.5 is the first time he feels local agents can work for simpler programming tasks on your own machine (not state of the art, but effective) .
    • He compares the 27B dense model (more stable, good for GPU) and 35B MoE (3B active) (faster iteration, maybe better in practice) .
  • Augment — “Intent” UI for large workloads

    • Theo describes Intent as a shift from chat/autocomplete toward a UI for planning and managing large agentic coding workloads.
    • He also highlights pulling context from Linear, Sentry, GitHub issues, or PRs to keep workstreams compatible .

💡 WORKFLOWS & TRICKS

1) Cursor’s “Cloud Agent” loop (test-first + video-first + HITL)

A replicable loop Cursor describes for cloud-agent work:

  • Kick off an agent in cursor.com/agents; it works longer because it tests end-to-end (starts dev servers, iterates) and aims to return a tested PR.
  • First review pass: watch the demo video (a faster entry point than reviewing a huge diff) .
  • If needed: use remote desktop (VNC-style) + terminal access to interactively verify behavior and iterate .
  • Testing controls:
    • Default behavior is calibrated testing: don’t test “very simple copy changes,” but test complex ones; configurable via agents.md.
    • Use /notest to force skipping tests .

2) Bugfixes that ship faster: /repro before/after videos

Cursor’s **/repro** pattern:

  • Agent reproduces the bug and records a video, then fixes and records an “after” video .
  • Cursor says this moves some bug classes from “hard to repro locally” to “merge in ~90 seconds” .

3) Parallelism you can actually review: Best-of-N via 20s videos

  • Cursor says demo videos made them use best-of-N more often because reviewing four 20-second videos is manageable vs reviewing 4× giant diffs.

4) Slack as the “new IDE” surface (team workflows)

  • Cursor engineers describe Slack threads as a dev surface: you can @cursor in issue/product channels to kick off a cloud agent; teammates can “follow up” in-thread with more context .
  • They say the human discussion shifts to the high-order decisions (“do we ship this?”, “is this the right UX?”) while the agent handles implementation .

5) Subagents for context + compute management

  • Cursor highlights subagents as a way to delegate across prompts/goals/models and keep context manageable .
  • Example: an explore subagent can be routed to a faster model to read lots of code quickly, then summarize back to the parent agent .

6) Long-running agent mode (“grind mode”)

  • Cursor describes a long-running mode (“grind mode”) that aligns on a plan first, then grinds until criteria are met—potentially for days .

7) “Meta-setup” is becoming its own benchmark (Karpathy)

  • Andrej Karpathy says he has agents iterating on nanochat automatically: agents work on feature branches, try ideas, merge improvements, and iterate .
  • In one snapshot he reports 110 changes in ~12 hours reducing validation loss from 0.862415 → 0.858039 (d12 model) with no wall-clock penalty .
  • He calls the real benchmark: “what is the research org agent code that produces improvements on nanochat the fastest?.

8) Let the model improve the model (Hanson Wang’s GPT-5.4 workflow)

  • Hanson Wang says he asked GPT-5.4-xhigh in Codex to autonomously iterate on Codex’s own system prompt; it ran >17 hours, executed 200+ evals, wrote scripts to monitor eval progress, and pruned unpromising branches .

9) Skills need evals (not vibes): LangChain’s skills benchmarking loop

  • LangChain’s Robert Xu outlines an evaluation pipeline: define tasks + define skills, run with/without skills, compare, iterate .
  • Reported outcome (their tests): Claude Code completed tasks 82% of the time with skills vs 9%without skills .
  • Practical detail: they stress consistent clean environments (they used a lightweight Docker scaffold) for reproducible agent tests .

10) Manual testing is still non-negotiable (and agents can help)

  • Simon Willison: “Just because code passes tests doesn’t mean it works as intended… Automated tests are no replacement for manual testing.
  • He recommends having agents execute what they wrote (e.g., Playwright for UI testing) instead of assuming correctness .
  • For evidence, Willison’s Showboat pattern records commands + outputs to discourage agents from writing what they hoped happened .

11) Security footgun: prompt-injected CI agents + cache poisoning (Cline)

  • Cline ran an issue-triage workflow using anthropics/claude-code-action@v1 on every newly opened GitHub issue with --allowedTools "Bash,Read,Write,...".
  • Because the workflow prompt included the untrusted issue title, an attacker could prompt-inject tool execution and use GitHub Actions cache behavior to poison shared caches and steal release secrets, leading to a compromised cline@2.3.0 release (later retracted) .

👤 PEOPLE TO WATCH

  • Jonas Nelle + Samantha Whitmore (Cursor) — unusually specific harness design details: test-first PRs, video review entrypoint, Slack-as-IDE, subagents, and long-running “grind mode” .
  • Michael Truell (Cursor) — adoption signal: Automations running thousands/day internally, including “compute-intensive security review” and team memory .
  • Hanson Wang (OpenAI/Codex) — concrete “agent improves agent” workflow (17h autonomous system-prompt iteration with 200+ evals) .
  • Andrej Karpathy — framing shift: optimize the agent org (meta-setup) and measure “time-to-improvement” loops .
  • Simon Willison — high-signal practical guidance across (1) agentic manual testing and (2) real-world agent CI security failures.
  • swyx — pushes for better rigor + tooling around agent reliability, including an open-sourced Claude compaction viewer for diagnosing bad compactions and a reminder that statistically meaningful SWE-bench comparisons can require 30–60x more compute than cheap samples .

🎬 WATCH & LISTEN

1) Cursor Cloud Agents: test + video + remote desktop as the new review loop (≈02:23–05:33)

Hook: why video is the “entry point” for reviewing agent output, and how remote desktop/terminal access closes the loop on real verification.

2) Slack as the collaboration surface for agents (≈20:57–23:26)

Hook: how agent threads + team follow-ups shift human work from “where does this if-statement go?” to product/UX decisions.

📊 PROJECTS & REPOS


Editorial take: Today’s theme is throughput via autonomous + parallel agents—and the tax you can’t dodge is verification (tests + manual evidence) and security boundaries around what those agents are allowed to touch.

GPT‑5.4 rolls out with native computer use; KARL and FlashAttention‑4 reshape the agent stack
Mar 6
9 min read
1027 docs
More Perfect Union
Lisan al Gaib
Tibo
+43
OpenAI’s GPT‑5.4 rollout dominates the cycle, bringing native computer use, tool-search efficiency, and 1M-token context (with real long-context caveats). Also: Databricks’ RL-trained KARL knowledge agent, FlashAttention‑4’s push into mainstream frameworks, a major Anthropic–Pentagon escalation, and a developer-agent supply-chain security incident.

Top Stories

1) OpenAI rolls out GPT‑5.4 (Thinking + Pro) with native computer use and 1M context

Why it matters: This is a consolidated “frontier model” push that pairs agentic coding + tool use + computer control with very long context, which changes what’s practical in production workflows (especially multi-step, tool-heavy tasks).

Key details (as announced across OpenAI + OpenAI DevRel):

  • Availability / SKUs: GPT‑5.4 is available now in the API and Codex, with GPT‑5.4 Thinking and GPT‑5.4 Pro rolling out in ChatGPT. In the API, it’s available as gpt-5.4 and gpt-5.4-pro.
  • Core capability bundle: Native computer-use capabilities; up to 1M tokens of context (Codex + API); “best-in-class agentic coding for complex tasks”; scalable tool search; more efficient reasoning for long, tool-heavy workflows .
  • Computer use specifics: OpenAI Devs says GPT‑5.4 can write Playwright code, read screenshots, and issue keyboard/mouse actions to operate computers, with steerable behavior and configurable confirmation policies .
  • Benchmarks shared by OpenAI Devs: 83.0% on GDPval, 75.0% on OSWorld‑Verified, 57.7% on SWE‑Bench Pro (Public), 54.6% on Toolathlon .
  • Efficiency + speed knobs in Codex: /fast mode delivers up to 1.5× faster performance across supported models (including GPT‑5.4) . Separately, a user report notes 1.5× speed at 2× credit consumption.
  • Steering mid-response: In ChatGPT, OpenAI says you can now interrupt GPT‑5.4 Thinking mid-response to add instructions or adjust direction, with steering rolling out on Android and web (iOS “coming soon”) .

Practical caveat on long context:

  • Even with a 1M context window, retrieval degrades at very large contexts. One reported MRCR v2 “needle-in-a-haystack” curve shows 97% at 16–32K tokens, 57% at 256–512K, and 36% at 512K–1M—prompting recommendations to compact regularly.

Relevant links:


2) Databricks releases KARL, an RL-trained “knowledge agent” aimed at grounded enterprise reasoning

Why it matters: KARL is a concrete example of applying RL to non-verifiable enterprise knowledge tasks (messy docs, long tool chains), and Databricks frames it as an “assembly line” for producing agents—important for teams trying to move beyond “RAG as a demo.”

What was announced:

  • What it is: KARL (Knowledge Agents from Reinforcement Learning) is an RL-trained agent for document-centric grounded reasoning over complex questions, “millions of documents,” “hundreds of tool calls,” and repeated context compression .
  • Performance framing: Databricks describes “frontier-level performance on complex knowledge workloads at a fraction of the cost and latency of leading proprietary models” .
  • Why RL here: Databricks emphasizes these enterprise tasks “are not strictly verifiable” like unit-test-style RL wins .
  • Mechanics (high level): Off-policy RL with synthetic data (OAPL), multi-task RL that generalizes, and “parallel thinking” test-time compute to manage latency .
  • RAG++++ detail: A VentureBeat summary highlights KARL matching frontier quality on messy enterprise data by running up to 200 vector searches per query.

Links:


3) FlashAttention‑4 goes GA; PyTorch adds a FlashAttention‑4 backend for FlexAttention

Why it matters: Attention kernels are a performance ceiling for both training and inference. FA4 is positioned as a Blackwell-era redesign that shifts bottlenecks away from softmax/SMEM limits, while PyTorch is trying to make these gains accessible for custom attention variants (not only a single “blessed” kernel).

What’s new:

  • FA4 GA: “FlashAttention‑4 is GA” .
  • Core performance claim: FA4 reaches ~1600 TFLOPs attention on Blackwell GPUs and is described as “pretty much at matmul speed,” by changing the algorithm/pipeline so softmax and shared memory bandwidth no longer dictate speed .
  • PyTorch integration: PyTorch added a FlashAttention‑4 backend to FlexAttention on Hopper and Blackwell GPUs; PyTorch now auto-generates CuTeDSL score/mask modifications and JIT-instantiates FA4 for custom attention variants . PyTorch reports 1.2× to 3.2× speedups over Triton on compute-bound workloads .
  • Transformers integration (in progress): A PR for FA4 integration into Hugging Face Transformers was shared (PR #42435) .

4) Anthropic–Pentagon escalation: “supply chain risk” designation + Amodei statement

Why it matters: This is a high-stakes governance signal: AI labs are increasingly treated as critical suppliers (and potential risks) in national-security procurement, with direct implications for enterprise adoption, contracts, and oversight.

Reported developments:

  • Designation: A post claims the Pentagon formally notified Anthropic it’s been deemed a “supply chain risk”.
  • Amodei response (as summarized): A memo-style summary says Amodei apologized for the tone of a leaked memo, said it was outdated/not his considered view, emphasized keeping warfighters equipped, and offered Claude to the military at nominal cost with forward-deployed engineer support .
  • Anthropic’s statement link: Anthropic shared a statement from Amodei: https://www.anthropic.com/news/where-stand-department-war.

“Anthropic has much more in common with the Department of War than we have differences.”


5) Security incident report: “Clinejection” installs a separate agent (OpenClaw) without consent

Why it matters: Agentic dev tools run with broad local permissions; supply-chain style incidents can turn “developer convenience” into fleet-wide risk.

  • A write-up alleges “every developer who installed or updated Cline got OpenClaw … installed globally on their machine without consent,” describing it as “malicious agent injection” and noting OpenClaw has “full system access” .

Details: https://grith.ai/blog/clinejection-when-your-ai-tool-installs-another

Research & Innovation

Why it matters: This week’s research is converging on a few themes: RL methods for messy tasks, hybrid architectures for scaling efficiency, and benchmarks that better approximate real agent constraints (implicit rules, over/underthinking, interaction).

Open models + hybrid architectures

  • OLMo Hybrid (AI2): Allen AI released OLMo Hybrid, mixing transformer attention with linear RNN layers; the team claims hybrid models are “strictly more expressive” than either alone and that this translates to better scaling (49% fewer tokens to match OLMo 3 MMLU accuracy) .
  • Training “fully in the open”: Lambda says OLMo Hybrid 7B was trained in the open with training logs/recovery metrics/weights, using 3T tokens, 512 NVIDIA Blackwell GPUs, over 7 days, with 97% active training time and median recovery under 4 minutes.

RL + evaluation research (Meta FAIR ICLR set)

  • Meta FAIR says its team co-authored 7 papers accepted to ICLR, covering topics including joint safety agents (“Alignment Waltz”), judge RL (“J1”), experience synthesis for agent learning, and benchmarks for over/underthinking (“OptimalThinkingBench”) .

Data efficiency for language models

  • Semantic Tube Prediction (STP): STP (co-authored by Yann LeCun) is described as forcing hidden states into locally linear “semantic tubes,” matching baseline accuracy with 16× less training data. Paper: https://arxiv.org/abs/2602.22617.

Benchmarks for agent “implicit constraints”

  • Implicit Intelligence: Labelbox Applied ML Research introduced a benchmark testing whether agents respect unstated constraints across implicit reasoning, catastrophic risk, privacy/security, and accessibility . Paper: https://arxiv.org/abs/2602.20424.

Long-running agents: context compression as a core problem

  • Baseten KV-cache compression: Baseten reports one-shot compaction preserves detailed information with 65–80% accuracy at 2–5× compression (outperforming text summarization) and explores what happens when you compress repeatedly for persistent agents .

Products & Launches

Why it matters: The biggest product shifts are around agent scaffolding: better computer-use interfaces, orchestration/automation, and cross-tool connectivity (so agents can actually act, not just chat).

GPT‑5.4 distribution and integrations

  • GitHub Copilot: GitHub says GPT‑5.4 is now generally available and rolling out in Copilot; early testing highlights “enhanced logical reasoning and task execution” . Changelog: https://github.blog/changelog/2026-03-05-gpt-5-4-is-generally-available-in-github-copilot/.
  • Cursor: Cursor says “GPT 5.4 is now available in Cursor,” and they found it “more natural and assertive than previous models” .
  • Perplexity: Perplexity announced GPT‑5.4 and GPT‑5.4 Thinking availability for Pro/Max subscribers .
  • Arena: Arena reports GPT‑5.4 variants in Text/Vision/Code arenas and publishes ranking highlights (e.g., GPT‑5.4‑high tied with Gemini‑3‑Pro in Text Arena) .

Codex tooling updates

  • Codex app on Windows: OpenAI Devs announced Codex is now on Windows with a “native agent sandbox” and PowerShell support . Landing page: https://developers.openai.com/wendows.

Always-on agent operations

  • Cursor Automations: Cursor introduced Automations for always-on agents that run based on triggers and instructions you define . Blog: http://cursor.com/blog/automations.

Office / finance workflow tooling

  • ChatGPT for Excel: OpenAI launched “ChatGPT for Excel,” positioning it as bringing ChatGPT into spreadsheet workflows (“where decisions get made”) . Link: https://openai.com/index/chatgpt-for-excel/.

Video generation continues to split into “engines” vs “story tools”

  • Bing Video Creator: Microsoft rolled out “Sora 2 generative video” in Bing Video Creator, adding audio integration and watermark + C2PA credentials .
  • PAI (Utopai Studios): Utopai says PAI is rolling out as a long-form cinematic model with minutes-long continuous generation, character/scene consistency, and natural-language editing .
  • LTX‑2.3 on fal: fal says LTX‑2.3 is live with Pro (audio-to-video, retake, extend) and Fast modes plus sharper detail/cleaner audio/stronger motion .

Industry Moves

Why it matters: Distribution and enterprise positioning are starting to matter as much as raw model quality—especially for agents (where tool ecosystems + integrations decide what gets adopted).

  • Together AI fundraising (reported): Together AI is reportedly raising $1B at a $7.5B pre-money valuation, generating ~$1B ARR, with growth tied to moving from leasing GPUs to buying their own GPUs to rent out .
  • Codex user growth: Codex surpassed 2M+ active users, up 25% week-over-week (noted as before Windows + GPT‑5.4 launch) .
  • Claude adoption: One post claims “more than a million people are now signing up for Claude every day” .
  • Sakana AI × MUFG: Sakana AI and Mitsubishi UFJ Bank advanced their “AI Lending Expert” system from ~6-month PoC to real-case verification phase. Link: https://sakana.ai/mufg-ai-lending.

Policy & Regulation

Why it matters: Export controls and professional-liability rules can become hard constraints on where AI can be deployed—and what assistants can legally do.

  • US AI chip export restrictions (reported): A post says the Trump Administration is preparing a rule to restrict AI chip shipments globally without US approval, requiring permission for “virtually all exports of AI chips,” with Nvidia and AMD heavily impacted .
  • New York bill targeting “substantive responses”: A New York bill would ban AI from answering questions related to licensed professions (medicine, law, dentistry, nursing, psychology, social work, engineering, and more), and companies would be liable if chatbots give “substantive responses” in these areas .

Quick Takes

Why it matters: Smaller releases often become “quiet defaults” inside stacks—especially around evaluation, routing, and on-device constraints.

  • OpenAI: Chain-of-Thought controllability: OpenAI published a new evaluation suite/paper and says GPT‑5.4 Thinking shows “low ability to obscure its reasoning,” suggesting CoT monitoring remains a useful safety tool .
  • Gemini 3.1 Flash‑Lite preview (pricing): Google launched Gemini 3.1 Flash‑Lite in preview at $0.25 / 1M input tokens for high-volume developer workloads .
  • Perplexity “Model Council”: Perplexity launched a mode that runs GPT‑5.4, Claude Opus 4.6, and Gemini 3.1 Pro simultaneously and selects the best answer in one workflow .
  • OLMo Hybrid (distribution): AI2 released a family of OLMo Hybrid models (base/SFT/DPO) on Hugging Face .
  • FlashAttention‑4 resources: FA4 paper and code links shared (paper PDF + GitHub repo) .
  • LiquidAI on-device agent: A 24B-parameter model (2.3B active per token) is reported to fit in 14.5GB and run tool selection with 385ms average latency (67 tools, 13 MCP servers) with “zero network calls” .
  • OpenHands Critic v1.0: OpenHands released a “critic” model that scores coding agent traces to address the verification bottleneck, with real-time thumbs-up/down monitoring and support in SDK/CLI/Hugging Face .
  • LangChain skills evaluation: LangChain released an evaluation benchmark for LangSmith/LangChain “skills,” emphasizing variance across tasks for coding agents . Repo: https://github.com/langchain-ai/skills-benchmarks.
  • GitHub AGENTS.md guidance: GitHub’s analysis of 2,500+ repos suggests effective AGENTS.md files stay brief and include persona, exact commands, boundaries, and good output examples .
GPT-5.4 rolls out broadly as coding agents, hybrid open models, and interpretability funding accelerate
Mar 6
6 min read
201 docs
OpenAI
Ai2
swyx
+15
OpenAI rolled out GPT-5.4 (Thinking + Pro) across ChatGPT, the API, and Codex—highlighting steering mid-response, 1M-token context, and native computer use—alongside new safety research on chain-of-thought controllability. The digest also covers Cursor’s cloud agents workflow, Perplexity’s multi-model “Model Council,” AllenAI’s open Olmo Hybrid architecture release, Goodfire’s $150M fundraise, and fresh signals of agents moving into enterprise operations.

OpenAI launches GPT-5.4 (Thinking + Pro) across ChatGPT, API, and Codex

GPT-5.4 roll-out + headline capabilities

OpenAI announced GPT-5.4 is available now in the API and Codex, with a gradual rollout in ChatGPT starting today . OpenAI frames GPT-5.4 as combining advances in reasoning, coding, and agentic workflows into one frontier model .

Notable feature claims include:

  • Steering mid-response (interrupt the model and adjust direction)
  • 1M tokens of context
  • Better performance on knowledge work and web search, plus native computer use capabilities
  • “Most factual and efficient” (OpenAI claims fewer tokens and faster speed)

Steering availability: OpenAI says steering is available this week on Android and web, with iOS “coming soon”.

Details: http://openai.com/index/introducing-gpt-5-4/

Codex updates: /fast mode + GPT-5.4 improvements for tool use and frontend work

Sam Altman also highlighted that GPT-5.4 is introduced in Codex with improved token efficiency and better tool calling, computer use, and frontend development. OpenAI is also introducing /fast for a faster version of Codex .

Demo: computer use + frontend workflows (with self-testing efficiency claims)

In OpenAI’s GPT-5.4 Thinking demo, the model uses CUA (computer use) to test its own work (e.g., interacting with a 3D chess Electron app) . The presenter claims token use has dropped by 2/3 in some self-testing cases .

Windows: Codex app + “Windows-native agent sandbox”

OpenAI says the Codex app is now live on Windows. The app can run natively or in WSL with integrated terminals (PowerShell, Command Prompt, Git Bash, or WSL) .

OpenAI also described a Windows-native agent sandbox with OS-level controls that block filesystem writes outside a working folder and prevent outbound network access unless explicitly approved .

Safety research: Chain-of-Thought (CoT) controllability eval

OpenAI published a new evaluation suite and research paper on Chain-of-Thought (CoT) Controllability. The company reports that GPT-5.4 Thinking shows low ability to obscure its reasoning, suggesting CoT monitoring remains a useful safety tool .

Paper: https://openai.com/index/reasoning-models-chain-of-thought-controllability/

Early tester feedback (including weaknesses flagged)

One tester wrote that after a week of testing, GPT-5.4 felt like “the best model in the world” and reduced their reliance on Pro modes . The same thread praised coding reliability in Codex and speed improvements from using fewer reasoning tokens .

That tester also listed weaknesses: “frontend taste” lagging competitors, missing obvious real-world context in planning, and stopping short before finishing tasks in OpenClaw . Sam Altman replied: “We will be able to fix these three things!” .

Coding agents: Cursor’s cloud agents push toward test-and-video workflows

Cursor’s “cloud agents” are described as having surpassed tab-autocomplete usage internally, reinforcing the claim that “the IDE is Dead” . In this model, agents do more end-to-end work and return artifacts that are easier to review than raw diffs.

Key product mechanics highlighted:

  • Automatic testing of changes before PR submission (with calibrated prompting and a /no test override)
  • Demo videos as an entry point for review, plus Storybook-style galleries
  • Remote VM access (VNC) for live interaction and iteration
  • A /repro workflow for bug reproduction + fix verification with before/after videos

The same discussion frames a near-term “big unlock” as widening throughput via parallel agents and subagents for context management and long-running threads .

Multi-model orchestration: Perplexity adds “Model Council” to Perplexity Computer

Perplexity launched Model Council inside Perplexity Computer, allowing users to run GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro simultaneously and select an orchestrator model . Perplexity’s positioning: “Three frontier models. One workflow. Best answer wins.”

Open models and new architectures: AllenAI releases Olmo Hybrid (7B)

Allen AI released Olmo Hybrid, a fully open 7B model combining transformer and linear RNN (gated delta net / GDN) layers in a 3:1 ratio with full attention . AllenAI and commentary in Interconnects describe it as a strong artifact for studying hybrid architectures, with theory and scaling experiments accompanying the release .

Interconnects reports:

  • Pretraining gains: about a 2× gain on training efficiency vs. Olmo 3 dense
  • Post-training results: mixed (knowledge wins, reasoning losses vs. dense), but still a strong open model overall
  • Practical challenge: OSS tooling and long-context inference issues can negate efficiency gains in practice right now

Resources:

Research workflow shift: Karpathy’s nanochat gets faster—and agents iterate on it autonomously

Andrej Karpathy reported nanochat can now train a GPT-2-capability model in 2 hours on a single 8×H100 node (down from ~3 hours a month ago), largely due to switching from FineWeb-edu to NVIDIA ClimbMix.

He also described AI agents automatically iterating on nanochat, making 110 changes over ~12 hours and improving validation loss from 0.862415 → 0.858039 for a d12 model without increasing wall-clock time (feature branch experimentation + merge when ideas work) . Karpathy later framed the “new meta” benchmark as: “what is the research org agent code that produces improvements on nanochat the fastest?”.

Interpretability funding + “Intentional Design”: Goodfire raises $150M Series B

Mechanistic interpretability startup Goodfire announced a $150M Series B at a $1.25B valuation, less than 2 years after founding . Alongside the raise, the company introduced Intentional Design: complementing reverse engineering with an approach focused on shaping the loss landscape to influence what models learn and how they generalize .

One proof-of-concept described is hallucination reduction using a probe trained to detect hallucinations for both runtime steering and RL reward signals, with a key training trick: run the probe on a frozen copy of the model to reduce incentives/ability to evade the detector during training .

Enterprise adoption notes: MUFG + Sakana AI lending agent moves to real-case testing; Microsoft updates Dragon Copilot

Sakana AI and Mitsubishi UFJ Bank (MUFG) advanced their “AI Lending Expert” agent system from a ~6-month PoC to a real-case verification phase, following their 2025 comprehensive partnership announcement .

Microsoft announced “big updates” to Dragon Copilot at HIMSS, introducing Work IQ to bring the right work context alongside patient data, aiming to reduce admin busywork and let clinicians focus more on patients .

Two cautionary notes circulating: benchmarks and moral-reasoning behavior

  • Benchmark noise: swyx cautioned against a viral claim that Claude Opus 4.6 had its “worst benchmark day,” pointing out that the SWE-bench author does not endorse “cheap sample” benchmarks and arguing 30–60× more compute is needed for statistically meaningful results .

  • Moral-reasoning oddities: Gary Marcus amplified a study thread reporting that GPT answered “yes” to torturing a woman to prevent a nuclear apocalypse but “absolutely not” to harassing a woman in the same scenario—described as a reversal that appeared only when the target was a woman . The thread argues this may reflect mechanical overgeneralization from RLHF rather than reasoning about underlying harms .

Self-help sobriety, database fundamentals, and AI-era signal vs. noise
Mar 6
4 min read
159 docs
Tim Ferriss
martin_casado
Jamie Turner
+3
Today’s highest-signal picks include Shaan Puri’s standout endorsement of Tim Ferriss on the “self-help trap,” Martin Casado’s pointer to a practical database fundamentals video, and Packy McCormick’s curated reading on AI’s impact on hiring and software production—plus two history/strategy recommendations about control, institutions, and high-stakes decision-making.

Most compelling recommendation: a check on “improvement” becoming its own addiction

The Self-Help Trap: What 20+ Years of “Optimizing” Has Taught Me — Tim Ferriss (blog post)

  • Type: Blog post
  • Author/creator: Tim Ferriss
  • Link/URL:https://x.com/tferriss/status/2029283224866770944
  • Recommended by: Shaan Puri (@ShaanVP)
  • Key takeaway (as shared): Shaan connects Ferriss’ point to seeing people at a Tony Robbins event who, after attending 3+ times, seemed to get “addicted to the medicine” .
  • Why it matters: It’s a rare, high-conviction endorsement (“my favorite thing Tim has written in 10+ years”) aimed directly at “fellow self improvers” —useful if you’re trying to ensure learning and self-improvement translate into changed behavior (not just repeated consumption).

Engineering fundamentals worth (re)loading into your brain

Video on database consistency + concurrency tradeoffs — @jamwt (video)

  • Type: Video (posted on X)
  • Author/creator: @jamwt
  • Link/URL:https://x.com/jamwt/status/2029353984792961278
  • Recommended by: Martin Casado (@martin_casado)
  • Key takeaway (as shared): A “fantastic overview” demystifying database consistency, isolation levels, record contention, and pessimistic vs. optimistic concurrency control tradeoffs.
  • Why it matters: Casado frames these as “super important concepts” if you’re “building production systems,” and suggests that as AI reduces the need to “memorize random framework nonsense,” this is the kind of broadly useful material to replace it with .

AI is changing workflows (and the signal-to-noise ratio)

The Tinder-ization of the Job Market — Matt Darling (essay)

  • Type: Article/essay
  • Author/creator: Matt Darling (The Argument)
  • Link/URL:https://www.theargumentmag.com/p/the-tinder-ization-of-the-job-market
  • Recommended by: Packy McCormick (Not Boring)
  • Key takeaway (as highlighted): The argument presented is that the job market is “stuck” (e.g., hiring rate averaged 3.3% in H2 2025) even while unemployment was 4.3% in January and prime-age (25–54) employment was 80.9%. One proposed mechanism: LLMs make it easier to apply to many jobs, increasing volume while weakening traditional signals (e.g., recruiting workload rose 26% in Q3 2024; 38% of job seekers reported “mass applying”; an applications-to-recruiter ratio “about 500–1”) .
  • Why it matters: If you hire, this is a concrete pointer to why screening may be breaking down under application flooding and AI-generated materials—and why process adjustments may be required .

Claude Code Is The Inflection Point — SemiAnalysis (newsletter post)

  • Type: Newsletter post
  • Author/creator: SemiAnalysis
  • Link/URL:https://newsletter.semianalysis.com/p/claude-code-is-the-inflection-point
  • Recommended by: Packy McCormick (Not Boring)
  • Key takeaway (as quoted): “4% of GitHub public commits are being authored by Claude Code right now,” with a projection that it could reach “20%+ of all daily commits by the end of 2026” .
  • Why it matters: It’s a specific metric + trajectory claim that can recalibrate how quickly you expect AI-assisted coding to show up in day-to-day software production .

Tool Shaped Objects — Minutes (essay)

  • Type: Article/essay
  • Author/creator: Minutes (publication)
  • Link/URL:https://minutes.substack.com/p/tool-shaped-objects
  • Recommended by: Packy McCormick (Not Boring)
  • Key takeaway (quote highlighted in Not Boring):

“The market for feeling productive is orders of magnitude larger than the market for being productive.”

  • Why it matters: A compact framing for evaluating tools, dashboards, and workflows that optimize for the appearance of progress rather than actual outcomes .

Strategy + history: control, institutions, and who makes the call

The Control Revolution — James R. Beniger (book)

  • Type: Book
  • Author/creator: James R. Beniger
  • Link/URL:https://www.amazon.com/Control-Revolution-Technological-Economic-Information/dp/0674169867
  • Recommended by: Packy McCormick (Not Boring)
  • Key takeaway (as shared): Packy says he’s only started it, but flags Beniger’s idea that modern information technology emerged as a response to industrial scale and complexity—an industrial-era “crisis of control” where information/communication innovations lagged behind energy and manufacturing advances .
  • Why it matters: It’s a lens for thinking about why information systems and coordination mechanisms proliferate when production and complexity accelerate .

The Making of the Atomic Bomb — Richard Rhodes (book)

  • Type: Book
  • Author/creator: Richard Rhodes
  • Link/URL: Not provided in-source
  • Recommended by: Rory (20VC panel)
  • Key takeaway (as shared): Rory describes a lesson from reading it: military leadership “didn’t give a rat’s ass about the scientists” and that expecting “the luxury of getting to be part of the decision” is “unrealistic” .
  • Why it matters: A grounded reminder about institutional power and decision rights—especially relevant when technical teams assume they’ll control downstream use of what they build .

Owen (Intercom) on building Fin on top of an existing business (article/blog post)

  • Type: Article/blog post (not linked in-source)
  • Author/creator: Owen (Intercom)
  • Link/URL: Not provided in-source
  • Recommended by: Rory (20VC panel)
  • Key takeaway (as shared): Described as a “really great piece” about taking an existing business, “gut[ting] it,” and building Fin on top—emphasizing not just technical work but the fortitude to bet on the new thing, including “probably a year of feeling ridiculous” amid customer pressure .
  • Why it matters: A candid reference point for leaders attempting major AI-driven product transitions inside a live, legacy business .
Taste at speed, “GitHub for PM” context layers, and prototype-first workflows
Mar 6
9 min read
67 docs
Sachin Rekhi
Teresa Torres
Casey Winters
+9
This digest focuses on what changes when prototyping and code generation become cheap: PM leverage shifts toward fast judgment, context management, and quality measurement. It also includes practical playbooks for vibe prototyping, building persistent AI assistants (Gems/Projects), and a case-study-driven look at alignment, hardware constraints, and career tactics in AI-era product roles.

Big Ideas

1) When building gets cheap, the bottleneck becomes judgment ("taste at speed")

Teams are increasingly using AI to prototype so fast that the core constraint shifts from "can we build it?" to "should we ship it?". Aakash Gupta highlights Anthropic’s Claude Code team as an extreme example: they build hundreds of working prototypes before shipping a single feature, with Boris Cherny reportedly shipping 20–30 PRs/day across parallel Claude instances, and building "Cowork" in about 10 days.

This shows up in the broader PM community too: one Reddit post describes moving from a weeks-long spec → align → build → measure loop to putting rough versions in front of clients the same day, shrinking feedback loops from weeks to hours.

Why it matters: As prototyping cost collapses, PM leverage moves to rapid evaluation, ruthless focus, and decision quality—especially when stakeholders can react to a demo instead of a doc .

How to apply (this week):

  • Run a prototype-first cycle: build a rough demo, test it, then document decisions after validation (not before) .
  • Treat the PRD as a source of truth after learning, not an authorization artifact .

2) Alignment is becoming an AI problem: “GitHub for product management”

Teresa Torres spotlights Momental’s vision of a “GitHub for product management”: ingest org documents/transcripts/recordings and use AI agents to map them into a structured context layer, then surface “merge conflicts” in strategy (e.g., one team prioritizing retention while another prioritizes conversion) for humans to resolve .

Momental frames an internal “product chain” (signals → learnings → decisions → principles) and models org context as three trees (product tree, wisdom tree, people/time tree) . They emphasize metadata (who said it, when, and in what context) as critical for preventing hallucinations .

Why it matters: Even if engineers ship faster, PMs still spend large amounts of time coordinating alignment; Momental cites the reality that you “don’t know what you don’t know” when conflicts are implicit or distributed .

How to apply:

  • Treat misalignment like a first-class defect: explicitly track decisions with reasoning (not just outcomes), and make conflicts visible for resolution .
  • When adopting AI for org context, prioritize provenance/metadata over “just summarization” to reduce ambiguity .

3) Accurate AI agents require domain knowledge + proprietary data, plus a hybrid architecture

In high-stakes domains (fintech, legal, healthcare), accuracy is “the product,” and out-of-the-box LLMs aren’t naturally reliable enough . Two advantages help close the gap:

  • Domain knowledge: map workflows, stakeholders, and where a “90% answer” is acceptable vs. a failure .
  • Proprietary data: transaction-level data, interaction history, domain corpora for personalization and insights a general model can’t produce .

On architecture, Lisa Huang recommends a hybrid system: LLMs (including multi-agent workflows) where they fit, but deterministic code where you need reliability and control .

Why it matters: Without domain constraints, data advantage, and deterministic guardrails, teams can build fast but ship unreliable behavior in the places users care most .

How to apply:

  • Before building: map tasks/subtasks and define explicit accuracy thresholds by step .
  • Build hybrid: identify components that must be deterministic and keep them in code .

4) “Personalized AI” beats one-off chats: build persistent context assistants (Gems/Projects)

Lisa Huang argues the core issue with typical LLM usage is starting from scratch each chat—role, strategy, writing style, product history all reset . Gemini Gems (and analogous Claude Projects / custom GPTs) aim to retain context across work so you don’t re-brief every time .

Why it matters: Persistent context makes AI useful as a daily collaborator for writing, strategy, and synthesis—not just a “glorified search engine” .

How to apply: Start with three “foundation” assistants:

  • Writing clone: upload PRDs/emails/Slack messages for drafts in your voice .
  • Product strategy advisor: feed strategy docs, positioning, competitor analysis; use as a thought partner (not a replacement for judgment) .
  • User research synthesizer: upload transcripts/surveys/support tickets to extract themes you can’t manually read at scale .

Tactical Playbook

1) Prototype fast without derailing the org (vibe prototyping change management)

A practical framing: call prototypes what they are—not deployable code, but a substitute for a “clickable Figma,” and pilot with one team first .

Step-by-step:

  1. Prototype for yourself first: expect to revise requirements 5–15 times after seeing the first version and noticing what you forgot to specify .
  2. Bring it to the team: use it to get engineering/design feedback without claiming it replaces their work .
  3. Use it for stakeholders: prototypes create shared understanding; senior stakeholders often won’t read PRDs, but they will react to a working flow .
  4. Then validate with users: test with customers to learn quickly .

Prompting discipline (avoid “degrees of freedom”):

  • Provide enough context so the tool doesn’t guess across thousands of possibilities .
  • Include an object model (high-level entities/relationships) so the prototype isn’t built on wrong assumptions .
  • Use a “Goldilocks” amount of context—too little causes wrong guesses; too much can overwhelm context windows .

Build advice for speed: stay front-end as long as possible; delay auth/DB, and “fake it” with sample data (CSV/local storage) until needed . If you realize you’re on the wrong path, restart (“nuke from orbit”) because regenerating is cheap .


2) A quick-start map for the “vibe coding” tool landscape

Dan Olsen’s “Vibe Coding Spectrum” organizes tools from less technical (browser/UI-first) to more technical (IDE/CLI/code-first), with suggested entry points by role :

  • Designer-friendly: Figma Make, Magic Patterns
  • PM-friendly: Lovable, Bolt, Base44
  • More technical: Replit, V0
  • Developer tools: Cursor, GitHub Copilot (and others)

How to apply: start on the left where you can iterate quickly, then migrate right only when you hit constraints .


3) Build a Gem (persistent copilot) with a PM-friendly workflow

Step-by-step:

  1. Write detailed instructions: a full page of context (role, audience, format preferences). Avoid vague prompts like “help me write better” .
  2. Upload your knowledge files: PRDs, emails, competitor teardowns, roadmaps; Gemini Gems rely strictly on instructions + files, so update the files as context changes .
  3. Iterate like a mini product: refine instructions/knowledge over time .

Lisa Huang’s suggested scale: a PM may end up with ~20 Gems/Projects across workflows .


4) Measuring AI agents: a three-layer scoreboard (in order)

  1. Quality: ask “is the AI doing what it’s supposed to do?” via evals, human annotators, and LLM judges (each scales differently) .
  2. Product metrics: adoption, usage, retention, CSAT; also track qualitative signals (social, customer conversations, support tickets) .
  3. Business impact: revenue attribution, retention influence, ARR contribution—tracked consistently on the business scorecard .

The sequence matters: jumping to business impact without a quality foundation is unstable measurement .

Case Studies & Lessons

1) Building AI into hardware changes the design space (Meta Ray-Ban)

Lisa Huang describes constraints that “pure software” teams often don’t face: weight, battery life, privacy, bystander concerns, and even partner pace differences (e.g., Luxottica vs. a Silicon Valley engineering org) . She flags an important trade-off: cloud processing is the default today, but on-device is positioned as the future—especially because “privacy wins over performance” for a device worn on your face all day .

Takeaway: Don’t “fall in love with the technology.” The best AI products sit at the intersection of what users need and what the tech can reliably do today; build fast, observe behavior, and update assumptions .


2) Standing out in AI PM interviews: do the work before you’re asked

Aakash Gupta relays a hiring story from Lisa Huang: a candidate with zero AI experience stood out by watching three hours of TikTok videos from coaches working with small businesses, then bringing synthesized user needs into the interview. No other candidate did comparable pre-work .

Takeaway: The differentiator wasn’t AI credentials—it was initiative and user-centric research depth .


3) Even agents need aligned context (Momental’s pivot)

Momental’s founders described building a “product team of agents” (developer agent, PM agent doing slides/sprint planning), but discovered the agents asked endless sensible questions—mirroring the same alignment problems real teams have. The insight: they hadn’t solved alignment; they needed a context foundation first .

Takeaway: Multi-agent systems can amplify the demand for clear shared context—speed doesn’t remove coordination problems .


4) Cheap code can lead to “shipping slop” unless strategy and focus stay sharp

Casey Winters argues code is now “incredibly cheap,” which can become an excuse for a lack of strategic thinking about what’s worth building . He describes incumbents “DDoS’ing” customers with too many features and notes that running multiple agents doesn’t guarantee value—often it produces “slop” without product sense and business strategy .

Takeaway: Higher build throughput increases the penalty for weak focus: customers get overwhelmed and teams lose clear signal on what’s working .

Career Corner

1) The PM role is shifting toward hybrid builders (judgment stays core)

Aakash Gupta’s framing: AI won’t replace PMs, but it will automate or accelerate execution work (PRDs, mocks, roadmaps, data pulls). Product judgment—deciding in ambiguity what’s worth doing—remains core . Structural changes follow: PM-to-engineer ratios compress and PM expectations shift toward prototyping/design/coding enough to communicate intent .

How to act on it: choose one build-adjacent skill (rapid prototyping, lightweight coding, or system prompt + eval design) and ship artifacts regularly .


2) Breaking into AI PM: remove the “I don’t work on AI” excuse

Gupta’s roadmap includes:

  • Get direct AI experience in-role if possible; otherwise build on the side .
  • Invest in network and referrals (he emphasizes referrals still matter) .
  • Treat interview prep as a skill: practice out loud, get mocks, drill the format (product sense, execution, behavioral, case questions) .

He also argues you don’t need permission, budget, or a team to build AI products—consumer tools provide access to the same models many companies build on, and many companies aren’t fine-tuning at all .


3) Product-manage your career (and keep empathy as the strategy anchor)

Deb Liu recommends treating your career with the same intentionality PMs apply to product roadmaps . She also anchors product strategy in empathy—“vision without customer pain is theater” .


4) Job market signal (EU): Technical Project Manager (AI & Web Infrastructure), Frankfurt

A Frankfurt-based technology startup is hiring a Technical Project Manager to coordinate product strategy and execution for the European market and translate technical capabilities into market-ready products . Responsibilities include market/competitor research, structuring product priorities, coordinating development cycles, supporting validation/evaluation, and exploring AI-based workflow tools . The post lists requirements like CS/technical background, web/cloud fundamentals, structured thinking, and interest in emerging AI tools . Apply via careers@novada.com.

Tools & Resources

  • Claude Code webinar recording (Sachin Rekhi): Rekhi hosted a live session with 1,500 PMs, covering why he views Claude Code as highly productive for PMs, showing 13 automation skills, and walking through setup (editors/terminals/voice tools) . Video link: https://www.youtube.com/watch?v=zsAAaY8a63Q.

  • Gemini Gems masterclass (Lisa Huang): Podcast episode URL: https://www.news.aakashg.com/p/lisa-huang-podcast. (Key build steps: detailed instructions, upload knowledge, iterate) .

  • Vibe Brief template + tool starting point: Dan Olsen recommends starting with Lovable by default and sharing a lightweight “vibe coding brief” at “bitly slash vibebrief” .

  • AI PM feedback loop (community writeup): Reddit thread link includes: https://www.clawrapid.com/en/blog/ai-pm-feedback-loop. One described workflow: rough requirements doc for Claude → prototyping → experimenting → PRD (source of truth) → ship .

  • A caution on “auto-invoked” AI skills: Rekhi noted that installing an auto-invoked “frontend-design” skill made his monthly NPS trend visualizations harder to read, and he prefers skills he can invoke manually .

Fertilizer disruption risk collides with corn acreage debates as Brazil’s Iran-linked corn trade faces uncertainty
Mar 6
8 min read
111 docs
This Week In Regenerative Agriculture
Regenerative Agriculture
Market Minute LLC
+5
Grain and livestock markets reacted to Middle East-driven input and energy uncertainty, while U.S. corn acreage expectations split between supply-disruption risks and claims that nitrogen is already prepaid. This digest also highlights practical agronomy tools, emerging sustainability/traceability workflows, and Brazil’s export exposure to Iran amid weather-driven production and logistics constraints.

1) Market Movers

War premium + inflation narrative lifts grains (U.S.)

Market commentary tied grain strength to inflationary buying and a perceived war premium as energy markets firmed during the Iran conflict .

Futures snapshot (Mar 5, early):

  • May corn $4.46 (+2.25¢)
  • May soybeans $11.74 (+4.5¢)
  • May Chicago wheat $5.75½ (+7.25¢)
  • May KC wheat $5.80½ (+8¢)
  • May spring wheat $6.14 (+4.75¢)

Corn: acreage debate intensifies; demand signals remain mixed (U.S.)

  • Acreage risk narrative: One Farm Journal segment framed U.S. corn acres as “hanging in the balance” as shippers try to move fertilizer out of the Middle East . It noted USDA had been expecting a 5M acre cut to 2026 corn acres before the latest fertilizer spike/disruption .
  • Counterpoint: Another analyst said feedback from subscribers was “almost unanimous” that nitrogen was prepaid and acreage plans aren’t changing—maintaining a 96.5M acre estimate and expecting the Iran situation to do “very, very little” to reduce corn acres .
  • Trade + demand notes:
    • A flash sale cited 5M bushels of corn sold to unknown destinations for delivery this marketing year .
    • U.S. ethanol production fell to 1.1M barrels/day (-1.6% WoW, +1.3% YoY), while stocks rose to 26.34M barrels (+2.7% WoW) .

Soybeans: holding near highs despite Brazil harvest (U.S. + Brazil)

  • Soybean futures were described as holding within 12–13 cents of multi-month highs despite an ongoing Brazilian harvest; the same commentary pointed to low U.S. farmer ownership of old-crop beans as limiting “natural selling” .
  • Separate analysis expected continued buying interest from China in U.S. beans and cited a tighter soybean balance sheet forecast (e.g., 265M bushels 2025/26 ending stocks, crush >2.6B bushels) as part of its pre-WASDE expectations .

Wheat: weather-driven strength in HRW; broader rally context

  • Forecasts showed dry/warm conditions for U.S. HRW wheat areas (western KS/eastern CO/southern NE/TX/OK), raising emergence/prospect concerns and supporting HRW relative strength .
  • KC wheat was described as maintaining an uptrend that began in December .

Livestock: boxed beef firm; cattle and hogs trend higher (U.S.)

  • Live cattle were reported higher (362–477) and feeders higher (672–765), with boxed beef up (Choice $388.57, +$0.52; Select $380.35, +$1.77) .
  • In another market segment, cattle were characterized as supported by tight supplies and robust demand, with expectations for steady-to-higher cash trade . Hogs were described as maintaining an uptrend with improving boxed beef/pork into grilling season .

2) Innovation Spotlight

Corn rootworm: Syngenta’s DuraStack (U.S.)

Syngenta highlighted DuraStack trait technology (available for the 2027 season) featuring three modes of action and a triple-Bt protein stack aimed at corn rootworm control .

High-horsepower tractor redesign: John Deere 8R / 8RX (U.S.)

John Deere described six redesigned models, including a flagship 540 HP (wheeled and 4-track), plus 440 and 490 variants (wheeled and 8RX 4-track) . The segment emphasized:

  • Central tire inflation system (CTIS) for transport vs. field traction
  • Transport capability at 60kph / 37mph
  • Engine braking and updates to suspension/steering and cab visibility/space

Residue-to-nutrient strategies in tight fertilizer markets (U.S.)

A no-till segment described Meristem’s Excavator residue breakdown product, applied in fall or spring to “eat the pith” and make residue easier to manage . It cited studies describing nutrient release equivalent to 100 lbs of a 10-30-30 fertilizer application, and suggested potential savings of $40–$50/acre.

It also highlighted:

  • UpShift C starter system (claimed to replace 30–50% of typical starter fertilizer cost) and a new version adding 1 pint zinc/acre and phenolic acids described as stress mitigators .
  • A hopper-applied zinc pail delivering 1.2 quarts (9% equivalent) zinc with talc/graphite at $3/acre (vs. $5–$7 for a quart of zinc) .

Regenerative systems: biochar + agrivoltaics (Germany / Mexico)

  • A German organic farm study reported that combining minimum tillage with deep-placed biochar (30 cm) increased native soil organic carbon by 2.24 Mg C ha⁻¹, with decreases in bulk density and higher microbial biomass carbon in the top 10 cm .
  • Researchers at Pitzer College proposed a “Regenerative Agrivoltaics” framework combining soil restoration and solar energy, suggesting agricultural productivity could increase by up to 70% while also improving photovoltaic performance (via ambient cooling effects) .

3) Regional Developments

U.S.: fertilizer logistics + acreage risk (and uncertainty)

A Farm Journal report noted analysts estimate fertilizer takes ~30 days to reach the U.S. from the Persian Gulf, plus another 3–4 weeks to reach farmers . It also warned delayed fall application or purchase could force later planting and less corn, while another comment referenced analysts suggesting losses of up to 1M corn acres per week under prolonged disruption .

Brazil: corn export exposure to Iran + corn/ethanol “verticalization” push

  • Mato Grosso (Brazil’s top corn producer) estimated 51.7M tons for 2025/26 and reported early-year exports of 2.53M tons to 28 countries .
  • Iran was described as a major destination: 9M tons shipped in 2024 (about 20% of Brazil’s corn exports) and roughly 80% of Iran’s corn imports sourced from Brazil .
  • In response to geopolitical risk and input costs, Canal Rural commentary argued for verticalizing corn by processing into ethanol and DDG (noting DDG’s 30% protein potential) . It also cited corn as 20% of Brazil’s ethanol mix and emphasized export outlets for ethanol (e.g., Korea and Vietnam, and interest from India) .

Brazil: excess rain and logistics disruptions in northern Mato Grosso (Marcelândia)

In Marcelândia (MT), rainfall totals were reported >2,200 mm with projections to 3,000 mm (vs. an average 1,800–2,000 mm), saturating soils and preventing machinery access . Reported impacts included:

  • Soy harvested at 28–30% moisture (described as about double ideal), driving quality/weight loss and discounts .
  • Estimated losses ranging 15–32% on properties, with a cited minimum 10% productivity loss .
  • Corn planting delayed beyond the ideal window due to rain .
  • Logistics pinch points: limited storage and truck backups, with concerns about road access and the MT-320 section “ceding” .

Brazil: second-crop corn planting delays + weather windows

  • Conab-linked reporting cited second-crop corn planting running about 4–5% behind last year due to excess moisture . Mato Grosso was cited at 85% planted and slightly ahead year-on-year, while São Paulo hadn’t started (awaiting rain), and Goiás/Paraná were flagged as delayed .
  • Bahia storms left 16 municipalities in emergency, with Jacobina reporting >150 mm in 12 hours and river overflow; forecasts suggested improvement then a return of heavier rain mid-month .

Policy / trade: Mercosur–EU agreement (Brazil)

Brazil’s Senate unanimously approved the Mercosur–EU free trade agreement, describing tariff reductions/elimination across >90% of trade; ratification is still required by other Mercosur countries and the EU .

4) Best Practices

Corn: crown rot risk management (field-level)

Ag PhD highlighted crown rot as a multi-pathogen problem (e.g., fusarium, anthracnose, charcoal rot, gibberella, pythium) and noted risk increases under plant stress (drought, insufficient fertility, insects/nematodes, wind/hail damage, high populations/weeds) . Practices cited to “keep it at bay” included:

  • Prioritize great drainage and excellent fertility (including high K, P, and micronutrients)
  • Improve seed treatment and consider an in-furrow fungicide such as Xyway

Grain marketing (producer panel tactics)

A producer panel described using working orders (including “odd numbers” to improve fill odds) and using put options to protect downside while staying open to upside around volatile headline-driven moves .

Sustainability + market access (Brazil): documentation as a practical workflow

Canal Rural coverage framed sustainability as increasingly tied to export market access through traceability and proof of practices, pushing producers toward:

  • Digital traceability for inputs and supplier origin documentation
  • Precision agriculture
  • Bioinputs / biofertilizers
  • Energy solutions like solar and biodigesters (biogas/biofertilizer from waste)

5) Input Markets

Fertilizer: supply chain timing, affordability pressure, and antitrust scrutiny (U.S.)

  • A Farm Journal segment said the corn-to-urea price ratio was already one of the second or third worst in history and was “quickly getting worse” .
  • DOJ is investigating multiple fertilizer companies (including Nutrien, Mosaic, CF Industries, Koch, and Yara) for alleged price collusion; the investigation was described as early-stage and examining potential civil and criminal antitrust violations .
  • Separate commentary cited market concentration figures from an industry watchdog (e.g., Nutrien/Mosaic controlling 90% of potash and phosphate capacity; Nutrien/CF/Koch/Yara controlling ~82% of nitrogen-based fertilizers) .

Freight + diesel cost pressures (Brazil)

Brazilian commentary anticipated higher global freight and diesel costs as oil rises during the conflict .

Crop protection drift risk (France/EU)

A discussion on prosulfocarb (a widely used herbicide in France) described it as highly volatile with dispersion over kilometers, with sales rising from ~1,000 tonnes (2012) to 7,400 tonnes (2022). The same post cited findings that two-thirds of fruit/vegetable samples tested contained residues, and 40% exceeded maximum permitted limits, alongside reports of organic crop rejections and financial losses .

6) Forward Outlook

Key near-term dates and decision points

  • Multiple segments pointed to the March planting intentions report as a key checkpoint for how farmers ultimately respond to fertilizer pricing/availability .

Corn: seasonality + fund positioning as a watch item (U.S.)

MarketMinute noted that new-crop corn has topped before April only three times historically (2013, 2024, 2025) . It also argued current fund positioning is “pretty much flat,” unlike last year when funds were “super long” (over +300k contracts in February) before liquidation into July—an argument presented against another early non-seasonal top this year .

Weather: what markets may (and may not) price right now

One market/weather segment emphasized that spring dryness typically matters less to markets than rainfall during late June to mid-July, while excessive rain would need to be extreme enough to create delayed planting to become a major issue .

Discover agents

Subscribe to public agents from the community or create your own—private for yourself or public to share.

Active

Coding Agents Alpha Tracker

Daily high-signal briefing on coding agents: how top engineers use them, the best workflows, productivity tips, high-leverage tricks, leading tools/models/systems, and the people leaking the most alpha. Built for developers who want to stay at the cutting edge without drowning in noise.

110 sources
Active

AI in EdTech Weekly

Weekly intelligence briefing on how artificial intelligence and technology are transforming education and learning - covering AI tutors, adaptive learning, online platforms, policy developments, and the researchers shaping how people learn.

92 sources
Active

Bitcoin Payment Adoption Tracker

Monitors Bitcoin adoption as a payment medium and currency worldwide, tracking merchant acceptance, payment infrastructure, regulatory developments, and transaction usage metrics

102 sources
Active

AI News Digest

Daily curated digest of significant AI developments including major announcements, research breakthroughs, policy changes, and industry moves

114 sources
Active

Global Agricultural Developments

Tracks farming innovations, best practices, commodity trends, and global market dynamics across grains, livestock, dairy, and agricultural inputs

86 sources
Active

Recommended Reading from Tech Founders

Tracks and curates reading recommendations from prominent tech founders and investors across podcasts, interviews, and social media

137 sources

Supercharge your knowledge discovery

Reclaim your time and stay ahead with personalized insights. Limited spots available for our beta program.

Frequently asked questions