Hours of research in one daily brief–on your terms.

Tell us what you need to stay on top of. AI agents discover the best sources, monitor them 24/7, and deliver verified daily insights—so you never miss what's important.

Setup your daily brief agent
Discovering relevant sources...
Syncing sources 0/180...
Extracting information
Generating brief

Recent briefs

Git-First Agent Workflows and Harder Test Prompts Take the Lead
Mar 22
4 min read
70 docs
Theo - t3.gg
Yuchen Jin
Salvatore Sanfilippo
+5
The sharpest signal today came from practitioners tightening the loop around coding agents with tests, Git context, and clearer ownership. Also inside: Claude Code web's repo limit, Codex vs. Claude commit attribution, and the clips worth your time.

🔥 TOP SIGNAL

The strongest practical signal today: agent performance is still mostly a scaffolding problem. Simon Willison says tests, docs, CI/CD, and clean code make agents work better—and his own loop starts with uv run pytest; Salvatore Sanfilippo says generic "write tests" prompts miss the hard stuff, and recommends explicitly asking for edge cases, fragile implementation details, and random testing against a simpler reference implementation . Willison's follow-on warning matters just as much: code review is now the bottleneck, while cognitive debt remains unsolved .

🛠️ TOOLS & MODELS

  • Claude Code for web — current repo-auth ceiling: Simon says one session can't check out two private repos at once because Git operations go through a local proxy that only authenticates the repo attached to the session. He also says the docs don't mention this .
  • Claude Code vs Codex — commit metadata means adoption signals can lie: Claude Code auto-adds itself as a co-author on every commit; Codex doesn't. OpenAI engineer Tibo Sottiaux says Codex is designed so the user remains the owner and accountable party, even though that makes repo-level usage harder to observe .

"it exists to help you and it’s important that you remain the owner and accountable for your work without AI taking credit."

  • T3 Code vs Claude Code CLI — creator-posted RAM snapshot: Theo says T3 Code used 350.9 MB vs 635.5 MB for Claude Code CLI in his screenshot, and framed that as roughly 2x better efficiency .
  • Routing pattern worth copying: Matthew Berman describes a 3-tier stack—frontier models for exploratory work, Sonnet-class models for most execution, and local/fine-tuned models once a narrow workflow is ready for production. His own example was using Opus for front-end/HTML work; Jaden Clark described using a cheaper/default model for small personal tools where speed and cost matter more than max capability .

💡 WORKFLOWS & TRICKS

  • Bootstrap a session in 3 moves:(1) run uv run pytest, (2) ask for "recent changes" or "last three commits" so the agent runs git log, (3) only then split into 2-3 parallel sessions.
  • Use Git as an agent power tool, not just a backup: Ask for git status when the repo is messy—Willison says he uses that prompt surprisingly often—then let the agent work through conflicts with tests. For archaeology, have it search the reflog or other branches for lost code, or run git bisect; for cleanup, ask it to rewrite history with git reset --soft HEAD~1, split/combine commits, or extract a library into a new repo while preserving history .
  • Ask for adversarial tests: Tell the model to stress limit conditions and fragile implementation details, and to use random testing plus a simpler in-test reference implementation to check invariants. Sanfilippo says even a small wording change can strongly steer the model, and the resulting tests become guardrails for both AI-written changes and future refactors .
  • Assume review is the scarce resource: Faster generation just moves the pain to review. Willison's warning is blunt: code review is now the biggest slowdown, and "cognitive debt" is still unsolved .

👤 PEOPLE TO WATCH

  • Simon Willison — published the first draft of Using Git with coding agents. Why it matters: it turns Git from a safety net into an active agent workflow for context loading, debugging, conflict recovery, bisecting, and history rewriting .
  • Salvatore Sanfilippo — Redis creator; today's high-signal contribution was a prompt pattern for stronger tests that targets brittle implementation details instead of shallow happy-path coverage .
  • Tibo Sottiaux — useful because he's surfacing product philosophy from inside Codex: ownership and accountability over brand visibility in commit history .
  • Theo — worth tracking if you care about coding-agent UX tradeoffs; he keeps posting blunt first-party comparisons while shipping T3 Code .

🎬 WATCH & LISTEN

  • 14:39-17:35 — Hard-test prompting that actually changes model behavior. Sanfilippo explains why "write tests" is too generic, and shows how to request edge-case stress plus random testing against a simpler reference implementation .
  • 1:13:24-1:16:46 — The sim-to-real warning for local/fine-tuned agents. Shaw Walters says harness-specific data can improve narrow tasks quickly, but may not transfer back to broader benchmarks and can even narrow the model's capability space .

📊 PROJECTS & REPOS

  • ELIZA OS — worth watching for routing and safety questions. Walters describes it as an open-source framework for building agents, games, and applications, with deployments ranging from an 8B quantized model up through Sonnet and Opus; he also says security is still the blocker for unsupervised browser + shell agents . Adoption signal: the show introduced it as "the most widely used open source framework for building autonomous agents" .
  • Sentient Arena / EVO Skill — still pre-results, but the setup is concrete: the first arena uses Office QA for enterprise-style reading, calculation, and document analysis, and the first cohort closes in the first week of April. The notable mechanic is multi-proposal skill evolution from eval feedback; the team says that setup currently does much better with Opus + Claude Code-style workflows than with open harnesses/open models .

Editorial take: today's real edge was not a flashy new model—it was stronger guardrails around the ones we already have: tests first, Git history in context, and clear human ownership of the output .

Mid-Training Design, Open Model Coalitions, and Inference Hardware Lead the Week
Mar 22
10 min read
534 docs
Reuters
Demis Hassabis
Andrej Karpathy
+29
PRISM supplied unusually concrete evidence that mid-training choices shape what later RL can unlock, while NVIDIA and Huawei made consequential moves in open models and inference hardware. The rest of the cycle brought notable advances in video learning, robotics, agent infrastructure, and AI compliance.

Top Stories

Why it matters: The most consequential developments this cycle were about the infrastructure behind AI progress: how models are trained, how open ecosystems are organized, what hardware can lower inference costs, and how general robot models are being pushed toward precise control.

1) PRISM turns mid-training into a measurable design problem

PRISM frames mid-training as a distinct stage between pretraining and RL, where targeted high-quality data mixtures build reasoning foundations. The project ran controlled experiments on roughly 27B tokens across 7 models, 4 families, and 3B-24B parameters, spanning dense Transformers and attention-Mamba hybrids, while measuring changes in performance, weights, representations, and downstream RL .

"The single biggest lever in mid-training design is Data Composition."

Across those ablations, math-only improved math, math+code improved math and code, and math+code+science produced the best overall results while most improving GPQA-Diamond during later RL . The authors also reported that adding science during mid-training unlocked +17 to +28 points on GPQA-Diamond once RL was added later, while changing the RL data mix itself moved results by less than 2 points.

A separate timing result on Granite-4 Micro found that mid-training after long-context pretraining gave the largest gains in math, code, and science while preserving general reasoning; doing it at 8K context hurt long-context ability, though much of that could be restored with a brief extension phase and model merging . One practitioner summary distilled the practical upshot as 3-4x larger gains during later RL when mid-training is tuned well beforehand, while other practitioners emphasized the work's value as a comprehensive disambiguation of a stage many teams already use . Resources: project and paper.

Impact: PRISM makes mid-training look less like hidden craft knowledge and more like a controllable stage that determines what later RL can actually amplify .

2) NVIDIA is trying to industrialize open model development with the Nemotron Coalition

NVIDIA announced the Nemotron Coalition with Black Forest Labs, Cursor, LangChain, Mistral AI, Perplexity, Reflection AI, Sarvam AI, and Thinking Machines Lab to develop the open-source Nemotron family of foundation models . NVIDIA's stated idea is to build shared high-end base models that outperform what any single company could build alone, then let partners specialize them for different applications .

The first project is pretraining Nemotron 4 base with Mistral, with later post-training involving more partners. NVIDIA also outlined expected roles including multimodal work from Black Forest Labs, agent systems expertise from LangChain, evaluation datasets and real-world performance requirements from Cursor, and applied-system feedback from Perplexity .

Impact: This is a coordinated attempt to make open foundation models into shared industrial infrastructure rather than one-off lab releases .

3) Huawei is pushing an inference-focused hardware response with Atlas 350

Huawei launched the Atlas 350 accelerator card, powered by its 950PR AI chip, at the Ascend AI Partner Summit on March 20. According to the cited report, Huawei says the card delivers 2.87x the single-card compute performance of NVIDIA's H20 and is currently the only product in China supporting FP4 low-precision inference .

The same report lists 112GB HBM, 60% higher multimodal generation throughput, 4x better memory-access efficiency for small operators, 1.56 PFLOPS at FP4 precision, 1.4 TB/s of memory bandwidth, and 600W TDP . One expert note added that FP4 support matters especially for staying competitive in inference, even without native FP4 training .

Impact: The significance here is not just raw chip specs. It is whether domestic Chinese hardware can materially improve inference cost and throughput at a time when deployment efficiency matters more and more .

4) Physical Intelligence's RL tokens target the precision gap in robotics

Physical Intelligence introduced RL tokens as compact snapshots of robot state that let a small model quickly learn and refine actions in real time . The company argues the bottleneck for general-purpose robot models is often the "last millimeter" of precision, where broad competence is not enough .

Its method compresses high-dimensional VLA embeddings into a low-dimensional token, trains that token with a reconstruction objective, and then uses a small actor-critic module to learn residual action corrections directly on the robot through trial and error . Reported results were robots that are up to 3x faster, make fewer mistakes, can beat human teleoperation in some cases, and learn with as little as 15 minutes of real-world practice. Full research: pi.website/research/rlt.

Impact: The design separates general policy generation from fast local correction, which could be an important pattern for getting broad robot models to reliable task execution .

Research & Innovation

Why it matters: The strongest research signals were about better use of depth, data, memory, and embodiment—areas that often move production systems more than a single benchmark headline.

  • Depth and information reuse:Attention Residuals replaces fixed residual weights with attention over preceding layer outputs to reduce hidden-state dilution; in a 48B model trained on 1.4T tokens, the authors report better gradient distribution and consistent downstream gains . MoDA tackles a similar problem by letting attention read key/value states from preceding layers, while keeping 97.3% of FlashAttention-2 efficiency at 64K context; in 1.5B models it improved perplexity by 0.2 and downstream task scores by 2.11% with a 3.7% FLOP increase .
  • State-space sequence models:Mamba-3 combines discretized SSM recurrence, complex-valued state updates, and a multi-input/multi-output formulation. At 1.5B parameters, it improved average accuracy by 1.8 points over Gated DeltaNet while using half the state size of Mamba-2 .
  • Video and visual reasoning:V-JEPA 2.1 adds dense predictive loss, hierarchical self-supervision, and multimodal tokenizers, with reported 20-point gains in action anticipation and robotic grasping and new SOTA results on Ego4D, EPIC-KITCHENS, and TartanDrive . HopChain, from Qwen and Tsinghua LeapLab, synthesizes chained visual-reasoning data for RLVR; added to Qwen3.5 VL training, it improved 20 of 24 benchmarks and topped 50 accuracy points in the ultra-long-CoT regime .
  • Cheaper image generation: Apple researchers' Feature Auto-Encoder trains diffusion models on compressed embeddings from a pretrained vision model, with up to 7x faster training while keeping image quality comparable to state-of-the-art diffusion systems .
  • Memory and planning:GradMem writes context into compact memory states by optimizing memory tokens at test time with a reconstruction loss, rather than only encoding context in a forward pass . Temporal Straightening adds a curvature regularizer that makes latent trajectories more locally straight, aligning Euclidean and geodesic distances and improving goal-reaching success .
  • Evaluating scientific taste: A paper on Reinforcement Learning from Community Feedback trained a "Scientific Judge" on 700,000 citation-matched paper pairs to predict research impact, then used it as a reward model for a "Scientific Thinker" that proposed higher-impact ideas than baselines .

Products & Launches

Why it matters: Product teams kept translating model progress into working systems—faster agent infrastructure, more enterprise control, more local deployment, and new interfaces that treat existing software as the substrate.

  • OpenAI agent infrastructure: OpenAI said agent workflows can now spin up containers for skills, shell, and code interpreter about 10x faster. The change comes from a container pool in the Responses API that reuses warm infrastructure instead of creating a full container for each session; OpenAI also published a hosted shell quickstart.
  • Enterprise agent stack: LangChain launched an enterprise agent platform built with NVIDIA AI. The stack supports AI-Q plus Deep Agents for enterprise search, shallow and deep research agents using Nemotron and frontier LLMs, LangSmith tracing, and connections to internal data through NeMo Agent Toolkit; LangChain linked a full guide.
  • Vision-native software control: Mat Velloso's Unswitch prototype uses vision to operate existing software "more like a person does." He says prompts are a last resort, and demos show multi-tab research compiled into documents or slides, screenshots turned into formatted Excel sheets with formulas, and spatial organization across files, calendars, contacts, and email without replacing the underlying apps . The prototype runs natively on Mac and Windows and was built without JS or Python .
  • Offline local AI stack:Project N.O.M.A.D. packages local AI models via Ollama + Open WebUI, full Wikipedia via Kiwix, offline maps, and a browser-based management UI into a system that runs without internet or telemetry after install. The project says it can be installed with one curl command on Debian-based systems and accessed across a local network as a headless server .
  • Agent skills as open source: MiniMax open-sourced an official skills repository for agents, with curated skills for iOS and Android development, Office file editing, and GLSL visual effects .

Industry Moves

Why it matters: Corporate moves this cycle point to the next layer of competition: monetization, leadership, sector-specific deployment, and the training infrastructure other labs quietly standardize on.

  • OpenAI monetization: Reuters reported that OpenAI will begin showing ads to users of the free and Go versions of ChatGPT in the United States in the coming weeks .
  • DeepMind leadership: Google DeepMind appointed Jas Sekhon as chief strategy officer; Demis Hassabis highlighted Sekhon's prior role as Bridgewater's chief scientist and head of AI when introducing the hire .
  • AI in agriculture:Halter reached a $2B valuation. Its product is AI-powered collars that let ranchers herd cattle from their phones using sound and vibration cues, and Founders Fund is leading the round .
  • Training stack standardization: Multiple labs are reportedly using Megatron for training. Reflection AI and Periodic Labs were both cited, and one practitioner summarized the situation bluntly: for training MoEs, Megatron is "the only game in town" .

Policy & Regulation

Why it matters: The legal and compliance edge of AI keeps moving from abstract debate to concrete distribution rules: authorship, app-store boundaries, and the operating cost of monitoring agents at scale.

  • Authorship: A legal explainer emphasized that under U.S. law, AI-generated art without human authorship does not get copyright protection; brands building on AI art were urged to understand that ownership position clearly .
  • Platform rules for AI coding apps: Replit said its App Store coding app has kept the same core generate-code, server-side compile, and webview-preview workflow for 4 years, and that Apple eventually acknowledged it was not violating guidelines . Follow-on commentary argued that the distinction between remotely hosted code and locally downloaded-and-run code may become important if Apple tightens rules around AI coding webviews .
  • Compliance cost: Fiddler's new TCO guide argues that evaluating agents with external LLMs creates a "Trust Tax" that can reach roughly $2.6M per year, because every trace adds external API cost on top of tooling fees .

Quick Takes

Why it matters: These smaller updates give a useful read on where deployment is heading: cheaper local models, practical agent evaluation, developer ergonomics, and lighter-weight coding stacks.

  • Local deployment: PinchBench results on Qwen3.5 27B using UnslothAI K_XL quantizations showed little degradation in best results; Q4_K_XL averaged about 84% with thinking enabled, Q3 KXL remained viable at 14.5GB, and a later non-thinking run made Q3 KXL the top performer for speed-conscious settings. One follow-up said this makes OpenClaw usable on a 16GB card with decent reliability .
  • Autonomous research, reality check: Karpathy's autoresearch package aims to let agents iterate on training code while humans iterate on prompts. In a real-scale test, Mikhail Parakhin ran 103 distributed experiments over a week and found one improvement, calling it a worse batting average than personal experimentation but still a "free" gain .
  • Frontend generation: OpenAI published frontend guidance for GPT-5.4 after one developer said the model can produce "pretty great frontends" when used with enough thought and intentionality .
  • Agent monitoring: LangChain published a conceptual guide arguing that agent observability needs a distinct production playbook because natural-language input is unbounded, prompts are sensitive to small changes, and multi-step reasoning is hard to anticipate in development .
  • Memory footprint: T3 Code claimed significantly lower RAM usage than Claude Code in one comparison—350.9 MB versus 635.5 MB—and said its Electron app was 2x more efficient than a Bun CLI in that setup .
  • Model release watch:MiniMax-M2.7-highspeed was spotted inside OpenCode without specs yet , and GLM-5.1 was teased as an incoming release .
  • Hiring signal: One engineer said interview loops are already changing in light of LLMs, with less weight on LeetCode-style screening .
TERAFAB Puts Space-Scale AI Infrastructure on the Table as Bengio Warns of a Safety Gap
Mar 22
3 min read
130 docs
Canada Info
DogeDesigner
Yoshua Bengio
+4
Tesla, SpaceX, and xAI outlined an unusually ambitious plan that ties chip manufacturing, power, satellites, and AI demand into one infrastructure story. Separately, Yoshua Bengio urged stronger AI guardrails and international coordination, while Andrej Karpathy offered a candid view of the tradeoffs of working inside frontier labs.

The dominant story

Tesla, SpaceX, and xAI outline TERAFAB and a space-first compute strategy

Tesla said it is building TERAFAB with SpaceX and xAI, describing it as a 1TW/year chip manufacturing facility that would combine logic, memory, and advanced packaging under one roof . Tesla's announcement and related posts tied the effort to projected demand from Optimus robots and solar-powered AI satellites, while arguing that terrestrial electricity limits mean much of the added compute would need to move to space .

Related posts sketched the broader stack around that thesis: a 100kW AI "Mini Sat" intended to scale into the megawatt range, and a D3 chip described as optimized for hotter operation in space to reduce radiator mass . Musk also argued that space solar becomes more attractive as launch costs fall, because adding power on Earth gets harder as land, siting, and local opposition increase .

"Most must necessarily go to space, as US electricity is only 0.5TW"

Why it matters: This is a much broader infrastructure claim than a new data center buildout. The TERAFAB framing links chip supply, power generation, and deployment architecture into one strategy, with space presented as the long-run answer to compute growth .

Policy signal

Bengio tells Canadian senators that capability growth is outpacing safeguards

In Senate testimony, Yoshua Bengio said AI capabilities are advancing rapidly while leading companies' efforts to mitigate risk are not keeping up . He pointed to current harms including deepfakes, scams, fraud, disinformation, and cases involving emotional attachment or "AI psychosis," and said misalignment problems can include deceptive or self-preserving behavior such as lying, hacking, or blackmail in experiments .

Bengio also warned that frontier AI power is concentrating in U.S. and Chinese firms, creating economic and sovereignty risks for countries that depend on foreign model access . His recommendation for Canada was stronger transparency and risk regulation, plus coordination with like-minded countries on national laws and international treaties; he cited the EU Code of Practice and California SB 53 as useful templates . He also said he has launched Law Zero and is involved in international AI safety efforts backed by 30 countries and multilateral bodies .

Why it matters: Bengio is framing AI governance as a combined safety, competitiveness, and sovereignty issue—not only a consumer-protection question. That makes this testimony a useful signal for how policy debates may broaden as access to frontier systems concentrates .

Industry dynamics

Karpathy argues for staying close to frontier labs without being fully absorbed by them

In a podcast discussion shared by Nathan Lambert, Andrej Karpathy said researchers can have substantial impact in ecosystem-level roles outside frontier labs, and argued that internal financial incentives and social pressure can make it hard to operate as a fully independent voice from inside them . At the same time, he said frontier labs remain opaque and close to the capability edge, so people who stay outside too long risk losing judgment about what is actually changing inside the systems .

His tentative solution was a rotation model: moving in and out of frontier labs to stay technically grounded without giving up autonomy altogether . Why it matters: As AI talent and decision-making concentrate in a small number of organizations, Karpathy is describing a structural tension that affects research independence, public commentary, and how the wider field understands the frontier .

The Nature of Gothic and Merchant Ivory Stand Out in Today’s Resource List
Mar 22
2 min read
131 docs
最佳拍档
Patrick Collison
Ivan Zhao
Patrick Collison points readers to Ruskin’s The Nature of Gothic with a direct link and substantive excerpt, while Ivan Zhao highlights Merchant Ivory films as a source of humanistic perspective and optimism. Together, the day’s recommendations skew toward durable aesthetic judgment over tactical startup reading.

Most compelling recommendation

Patrick Collison’s Ruskin pick is the strongest save today because it comes with all three things that matter: a clear endorsement, a direct link, and a concrete glimpse of what he found worth reading .

"Just discovered Ruskin’s The Nature of Gothic. Remarkable essay:"

  • Title:The Nature of Gothic
  • Content type: Essay
  • Author/creator: Ruskin
  • Link/URL:https://www.gutenberg.org/files/30755/30755-h/30755-h.htm#page151
  • Who recommended it: Patrick Collison
  • Key takeaway: The passage he highlighted moves across Mediterranean and northern landscapes, animal life, and human making, arguing that artistic form should be understood in relation to the natural laws and conditions of the places people inhabit
  • Why it matters: This is not a generic link share. Collison points straight to the text and spotlights the specific mode of seeing—connecting geography, life, and craft—that made the essay stand out to him

A second signal: humanistic cinema as a source of optimism

Ivan Zhao’s recommendation is broader, but the conviction behind it is unusually strong: he said he and his wife watched more than 20 Merchant Ivory films, and he cited them when asked what makes him optimistic about the future .

  • Title:Merchant Ivory Productions film catalog (representative title: A Room with a View)
  • Content type: Films
  • Author/creator: Merchant Ivory Productions
  • Who recommended it: Ivan Zhao
  • Key takeaway: Zhao describes the films as low-budget adaptations of era novels with exquisite detail and visuals, centered on emotions, fates, and distinctively human traits in beautiful settings
  • Why it matters: He presents these films as a source of optimism because each work condenses valuable human qualities into stories about emotion and fate

Pattern worth noting

Today’s authentic recommendations lean away from tactical startup content and toward durable aesthetic judgment. One points to a text about how place shapes human expression; the other to films prized for their humanistic depth and beauty .

Agent-First Product Strategy, Evals for AI Features, and Better 1:1s
Mar 22
6 min read
30 docs
Productify by Bandan
Aakash Gupta
andrew chen
This issue covers a shift toward agent-first product design, why structured evals are becoming core PM infrastructure for AI features, and a practical 1:1 framework that surfaces blockers and growth conversations. It also includes growth defensibility guidance, eval-driven case studies, and a short list of resources worth reviewing.

Big Ideas

1) Agent-first products may win as callable primitives, not destinations

Andrew Chen’s argument is that many current AI chat panels and copilots are a transitional local maximum. The longer-term end state may look more like invisible infrastructure that agents orchestrate, with the human UI acting as a debug layer. In that world, products are better thought of as composable APIs or CLIs that expose a narrow, high-leverage capability agents can repeatedly choose .

Why it matters:

  • Distribution shifts from top of funnel to top of call stack; the winner is the default callable primitive in agent-generated plans
  • Product surface area may shrink toward tighter interfaces with opinionated defaults and structured outputs
  • Brand becomes partly machine-legible through reliability, latency, error rates, schema clarity, and integration ease
  • Moats may come from integration depth with agent ecosystems and becoming the sticky default in templates and workflows

How to apply: ask what minimal capability your product can expose that an agent would repeatedly select, then optimize for clean interfaces, structured outputs, and reliability at scale .

2) Paid acquisition can hide weak defensibility

"Paid acquisition is a tax on your product’s defensibility."

Chen’s warning, aimed at AI companies experimenting with paid marketing, is that if you cannot keep outspending incumbents and competitors, you are renting growth rather than building it .

Why it matters: growth quality determines whether scale improves your position or just increases your spend burden .

How to apply: treat paid as a tactic, not proof of product strength, and pressure-test whether your main channels get cheaper as the product grows .

3) For AI features, evals are becoming core PM infrastructure

Aakash Gupta’s key point is that the farther a team is from the end user, the more it needs structured evals. Teams where engineers are also the users can sometimes rely on intuition; teams farther from the user cannot .

"Evals are the new PRD"

Why it matters:

  • evals bridge the distance between builder and user
  • they create a repeatable quality bar instead of one-off demos
  • preserved failing evals become a durable asset when models change

How to apply: build evals that include failure cases, rerun those failures first when new models arrive, and improve the full loop of dataset, tool access, scoring, and prompting rather than only tuning the prompt .

Tactical Playbook

1) A fast eval loop for AI features

  1. Start with a concrete task and a simple system prompt
  2. Generate a test dataset for that task
  3. Connect the model to the real tools it needs, rather than judging it without working access
  4. Replace vague grading with a clearer scoring function; Gupta’s example used three levels instead of a fuzzy numeric scale
  5. Add few-shot examples and rerun
  6. Keep the failing evals and use them as your first regression suite when models change

In Gupta’s demo, that loop moved performance from 0 to 0.75 in about 20 minutes.

"If you only have evals that succeed, you don’t know what problems there are."

Why this matters: it turns AI product iteration into something PMs can measure, compare, and revisit .

2) Rebuild 1:1s so the direct report frames the conversation first

Productify’s framework recommends that the direct report’s agenda comes first, unless the manager has something genuinely time-critical .

A practical flow:

  1. Start with progress on key priorities and blockers that need manager help; this is not a status meeting
  2. Move into cross-functional relationships and team issues that need coaching
  3. Cover goals, aspirations, well-being, and leadership growth
  4. Put feedback for the manager on the agenda explicitly, rather than as an afterthought
  5. Handle administrative items
  6. Then have the manager share updates, context, specific goal follow-ups, and feedback

Two useful operating details:

  • if both sides arrive with full lists, write them independently and share them at the same time so the agenda comes from the overlap and gaps, not from whoever spoke first
  • if either side sends context beforehand, the other side should read it; that makes the conversation sharper and shorter

Why this matters: when the first part of the meeting clearly belongs to the direct report, issues surface that might otherwise never come up, and the goal is for the person to leave feeling seen, supported, and taken seriously .

Case Studies & Lessons

1) An AI assistant improved when the team fixed the whole eval system, not just the prompt

Gupta describes an assistant answering questions from Linear. The first run failed completely: when asked, “How many tasks are assigned to me?”, it responded with a generic offer to help, scoring 0 across the board. The improvement came from several coordinated changes: connecting Linear’s MCP server, giving access to real tools, telling the model to use them, creating a better scoring function, and adding few-shot examples . About 20 minutes later, the score reached 0.75 across the board.

Lesson: if an AI feature is underperforming, do not assume the prompt is the only problem; the dataset, tool access, task setup, and evaluator may be where the real gains are .

2) Closed-loop experimentation produced gains on code humans had already optimized

Karpathy’s system runs a tight cycle: read the code and instructions, form a hypothesis, make one change, train, check the score, then either git commit or git reset based on the result . In Gupta’s examples, that loop found 20 improvements on code Karpathy had already optimized by hand, yielding an 11% speedup . Tobi Lutke applied the same pattern to Shopify’s Liquid templating engine, where 93 automated commits led to 53% faster rendering.

Lesson: autonomous experimentation is most compelling when the objective is clearly scorable and the system can safely keep or discard each change .

Career Corner

1) Evals are becoming a PM skill, not just an ML concern

If your team is not close enough to “vibe check” with the actual end user, designing structured evals becomes part of product judgment . That includes deciding what success looks like, which failure cases to preserve, and what to retest when models change .

How to apply: treat your eval suite as a living product artifact alongside the spec, not a one-time launch task .

2) PM strategy in AI is shifting from screen design to capability design

As agents start selecting tools based on reliability, latency, error rates, and schema clarity, PMs may need to think less about destination UX alone and more about the minimal callable capability their product exposes . That is a different product design skill: making your product easy for machines to choose and compose .

How to apply: when reviewing a roadmap, ask not only "what is the new feature?" but also "what is the reusable capability?" .

3) Listening is a leadership skill, and 1:1 structure signals whether you mean it

Productify’s 1:1 advice is less about etiquette than about who gets to frame the conversation. Explicitly putting the direct report first, adding feedback for the manager to the agenda, and reading pre-shared context all signal openness rather than control .

How to apply: use the agenda structure itself to show that blockers, relationships, and growth matter, not just updates .

Tools & Resources

  • Aakash Gupta on evals for PMs — useful if you are building AI features and need a practical example of dataset creation, tool connection, scoring design, and failure-case management
  • Autoresearch guide for PMs — a follow-on resource if you want to explore scorable experimentation loops more deeply
  • Most 1:1s are run the wrong way — a reusable template for direct-report-led 1:1s, including simultaneous agenda setting and pre-read norms
  • Andrew Chen on agent-first products — a compact strategy prompt for thinking about callable capabilities, agent distribution, and machine-legible product quality
  • Andrew Chen on paid acquisition — a short but useful gut check for growth plans that depend heavily on paid spend
Brazil Financing Strain, China Production Tech, and Water-Stress Tools Ahead of Planting
Mar 22
5 min read
76 docs
Regenerative Agriculture
农业致富经 Agriculture And Farming
Successful Farming
+4
Brazilian producers are managing tighter credit, softer commodity prices, and weather disruptions, while U.S. planting conditions improve in Iowa. The brief also highlights climate-resilient rice breeding, water-stress analytics, livestock and dairy management upgrades, and the input and financing tactics shaping next-season decisions.

1) Market Movers

  • Brazil: Input costs remain the main margin driver. War and geopolitical tension are raising fertilizer and diesel costs while commodity prices soften and the Selic rate sits at 14-14.75% . Excess rain is also disrupting soy harvest in parts of Mato Grosso, while safrinha corn planting continues under financing pressure . Roughly 2,000 judicial recovery filings last year, up 56%, have worsened lenders' perception of sector risk .
  • United States - Iowa: March precipitation improved soil moisture after a record-dry winter, easing drought pressure as growers prepare for the 2026 planting season .
  • United States - poultry: USDA pushed a poultry payment rule to 2027, drawing criticism from farm groups .

2) Innovation Spotlight

  • China - climate-resilient rice breeding: Hefei breeder Zhang Qin, with 17 years in breeding, is developing hybrid rice with stronger heat tolerance and lodging resistance for more extreme weather . A Sanya breeding base of more than 100 mu and over 10,000 materials is being used to speed variety development . One cold-region line, Quanxin No. 5, posted a 13.5% yield gain in national northern trials and was described as having more than 10% large-scale yield potential . Work also includes combining high-yield backgrounds with anti-lodging traits and adding anti-insect genes from wild rice . The program was reported to have reached 800,000 mu and produced 1 billion jin of grain .
  • Water-stress analytics: In vineyards, AI models are being positioned to predict water stress before symptoms appear by tracking seven variables and combining climate, soil, and crop data, with the goal of reducing yield loss and protecting crop quality . The operational lesson is to capture field data fast enough to use it in season, including voice notes, WhatsApp messages, and photos .
  • Soil hydrodynamics sensing: Research highlighted in regenerative-ag circles used fiber-optic cables as soil-health sensor networks and found that tillage disturbance weakens moisture retention and drought resilience; the practical implication is stronger support for low-disturbance systems as climate adaptation .

3) Regional Developments

  • Brazil: Despite tighter credit and higher costs, one Brazilian market source still described the sector as productive, export-strong, and improving efficiency rather than facing a broad production collapse .
  • China - rabbit sector: China remains the world's largest rabbit-meat producer, with annual output above 300 million rabbits and exports to Europe and America . Commercial rabbit farms are using controlled lighting and tighter house management to improve breeding uniformity and conception rates .
  • China - specialty dairy: In Lingshan County, Guangxi, about 40,000 water buffalo produce roughly 50 tons of milk per day, supported by automated milking systems; buffalo milk is described as having higher fat and dry-matter content than standard cow milk . In Fuping County, Shaanxi, goat dairies are pairing mechanized milking with animal-level production and health tracking .

4) Best Practices

Grains and soil

  • Use no- or low-disturbance tillage where drought buffering matters; the cited soil-hydrodynamics work found tillage damage to moisture retention and recommended low-disturbance systems to preserve structural drought resilience .
  • For irrigation-sensitive crops, monitor climate, soil, and crop variables early enough to act before water stress is visible, and standardize field data capture so observations can be used quickly .

Dairy

  • In buffalo dairies, light music was reported to relax animals and improve health and milk yield, while higher-sugar feed ingredients such as fermented wine lees, pineapple skins, and corn stalks were used to improve milk sweetness and creaminess .
  • In goat dairies, raised mesh-floor housing and daily disinfection were used to reduce mastitis risk and improve hygiene; carousel milking systems with soft liners and electronic ear tags tracked individual yield, milking speed, and health status . Flash steaming around 90°C was used to remove strong flavor notes from milk .

Livestock

  • In rabbit production, controlled lighting can synchronize estrus and simplify batch breeding . Keep breeding does from becoming overfat and prevent direct cold drafts in houses; one 600-doe farm operating at a 75% conception rate versus a normal 85% was projected to lose nearly 5,000 kits and more than 40,000 yuan annually . An expert system using baffles, louvered vents, and plastic-shed buffering to lift temperature by 6-8°C without electricity or coal reported conception above 90% .
  • For pasture brush control, the options highlighted were burning, herbicides, and goats .

5) Input Markets

  • Brazil - fertilizer, diesel, and finance: Producers are dealing with rising fertilizer and diesel costs they cannot control, softer commodity prices, and a Selic rate at 14-14.75% . The proposed response is stronger internal management: clearer cash-flow tracking, a usable DRE, better productivity and cost records, and audits where possible to show lenders a lower-risk credit profile .
  • Credit structuring: The same Brazilian source pointed to more sophisticated funding structures, including mixing real- and dollar-denominated borrowing and using derivatives to reduce borrowing cost, rather than relying only on plain local-currency loans . The note specifically said this level of organization is achievable for small, medium, and large farms .
  • Machinery systems: On the equipment side, unsupported legacy GPS platforms are becoming a cost issue. One U.S. tillage operation reported that basic monitor/receiver/control-module replacements for autosteer-ready Caterpillar Challenger tractors can turn into full system swaps of roughly $15,000 because older Topcon, Trimble, and CNH units are no longer supported .
  • Feed formulation: Specialty dairy systems in China are using fermented wine lees, pineapple skins, and corn stalks as ration components, showing continued interest in agricultural byproducts as feed inputs .

6) Forward Outlook

  • Pre-plant decisions: Improving Iowa soil moisture is a constructive signal for U.S. field prep, but financing discipline is the bigger immediate planning issue in Brazil, where credit cost and risk perception are shaping next-season decisions .
  • Climate adaptation is the common theme: Heat-tolerant and lodging-resistant rice, AI-based water-stress prediction, low-disturbance soil systems, and rabbit-house airflow control all point in the same direction: more production systems are being redesigned around heat, drought, and weather volatility .
  • Research watch: A new U.S. study reported that maize yield gains have decoupled from the need for higher plant densities, a result relevant to future seeding-rate decisions .

Your time, back.

An AI curator that monitors the web nonstop, lets you control every source and setting, and delivers one verified daily brief.

Save hours

AI monitors connected sources 24/7—YouTube, X, Substack, Reddit, RSS, people's appearances and more—condensing everything into one daily brief.

Full control over the agent

Add/remove sources. Set your agent's focus and style. Auto-embed clips from full episodes and videos. Control exactly how briefs are built.

Verify every claim

Citations link to the original source and the exact span.

Discover sources on autopilot

Your agent discovers relevant channels and profiles based on your goals. You get to decide what to keep.

Multi-media sources

Track YouTube channels, Podcasts, X accounts, Substack, Reddit, and Blogs. Plus, follow people across platforms to catch their appearances.

Private or Public

Create private agents for yourself, publish public ones, and subscribe to agents from others.

Get your briefs in 3 steps

1

Describe your goal

Tell your AI agent what you want to track using natural language. Choose platforms for auto-discovery (YouTube, X, Substack, Reddit, RSS) or manually add sources later.

Stay updated on space exploration and electric vehicle innovations
Daily newsletter on AI news and research
Track startup funding trends and venture capital insights
Latest research on longevity, health optimization, and wellness breakthroughs
Auto-discover sources

2

Confirm your sources and launch

Your agent finds relevant channels and profiles based on your instructions. Review suggestions, keep what fits, remove what doesn't, add your own. Launch when ready—you can always adjust sources anytime.

Discovering relevant sources...
Sam Altman Profile

Sam Altman

Profile
3Blue1Brown Avatar

3Blue1Brown

Channel
Paul Graham Avatar

Paul Graham

Account
Example Substack Avatar

The Pragmatic Engineer

Newsletter · Gergely Orosz
Reddit Machine Learning

r/MachineLearning

Community
Naval Ravikant Profile

Naval Ravikant

Profile ·
Example X List

AI High Signal

List
Example RSS Feed

Stratechery

RSS · Ben Thompson
Sam Altman Profile

Sam Altman

Profile
3Blue1Brown Avatar

3Blue1Brown

Channel
Paul Graham Avatar

Paul Graham

Account
Example Substack Avatar

The Pragmatic Engineer

Newsletter · Gergely Orosz
Reddit Machine Learning

r/MachineLearning

Community
Naval Ravikant Profile

Naval Ravikant

Profile ·
Example X List

AI High Signal

List
Example RSS Feed

Stratechery

RSS · Ben Thompson

3

Receive verified daily briefs

Get concise, daily updates with precise citations directly in your inbox. You control the focus, style, and length.

Git-First Agent Workflows and Harder Test Prompts Take the Lead
Mar 22
4 min read
70 docs
Theo - t3.gg
Yuchen Jin
Salvatore Sanfilippo
+5
The sharpest signal today came from practitioners tightening the loop around coding agents with tests, Git context, and clearer ownership. Also inside: Claude Code web's repo limit, Codex vs. Claude commit attribution, and the clips worth your time.

🔥 TOP SIGNAL

The strongest practical signal today: agent performance is still mostly a scaffolding problem. Simon Willison says tests, docs, CI/CD, and clean code make agents work better—and his own loop starts with uv run pytest; Salvatore Sanfilippo says generic "write tests" prompts miss the hard stuff, and recommends explicitly asking for edge cases, fragile implementation details, and random testing against a simpler reference implementation . Willison's follow-on warning matters just as much: code review is now the bottleneck, while cognitive debt remains unsolved .

🛠️ TOOLS & MODELS

  • Claude Code for web — current repo-auth ceiling: Simon says one session can't check out two private repos at once because Git operations go through a local proxy that only authenticates the repo attached to the session. He also says the docs don't mention this .
  • Claude Code vs Codex — commit metadata means adoption signals can lie: Claude Code auto-adds itself as a co-author on every commit; Codex doesn't. OpenAI engineer Tibo Sottiaux says Codex is designed so the user remains the owner and accountable party, even though that makes repo-level usage harder to observe .

"it exists to help you and it’s important that you remain the owner and accountable for your work without AI taking credit."

  • T3 Code vs Claude Code CLI — creator-posted RAM snapshot: Theo says T3 Code used 350.9 MB vs 635.5 MB for Claude Code CLI in his screenshot, and framed that as roughly 2x better efficiency .
  • Routing pattern worth copying: Matthew Berman describes a 3-tier stack—frontier models for exploratory work, Sonnet-class models for most execution, and local/fine-tuned models once a narrow workflow is ready for production. His own example was using Opus for front-end/HTML work; Jaden Clark described using a cheaper/default model for small personal tools where speed and cost matter more than max capability .

💡 WORKFLOWS & TRICKS

  • Bootstrap a session in 3 moves:(1) run uv run pytest, (2) ask for "recent changes" or "last three commits" so the agent runs git log, (3) only then split into 2-3 parallel sessions.
  • Use Git as an agent power tool, not just a backup: Ask for git status when the repo is messy—Willison says he uses that prompt surprisingly often—then let the agent work through conflicts with tests. For archaeology, have it search the reflog or other branches for lost code, or run git bisect; for cleanup, ask it to rewrite history with git reset --soft HEAD~1, split/combine commits, or extract a library into a new repo while preserving history .
  • Ask for adversarial tests: Tell the model to stress limit conditions and fragile implementation details, and to use random testing plus a simpler in-test reference implementation to check invariants. Sanfilippo says even a small wording change can strongly steer the model, and the resulting tests become guardrails for both AI-written changes and future refactors .
  • Assume review is the scarce resource: Faster generation just moves the pain to review. Willison's warning is blunt: code review is now the biggest slowdown, and "cognitive debt" is still unsolved .

👤 PEOPLE TO WATCH

  • Simon Willison — published the first draft of Using Git with coding agents. Why it matters: it turns Git from a safety net into an active agent workflow for context loading, debugging, conflict recovery, bisecting, and history rewriting .
  • Salvatore Sanfilippo — Redis creator; today's high-signal contribution was a prompt pattern for stronger tests that targets brittle implementation details instead of shallow happy-path coverage .
  • Tibo Sottiaux — useful because he's surfacing product philosophy from inside Codex: ownership and accountability over brand visibility in commit history .
  • Theo — worth tracking if you care about coding-agent UX tradeoffs; he keeps posting blunt first-party comparisons while shipping T3 Code .

🎬 WATCH & LISTEN

  • 14:39-17:35 — Hard-test prompting that actually changes model behavior. Sanfilippo explains why "write tests" is too generic, and shows how to request edge-case stress plus random testing against a simpler reference implementation .
  • 1:13:24-1:16:46 — The sim-to-real warning for local/fine-tuned agents. Shaw Walters says harness-specific data can improve narrow tasks quickly, but may not transfer back to broader benchmarks and can even narrow the model's capability space .

📊 PROJECTS & REPOS

  • ELIZA OS — worth watching for routing and safety questions. Walters describes it as an open-source framework for building agents, games, and applications, with deployments ranging from an 8B quantized model up through Sonnet and Opus; he also says security is still the blocker for unsupervised browser + shell agents . Adoption signal: the show introduced it as "the most widely used open source framework for building autonomous agents" .
  • Sentient Arena / EVO Skill — still pre-results, but the setup is concrete: the first arena uses Office QA for enterprise-style reading, calculation, and document analysis, and the first cohort closes in the first week of April. The notable mechanic is multi-proposal skill evolution from eval feedback; the team says that setup currently does much better with Opus + Claude Code-style workflows than with open harnesses/open models .

Editorial take: today's real edge was not a flashy new model—it was stronger guardrails around the ones we already have: tests first, Git history in context, and clear human ownership of the output .

Mid-Training Design, Open Model Coalitions, and Inference Hardware Lead the Week
Mar 22
10 min read
534 docs
Reuters
Demis Hassabis
Andrej Karpathy
+29
PRISM supplied unusually concrete evidence that mid-training choices shape what later RL can unlock, while NVIDIA and Huawei made consequential moves in open models and inference hardware. The rest of the cycle brought notable advances in video learning, robotics, agent infrastructure, and AI compliance.

Top Stories

Why it matters: The most consequential developments this cycle were about the infrastructure behind AI progress: how models are trained, how open ecosystems are organized, what hardware can lower inference costs, and how general robot models are being pushed toward precise control.

1) PRISM turns mid-training into a measurable design problem

PRISM frames mid-training as a distinct stage between pretraining and RL, where targeted high-quality data mixtures build reasoning foundations. The project ran controlled experiments on roughly 27B tokens across 7 models, 4 families, and 3B-24B parameters, spanning dense Transformers and attention-Mamba hybrids, while measuring changes in performance, weights, representations, and downstream RL .

"The single biggest lever in mid-training design is Data Composition."

Across those ablations, math-only improved math, math+code improved math and code, and math+code+science produced the best overall results while most improving GPQA-Diamond during later RL . The authors also reported that adding science during mid-training unlocked +17 to +28 points on GPQA-Diamond once RL was added later, while changing the RL data mix itself moved results by less than 2 points.

A separate timing result on Granite-4 Micro found that mid-training after long-context pretraining gave the largest gains in math, code, and science while preserving general reasoning; doing it at 8K context hurt long-context ability, though much of that could be restored with a brief extension phase and model merging . One practitioner summary distilled the practical upshot as 3-4x larger gains during later RL when mid-training is tuned well beforehand, while other practitioners emphasized the work's value as a comprehensive disambiguation of a stage many teams already use . Resources: project and paper.

Impact: PRISM makes mid-training look less like hidden craft knowledge and more like a controllable stage that determines what later RL can actually amplify .

2) NVIDIA is trying to industrialize open model development with the Nemotron Coalition

NVIDIA announced the Nemotron Coalition with Black Forest Labs, Cursor, LangChain, Mistral AI, Perplexity, Reflection AI, Sarvam AI, and Thinking Machines Lab to develop the open-source Nemotron family of foundation models . NVIDIA's stated idea is to build shared high-end base models that outperform what any single company could build alone, then let partners specialize them for different applications .

The first project is pretraining Nemotron 4 base with Mistral, with later post-training involving more partners. NVIDIA also outlined expected roles including multimodal work from Black Forest Labs, agent systems expertise from LangChain, evaluation datasets and real-world performance requirements from Cursor, and applied-system feedback from Perplexity .

Impact: This is a coordinated attempt to make open foundation models into shared industrial infrastructure rather than one-off lab releases .

3) Huawei is pushing an inference-focused hardware response with Atlas 350

Huawei launched the Atlas 350 accelerator card, powered by its 950PR AI chip, at the Ascend AI Partner Summit on March 20. According to the cited report, Huawei says the card delivers 2.87x the single-card compute performance of NVIDIA's H20 and is currently the only product in China supporting FP4 low-precision inference .

The same report lists 112GB HBM, 60% higher multimodal generation throughput, 4x better memory-access efficiency for small operators, 1.56 PFLOPS at FP4 precision, 1.4 TB/s of memory bandwidth, and 600W TDP . One expert note added that FP4 support matters especially for staying competitive in inference, even without native FP4 training .

Impact: The significance here is not just raw chip specs. It is whether domestic Chinese hardware can materially improve inference cost and throughput at a time when deployment efficiency matters more and more .

4) Physical Intelligence's RL tokens target the precision gap in robotics

Physical Intelligence introduced RL tokens as compact snapshots of robot state that let a small model quickly learn and refine actions in real time . The company argues the bottleneck for general-purpose robot models is often the "last millimeter" of precision, where broad competence is not enough .

Its method compresses high-dimensional VLA embeddings into a low-dimensional token, trains that token with a reconstruction objective, and then uses a small actor-critic module to learn residual action corrections directly on the robot through trial and error . Reported results were robots that are up to 3x faster, make fewer mistakes, can beat human teleoperation in some cases, and learn with as little as 15 minutes of real-world practice. Full research: pi.website/research/rlt.

Impact: The design separates general policy generation from fast local correction, which could be an important pattern for getting broad robot models to reliable task execution .

Research & Innovation

Why it matters: The strongest research signals were about better use of depth, data, memory, and embodiment—areas that often move production systems more than a single benchmark headline.

  • Depth and information reuse:Attention Residuals replaces fixed residual weights with attention over preceding layer outputs to reduce hidden-state dilution; in a 48B model trained on 1.4T tokens, the authors report better gradient distribution and consistent downstream gains . MoDA tackles a similar problem by letting attention read key/value states from preceding layers, while keeping 97.3% of FlashAttention-2 efficiency at 64K context; in 1.5B models it improved perplexity by 0.2 and downstream task scores by 2.11% with a 3.7% FLOP increase .
  • State-space sequence models:Mamba-3 combines discretized SSM recurrence, complex-valued state updates, and a multi-input/multi-output formulation. At 1.5B parameters, it improved average accuracy by 1.8 points over Gated DeltaNet while using half the state size of Mamba-2 .
  • Video and visual reasoning:V-JEPA 2.1 adds dense predictive loss, hierarchical self-supervision, and multimodal tokenizers, with reported 20-point gains in action anticipation and robotic grasping and new SOTA results on Ego4D, EPIC-KITCHENS, and TartanDrive . HopChain, from Qwen and Tsinghua LeapLab, synthesizes chained visual-reasoning data for RLVR; added to Qwen3.5 VL training, it improved 20 of 24 benchmarks and topped 50 accuracy points in the ultra-long-CoT regime .
  • Cheaper image generation: Apple researchers' Feature Auto-Encoder trains diffusion models on compressed embeddings from a pretrained vision model, with up to 7x faster training while keeping image quality comparable to state-of-the-art diffusion systems .
  • Memory and planning:GradMem writes context into compact memory states by optimizing memory tokens at test time with a reconstruction loss, rather than only encoding context in a forward pass . Temporal Straightening adds a curvature regularizer that makes latent trajectories more locally straight, aligning Euclidean and geodesic distances and improving goal-reaching success .
  • Evaluating scientific taste: A paper on Reinforcement Learning from Community Feedback trained a "Scientific Judge" on 700,000 citation-matched paper pairs to predict research impact, then used it as a reward model for a "Scientific Thinker" that proposed higher-impact ideas than baselines .

Products & Launches

Why it matters: Product teams kept translating model progress into working systems—faster agent infrastructure, more enterprise control, more local deployment, and new interfaces that treat existing software as the substrate.

  • OpenAI agent infrastructure: OpenAI said agent workflows can now spin up containers for skills, shell, and code interpreter about 10x faster. The change comes from a container pool in the Responses API that reuses warm infrastructure instead of creating a full container for each session; OpenAI also published a hosted shell quickstart.
  • Enterprise agent stack: LangChain launched an enterprise agent platform built with NVIDIA AI. The stack supports AI-Q plus Deep Agents for enterprise search, shallow and deep research agents using Nemotron and frontier LLMs, LangSmith tracing, and connections to internal data through NeMo Agent Toolkit; LangChain linked a full guide.
  • Vision-native software control: Mat Velloso's Unswitch prototype uses vision to operate existing software "more like a person does." He says prompts are a last resort, and demos show multi-tab research compiled into documents or slides, screenshots turned into formatted Excel sheets with formulas, and spatial organization across files, calendars, contacts, and email without replacing the underlying apps . The prototype runs natively on Mac and Windows and was built without JS or Python .
  • Offline local AI stack:Project N.O.M.A.D. packages local AI models via Ollama + Open WebUI, full Wikipedia via Kiwix, offline maps, and a browser-based management UI into a system that runs without internet or telemetry after install. The project says it can be installed with one curl command on Debian-based systems and accessed across a local network as a headless server .
  • Agent skills as open source: MiniMax open-sourced an official skills repository for agents, with curated skills for iOS and Android development, Office file editing, and GLSL visual effects .

Industry Moves

Why it matters: Corporate moves this cycle point to the next layer of competition: monetization, leadership, sector-specific deployment, and the training infrastructure other labs quietly standardize on.

  • OpenAI monetization: Reuters reported that OpenAI will begin showing ads to users of the free and Go versions of ChatGPT in the United States in the coming weeks .
  • DeepMind leadership: Google DeepMind appointed Jas Sekhon as chief strategy officer; Demis Hassabis highlighted Sekhon's prior role as Bridgewater's chief scientist and head of AI when introducing the hire .
  • AI in agriculture:Halter reached a $2B valuation. Its product is AI-powered collars that let ranchers herd cattle from their phones using sound and vibration cues, and Founders Fund is leading the round .
  • Training stack standardization: Multiple labs are reportedly using Megatron for training. Reflection AI and Periodic Labs were both cited, and one practitioner summarized the situation bluntly: for training MoEs, Megatron is "the only game in town" .

Policy & Regulation

Why it matters: The legal and compliance edge of AI keeps moving from abstract debate to concrete distribution rules: authorship, app-store boundaries, and the operating cost of monitoring agents at scale.

  • Authorship: A legal explainer emphasized that under U.S. law, AI-generated art without human authorship does not get copyright protection; brands building on AI art were urged to understand that ownership position clearly .
  • Platform rules for AI coding apps: Replit said its App Store coding app has kept the same core generate-code, server-side compile, and webview-preview workflow for 4 years, and that Apple eventually acknowledged it was not violating guidelines . Follow-on commentary argued that the distinction between remotely hosted code and locally downloaded-and-run code may become important if Apple tightens rules around AI coding webviews .
  • Compliance cost: Fiddler's new TCO guide argues that evaluating agents with external LLMs creates a "Trust Tax" that can reach roughly $2.6M per year, because every trace adds external API cost on top of tooling fees .

Quick Takes

Why it matters: These smaller updates give a useful read on where deployment is heading: cheaper local models, practical agent evaluation, developer ergonomics, and lighter-weight coding stacks.

  • Local deployment: PinchBench results on Qwen3.5 27B using UnslothAI K_XL quantizations showed little degradation in best results; Q4_K_XL averaged about 84% with thinking enabled, Q3 KXL remained viable at 14.5GB, and a later non-thinking run made Q3 KXL the top performer for speed-conscious settings. One follow-up said this makes OpenClaw usable on a 16GB card with decent reliability .
  • Autonomous research, reality check: Karpathy's autoresearch package aims to let agents iterate on training code while humans iterate on prompts. In a real-scale test, Mikhail Parakhin ran 103 distributed experiments over a week and found one improvement, calling it a worse batting average than personal experimentation but still a "free" gain .
  • Frontend generation: OpenAI published frontend guidance for GPT-5.4 after one developer said the model can produce "pretty great frontends" when used with enough thought and intentionality .
  • Agent monitoring: LangChain published a conceptual guide arguing that agent observability needs a distinct production playbook because natural-language input is unbounded, prompts are sensitive to small changes, and multi-step reasoning is hard to anticipate in development .
  • Memory footprint: T3 Code claimed significantly lower RAM usage than Claude Code in one comparison—350.9 MB versus 635.5 MB—and said its Electron app was 2x more efficient than a Bun CLI in that setup .
  • Model release watch:MiniMax-M2.7-highspeed was spotted inside OpenCode without specs yet , and GLM-5.1 was teased as an incoming release .
  • Hiring signal: One engineer said interview loops are already changing in light of LLMs, with less weight on LeetCode-style screening .
TERAFAB Puts Space-Scale AI Infrastructure on the Table as Bengio Warns of a Safety Gap
Mar 22
3 min read
130 docs
Canada Info
DogeDesigner
Yoshua Bengio
+4
Tesla, SpaceX, and xAI outlined an unusually ambitious plan that ties chip manufacturing, power, satellites, and AI demand into one infrastructure story. Separately, Yoshua Bengio urged stronger AI guardrails and international coordination, while Andrej Karpathy offered a candid view of the tradeoffs of working inside frontier labs.

The dominant story

Tesla, SpaceX, and xAI outline TERAFAB and a space-first compute strategy

Tesla said it is building TERAFAB with SpaceX and xAI, describing it as a 1TW/year chip manufacturing facility that would combine logic, memory, and advanced packaging under one roof . Tesla's announcement and related posts tied the effort to projected demand from Optimus robots and solar-powered AI satellites, while arguing that terrestrial electricity limits mean much of the added compute would need to move to space .

Related posts sketched the broader stack around that thesis: a 100kW AI "Mini Sat" intended to scale into the megawatt range, and a D3 chip described as optimized for hotter operation in space to reduce radiator mass . Musk also argued that space solar becomes more attractive as launch costs fall, because adding power on Earth gets harder as land, siting, and local opposition increase .

"Most must necessarily go to space, as US electricity is only 0.5TW"

Why it matters: This is a much broader infrastructure claim than a new data center buildout. The TERAFAB framing links chip supply, power generation, and deployment architecture into one strategy, with space presented as the long-run answer to compute growth .

Policy signal

Bengio tells Canadian senators that capability growth is outpacing safeguards

In Senate testimony, Yoshua Bengio said AI capabilities are advancing rapidly while leading companies' efforts to mitigate risk are not keeping up . He pointed to current harms including deepfakes, scams, fraud, disinformation, and cases involving emotional attachment or "AI psychosis," and said misalignment problems can include deceptive or self-preserving behavior such as lying, hacking, or blackmail in experiments .

Bengio also warned that frontier AI power is concentrating in U.S. and Chinese firms, creating economic and sovereignty risks for countries that depend on foreign model access . His recommendation for Canada was stronger transparency and risk regulation, plus coordination with like-minded countries on national laws and international treaties; he cited the EU Code of Practice and California SB 53 as useful templates . He also said he has launched Law Zero and is involved in international AI safety efforts backed by 30 countries and multilateral bodies .

Why it matters: Bengio is framing AI governance as a combined safety, competitiveness, and sovereignty issue—not only a consumer-protection question. That makes this testimony a useful signal for how policy debates may broaden as access to frontier systems concentrates .

Industry dynamics

Karpathy argues for staying close to frontier labs without being fully absorbed by them

In a podcast discussion shared by Nathan Lambert, Andrej Karpathy said researchers can have substantial impact in ecosystem-level roles outside frontier labs, and argued that internal financial incentives and social pressure can make it hard to operate as a fully independent voice from inside them . At the same time, he said frontier labs remain opaque and close to the capability edge, so people who stay outside too long risk losing judgment about what is actually changing inside the systems .

His tentative solution was a rotation model: moving in and out of frontier labs to stay technically grounded without giving up autonomy altogether . Why it matters: As AI talent and decision-making concentrate in a small number of organizations, Karpathy is describing a structural tension that affects research independence, public commentary, and how the wider field understands the frontier .

The Nature of Gothic and Merchant Ivory Stand Out in Today’s Resource List
Mar 22
2 min read
131 docs
最佳拍档
Patrick Collison
Ivan Zhao
Patrick Collison points readers to Ruskin’s The Nature of Gothic with a direct link and substantive excerpt, while Ivan Zhao highlights Merchant Ivory films as a source of humanistic perspective and optimism. Together, the day’s recommendations skew toward durable aesthetic judgment over tactical startup reading.

Most compelling recommendation

Patrick Collison’s Ruskin pick is the strongest save today because it comes with all three things that matter: a clear endorsement, a direct link, and a concrete glimpse of what he found worth reading .

"Just discovered Ruskin’s The Nature of Gothic. Remarkable essay:"

  • Title:The Nature of Gothic
  • Content type: Essay
  • Author/creator: Ruskin
  • Link/URL:https://www.gutenberg.org/files/30755/30755-h/30755-h.htm#page151
  • Who recommended it: Patrick Collison
  • Key takeaway: The passage he highlighted moves across Mediterranean and northern landscapes, animal life, and human making, arguing that artistic form should be understood in relation to the natural laws and conditions of the places people inhabit
  • Why it matters: This is not a generic link share. Collison points straight to the text and spotlights the specific mode of seeing—connecting geography, life, and craft—that made the essay stand out to him

A second signal: humanistic cinema as a source of optimism

Ivan Zhao’s recommendation is broader, but the conviction behind it is unusually strong: he said he and his wife watched more than 20 Merchant Ivory films, and he cited them when asked what makes him optimistic about the future .

  • Title:Merchant Ivory Productions film catalog (representative title: A Room with a View)
  • Content type: Films
  • Author/creator: Merchant Ivory Productions
  • Who recommended it: Ivan Zhao
  • Key takeaway: Zhao describes the films as low-budget adaptations of era novels with exquisite detail and visuals, centered on emotions, fates, and distinctively human traits in beautiful settings
  • Why it matters: He presents these films as a source of optimism because each work condenses valuable human qualities into stories about emotion and fate

Pattern worth noting

Today’s authentic recommendations lean away from tactical startup content and toward durable aesthetic judgment. One points to a text about how place shapes human expression; the other to films prized for their humanistic depth and beauty .

Agent-First Product Strategy, Evals for AI Features, and Better 1:1s
Mar 22
6 min read
30 docs
Productify by Bandan
Aakash Gupta
andrew chen
This issue covers a shift toward agent-first product design, why structured evals are becoming core PM infrastructure for AI features, and a practical 1:1 framework that surfaces blockers and growth conversations. It also includes growth defensibility guidance, eval-driven case studies, and a short list of resources worth reviewing.

Big Ideas

1) Agent-first products may win as callable primitives, not destinations

Andrew Chen’s argument is that many current AI chat panels and copilots are a transitional local maximum. The longer-term end state may look more like invisible infrastructure that agents orchestrate, with the human UI acting as a debug layer. In that world, products are better thought of as composable APIs or CLIs that expose a narrow, high-leverage capability agents can repeatedly choose .

Why it matters:

  • Distribution shifts from top of funnel to top of call stack; the winner is the default callable primitive in agent-generated plans
  • Product surface area may shrink toward tighter interfaces with opinionated defaults and structured outputs
  • Brand becomes partly machine-legible through reliability, latency, error rates, schema clarity, and integration ease
  • Moats may come from integration depth with agent ecosystems and becoming the sticky default in templates and workflows

How to apply: ask what minimal capability your product can expose that an agent would repeatedly select, then optimize for clean interfaces, structured outputs, and reliability at scale .

2) Paid acquisition can hide weak defensibility

"Paid acquisition is a tax on your product’s defensibility."

Chen’s warning, aimed at AI companies experimenting with paid marketing, is that if you cannot keep outspending incumbents and competitors, you are renting growth rather than building it .

Why it matters: growth quality determines whether scale improves your position or just increases your spend burden .

How to apply: treat paid as a tactic, not proof of product strength, and pressure-test whether your main channels get cheaper as the product grows .

3) For AI features, evals are becoming core PM infrastructure

Aakash Gupta’s key point is that the farther a team is from the end user, the more it needs structured evals. Teams where engineers are also the users can sometimes rely on intuition; teams farther from the user cannot .

"Evals are the new PRD"

Why it matters:

  • evals bridge the distance between builder and user
  • they create a repeatable quality bar instead of one-off demos
  • preserved failing evals become a durable asset when models change

How to apply: build evals that include failure cases, rerun those failures first when new models arrive, and improve the full loop of dataset, tool access, scoring, and prompting rather than only tuning the prompt .

Tactical Playbook

1) A fast eval loop for AI features

  1. Start with a concrete task and a simple system prompt
  2. Generate a test dataset for that task
  3. Connect the model to the real tools it needs, rather than judging it without working access
  4. Replace vague grading with a clearer scoring function; Gupta’s example used three levels instead of a fuzzy numeric scale
  5. Add few-shot examples and rerun
  6. Keep the failing evals and use them as your first regression suite when models change

In Gupta’s demo, that loop moved performance from 0 to 0.75 in about 20 minutes.

"If you only have evals that succeed, you don’t know what problems there are."

Why this matters: it turns AI product iteration into something PMs can measure, compare, and revisit .

2) Rebuild 1:1s so the direct report frames the conversation first

Productify’s framework recommends that the direct report’s agenda comes first, unless the manager has something genuinely time-critical .

A practical flow:

  1. Start with progress on key priorities and blockers that need manager help; this is not a status meeting
  2. Move into cross-functional relationships and team issues that need coaching
  3. Cover goals, aspirations, well-being, and leadership growth
  4. Put feedback for the manager on the agenda explicitly, rather than as an afterthought
  5. Handle administrative items
  6. Then have the manager share updates, context, specific goal follow-ups, and feedback

Two useful operating details:

  • if both sides arrive with full lists, write them independently and share them at the same time so the agenda comes from the overlap and gaps, not from whoever spoke first
  • if either side sends context beforehand, the other side should read it; that makes the conversation sharper and shorter

Why this matters: when the first part of the meeting clearly belongs to the direct report, issues surface that might otherwise never come up, and the goal is for the person to leave feeling seen, supported, and taken seriously .

Case Studies & Lessons

1) An AI assistant improved when the team fixed the whole eval system, not just the prompt

Gupta describes an assistant answering questions from Linear. The first run failed completely: when asked, “How many tasks are assigned to me?”, it responded with a generic offer to help, scoring 0 across the board. The improvement came from several coordinated changes: connecting Linear’s MCP server, giving access to real tools, telling the model to use them, creating a better scoring function, and adding few-shot examples . About 20 minutes later, the score reached 0.75 across the board.

Lesson: if an AI feature is underperforming, do not assume the prompt is the only problem; the dataset, tool access, task setup, and evaluator may be where the real gains are .

2) Closed-loop experimentation produced gains on code humans had already optimized

Karpathy’s system runs a tight cycle: read the code and instructions, form a hypothesis, make one change, train, check the score, then either git commit or git reset based on the result . In Gupta’s examples, that loop found 20 improvements on code Karpathy had already optimized by hand, yielding an 11% speedup . Tobi Lutke applied the same pattern to Shopify’s Liquid templating engine, where 93 automated commits led to 53% faster rendering.

Lesson: autonomous experimentation is most compelling when the objective is clearly scorable and the system can safely keep or discard each change .

Career Corner

1) Evals are becoming a PM skill, not just an ML concern

If your team is not close enough to “vibe check” with the actual end user, designing structured evals becomes part of product judgment . That includes deciding what success looks like, which failure cases to preserve, and what to retest when models change .

How to apply: treat your eval suite as a living product artifact alongside the spec, not a one-time launch task .

2) PM strategy in AI is shifting from screen design to capability design

As agents start selecting tools based on reliability, latency, error rates, and schema clarity, PMs may need to think less about destination UX alone and more about the minimal callable capability their product exposes . That is a different product design skill: making your product easy for machines to choose and compose .

How to apply: when reviewing a roadmap, ask not only "what is the new feature?" but also "what is the reusable capability?" .

3) Listening is a leadership skill, and 1:1 structure signals whether you mean it

Productify’s 1:1 advice is less about etiquette than about who gets to frame the conversation. Explicitly putting the direct report first, adding feedback for the manager to the agenda, and reading pre-shared context all signal openness rather than control .

How to apply: use the agenda structure itself to show that blockers, relationships, and growth matter, not just updates .

Tools & Resources

  • Aakash Gupta on evals for PMs — useful if you are building AI features and need a practical example of dataset creation, tool connection, scoring design, and failure-case management
  • Autoresearch guide for PMs — a follow-on resource if you want to explore scorable experimentation loops more deeply
  • Most 1:1s are run the wrong way — a reusable template for direct-report-led 1:1s, including simultaneous agenda setting and pre-read norms
  • Andrew Chen on agent-first products — a compact strategy prompt for thinking about callable capabilities, agent distribution, and machine-legible product quality
  • Andrew Chen on paid acquisition — a short but useful gut check for growth plans that depend heavily on paid spend
Brazil Financing Strain, China Production Tech, and Water-Stress Tools Ahead of Planting
Mar 22
5 min read
76 docs
Regenerative Agriculture
农业致富经 Agriculture And Farming
Successful Farming
+4
Brazilian producers are managing tighter credit, softer commodity prices, and weather disruptions, while U.S. planting conditions improve in Iowa. The brief also highlights climate-resilient rice breeding, water-stress analytics, livestock and dairy management upgrades, and the input and financing tactics shaping next-season decisions.

1) Market Movers

  • Brazil: Input costs remain the main margin driver. War and geopolitical tension are raising fertilizer and diesel costs while commodity prices soften and the Selic rate sits at 14-14.75% . Excess rain is also disrupting soy harvest in parts of Mato Grosso, while safrinha corn planting continues under financing pressure . Roughly 2,000 judicial recovery filings last year, up 56%, have worsened lenders' perception of sector risk .
  • United States - Iowa: March precipitation improved soil moisture after a record-dry winter, easing drought pressure as growers prepare for the 2026 planting season .
  • United States - poultry: USDA pushed a poultry payment rule to 2027, drawing criticism from farm groups .

2) Innovation Spotlight

  • China - climate-resilient rice breeding: Hefei breeder Zhang Qin, with 17 years in breeding, is developing hybrid rice with stronger heat tolerance and lodging resistance for more extreme weather . A Sanya breeding base of more than 100 mu and over 10,000 materials is being used to speed variety development . One cold-region line, Quanxin No. 5, posted a 13.5% yield gain in national northern trials and was described as having more than 10% large-scale yield potential . Work also includes combining high-yield backgrounds with anti-lodging traits and adding anti-insect genes from wild rice . The program was reported to have reached 800,000 mu and produced 1 billion jin of grain .
  • Water-stress analytics: In vineyards, AI models are being positioned to predict water stress before symptoms appear by tracking seven variables and combining climate, soil, and crop data, with the goal of reducing yield loss and protecting crop quality . The operational lesson is to capture field data fast enough to use it in season, including voice notes, WhatsApp messages, and photos .
  • Soil hydrodynamics sensing: Research highlighted in regenerative-ag circles used fiber-optic cables as soil-health sensor networks and found that tillage disturbance weakens moisture retention and drought resilience; the practical implication is stronger support for low-disturbance systems as climate adaptation .

3) Regional Developments

  • Brazil: Despite tighter credit and higher costs, one Brazilian market source still described the sector as productive, export-strong, and improving efficiency rather than facing a broad production collapse .
  • China - rabbit sector: China remains the world's largest rabbit-meat producer, with annual output above 300 million rabbits and exports to Europe and America . Commercial rabbit farms are using controlled lighting and tighter house management to improve breeding uniformity and conception rates .
  • China - specialty dairy: In Lingshan County, Guangxi, about 40,000 water buffalo produce roughly 50 tons of milk per day, supported by automated milking systems; buffalo milk is described as having higher fat and dry-matter content than standard cow milk . In Fuping County, Shaanxi, goat dairies are pairing mechanized milking with animal-level production and health tracking .

4) Best Practices

Grains and soil

  • Use no- or low-disturbance tillage where drought buffering matters; the cited soil-hydrodynamics work found tillage damage to moisture retention and recommended low-disturbance systems to preserve structural drought resilience .
  • For irrigation-sensitive crops, monitor climate, soil, and crop variables early enough to act before water stress is visible, and standardize field data capture so observations can be used quickly .

Dairy

  • In buffalo dairies, light music was reported to relax animals and improve health and milk yield, while higher-sugar feed ingredients such as fermented wine lees, pineapple skins, and corn stalks were used to improve milk sweetness and creaminess .
  • In goat dairies, raised mesh-floor housing and daily disinfection were used to reduce mastitis risk and improve hygiene; carousel milking systems with soft liners and electronic ear tags tracked individual yield, milking speed, and health status . Flash steaming around 90°C was used to remove strong flavor notes from milk .

Livestock

  • In rabbit production, controlled lighting can synchronize estrus and simplify batch breeding . Keep breeding does from becoming overfat and prevent direct cold drafts in houses; one 600-doe farm operating at a 75% conception rate versus a normal 85% was projected to lose nearly 5,000 kits and more than 40,000 yuan annually . An expert system using baffles, louvered vents, and plastic-shed buffering to lift temperature by 6-8°C without electricity or coal reported conception above 90% .
  • For pasture brush control, the options highlighted were burning, herbicides, and goats .

5) Input Markets

  • Brazil - fertilizer, diesel, and finance: Producers are dealing with rising fertilizer and diesel costs they cannot control, softer commodity prices, and a Selic rate at 14-14.75% . The proposed response is stronger internal management: clearer cash-flow tracking, a usable DRE, better productivity and cost records, and audits where possible to show lenders a lower-risk credit profile .
  • Credit structuring: The same Brazilian source pointed to more sophisticated funding structures, including mixing real- and dollar-denominated borrowing and using derivatives to reduce borrowing cost, rather than relying only on plain local-currency loans . The note specifically said this level of organization is achievable for small, medium, and large farms .
  • Machinery systems: On the equipment side, unsupported legacy GPS platforms are becoming a cost issue. One U.S. tillage operation reported that basic monitor/receiver/control-module replacements for autosteer-ready Caterpillar Challenger tractors can turn into full system swaps of roughly $15,000 because older Topcon, Trimble, and CNH units are no longer supported .
  • Feed formulation: Specialty dairy systems in China are using fermented wine lees, pineapple skins, and corn stalks as ration components, showing continued interest in agricultural byproducts as feed inputs .

6) Forward Outlook

  • Pre-plant decisions: Improving Iowa soil moisture is a constructive signal for U.S. field prep, but financing discipline is the bigger immediate planning issue in Brazil, where credit cost and risk perception are shaping next-season decisions .
  • Climate adaptation is the common theme: Heat-tolerant and lodging-resistant rice, AI-based water-stress prediction, low-disturbance soil systems, and rabbit-house airflow control all point in the same direction: more production systems are being redesigned around heat, drought, and weather volatility .
  • Research watch: A new U.S. study reported that maize yield gains have decoupled from the need for higher plant densities, a result relevant to future seeding-rate decisions .

Discover agents

Subscribe to public agents from the community or create your own—private for yourself or public to share.

Active

Coding Agents Alpha Tracker

Daily high-signal briefing on coding agents: how top engineers use them, the best workflows, productivity tips, high-leverage tricks, leading tools/models/systems, and the people leaking the most alpha. Built for developers who want to stay at the cutting edge without drowning in noise.

110 sources
Active

AI in EdTech Weekly

Weekly intelligence briefing on how artificial intelligence and technology are transforming education and learning - covering AI tutors, adaptive learning, online platforms, policy developments, and the researchers shaping how people learn.

92 sources
Active

Bitcoin Payment Adoption Tracker

Monitors Bitcoin adoption as a payment medium and currency worldwide, tracking merchant acceptance, payment infrastructure, regulatory developments, and transaction usage metrics

102 sources
Active

AI News Digest

Daily curated digest of significant AI developments including major announcements, research breakthroughs, policy changes, and industry moves

114 sources
Active

Global Agricultural Developments

Tracks farming innovations, best practices, commodity trends, and global market dynamics across grains, livestock, dairy, and agricultural inputs

86 sources
Active

Recommended Reading from Tech Founders

Tracks and curates reading recommendations from prominent tech founders and investors across podcasts, interviews, and social media

137 sources

Supercharge your knowledge discovery

Reclaim your time and stay ahead with personalized insights. Limited spots available for our beta program.

Frequently asked questions