Hours of research in one daily brief–on your terms.

Tell us what you need to stay on top of. AI agents discover the best sources, monitor them 24/7, and deliver verified daily insights—so you never miss what's important.

Setup your daily brief agent
Discovering relevant sources...
Syncing sources 0/180...
Extracting information
Generating brief

Recent briefs

Muse Spark Reopens the Frontier Race as Agent Platforms Mature
Apr 9
8 min read
876 docs
Jason Weston
Cursor
Alexandr Wang
+26
Meta’s new frontier model led the cycle, while Anthropic pushed fully hosted agents and new benchmarks showed how difficult real-world agent work remains. Research also advanced protein design, math, memory systems, and automated scientific writing.

Top Stories

Why it matters: The biggest developments this cycle combined a new frontier model, a push toward fully hosted agent infrastructure, and better evidence about where agent systems still break in real work.

Meta launches Muse Spark and re-enters the frontier race

Meta released Muse Spark as the first model from Meta Superintelligence Labs after a nine-month rebuild of its AI stack, and the model now powers Meta AI . Meta describes it as a natively multimodal reasoning model with tool-use, visual chain of thought, and multi-agent orchestration . Artificial Analysis scored it 52 on the Intelligence Index, placing it in the top five models it has benchmarked . On those benchmarks, Muse Spark was also notably token efficient at 58M output tokens, versus 157M for Claude Opus 4.6 and 120M for GPT-5.4 .

The model’s strongest third-party results were in vision and reasoning: 80.5% on MMMU-Pro and 39.9% on Humanity’s Last Exam, while agentic performance trailed leaders on GDPval-AA and TerminalBench Hard . Meta is also gradually rolling out Contemplating mode, which has multiple agents reason in parallel, and says the model is still weaker in long-horizon agentic systems and coding workflows . Muse Spark is available at meta.ai and in the Meta AI app, with private preview API access for select partners .

Impact: This is both a capability jump and a strategy change. Muse Spark is Meta’s first frontier model since Llama 4 Maverick and its first frontier release that is not open weights .

Anthropic moves further up the stack with Claude Managed Agents

Anthropic introduced Claude Managed Agents as a public beta on the Claude Platform, positioning it as a way to build and deploy agents at scale . The product pairs a performance-tuned agent harness with production infrastructure so teams can move from prototype to launch in days . Anthropic’s engineering blog describes it as a hosted service for long-running agents .

Impact: Anthropic is packaging more of the agent stack as a hosted service, shifting competition from model access alone toward runtime, orchestration, and deployment infrastructure.

APEX-Agents-AA shows how hard real agent work still is

Artificial Analysis launched APEX-Agents-AA, a benchmark based on 452 long-horizon tasks from investment banking, management consulting, and corporate law, using MCP-based tools and pass@1 grading across three runs per task . The leaderboard is tightly clustered at the top: GPT-5.4 at 33.3%, Claude Opus 4.6 at 33.0%, and Gemini 3.1 Pro Preview at 32%.

The implementation runs inside Stirrup, Artificial Analysis’s open-source agent harness, and one outside summary noted a very large gap between proprietary and open-source models on this workload .

Impact: The result is a useful reality check. Even the leading models are completing only about one-third of these long-horizon professional tasks.

DISCO pushes generative AI deeper into experimental science

DISCO is a new diffusion system for joint protein sequence-structure co-design from Mila and Frances Arnold’s Caltech lab, with Yoshua Bengio also highlighting the release . In the headline example, it engineered an enzyme for selective C(sp³)–H insertion—described as one of the most challenging transformations in organic chemistry—using a single plate, without pre-specified catalytic residues, templates, theozymes, or inverse folding .

Impact: This is a strong example of multimodal generative modeling moving beyond software tasks into experimentally grounded molecular design.

Research & Innovation

Why it matters: The most interesting research this cycle focused on memory, writing, training methods, and formal reasoning—areas that directly affect whether AI systems become more useful in extended workflows.

OpenAI reports five more Erdős problem solutions

OpenAI released a paper describing solutions to five further Erdős problems using an internal model . One highlighted result is a counterexample for Erdős Problem 1091, and the paper’s Figure 5 was produced by Codex .

Google’s PaperOrchestra targets automated research writing

PaperOrchestra is a multi-agent system that turns raw ideas, notes, and experimental logs into submission-ready LaTeX manuscripts . It uses specialized agents for literature synthesis, plot generation, conceptual diagrams, and iterative refinement, and introduces PaperWritingBench, built from reverse-engineered materials from 200 top AI conference papers . In side-by-side human evaluations, it posted 50–68% absolute win-rate margins on literature review quality and 14–38% on overall manuscript quality over autonomous baselines .

MIA treats agent memory as something that evolves during use

The Memory Intelligence Agent combines a non-parametric memory manager, an RL-trained planner, and an executor, with bidirectional conversion between parametric and non-parametric memory plus test-time learning during inference . Reported gains include up to 9% improvement for GPT-5.4 on LiveVQA and 31% average improvement across 11 benchmarks with a lightweight 7B executor .

Thinking Mid-training inserts reasoning before post-training

A new Thinking Mid-training recipe adds supervised fine-tuning and reinforcement learning between pretraining and post-training, using interleaved thoughts to teach models when and how to reason . On base Llama-3-8B, the authors report a 3.2x improvement on reasoning benchmarks compared with direct RL post-training .

Products & Launches

Why it matters: Product releases were less about generic chat and more about making models cheaper, more grounded, or easier to use in real workflows.

Qwen3.6 Plus improves Alibaba’s hosted model offering

Alibaba released Qwen3.6 Plus, a proprietary model with native vision input and a 1M-token context window, available through Alibaba Cloud’s API . Artificial Analysis scored it 50 on the Intelligence Index, up 5 points from Qwen3.5 397B . It also improved on agentic and reliability-oriented measures, including 1373 Elo on GDPval-AA and an AA-Omniscience move from -30 to +3 via reduced hallucination .

A notable commercial angle is cost: Artificial Analysis estimated about $483 to run the full Intelligence Index on Qwen3.6 Plus, versus much higher costs for frontier proprietary peers .

Google brings notebooks into Gemini

Google is rolling out Notebooks in Gemini as a project workspace where users can organize chats, notes, documents, and PDFs, and get answers grounded in those sources . The feature syncs with NotebookLM in both directions, so sources added in one appear in the other . Rollout starts on the web for Google AI Ultra, Pro, and Plus subscribers .

Cognition ships SWE-1.6 in Windsurf

Cognition released SWE-1.6, which it describes as its best model on both intelligence and model UX, matching its Preview model on SWE-Bench Pro while improving behavior on other axes . It is available in Windsurf with a 200 tok/s free tier and a 950 tok/s fast tier .

LiquidAI targets edge reasoning with LFM2.5-VL-450M

LiquidAI released LFM2.5-VL-450M, a vision-language model built for real-time reasoning on edge devices . It supports bounding boxes, object detection, function calling, and nine-language multilingual use, and processes a 512×512 image in about 240ms on-device .

Industry Moves

Why it matters: Labs are making bigger strategic bets on distribution, compute scale, and applied AI programs beyond core model releases.

Meta shifts its release strategy

Muse Spark is not just a new model. It is Meta’s first frontier release that is not open weights, and Meta is integrating it across Meta AI, Facebook, Instagram, and Threads while saying larger models are already in development .

xAI outlines a larger training slate at Colossus 2

Elon Musk said Colossus 2 now has seven models in training: Imagine V2, two 1T variants, two 1.5T variants, a 6T, and a 10T model . In follow-up posts, Musk said the 1T model is about 2–3 weeks away, the 1.5T about 4–5 weeks, and a pre-training phase is about two months.

OpenAI Foundation commits major Alzheimer’s funding

The OpenAI Foundation said it is taking an end-to-end AI approach to Alzheimer’s, spanning early diagnosis, disease understanding, and drug discovery . It is finalizing over $100M in grants across six institutions this month .

Policy & Regulation

Why it matters: Safety disclosures and evaluation frameworks continued to shape de facto standards for deployment.

Meta publishes a safety framework with Muse Spark

Meta released Muse Spark alongside an Advanced AI Scaling Framework that covers evaluation across bio, chem, cyber, and loss-of-control risks before and after mitigations . In that framework, Muse Spark achieved a 98% bioweapons refusal rate on BioTier-refuse, which Meta says was the highest among the models it benchmarked . Meta says this is the start of a safety system designed to scale with future model capability .

ClawsBench highlights how weak agent safety can still be

ClawsBench measures both capability and safety in stateful agent environments built around tools like Google Workspace CLI and Slack MCP . One key finding is that scaffolding matters more than model choice: adding skills moved results from 0–8% to 39–63%. Another is that capability and safety can diverge: Opus led capability at 63% but also tied for the worst unsafe-action rate at 23%, while GPT-5.4 had the lowest unsafe-action rate at 7% but only mid-tier task performance . Only 1 out of 7,224 trials explicitly detected a prompt injection .

Quick Takes

Why it matters: Smaller releases still showed rapid movement in video generation, developer tooling, model serving, and workflow automation.*

  • Bytedance’s Dreamina Seedance 2.0 moved to #1 in Video Arena for both text-to-video and image-to-video, with large gains over its prior version .
  • Google added Flex and Priority service tiers to the Gemini API, including a 50% lower-cost tier for latency-tolerant workloads and a priority tier for critical apps .
  • W&B Automations is now live, adding metric alerts, Slack notifications, and webhook-driven actions like triggering eval pipelines or killing failed jobs .
  • Cursor’s code review agent now learns from PR activity to self-improve in real time; the company says 78% of issues it finds are resolved before merge .
  • Nomic and Muna released on-device layout models for PDF understanding, with no server, no API key, and local parsing of 500-page PDFs .
  • SWE-bench crossed 1 million downloads; an easier inference stack and SWE-bench Multimodal are next .
  • NVIDIA and vLLM submitted the first MLPerf vision-language-model benchmark using vLLM .
  • Runway added custom voices for Runway Characters, generated from text prompts .
Smalltalk Best Practice Patterns Leads Picks on Code, Ecosystems, and Institutions
Apr 9
3 min read
209 docs
The All-In Podcast
Reid Hoffman
David Heinemeier Hansson (DHH)
+5
Today's strongest organic recommendations center on one high-conviction programming book, two books about systems and motivation, and two videos on California policy and education design. DHH's praise of Smalltalk Best Practice Patterns stands out for its unusually specific use case.

What stood out

After filtering for direct, organic recommendations, today's strongest picks split into two useful clusters: builder craft and institutional design. The clearest signal is David Heinemeier Hansson's endorsement of Smalltalk Best Practice Patterns because he gives both a very strong ranking and a specific reason to read it: learning how to structure methods and classes .

Most compelling recommendation

Smalltalk Best Practice Patterns

  • Content type: Book
  • Author/creator: Kent Beck
  • Who recommended it: David Heinemeier Hansson
  • Key takeaway: DHH calls it his #1 recommendation for programmers who want to learn the nitty-gritty of structuring methods and classes, and says it remains his favorite book on tactical programming patterns
  • Why it matters: This is the strongest pick today because the recommendation is both high-conviction and highly specific: it tells readers exactly what skill the book helps build

"Small Talk Best Practices is my number one recommendation for any programmer who want to learn the nitty gritty of how to structure a method and a class..."

Books that explain how builders operate

Regional Advantage

  • Content type: Book
  • Author/creator: AnnaLee Saxenian
  • Who recommended it: Reid Hoffman
  • Key takeaway: Hoffman points to it as a useful explanation for why Silicon Valley outstripped Boston, emphasizing that looser non-compete rules let network effects and knowledge spread across the region instead of being locked inside individual companies
  • Why it matters: It is a strong systems-level recommendation for readers trying to understand how policy and labor mobility shape startup ecosystems

Man's Search for Meaning

  • Content type: Book
  • Author/creator: Viktor Frankl
  • Who recommended it: David Heinemeier Hansson
  • Key takeaway: DHH invokes Frankl's idea that finding a "why" helps people endure discomfort, and applies that lesson directly to the frustrations of building with computers
  • Why it matters: This is the clearest mindset recommendation in the set because it connects meaning to persistence in technical work

Two watches on institutions

How Matt Mahan Thinks He Can Save California

  • Content type: Podcast/video episode
  • Author/creator: The All-In Podcast
  • Link/URL:https://x.com/theallinpod/status/2035888224308957611
  • Who recommended it: Marc Andreessen
  • Key takeaway: Andreessen says the episode is worth watching and praises Matt Mahan as an outstanding mayor; the discussion covers California's public-sector unions, pension liabilities, housing regulation and fees, and energy policy
  • Why it matters: It is the most concrete policy resource in today's set, with a clear topic map for readers focused on California's operating environment

A 200 year vision for American Education

  • Content type: Video
  • Author/creator: @delk and the Primer team
  • Link/URL:https://x.com/delk/status/2041875924811915664
  • Who recommended it: Scott Belsky
  • Key takeaway: Belsky says the team has been reimagining "school for what's ahead" for more than five years and argues that education should prepare the next generation for resilience, flexibility, courage, and mastery
  • Why it matters: It is the clearest education-design recommendation today, and it makes the desired outcomes explicit instead of talking about schooling in generalities

Bottom line

The highest-signal pattern today is specificity. The best recommendations are not generic praise; they come with a precise lesson, whether that's how to structure code, how regions compound knowledge, how to endure hard technical work, or what institutions should optimize for

DHH Goes Agent-First, PI Pushes Minimalism, and GPT-5.4 Wins a Brutal Benchmark
Apr 9
5 min read
105 docs
DHH
David Heinemeier Hansson (DHH)
Salvatore Sanfilippo
+10
DHH’s production workflow is the clearest practical signal today: start with agents, review diffs, and keep two models running in parallel. Also: PI’s minimalist harness philosophy, a punishing GPT-5.4 vs Opus benchmark, and cost-conscious automation tricks from Kent C. Dodds.

🔥 TOP SIGNAL

DHH has gone from disliking autocomplete-style AI to running an agent-first workflow in production: he now starts new work with an agent draft, reviews diffs in Neovim/Lazygit, and keeps a second model running in parallel for harder problems . The broader takeaway from today’s sources is even more useful: the best practitioners are not converging on full autonomy; they’re converging on lean harnesses, explicit review loops, and senior engineers as validators/redirection layers.

"Now I start with the agent. Now it'll give me the draft. I'll review the draft, and I'll make alterations if need be."

🛠️ TOOLS & MODELS

  • Model winner depends heavily on task + harness. DHH says Opus 4.5 was the inflection point that made agent-first coding viable for him, and he still reaches for Opus on hard problems . In a very different setup, Salvatore Sanfilippo’s multi-day reverse-engineering benchmark had GPT-5.4/Codex doing 99.5-100% of the work while Opus 4.6 mostly spun its wheels . Theo’s smaller prompt example points the same way: he says GPT/Codex treats prompts like instructions, while Opus sometimes treats them like a vibe .
  • PI is the most interesting open-source harness signal today: ~4 built-in tools, a ~20-line system prompt, no automatic AgentMD/MCP clutter, self-customization via editing its own source and /reload, plus a TypeScript extension system for building custom agents/TUIs on top .
  • Cursor shipped two concrete agent updates: remote agents you can kick off from your phone onto a devbox, and BugBot code review that learns from PR activity and says 78% of the issues it finds are resolved by merge .
  • Local review loops are getting first-class support. CodeRabbit’s CLI can now be called directly by an agent, returns structured JSON with issues + fixes, and is being framed as a pre-PR review layer by both creator coverage and Theo/Ben’s discussion of agent-integrated review .
  • OpenClaw keeps leaning into local-model and provider flexibility: Peter Steinberger added support for inferrs, described as a super efficient TurboQuant inference server, and says he has spent significant time making local models easy to use in OpenClaw .

💡 WORKFLOWS & TRICKS

  • Copy DHH’s dual-model loop. Run tmux with Neovim on the left, a faster model in one pane, Opus in another, and a terminal strip below. Start the task in an agent pane, watch the diff in Lazygit, then either commit immediately or edit the code yourself if the diff is close-but-not-right .
  • Use agents for PR triage, not just greenfield code. DHH’s loop is: pass Claude a PR/issue URL, let it analyze, merge the small minority that are good as-is, ask for a clean-room rewrite when the problem is right but the implementation is wrong, and reject the rest. He says that got him through 100 PRs in 90 minutes.
  • For hard problems, make two models argue before either writes code. DHH asks one model for a plan, sends that plan to another model for critique, then ping-pongs a couple more rounds before execution .
  • Queue your agent-triggered CI. DHH says concurrent all-core local CI runs from multiple agents were overrunning his machine, so they added a simple “WAIT YOUR TURN” line for agents .
  • Voice-to-deploy is already real for small projects. Kent bought a domain on Cloudflare, used Claude speech-to-text to tell Kody what he wanted, and Kody built/deployed a Cloudflare Worker landing page with Kit signup integration and an OG image from Cloudflare Browser Rendering .
  • Cost discipline matters. Kent shut off OpenClaw after it started costing real money, then rebuilt the two features he needed with Kody + Cloudflare infra; his explicit tradeoff is that MCP is more limiting, but it can keep usage inside the AI subs he’s already paying for .
  • Timeless harness rule: keep the core loop lean. Ben’s PI case is blunt: fewer tools and smaller prompts work better; don’t dump LSP noise into every turn, let the agent finish its generation, then run lint/checks afterward .
  • Common agent grammar is converging. Simon Willison notes that file tools like view / insert / str_replace and “sub-agent as a tool” patterns are showing up outside Claude-style coding harnesses too, which is a good hint at which abstractions may stick .

👤 PEOPLE TO WATCH

  • DHH — High-signal because he’s showing an actual senior-engineer production loop, not just hot takes: agent-first starts, diff review, PR triage, CLI design for agent interoperability, and a clear view on where human review still matters .
  • Salvatore Sanfilippo — One of the few people running long, reproducible agent benchmarks on weird real systems work instead of toy app demos; his GPT-5.4 vs Opus emulator test is worth reading for methodology alone .
  • Kent C. Dodds — Useful because he keeps turning agent talk into concrete side-project automation: NAS scripts, tunnels, Cloudflare infra, voice-driven landing pages, and cost-conscious rewrites when a setup gets too expensive .
  • Theo + Ben — Watch them when you want harsh negative signal on harness design. Their main argument today: Claude Code is bloated, PI is minimal, and model quality is only half the story if the execution layer is wasting tokens and polluting context .

🎬 WATCH & LISTEN

  • 45:38-48:06 — DHH’s dual-model layout. One fast model, one stronger model, Neovim in the middle, and human review on the diff instead of autocomplete-driven coding .
  • 1:05:22-1:07:15 — DHH’s PR triage loop.review URL, merge the clean ones, clean-room the fix when the idea is right but the code is wrong, and move on fast .
  • 30:52-32:59 — Human steering still matters. Salvatore explains how non-technical nudges—not hand-holding, just expert steering—helped GPT-5.4 break out of plateaus in a days-long emulator reconstruction task .

📊 PROJECTS & REPOS

  • PI — Open-source minimal agent harness with a small team behind it; Ben rebuilt his BTCA research agent and custom TUI around its SDK/extensions because it avoids auto-loading extra context and makes custom tools easier to control .
  • OpenClaw — Latest release adds inferrs support for efficient local inference, while Peter continues pushing local-model usability. The reality check: users like Kent are also finding that unattended agent setups can get expensive fast .
  • 37signals’ internal CLI push — Not open source, but worth watching as a project pattern: DHH says they’re building CLIs for Basecamp, HEY, and Fizzy so agents can pipe work across tools like Sentry, GitHub, and Basecamp with a clean record of what happened .

Editorial take: the edge right now is not “more autonomous agents” — it’s better harnesses, tighter review loops, and humans who know when to redirect the model.

Truthful Roadmaps, Problem-First Strategy, and AI-Native Execution
Apr 9
10 min read
82 docs
Product Management
Product Marketing
Melissa Perri
+7
This brief connects three themes shaping product work right now: truthful roadmaps, problem-first strategy, and the rise of AI-native execution. It also includes concrete playbooks, operating cases, and career signals from across the PM community.

Big Ideas

1) Roadmaps should communicate certainty and constraint

“Feature-based roadmaps are fiction. Everyone on the product team knows it.”

Teresa Torres argues roadmaps should match what the team actually knows: specific about what is being built now, directional about what is next, and outcome-focused further out. She positions Now / Next / Later as a better compromise between flexibility and visibility, especially when combined with opportunity solution trees so teams can explore without overpromising .

The same idea shows up in execution. In Melissa Perri’s date example, the real problem was not missing notifications when dates changed; it was forcing teams to enter overly specific dates they did not actually know. Allowing quarter-, month-, or day-level precision reduced churn and made communication more accurate . Anna Hannemann describes the organizational version: explicit trade-offs in workdays and dedicated dependency pre-alignment were needed because unresolved dependencies caused waiting, late objections, and wasted effort across 12 teams .

  • Why it matters: False certainty erodes trust, creates noise, and hides the real capacity and dependency constraints behind a roadmap .
  • How to apply: Use Now / Next / Later, let date precision match certainty, and make trade-offs and dependencies visible before kickoff .

2) Strong strategy starts with the real job to be done

“We have to solve real problems for real people.”

Melissa Perri’s guidance is to ask for the last concrete moment when someone needed a feature, then reconstruct what actually happened. That is how teams discover whether the request is the solution or only a proxy for the real constraint . The same logic appears in API strategy discussions: platform parity and exposing everything by request volume can produce APIs that ship but do not sell, while better strategies start from operational problems, concrete use cases, adoption scale, company fit, and clear success metrics .

This is also the lesson Melissa surfaced from AI work. Teams that focused on delivering AI solutions instead of solving customer problems spent time on efforts that did not bear fruit . Stack Overflow’s response has been to return to core strengths—human connection and canonical answers—while creating space for adjacent kinds of questions and a broader definition of technologists .

  • Why it matters: Teams waste time when they optimize for feature requests, parity, or shiny technology instead of the underlying problem .
  • How to apply: Ask for a real incident, define the operational problem, size the use case, and only then decide whether the right answer is a feature, an API, or no build at all .

3) Better product bets are measured by ongoing outcomes and incrementality

At International Baccalaureate, Kate Kempe says one of the biggest shifts has been moving the team from project delivery to delivering and maintaining a healthy product in life. Launch is the start of the journey, not the success condition; the real question is what measurable outcomes the product is for . TikTok applies a similar filter in growth: when evaluating what to build next, the team asks whether a use case is truly new, whether it unlocks incremental revenue, and what extra advertiser value it creates .

This discipline also sharpens positioning. In saturated markets, if the real edge is enterprise readiness—security, governance, compliance, integrations, scalability, support, procurement—then the target audience should be the buyers who care about those things. The trade-off is that the winning segment may also cap growth if the market is too small .

  • Why it matters: Output at launch can look successful even when the product is not healthy, the use case is not incremental, or the segment is too diffuse .
  • How to apply: Define the post-launch outcome, test whether the bet opens new budget or a new use case, and narrow messaging to the segment that actually values the differentiation .

Tactical Playbook

1) Make roadmap certainty explicit

  1. Put near-term work in Now and describe it specifically .
  2. Keep Next directional and push the longer-range view into outcomes rather than feature promises .
  3. Let date precision match certainty: quarter, month, or exact day only when you really know it .
  4. Price major initiatives in workdays so stakeholders see the trade-off as “this or that,” not as an unexplained no .
  5. For cross-team work, run a staged dependency review: week 1 for epic completeness, week 2 with the team, weeks 3-4 for domain-level dependency mapping, then weekly realignment after kickoff .

Why it matters: This combination addresses the three failure modes described across the sources: false certainty, hidden dependencies, and late reversals .

2) Turn a request into a product decision

  1. Ask for the last real moment the user felt the need, not a hypothetical preference .
  2. Reconstruct the situation, constraints, and what actually happened .
  3. Test whether the requested feature actually solves the problem; in the date example, notifications were secondary to bad certainty handling .
  4. If the ask is an API or platform request, document the operational problem, the use case, and how important that job is to the customer’s business .
  5. Estimate adoption scale and TAM, define whether the value is revenue, retention, or engagement, and decide how success will be measured .
  6. Check whether the solution fits your data, core competencies, and competitive position before you commit .

Why it matters: This is the discipline that prevents parity-only roadmaps and AI-for-AI’s-sake initiatives .

3) Build for developers without becoming their admin layer

  1. Remove clunky data-entry work and indirect workflows that engineers avoid .
  2. Optimize for speed and directness so the tool gets users to their goal faster .
  3. Design for sophisticated users who will jump through hoops when empowered, but will reject tools that waste their time or disrespect their expertise .
  4. Support the environment they actually work in: local workflows, CI/CD pipelines, and infrastructure as code .

Why it matters: When developer tools are clunky, PMs inherit the coordination and data-entry burden by default .

4) Run growth experiments with tighter commercial loops

  1. Start from observed user behavior; TikTok’s live-shopping push began when people were already sharing phone numbers on live streams to close sales .
  2. Treat high-potential live or commerce events as scheduled programs, not as spontaneous moments .
  3. Pair the event with creators who already have an engaged audience on the platform .
  4. Build an internal workflow that can react quickly, because the businesses that were more nimble in shifting focus to the best-performing product were more successful .

Why it matters: TikTok cited examples of stores doing $1M per hour on live streams when those loops worked well .

Case Studies & Lessons

1) The roadmap-date problem was really a certainty problem

A request for date-change notifications sounded reasonable until the team dug deeper. The underlying issue was that PMs had to enter precise dates they did not actually know, so dates kept changing and stakeholders got noisy updates. The better fix was to let teams express certainty at the right level of granularity—quarter, month, or exact day—which reduced changes and improved communication .

  • Takeaway: When a request keeps surfacing, ask whether it is compensating for a deeper system design flaw .

2) Picnic made cross-team dependency work visible

In Picnic’s warehouse systems domain, 12 product teams support inbound, stock, and outbound workflows. To keep innovation flowing while protecting fulfillment quality and efficiency, the team collects ideas, turns them into explicit trade-off options, and brings those options to founders for choice. For large initiatives, they now give dependency uncovering a four-week time box before kickoff and keep weekly check-ins running after launch .

  • Takeaway: If dependency alignment has no explicit time and place, it shows up later as waiting, rework, and late vetoes .

3) SaaStr’s QB moved from a portal replacement to an agentic CS system

QB started as a custom replacement for a sponsor portal that had poor usability, weak visibility into customer activity, and no agentic behavior. Once it was in production, the team added more automation based on real usage data, including personalized emails, daily gap identification, and task follow-up. The reported impact was a roughly 70% decrease in billable hours, more than 10x engagement, near-universal logins, and AI costs below $200 per month across the apps involved .

  • Takeaway: Custom AI workflows can outperform rigid off-the-shelf tools, but only with spec-first design, incremental rollout, exhaustive testing, and daily maintenance .

4) TikTok Shop shows what happens when the funnel collapses

TikTok described commerce as an end-to-end in-app system spanning discovery, creator promotion, fulfillment, and purchase. The product thesis is simple: every extra click kills conversion, so pushing users closer to the transaction matters . On live shopping, the winning pattern was scheduled lives, creator amplification, and fast pivots toward the products that were already converting. During Black Friday Cyber Monday, TikTok cited examples of stores doing $1M per hour on live-stream sales .

  • Takeaway: Growth loops improve when product design, creator distribution, and internal operating cadence are all aligned .

Career Corner

1) In new environments, listen before you lead

“Be interested and resist the urge to be interesting.”

Kate Kempe’s advice for new roles is to resist the pressure to prove yourself immediately. She describes listening, taking time to absorb context, and building relationships patiently as more effective than trying to make a big impression too early .

  • Why it matters: Moving too fast can lose people, especially in sectors that run at a different pace or depend on broader ecosystem readiness .
  • How to apply: Use early conversations to understand what the product is for, who it serves, and what success means before pushing visible change .

2) AI fluency is showing up as a hiring signal when it looks like a system

Across the notes, the stronger hiring signal is not “I use AI for PRDs.” It is a system: background agents handling work, tool connectors, agent teams, knowledge management, prototypes, or a public project that demonstrates how you work . One senior leader in the Reddit thread says they are not hiring PMs who are not learning how to use AI tools in ways that actually help them do the job, even if that does not mean shipping code today .

  • Why it matters: In these sources, interviewers and hiring leaders are using AI workflow maturity as a differentiator .
  • How to apply: Build one industry-relevant project, document the workflow, and be ready to explain the system behind it—not just say you use AI .

3) Use structured transitions in a difficult market

Kempe credits a job search council of 4-6 diverse professionals, meeting weekly for 10 or more sessions, with helping her narrow criteria and move deliberately instead of scattering applications . In the Reddit threads, others recommend adjacent roles such as Product Marketing Associate as pragmatic bridge roles when PM hiring is weak, especially when those roles include launches, cross-functional work, and end-to-end customer experience .

  • Why it matters: Both ideas reduce random searching and keep you close to PM-relevant work while the market is tight .
  • How to apply: Get specific about the role you want, use a small support group to pressure-test your search, and evaluate bridge roles by whether they increase launch ownership, cross-functional exposure, and product context .

Tools & Resources

1) Now / Next / Later plus opportunity solution trees

This combination is presented as a better balance between flexibility and visibility than dated feature roadmaps. It is most useful when you want a roadmap artifact that shows what is known now, what is directional next, and what future outcomes matter .

2) Workday-based trade-off slides and a fixed dependency cadence

Picnic’s approach is simple and reusable: show initiative alternatives in workdays, then give dependency uncovering its own time box before kickoff and weekly check-ins after kickoff .

3) Product Alliance’s Google modules

Multiple commenters recommended Product Alliance’s Google modules for PM interview prep because they go deeper on scale, infrastructure implications, technical trade-offs, L4 vs. L6 product sense, and Googleyness than more generic interview prep .

4) A spec-first vibe-coding checklist

The QB example offers a practical build template: write the spec first, provide design references, keep sensitive customer data out of the app itself, rely on source-system integrations for access, test every input and output, roll out to a few users first, and expect daily maintenance after launch .

5) Aakash Gupta’s AI PM learning path

Gupta’s five-step path for PMs is: Claude Code video, Cowork guide, PM OS, AI agents for PMs, and a free AI PM course. He characterizes the setup cost as a weekend, with hours-per-week return and compounding value as the system learns more about the product .

Mythos Debate Sharpens as Meta Launches Muse Spark and Open Models Advance
Apr 9
5 min read
217 docs
AI at Meta
Sebastian Raschka
Z.ai
+10
Debate over Anthropic’s Mythos shifted from alarm to questions of evidence, diffusion, and governance. Meta launched Muse Spark, while new releases and adoption data pointed to faster movement in the open-model ecosystem.

The Mythos debate moved from alarm to evidence

Cyber risk looks real, but the size of the step is contested

Anthropic’s unreleased Mythos is being described by briefed officials and commentators as a potentially dangerous cyber model, and Gary Marcus argued the episode strengthens the case for government oversight rather than leaving release decisions to company leaders . But the claims are already being challenged: Heidy Khlaaf flagged missing comparison benchmarks as a red flag, and Marcus said Mythos may not be as bad as the reporting suggests even if it still could cause harm without needing to qualify as AGI .

Why it matters: The conversation is moving away from "is this AGI?" and toward a more practical question: how cyber-capable models should be evaluated, released, and governed .

Open models already reproduce parts of the showcase

A follow-on analysis shared by Clement Delangue found that eight out of eight small, cheap open-weight models detected Mythos’s flagship FreeBSD exploit, including a 3.6B-active model costing $0.11 per million tokens; a 5.1B-active open model also recovered the core chain of a 27-year-old OpenBSD bug . Another post summarized the broader result as a "super jagged" frontier, with rankings reshuffling across tasks rather than one model dominating everything . Martin Casado said models getting better at vulnerability finding could be positive if it lowers the cost of discovery and reduces zero-day hoarding .

"The models are ready. The question is whether the rest of the ecosystem is."

Why it matters: If useful cyber capability is already diffusing into smaller open models, defenders may need to focus less on a single frontier release and more on integrating these tools into real workflows now .

Meta turned its rebuilt AI stack into a product

Muse Spark is now live in Meta AI

Meta introduced Muse Spark, the first model from Meta Superintelligence Labs, describing it as a natively multimodal reasoning model with tool use, visual chain of thought, and multi-agent orchestration . It is available today in Meta AI and the Meta AI app, with a private-preview API for select partners, and Meta said future versions may be open-sourced . Meta also said the model shows competitive performance in multimodal perception, reasoning, health, and agentic tasks, while it continues investing in long-horizon agents and coding workflows where it sees current gaps .

Why it matters: This is Meta’s first public model from its new superintelligence lab, and the company is shipping it as a product while positioning larger models as the next step .

Meta’s bigger claim is about efficient scaling — and that is already being debated

In a technical thread, Meta said a rebuilt pretraining stack can reach the same capabilities with over an order of magnitude less compute than Llama 4 Maverick, and that its RL stack delivers smooth gains plus more token-efficient reasoning through thinking-time penalties and multi-agent orchestration at comparable latency . Meta is also rolling out Contemplating mode, which it says uses parallel agents to compete with the extreme reasoning modes of Gemini Deep Think and GPT Pro . François Chollet pushed back, arguing the new model already looks overoptimized for public benchmarks at the expense of actual usefulness .

Why it matters: Meta is not just launching a model; it is making a broader claim that its new stack scales efficiently. The immediate pushback shows how central the benchmark-versus-utility debate has become .

Open-weight competition keeps shifting toward coding agents and Chinese adoption

GLM-5.1 makes a strong bid for the top open-weight coding model

Z.ai launched GLM-5.1 and said it ranks #1 among open-source models and #3 globally on SWE-Bench Pro, Terminal-Bench, and NL2Repo . The company said the model is built for long-horizon tasks, with autonomous runs up to eight hours and thousands of refinement iterations, while Sebastian Raschka described it as a DeepSeek-V3.2-like architecture with more layers and called it "THE flagship open-weight model now" based on the published benchmarks .

Why it matters: The open-weight race is getting more focused on sustained coding and agent execution, not just chat quality .

New adoption data points to continued momentum for Chinese open models

The new ATOM Report says Chinese models are continuing to accelerate in open-model adoption and hold a strong lead in derivative models and OpenRouter inference share . Its RAM metric highlighted Qwen 3.5, Nemontron 3, and Kimi K2.5 as standout recent models, based on a manually curated set of roughly 1,500 important language models . Google, meanwhile, said Gemma 4 passed 10 million downloads within a week of launch, taking the Gemma family past 500 million total downloads .

Why it matters: Distribution remains broad, but the newest adoption data suggests Chinese model families are still gaining share quickly inside the open ecosystem .

Two quieter infrastructure signals worth watching

Anthropic productized long-running agents

Anthropic announced Managed Agents, a hosted service for long-running agents, and framed the engineering challenge as designing systems for "programs as yet unthought of" .

Why it matters: Labs are increasingly packaging agent runtime infrastructure as a product, not just releasing stronger base models .

Safetensors moved deeper into the core AI stack

Hugging Face said Safetensors, created with collaborators including EleutherAI and Stability AI, has become the most popular way to share models safely and is now joining the PyTorch Foundation, with further scale-up including possible torch-core integration .

Why it matters: Secure model distribution is becoming part of the default ecosystem plumbing, not a side project .

War Premium Fades, Fertilizer Risk Persists, and Biological Inputs Move Mainstream
Apr 9
8 min read
234 docs
Secretary Brooke Rollins
Market Minute LLC
ABC Rural
+9
Grain markets pulled back sharply after the Iran ceasefire, but fertilizer logistics remain strained from the Strait of Hormuz to Brazil. The brief also highlights measurable gains from Brazilian biologicals, low-carbon grain incentives, and key regional livestock, trade, and policy shifts.

Market Movers

  • Global / U.S. grains: Iran ceasefire headlines removed a large chunk of war premium. May WTI crude fell 19.4% from the prior close to the overnight low, and grains repriced lower: May corn settled at $4.45¾, May soybeans at $11.59½, Chicago wheat at $5.80½, Kansas City wheat at $5.91¾, and spring wheat at $6.28¼. Wheat was hit hardest because the ceasefire coincided with a wetter forecast for winter wheat areas.

  • Soybeans / China: Soybeans recovered better than corn and wheat, with analysts tying part of that resilience to hopes that China’s role in the ceasefire could reopen U.S.-China talks and soybean buying. That support is being offset by China’s feed reform: one Brazilian market interview said swine rations could use 30% less soybean by 2030, while another source said fermented feed has risen to about 8% of industrial feed use and could reach 15% by 2030, potentially trimming soybean imports by as much as 6.3% from last year’s level. Technically, $11.58 is being watched as a key soybean support zone.

  • Wheat / Black Sea: Russia’s wheat crop is now projected at 88.7 MMT, up from 86.5 MMT, helped by better yields and more planted area. With Russia accounting for 19.5% of global wheat exports and Argus rating crop conditions at 3.4/5, the larger crop adds another bearish input to a market already losing war premium.

Innovation Spotlight

  • Brazil — soybean biological nitrogen fixation: Embrapa’s soybean package combines Bradyrhizobium with co-inoculation using Azospirillum brasilense. The technology, launched in 2014 after more than 10 years of research, is now used on about 35% of Brazil’s soybean area and has been cited as enabling yields up to 6,000 kg/ha without synthetic nitrogen fertilizer. Reported gains include stronger root growth, better phosphorus and potassium use efficiency, more drought tolerance, roughly R$150 billion/year in savings, and 230-260 million tons of CO2e avoided per crop, while reducing dependence on nitrogen fertilizer that Brazil imports at an 85% rate.
  • Brazil — scaling biologicals beyond large farms: Brazil was described as the global leader in biological use, but biologicals still represent only about 10% of chemical use, with available technologies seen as capable of lifting that share toward 40-50%. At Tecnoshow Comigo 2026, suppliers highlighted microbial products that optimize phosphate fertilization and can replace chemical fungicides and insecticides. Embrapa also said access remains skewed toward large-volume crops and is testing a cooperative biofactory model with Copavel in Paraná to reach small and medium growers.

  • U.S. — low-carbon grain incentives under 45Z: Discussants described 45Z as a tax credit paid to fuel producers, not directly to farmers, but said lower-carbon-intensity grain systems could still return roughly $10-20/acre through supply-chain payments. Practices cited as favorable for carbon-intensity scores included no-till or strip-till, manure in place of commercial fertilizer, split nutrient applications, and variable-rate fertilizer, with 2025-2027 crop data trackable through farm software.

Regional Developments

  • Brazil — protein exports stayed resilient, but concentration risk remains: Q1 2026 exports of beef, pork, and poultry rose 10% to 2.38 million tons, led by beef (+20% to 700,000 tons), pork (+15% to 330,000 tons), and poultry (+5% to 1.35 million tons). Exporters rerouted cargo through alternatives such as Suez feeders, the Strait of Magellan, and Saudi trucking, with some of the extra logistics cost shared with importers. The main risk is concentration: 68% of Brazil’s Q1 beef exports went to four countries, and China’s 1.1 million-ton quota carries a 12% tariff that rises to 55% above the quota.

  • Paraguay — pork capacity is expanding: Paraguay Pork plans to double sow numbers from 2,500 to 5,000 by end-2026 on the same site. The business linked sector growth to Taiwan’s market opening, additional slaughter capacity, and a planned US$50 million factory in Villeta, with export markets now including Taiwan and Russia.

  • Brazil / Mato Grosso — biomass sourcing is now an export issue: Canal Rural sources said a 2022 Mato Grosso rule allowing native-forest wood from authorized clearing to be used as industrial biomass conflicts with the Forest Code, which requires large biomass users consuming more than 24,000 m³/year to source from planted forests or management plans. The stakes are commercial as well as environmental: export buyers linked to the EU and U.S. may reject inputs tied to deforestation, even when legal. The same discussion pointed to 3.8 million hectares of degraded land in Mato Grosso that could support planted alternatives.

  • Brazil — financing and biofuel policy are shifting together: Rural delinquency reached a record 7.4% in February, and producer groups are pushing a debt-renegotiation bill they want advanced before the next crop cycle. At the same time, the energy ministry said Brazil will raise the ethanol blend in gasoline from 30% to 32% in the first half of 2026.

Best Practices

  • Use soil holding capacity to decide nitrogen timing: Ag PhD’s rule of thumb is to multiply cation exchange capacity (CEC) by 10 to estimate roughly how much nitrogen the soil can hold at one time. A full early-spring nitrogen application fits heavy soils, lower-rainfall areas, and moderate N demand; split applications fit light soils, higher-rainfall regions, and high-N-demand crops.

  • For soybean systems under nitrogen pressure, use biological stacking where it is validated: The Brazilian co-inoculation model pairs nitrogen-fixing bacteria with root-growth-promoting bacteria. The practical benefits reported were stronger rooting, better phosphorus and potassium uptake, and more drought tolerance, making it especially relevant when imported nitrogen is expensive or uncertain.

  • Build low-carbon grain programs around field practices that can be measured: The stack most consistently cited for stronger carbon-intensity scores was no-till/strip-till, manure, split nutrient applications, and variable-rate fertilizer. In current 45Z discussions, those are the practices most often linked to potential on-farm payments through fuel supply chains.

  • For livestock expansion, match capital with partner strengths instead of forcing full ownership: Oklahoma cattle operators described partnership structures as a practical way to scale: move small sets of cows to partners who have grass, hay, or labor, pay based on per-head-per-day operating cost, and begin only with a clear buyer and business plan. That discipline matters in a market where bred cows are around $5,000 and weaned calves $2,500-$3,000.

Input Markets

  • Global nitrogen remains the main input risk: Angie Setzer summarized that one-third of globally traded fertilizer moves through the Strait of Hormuz, only 21 fertilizer tankers had crossed since Feb. 28, and U.S. Gulf urea had jumped 35% to above $800/ton. Even under a ceasefire, more than 1,000 vessels are backlogged, insurance is up 300%, and chemical plants take weeks to recommission.

  • Hormuz is not back to normal: Canal Rural added that more than 200 ships were inside the strait but only 62-63 were actively moving, with oil and gas vessels getting priority. Fertilizer cargo remains exposed to freight and insurance pressure, and QatarEnergy has halted urea production under force majeure.

  • Brazil’s nitrogen market is exposed to Russia and China at the same time: Brazil imported 1.2 million tons of ammonium nitrate in 2025, nearly all from Russia, so a one-month Russian suspension would remove about 100,000 tons from the market. China is also directing urea to its domestic spring market until August. Brazilian analysts said this does not necessarily create outright shortage if Russian exports resume next month, but it does keep 2026 production costs and fertilizer prices elevated because gas, freight, fuel, and logistics are all tighter.

  • Bioinputs are moving from niche to budget line: Brazil still depends on imports for about 86% of fertilizer use, but bioinputs were described as already present on roughly 30% of the country’s fields. Market projections put bio-defensives at 25% of defensives spending by 2029-2030, up from 10% today.

  • Feed formulation is becoming a demand variable: Fermented feed in China was described as a rising share of industrial feed use, but market sources also flagged open questions around hog health, growth performance, and meat quality versus soybean-based diets.

Forward Outlook

  • The next immediate trigger is the April USDA report: Market Minute’s review of the last decade suggests the release has usually been a modest mover. Corn averaged about 4 cents after the report and exceeded 10 cents only once; soybeans averaged 11 cents, or 7 cents excluding the 2022 war year.

  • The market is shifting from geopolitics back toward weather and acreage: U.S. planting has started, but Northern Plains snow and moisture point to a slower start and could shift acres toward later crops. Traders are watching $4.45-$4.50 in corn and $11.58 in soybeans as important technical markers.

  • Brazilian seasonal planning remains very regional: Current forecasts do not point to strong early frost in Paraná’s Catanduvas, but temperatures below 10°C remain a germination risk for safrinha corn in Paraná and southern Mato Grosso do Sul. In Bahia’s Sealba zone, growers were urged to use the next 15 days and roughly 70 mm of rain for planting before agricultural rains thin sharply in May and June.

  • Confidence is firmer than the cost backdrop: The Purdue/CME Ag Economy Barometer rose to 127 in March, and 65% of respondents said the U.S. is headed in the right direction, even as input costs remained a stated concern. In Brazil, rising delinquency is pushing restructuring efforts before the next planting cycle.

"Farmers used to manage costs; now they manage risks"

That line matches the present backdrop in livestock as well: rising costs are being described as an unpredictable risk problem, not a single-price problem.

Your time, back.

An AI curator that monitors the web nonstop, lets you control every source and setting, and delivers one verified daily brief.

Save hours

AI monitors connected sources 24/7—YouTube, X, Substack, Reddit, RSS, people's appearances and more—condensing everything into one daily brief.

Full control over the agent

Add/remove sources. Set your agent's focus and style. Auto-embed clips from full episodes and videos. Control exactly how briefs are built.

Verify every claim

Citations link to the original source and the exact span.

Discover sources on autopilot

Your agent discovers relevant channels and profiles based on your goals. You get to decide what to keep.

Multi-media sources

Track YouTube channels, Podcasts, X accounts, Substack, Reddit, and Blogs. Plus, follow people across platforms to catch their appearances.

Private or Public

Create private agents for yourself, publish public ones, and subscribe to agents from others.

Get your briefs in 3 steps

1

Describe your goal

Tell your AI agent what you want to track using natural language. Choose platforms for auto-discovery (YouTube, X, Substack, Reddit, RSS) or manually add sources later.

Stay updated on space exploration and electric vehicle innovations
Daily newsletter on AI news and research
Track startup funding trends and venture capital insights
Latest research on longevity, health optimization, and wellness breakthroughs
Auto-discover sources

2

Confirm your sources and launch

Your agent finds relevant channels and profiles based on your instructions. Review suggestions, keep what fits, remove what doesn't, add your own. Launch when ready—you can always adjust sources anytime.

Discovering relevant sources...
Sam Altman Profile

Sam Altman

Profile
3Blue1Brown Avatar

3Blue1Brown

Channel
Paul Graham Avatar

Paul Graham

Account
Example Substack Avatar

The Pragmatic Engineer

Newsletter · Gergely Orosz
Reddit Machine Learning

r/MachineLearning

Community
Naval Ravikant Profile

Naval Ravikant

Profile ·
Example X List

AI High Signal

List
Example RSS Feed

Stratechery

RSS · Ben Thompson
Sam Altman Profile

Sam Altman

Profile
3Blue1Brown Avatar

3Blue1Brown

Channel
Paul Graham Avatar

Paul Graham

Account
Example Substack Avatar

The Pragmatic Engineer

Newsletter · Gergely Orosz
Reddit Machine Learning

r/MachineLearning

Community
Naval Ravikant Profile

Naval Ravikant

Profile ·
Example X List

AI High Signal

List
Example RSS Feed

Stratechery

RSS · Ben Thompson

3

Receive verified daily briefs

Get concise, daily updates with precise citations directly in your inbox. You control the focus, style, and length.

Muse Spark Reopens the Frontier Race as Agent Platforms Mature
Apr 9
8 min read
876 docs
Jason Weston
Cursor
Alexandr Wang
+26
Meta’s new frontier model led the cycle, while Anthropic pushed fully hosted agents and new benchmarks showed how difficult real-world agent work remains. Research also advanced protein design, math, memory systems, and automated scientific writing.

Top Stories

Why it matters: The biggest developments this cycle combined a new frontier model, a push toward fully hosted agent infrastructure, and better evidence about where agent systems still break in real work.

Meta launches Muse Spark and re-enters the frontier race

Meta released Muse Spark as the first model from Meta Superintelligence Labs after a nine-month rebuild of its AI stack, and the model now powers Meta AI . Meta describes it as a natively multimodal reasoning model with tool-use, visual chain of thought, and multi-agent orchestration . Artificial Analysis scored it 52 on the Intelligence Index, placing it in the top five models it has benchmarked . On those benchmarks, Muse Spark was also notably token efficient at 58M output tokens, versus 157M for Claude Opus 4.6 and 120M for GPT-5.4 .

The model’s strongest third-party results were in vision and reasoning: 80.5% on MMMU-Pro and 39.9% on Humanity’s Last Exam, while agentic performance trailed leaders on GDPval-AA and TerminalBench Hard . Meta is also gradually rolling out Contemplating mode, which has multiple agents reason in parallel, and says the model is still weaker in long-horizon agentic systems and coding workflows . Muse Spark is available at meta.ai and in the Meta AI app, with private preview API access for select partners .

Impact: This is both a capability jump and a strategy change. Muse Spark is Meta’s first frontier model since Llama 4 Maverick and its first frontier release that is not open weights .

Anthropic moves further up the stack with Claude Managed Agents

Anthropic introduced Claude Managed Agents as a public beta on the Claude Platform, positioning it as a way to build and deploy agents at scale . The product pairs a performance-tuned agent harness with production infrastructure so teams can move from prototype to launch in days . Anthropic’s engineering blog describes it as a hosted service for long-running agents .

Impact: Anthropic is packaging more of the agent stack as a hosted service, shifting competition from model access alone toward runtime, orchestration, and deployment infrastructure.

APEX-Agents-AA shows how hard real agent work still is

Artificial Analysis launched APEX-Agents-AA, a benchmark based on 452 long-horizon tasks from investment banking, management consulting, and corporate law, using MCP-based tools and pass@1 grading across three runs per task . The leaderboard is tightly clustered at the top: GPT-5.4 at 33.3%, Claude Opus 4.6 at 33.0%, and Gemini 3.1 Pro Preview at 32%.

The implementation runs inside Stirrup, Artificial Analysis’s open-source agent harness, and one outside summary noted a very large gap between proprietary and open-source models on this workload .

Impact: The result is a useful reality check. Even the leading models are completing only about one-third of these long-horizon professional tasks.

DISCO pushes generative AI deeper into experimental science

DISCO is a new diffusion system for joint protein sequence-structure co-design from Mila and Frances Arnold’s Caltech lab, with Yoshua Bengio also highlighting the release . In the headline example, it engineered an enzyme for selective C(sp³)–H insertion—described as one of the most challenging transformations in organic chemistry—using a single plate, without pre-specified catalytic residues, templates, theozymes, or inverse folding .

Impact: This is a strong example of multimodal generative modeling moving beyond software tasks into experimentally grounded molecular design.

Research & Innovation

Why it matters: The most interesting research this cycle focused on memory, writing, training methods, and formal reasoning—areas that directly affect whether AI systems become more useful in extended workflows.

OpenAI reports five more Erdős problem solutions

OpenAI released a paper describing solutions to five further Erdős problems using an internal model . One highlighted result is a counterexample for Erdős Problem 1091, and the paper’s Figure 5 was produced by Codex .

Google’s PaperOrchestra targets automated research writing

PaperOrchestra is a multi-agent system that turns raw ideas, notes, and experimental logs into submission-ready LaTeX manuscripts . It uses specialized agents for literature synthesis, plot generation, conceptual diagrams, and iterative refinement, and introduces PaperWritingBench, built from reverse-engineered materials from 200 top AI conference papers . In side-by-side human evaluations, it posted 50–68% absolute win-rate margins on literature review quality and 14–38% on overall manuscript quality over autonomous baselines .

MIA treats agent memory as something that evolves during use

The Memory Intelligence Agent combines a non-parametric memory manager, an RL-trained planner, and an executor, with bidirectional conversion between parametric and non-parametric memory plus test-time learning during inference . Reported gains include up to 9% improvement for GPT-5.4 on LiveVQA and 31% average improvement across 11 benchmarks with a lightweight 7B executor .

Thinking Mid-training inserts reasoning before post-training

A new Thinking Mid-training recipe adds supervised fine-tuning and reinforcement learning between pretraining and post-training, using interleaved thoughts to teach models when and how to reason . On base Llama-3-8B, the authors report a 3.2x improvement on reasoning benchmarks compared with direct RL post-training .

Products & Launches

Why it matters: Product releases were less about generic chat and more about making models cheaper, more grounded, or easier to use in real workflows.

Qwen3.6 Plus improves Alibaba’s hosted model offering

Alibaba released Qwen3.6 Plus, a proprietary model with native vision input and a 1M-token context window, available through Alibaba Cloud’s API . Artificial Analysis scored it 50 on the Intelligence Index, up 5 points from Qwen3.5 397B . It also improved on agentic and reliability-oriented measures, including 1373 Elo on GDPval-AA and an AA-Omniscience move from -30 to +3 via reduced hallucination .

A notable commercial angle is cost: Artificial Analysis estimated about $483 to run the full Intelligence Index on Qwen3.6 Plus, versus much higher costs for frontier proprietary peers .

Google brings notebooks into Gemini

Google is rolling out Notebooks in Gemini as a project workspace where users can organize chats, notes, documents, and PDFs, and get answers grounded in those sources . The feature syncs with NotebookLM in both directions, so sources added in one appear in the other . Rollout starts on the web for Google AI Ultra, Pro, and Plus subscribers .

Cognition ships SWE-1.6 in Windsurf

Cognition released SWE-1.6, which it describes as its best model on both intelligence and model UX, matching its Preview model on SWE-Bench Pro while improving behavior on other axes . It is available in Windsurf with a 200 tok/s free tier and a 950 tok/s fast tier .

LiquidAI targets edge reasoning with LFM2.5-VL-450M

LiquidAI released LFM2.5-VL-450M, a vision-language model built for real-time reasoning on edge devices . It supports bounding boxes, object detection, function calling, and nine-language multilingual use, and processes a 512×512 image in about 240ms on-device .

Industry Moves

Why it matters: Labs are making bigger strategic bets on distribution, compute scale, and applied AI programs beyond core model releases.

Meta shifts its release strategy

Muse Spark is not just a new model. It is Meta’s first frontier release that is not open weights, and Meta is integrating it across Meta AI, Facebook, Instagram, and Threads while saying larger models are already in development .

xAI outlines a larger training slate at Colossus 2

Elon Musk said Colossus 2 now has seven models in training: Imagine V2, two 1T variants, two 1.5T variants, a 6T, and a 10T model . In follow-up posts, Musk said the 1T model is about 2–3 weeks away, the 1.5T about 4–5 weeks, and a pre-training phase is about two months.

OpenAI Foundation commits major Alzheimer’s funding

The OpenAI Foundation said it is taking an end-to-end AI approach to Alzheimer’s, spanning early diagnosis, disease understanding, and drug discovery . It is finalizing over $100M in grants across six institutions this month .

Policy & Regulation

Why it matters: Safety disclosures and evaluation frameworks continued to shape de facto standards for deployment.

Meta publishes a safety framework with Muse Spark

Meta released Muse Spark alongside an Advanced AI Scaling Framework that covers evaluation across bio, chem, cyber, and loss-of-control risks before and after mitigations . In that framework, Muse Spark achieved a 98% bioweapons refusal rate on BioTier-refuse, which Meta says was the highest among the models it benchmarked . Meta says this is the start of a safety system designed to scale with future model capability .

ClawsBench highlights how weak agent safety can still be

ClawsBench measures both capability and safety in stateful agent environments built around tools like Google Workspace CLI and Slack MCP . One key finding is that scaffolding matters more than model choice: adding skills moved results from 0–8% to 39–63%. Another is that capability and safety can diverge: Opus led capability at 63% but also tied for the worst unsafe-action rate at 23%, while GPT-5.4 had the lowest unsafe-action rate at 7% but only mid-tier task performance . Only 1 out of 7,224 trials explicitly detected a prompt injection .

Quick Takes

Why it matters: Smaller releases still showed rapid movement in video generation, developer tooling, model serving, and workflow automation.*

  • Bytedance’s Dreamina Seedance 2.0 moved to #1 in Video Arena for both text-to-video and image-to-video, with large gains over its prior version .
  • Google added Flex and Priority service tiers to the Gemini API, including a 50% lower-cost tier for latency-tolerant workloads and a priority tier for critical apps .
  • W&B Automations is now live, adding metric alerts, Slack notifications, and webhook-driven actions like triggering eval pipelines or killing failed jobs .
  • Cursor’s code review agent now learns from PR activity to self-improve in real time; the company says 78% of issues it finds are resolved before merge .
  • Nomic and Muna released on-device layout models for PDF understanding, with no server, no API key, and local parsing of 500-page PDFs .
  • SWE-bench crossed 1 million downloads; an easier inference stack and SWE-bench Multimodal are next .
  • NVIDIA and vLLM submitted the first MLPerf vision-language-model benchmark using vLLM .
  • Runway added custom voices for Runway Characters, generated from text prompts .
Smalltalk Best Practice Patterns Leads Picks on Code, Ecosystems, and Institutions
Apr 9
3 min read
209 docs
The All-In Podcast
Reid Hoffman
David Heinemeier Hansson (DHH)
+5
Today's strongest organic recommendations center on one high-conviction programming book, two books about systems and motivation, and two videos on California policy and education design. DHH's praise of Smalltalk Best Practice Patterns stands out for its unusually specific use case.

What stood out

After filtering for direct, organic recommendations, today's strongest picks split into two useful clusters: builder craft and institutional design. The clearest signal is David Heinemeier Hansson's endorsement of Smalltalk Best Practice Patterns because he gives both a very strong ranking and a specific reason to read it: learning how to structure methods and classes .

Most compelling recommendation

Smalltalk Best Practice Patterns

  • Content type: Book
  • Author/creator: Kent Beck
  • Who recommended it: David Heinemeier Hansson
  • Key takeaway: DHH calls it his #1 recommendation for programmers who want to learn the nitty-gritty of structuring methods and classes, and says it remains his favorite book on tactical programming patterns
  • Why it matters: This is the strongest pick today because the recommendation is both high-conviction and highly specific: it tells readers exactly what skill the book helps build

"Small Talk Best Practices is my number one recommendation for any programmer who want to learn the nitty gritty of how to structure a method and a class..."

Books that explain how builders operate

Regional Advantage

  • Content type: Book
  • Author/creator: AnnaLee Saxenian
  • Who recommended it: Reid Hoffman
  • Key takeaway: Hoffman points to it as a useful explanation for why Silicon Valley outstripped Boston, emphasizing that looser non-compete rules let network effects and knowledge spread across the region instead of being locked inside individual companies
  • Why it matters: It is a strong systems-level recommendation for readers trying to understand how policy and labor mobility shape startup ecosystems

Man's Search for Meaning

  • Content type: Book
  • Author/creator: Viktor Frankl
  • Who recommended it: David Heinemeier Hansson
  • Key takeaway: DHH invokes Frankl's idea that finding a "why" helps people endure discomfort, and applies that lesson directly to the frustrations of building with computers
  • Why it matters: This is the clearest mindset recommendation in the set because it connects meaning to persistence in technical work

Two watches on institutions

How Matt Mahan Thinks He Can Save California

  • Content type: Podcast/video episode
  • Author/creator: The All-In Podcast
  • Link/URL:https://x.com/theallinpod/status/2035888224308957611
  • Who recommended it: Marc Andreessen
  • Key takeaway: Andreessen says the episode is worth watching and praises Matt Mahan as an outstanding mayor; the discussion covers California's public-sector unions, pension liabilities, housing regulation and fees, and energy policy
  • Why it matters: It is the most concrete policy resource in today's set, with a clear topic map for readers focused on California's operating environment

A 200 year vision for American Education

  • Content type: Video
  • Author/creator: @delk and the Primer team
  • Link/URL:https://x.com/delk/status/2041875924811915664
  • Who recommended it: Scott Belsky
  • Key takeaway: Belsky says the team has been reimagining "school for what's ahead" for more than five years and argues that education should prepare the next generation for resilience, flexibility, courage, and mastery
  • Why it matters: It is the clearest education-design recommendation today, and it makes the desired outcomes explicit instead of talking about schooling in generalities

Bottom line

The highest-signal pattern today is specificity. The best recommendations are not generic praise; they come with a precise lesson, whether that's how to structure code, how regions compound knowledge, how to endure hard technical work, or what institutions should optimize for

DHH Goes Agent-First, PI Pushes Minimalism, and GPT-5.4 Wins a Brutal Benchmark
Apr 9
5 min read
105 docs
DHH
David Heinemeier Hansson (DHH)
Salvatore Sanfilippo
+10
DHH’s production workflow is the clearest practical signal today: start with agents, review diffs, and keep two models running in parallel. Also: PI’s minimalist harness philosophy, a punishing GPT-5.4 vs Opus benchmark, and cost-conscious automation tricks from Kent C. Dodds.

🔥 TOP SIGNAL

DHH has gone from disliking autocomplete-style AI to running an agent-first workflow in production: he now starts new work with an agent draft, reviews diffs in Neovim/Lazygit, and keeps a second model running in parallel for harder problems . The broader takeaway from today’s sources is even more useful: the best practitioners are not converging on full autonomy; they’re converging on lean harnesses, explicit review loops, and senior engineers as validators/redirection layers.

"Now I start with the agent. Now it'll give me the draft. I'll review the draft, and I'll make alterations if need be."

🛠️ TOOLS & MODELS

  • Model winner depends heavily on task + harness. DHH says Opus 4.5 was the inflection point that made agent-first coding viable for him, and he still reaches for Opus on hard problems . In a very different setup, Salvatore Sanfilippo’s multi-day reverse-engineering benchmark had GPT-5.4/Codex doing 99.5-100% of the work while Opus 4.6 mostly spun its wheels . Theo’s smaller prompt example points the same way: he says GPT/Codex treats prompts like instructions, while Opus sometimes treats them like a vibe .
  • PI is the most interesting open-source harness signal today: ~4 built-in tools, a ~20-line system prompt, no automatic AgentMD/MCP clutter, self-customization via editing its own source and /reload, plus a TypeScript extension system for building custom agents/TUIs on top .
  • Cursor shipped two concrete agent updates: remote agents you can kick off from your phone onto a devbox, and BugBot code review that learns from PR activity and says 78% of the issues it finds are resolved by merge .
  • Local review loops are getting first-class support. CodeRabbit’s CLI can now be called directly by an agent, returns structured JSON with issues + fixes, and is being framed as a pre-PR review layer by both creator coverage and Theo/Ben’s discussion of agent-integrated review .
  • OpenClaw keeps leaning into local-model and provider flexibility: Peter Steinberger added support for inferrs, described as a super efficient TurboQuant inference server, and says he has spent significant time making local models easy to use in OpenClaw .

💡 WORKFLOWS & TRICKS

  • Copy DHH’s dual-model loop. Run tmux with Neovim on the left, a faster model in one pane, Opus in another, and a terminal strip below. Start the task in an agent pane, watch the diff in Lazygit, then either commit immediately or edit the code yourself if the diff is close-but-not-right .
  • Use agents for PR triage, not just greenfield code. DHH’s loop is: pass Claude a PR/issue URL, let it analyze, merge the small minority that are good as-is, ask for a clean-room rewrite when the problem is right but the implementation is wrong, and reject the rest. He says that got him through 100 PRs in 90 minutes.
  • For hard problems, make two models argue before either writes code. DHH asks one model for a plan, sends that plan to another model for critique, then ping-pongs a couple more rounds before execution .
  • Queue your agent-triggered CI. DHH says concurrent all-core local CI runs from multiple agents were overrunning his machine, so they added a simple “WAIT YOUR TURN” line for agents .
  • Voice-to-deploy is already real for small projects. Kent bought a domain on Cloudflare, used Claude speech-to-text to tell Kody what he wanted, and Kody built/deployed a Cloudflare Worker landing page with Kit signup integration and an OG image from Cloudflare Browser Rendering .
  • Cost discipline matters. Kent shut off OpenClaw after it started costing real money, then rebuilt the two features he needed with Kody + Cloudflare infra; his explicit tradeoff is that MCP is more limiting, but it can keep usage inside the AI subs he’s already paying for .
  • Timeless harness rule: keep the core loop lean. Ben’s PI case is blunt: fewer tools and smaller prompts work better; don’t dump LSP noise into every turn, let the agent finish its generation, then run lint/checks afterward .
  • Common agent grammar is converging. Simon Willison notes that file tools like view / insert / str_replace and “sub-agent as a tool” patterns are showing up outside Claude-style coding harnesses too, which is a good hint at which abstractions may stick .

👤 PEOPLE TO WATCH

  • DHH — High-signal because he’s showing an actual senior-engineer production loop, not just hot takes: agent-first starts, diff review, PR triage, CLI design for agent interoperability, and a clear view on where human review still matters .
  • Salvatore Sanfilippo — One of the few people running long, reproducible agent benchmarks on weird real systems work instead of toy app demos; his GPT-5.4 vs Opus emulator test is worth reading for methodology alone .
  • Kent C. Dodds — Useful because he keeps turning agent talk into concrete side-project automation: NAS scripts, tunnels, Cloudflare infra, voice-driven landing pages, and cost-conscious rewrites when a setup gets too expensive .
  • Theo + Ben — Watch them when you want harsh negative signal on harness design. Their main argument today: Claude Code is bloated, PI is minimal, and model quality is only half the story if the execution layer is wasting tokens and polluting context .

🎬 WATCH & LISTEN

  • 45:38-48:06 — DHH’s dual-model layout. One fast model, one stronger model, Neovim in the middle, and human review on the diff instead of autocomplete-driven coding .
  • 1:05:22-1:07:15 — DHH’s PR triage loop.review URL, merge the clean ones, clean-room the fix when the idea is right but the code is wrong, and move on fast .
  • 30:52-32:59 — Human steering still matters. Salvatore explains how non-technical nudges—not hand-holding, just expert steering—helped GPT-5.4 break out of plateaus in a days-long emulator reconstruction task .

📊 PROJECTS & REPOS

  • PI — Open-source minimal agent harness with a small team behind it; Ben rebuilt his BTCA research agent and custom TUI around its SDK/extensions because it avoids auto-loading extra context and makes custom tools easier to control .
  • OpenClaw — Latest release adds inferrs support for efficient local inference, while Peter continues pushing local-model usability. The reality check: users like Kent are also finding that unattended agent setups can get expensive fast .
  • 37signals’ internal CLI push — Not open source, but worth watching as a project pattern: DHH says they’re building CLIs for Basecamp, HEY, and Fizzy so agents can pipe work across tools like Sentry, GitHub, and Basecamp with a clean record of what happened .

Editorial take: the edge right now is not “more autonomous agents” — it’s better harnesses, tighter review loops, and humans who know when to redirect the model.

Truthful Roadmaps, Problem-First Strategy, and AI-Native Execution
Apr 9
10 min read
82 docs
Product Management
Product Marketing
Melissa Perri
+7
This brief connects three themes shaping product work right now: truthful roadmaps, problem-first strategy, and the rise of AI-native execution. It also includes concrete playbooks, operating cases, and career signals from across the PM community.

Big Ideas

1) Roadmaps should communicate certainty and constraint

“Feature-based roadmaps are fiction. Everyone on the product team knows it.”

Teresa Torres argues roadmaps should match what the team actually knows: specific about what is being built now, directional about what is next, and outcome-focused further out. She positions Now / Next / Later as a better compromise between flexibility and visibility, especially when combined with opportunity solution trees so teams can explore without overpromising .

The same idea shows up in execution. In Melissa Perri’s date example, the real problem was not missing notifications when dates changed; it was forcing teams to enter overly specific dates they did not actually know. Allowing quarter-, month-, or day-level precision reduced churn and made communication more accurate . Anna Hannemann describes the organizational version: explicit trade-offs in workdays and dedicated dependency pre-alignment were needed because unresolved dependencies caused waiting, late objections, and wasted effort across 12 teams .

  • Why it matters: False certainty erodes trust, creates noise, and hides the real capacity and dependency constraints behind a roadmap .
  • How to apply: Use Now / Next / Later, let date precision match certainty, and make trade-offs and dependencies visible before kickoff .

2) Strong strategy starts with the real job to be done

“We have to solve real problems for real people.”

Melissa Perri’s guidance is to ask for the last concrete moment when someone needed a feature, then reconstruct what actually happened. That is how teams discover whether the request is the solution or only a proxy for the real constraint . The same logic appears in API strategy discussions: platform parity and exposing everything by request volume can produce APIs that ship but do not sell, while better strategies start from operational problems, concrete use cases, adoption scale, company fit, and clear success metrics .

This is also the lesson Melissa surfaced from AI work. Teams that focused on delivering AI solutions instead of solving customer problems spent time on efforts that did not bear fruit . Stack Overflow’s response has been to return to core strengths—human connection and canonical answers—while creating space for adjacent kinds of questions and a broader definition of technologists .

  • Why it matters: Teams waste time when they optimize for feature requests, parity, or shiny technology instead of the underlying problem .
  • How to apply: Ask for a real incident, define the operational problem, size the use case, and only then decide whether the right answer is a feature, an API, or no build at all .

3) Better product bets are measured by ongoing outcomes and incrementality

At International Baccalaureate, Kate Kempe says one of the biggest shifts has been moving the team from project delivery to delivering and maintaining a healthy product in life. Launch is the start of the journey, not the success condition; the real question is what measurable outcomes the product is for . TikTok applies a similar filter in growth: when evaluating what to build next, the team asks whether a use case is truly new, whether it unlocks incremental revenue, and what extra advertiser value it creates .

This discipline also sharpens positioning. In saturated markets, if the real edge is enterprise readiness—security, governance, compliance, integrations, scalability, support, procurement—then the target audience should be the buyers who care about those things. The trade-off is that the winning segment may also cap growth if the market is too small .

  • Why it matters: Output at launch can look successful even when the product is not healthy, the use case is not incremental, or the segment is too diffuse .
  • How to apply: Define the post-launch outcome, test whether the bet opens new budget or a new use case, and narrow messaging to the segment that actually values the differentiation .

Tactical Playbook

1) Make roadmap certainty explicit

  1. Put near-term work in Now and describe it specifically .
  2. Keep Next directional and push the longer-range view into outcomes rather than feature promises .
  3. Let date precision match certainty: quarter, month, or exact day only when you really know it .
  4. Price major initiatives in workdays so stakeholders see the trade-off as “this or that,” not as an unexplained no .
  5. For cross-team work, run a staged dependency review: week 1 for epic completeness, week 2 with the team, weeks 3-4 for domain-level dependency mapping, then weekly realignment after kickoff .

Why it matters: This combination addresses the three failure modes described across the sources: false certainty, hidden dependencies, and late reversals .

2) Turn a request into a product decision

  1. Ask for the last real moment the user felt the need, not a hypothetical preference .
  2. Reconstruct the situation, constraints, and what actually happened .
  3. Test whether the requested feature actually solves the problem; in the date example, notifications were secondary to bad certainty handling .
  4. If the ask is an API or platform request, document the operational problem, the use case, and how important that job is to the customer’s business .
  5. Estimate adoption scale and TAM, define whether the value is revenue, retention, or engagement, and decide how success will be measured .
  6. Check whether the solution fits your data, core competencies, and competitive position before you commit .

Why it matters: This is the discipline that prevents parity-only roadmaps and AI-for-AI’s-sake initiatives .

3) Build for developers without becoming their admin layer

  1. Remove clunky data-entry work and indirect workflows that engineers avoid .
  2. Optimize for speed and directness so the tool gets users to their goal faster .
  3. Design for sophisticated users who will jump through hoops when empowered, but will reject tools that waste their time or disrespect their expertise .
  4. Support the environment they actually work in: local workflows, CI/CD pipelines, and infrastructure as code .

Why it matters: When developer tools are clunky, PMs inherit the coordination and data-entry burden by default .

4) Run growth experiments with tighter commercial loops

  1. Start from observed user behavior; TikTok’s live-shopping push began when people were already sharing phone numbers on live streams to close sales .
  2. Treat high-potential live or commerce events as scheduled programs, not as spontaneous moments .
  3. Pair the event with creators who already have an engaged audience on the platform .
  4. Build an internal workflow that can react quickly, because the businesses that were more nimble in shifting focus to the best-performing product were more successful .

Why it matters: TikTok cited examples of stores doing $1M per hour on live streams when those loops worked well .

Case Studies & Lessons

1) The roadmap-date problem was really a certainty problem

A request for date-change notifications sounded reasonable until the team dug deeper. The underlying issue was that PMs had to enter precise dates they did not actually know, so dates kept changing and stakeholders got noisy updates. The better fix was to let teams express certainty at the right level of granularity—quarter, month, or exact day—which reduced changes and improved communication .

  • Takeaway: When a request keeps surfacing, ask whether it is compensating for a deeper system design flaw .

2) Picnic made cross-team dependency work visible

In Picnic’s warehouse systems domain, 12 product teams support inbound, stock, and outbound workflows. To keep innovation flowing while protecting fulfillment quality and efficiency, the team collects ideas, turns them into explicit trade-off options, and brings those options to founders for choice. For large initiatives, they now give dependency uncovering a four-week time box before kickoff and keep weekly check-ins running after launch .

  • Takeaway: If dependency alignment has no explicit time and place, it shows up later as waiting, rework, and late vetoes .

3) SaaStr’s QB moved from a portal replacement to an agentic CS system

QB started as a custom replacement for a sponsor portal that had poor usability, weak visibility into customer activity, and no agentic behavior. Once it was in production, the team added more automation based on real usage data, including personalized emails, daily gap identification, and task follow-up. The reported impact was a roughly 70% decrease in billable hours, more than 10x engagement, near-universal logins, and AI costs below $200 per month across the apps involved .

  • Takeaway: Custom AI workflows can outperform rigid off-the-shelf tools, but only with spec-first design, incremental rollout, exhaustive testing, and daily maintenance .

4) TikTok Shop shows what happens when the funnel collapses

TikTok described commerce as an end-to-end in-app system spanning discovery, creator promotion, fulfillment, and purchase. The product thesis is simple: every extra click kills conversion, so pushing users closer to the transaction matters . On live shopping, the winning pattern was scheduled lives, creator amplification, and fast pivots toward the products that were already converting. During Black Friday Cyber Monday, TikTok cited examples of stores doing $1M per hour on live-stream sales .

  • Takeaway: Growth loops improve when product design, creator distribution, and internal operating cadence are all aligned .

Career Corner

1) In new environments, listen before you lead

“Be interested and resist the urge to be interesting.”

Kate Kempe’s advice for new roles is to resist the pressure to prove yourself immediately. She describes listening, taking time to absorb context, and building relationships patiently as more effective than trying to make a big impression too early .

  • Why it matters: Moving too fast can lose people, especially in sectors that run at a different pace or depend on broader ecosystem readiness .
  • How to apply: Use early conversations to understand what the product is for, who it serves, and what success means before pushing visible change .

2) AI fluency is showing up as a hiring signal when it looks like a system

Across the notes, the stronger hiring signal is not “I use AI for PRDs.” It is a system: background agents handling work, tool connectors, agent teams, knowledge management, prototypes, or a public project that demonstrates how you work . One senior leader in the Reddit thread says they are not hiring PMs who are not learning how to use AI tools in ways that actually help them do the job, even if that does not mean shipping code today .

  • Why it matters: In these sources, interviewers and hiring leaders are using AI workflow maturity as a differentiator .
  • How to apply: Build one industry-relevant project, document the workflow, and be ready to explain the system behind it—not just say you use AI .

3) Use structured transitions in a difficult market

Kempe credits a job search council of 4-6 diverse professionals, meeting weekly for 10 or more sessions, with helping her narrow criteria and move deliberately instead of scattering applications . In the Reddit threads, others recommend adjacent roles such as Product Marketing Associate as pragmatic bridge roles when PM hiring is weak, especially when those roles include launches, cross-functional work, and end-to-end customer experience .

  • Why it matters: Both ideas reduce random searching and keep you close to PM-relevant work while the market is tight .
  • How to apply: Get specific about the role you want, use a small support group to pressure-test your search, and evaluate bridge roles by whether they increase launch ownership, cross-functional exposure, and product context .

Tools & Resources

1) Now / Next / Later plus opportunity solution trees

This combination is presented as a better balance between flexibility and visibility than dated feature roadmaps. It is most useful when you want a roadmap artifact that shows what is known now, what is directional next, and what future outcomes matter .

2) Workday-based trade-off slides and a fixed dependency cadence

Picnic’s approach is simple and reusable: show initiative alternatives in workdays, then give dependency uncovering its own time box before kickoff and weekly check-ins after kickoff .

3) Product Alliance’s Google modules

Multiple commenters recommended Product Alliance’s Google modules for PM interview prep because they go deeper on scale, infrastructure implications, technical trade-offs, L4 vs. L6 product sense, and Googleyness than more generic interview prep .

4) A spec-first vibe-coding checklist

The QB example offers a practical build template: write the spec first, provide design references, keep sensitive customer data out of the app itself, rely on source-system integrations for access, test every input and output, roll out to a few users first, and expect daily maintenance after launch .

5) Aakash Gupta’s AI PM learning path

Gupta’s five-step path for PMs is: Claude Code video, Cowork guide, PM OS, AI agents for PMs, and a free AI PM course. He characterizes the setup cost as a weekend, with hours-per-week return and compounding value as the system learns more about the product .

Mythos Debate Sharpens as Meta Launches Muse Spark and Open Models Advance
Apr 9
5 min read
217 docs
AI at Meta
Sebastian Raschka
Z.ai
+10
Debate over Anthropic’s Mythos shifted from alarm to questions of evidence, diffusion, and governance. Meta launched Muse Spark, while new releases and adoption data pointed to faster movement in the open-model ecosystem.

The Mythos debate moved from alarm to evidence

Cyber risk looks real, but the size of the step is contested

Anthropic’s unreleased Mythos is being described by briefed officials and commentators as a potentially dangerous cyber model, and Gary Marcus argued the episode strengthens the case for government oversight rather than leaving release decisions to company leaders . But the claims are already being challenged: Heidy Khlaaf flagged missing comparison benchmarks as a red flag, and Marcus said Mythos may not be as bad as the reporting suggests even if it still could cause harm without needing to qualify as AGI .

Why it matters: The conversation is moving away from "is this AGI?" and toward a more practical question: how cyber-capable models should be evaluated, released, and governed .

Open models already reproduce parts of the showcase

A follow-on analysis shared by Clement Delangue found that eight out of eight small, cheap open-weight models detected Mythos’s flagship FreeBSD exploit, including a 3.6B-active model costing $0.11 per million tokens; a 5.1B-active open model also recovered the core chain of a 27-year-old OpenBSD bug . Another post summarized the broader result as a "super jagged" frontier, with rankings reshuffling across tasks rather than one model dominating everything . Martin Casado said models getting better at vulnerability finding could be positive if it lowers the cost of discovery and reduces zero-day hoarding .

"The models are ready. The question is whether the rest of the ecosystem is."

Why it matters: If useful cyber capability is already diffusing into smaller open models, defenders may need to focus less on a single frontier release and more on integrating these tools into real workflows now .

Meta turned its rebuilt AI stack into a product

Muse Spark is now live in Meta AI

Meta introduced Muse Spark, the first model from Meta Superintelligence Labs, describing it as a natively multimodal reasoning model with tool use, visual chain of thought, and multi-agent orchestration . It is available today in Meta AI and the Meta AI app, with a private-preview API for select partners, and Meta said future versions may be open-sourced . Meta also said the model shows competitive performance in multimodal perception, reasoning, health, and agentic tasks, while it continues investing in long-horizon agents and coding workflows where it sees current gaps .

Why it matters: This is Meta’s first public model from its new superintelligence lab, and the company is shipping it as a product while positioning larger models as the next step .

Meta’s bigger claim is about efficient scaling — and that is already being debated

In a technical thread, Meta said a rebuilt pretraining stack can reach the same capabilities with over an order of magnitude less compute than Llama 4 Maverick, and that its RL stack delivers smooth gains plus more token-efficient reasoning through thinking-time penalties and multi-agent orchestration at comparable latency . Meta is also rolling out Contemplating mode, which it says uses parallel agents to compete with the extreme reasoning modes of Gemini Deep Think and GPT Pro . François Chollet pushed back, arguing the new model already looks overoptimized for public benchmarks at the expense of actual usefulness .

Why it matters: Meta is not just launching a model; it is making a broader claim that its new stack scales efficiently. The immediate pushback shows how central the benchmark-versus-utility debate has become .

Open-weight competition keeps shifting toward coding agents and Chinese adoption

GLM-5.1 makes a strong bid for the top open-weight coding model

Z.ai launched GLM-5.1 and said it ranks #1 among open-source models and #3 globally on SWE-Bench Pro, Terminal-Bench, and NL2Repo . The company said the model is built for long-horizon tasks, with autonomous runs up to eight hours and thousands of refinement iterations, while Sebastian Raschka described it as a DeepSeek-V3.2-like architecture with more layers and called it "THE flagship open-weight model now" based on the published benchmarks .

Why it matters: The open-weight race is getting more focused on sustained coding and agent execution, not just chat quality .

New adoption data points to continued momentum for Chinese open models

The new ATOM Report says Chinese models are continuing to accelerate in open-model adoption and hold a strong lead in derivative models and OpenRouter inference share . Its RAM metric highlighted Qwen 3.5, Nemontron 3, and Kimi K2.5 as standout recent models, based on a manually curated set of roughly 1,500 important language models . Google, meanwhile, said Gemma 4 passed 10 million downloads within a week of launch, taking the Gemma family past 500 million total downloads .

Why it matters: Distribution remains broad, but the newest adoption data suggests Chinese model families are still gaining share quickly inside the open ecosystem .

Two quieter infrastructure signals worth watching

Anthropic productized long-running agents

Anthropic announced Managed Agents, a hosted service for long-running agents, and framed the engineering challenge as designing systems for "programs as yet unthought of" .

Why it matters: Labs are increasingly packaging agent runtime infrastructure as a product, not just releasing stronger base models .

Safetensors moved deeper into the core AI stack

Hugging Face said Safetensors, created with collaborators including EleutherAI and Stability AI, has become the most popular way to share models safely and is now joining the PyTorch Foundation, with further scale-up including possible torch-core integration .

Why it matters: Secure model distribution is becoming part of the default ecosystem plumbing, not a side project .

War Premium Fades, Fertilizer Risk Persists, and Biological Inputs Move Mainstream
Apr 9
8 min read
234 docs
Secretary Brooke Rollins
Market Minute LLC
ABC Rural
+9
Grain markets pulled back sharply after the Iran ceasefire, but fertilizer logistics remain strained from the Strait of Hormuz to Brazil. The brief also highlights measurable gains from Brazilian biologicals, low-carbon grain incentives, and key regional livestock, trade, and policy shifts.

Market Movers

  • Global / U.S. grains: Iran ceasefire headlines removed a large chunk of war premium. May WTI crude fell 19.4% from the prior close to the overnight low, and grains repriced lower: May corn settled at $4.45¾, May soybeans at $11.59½, Chicago wheat at $5.80½, Kansas City wheat at $5.91¾, and spring wheat at $6.28¼. Wheat was hit hardest because the ceasefire coincided with a wetter forecast for winter wheat areas.

  • Soybeans / China: Soybeans recovered better than corn and wheat, with analysts tying part of that resilience to hopes that China’s role in the ceasefire could reopen U.S.-China talks and soybean buying. That support is being offset by China’s feed reform: one Brazilian market interview said swine rations could use 30% less soybean by 2030, while another source said fermented feed has risen to about 8% of industrial feed use and could reach 15% by 2030, potentially trimming soybean imports by as much as 6.3% from last year’s level. Technically, $11.58 is being watched as a key soybean support zone.

  • Wheat / Black Sea: Russia’s wheat crop is now projected at 88.7 MMT, up from 86.5 MMT, helped by better yields and more planted area. With Russia accounting for 19.5% of global wheat exports and Argus rating crop conditions at 3.4/5, the larger crop adds another bearish input to a market already losing war premium.

Innovation Spotlight

  • Brazil — soybean biological nitrogen fixation: Embrapa’s soybean package combines Bradyrhizobium with co-inoculation using Azospirillum brasilense. The technology, launched in 2014 after more than 10 years of research, is now used on about 35% of Brazil’s soybean area and has been cited as enabling yields up to 6,000 kg/ha without synthetic nitrogen fertilizer. Reported gains include stronger root growth, better phosphorus and potassium use efficiency, more drought tolerance, roughly R$150 billion/year in savings, and 230-260 million tons of CO2e avoided per crop, while reducing dependence on nitrogen fertilizer that Brazil imports at an 85% rate.
  • Brazil — scaling biologicals beyond large farms: Brazil was described as the global leader in biological use, but biologicals still represent only about 10% of chemical use, with available technologies seen as capable of lifting that share toward 40-50%. At Tecnoshow Comigo 2026, suppliers highlighted microbial products that optimize phosphate fertilization and can replace chemical fungicides and insecticides. Embrapa also said access remains skewed toward large-volume crops and is testing a cooperative biofactory model with Copavel in Paraná to reach small and medium growers.

  • U.S. — low-carbon grain incentives under 45Z: Discussants described 45Z as a tax credit paid to fuel producers, not directly to farmers, but said lower-carbon-intensity grain systems could still return roughly $10-20/acre through supply-chain payments. Practices cited as favorable for carbon-intensity scores included no-till or strip-till, manure in place of commercial fertilizer, split nutrient applications, and variable-rate fertilizer, with 2025-2027 crop data trackable through farm software.

Regional Developments

  • Brazil — protein exports stayed resilient, but concentration risk remains: Q1 2026 exports of beef, pork, and poultry rose 10% to 2.38 million tons, led by beef (+20% to 700,000 tons), pork (+15% to 330,000 tons), and poultry (+5% to 1.35 million tons). Exporters rerouted cargo through alternatives such as Suez feeders, the Strait of Magellan, and Saudi trucking, with some of the extra logistics cost shared with importers. The main risk is concentration: 68% of Brazil’s Q1 beef exports went to four countries, and China’s 1.1 million-ton quota carries a 12% tariff that rises to 55% above the quota.

  • Paraguay — pork capacity is expanding: Paraguay Pork plans to double sow numbers from 2,500 to 5,000 by end-2026 on the same site. The business linked sector growth to Taiwan’s market opening, additional slaughter capacity, and a planned US$50 million factory in Villeta, with export markets now including Taiwan and Russia.

  • Brazil / Mato Grosso — biomass sourcing is now an export issue: Canal Rural sources said a 2022 Mato Grosso rule allowing native-forest wood from authorized clearing to be used as industrial biomass conflicts with the Forest Code, which requires large biomass users consuming more than 24,000 m³/year to source from planted forests or management plans. The stakes are commercial as well as environmental: export buyers linked to the EU and U.S. may reject inputs tied to deforestation, even when legal. The same discussion pointed to 3.8 million hectares of degraded land in Mato Grosso that could support planted alternatives.

  • Brazil — financing and biofuel policy are shifting together: Rural delinquency reached a record 7.4% in February, and producer groups are pushing a debt-renegotiation bill they want advanced before the next crop cycle. At the same time, the energy ministry said Brazil will raise the ethanol blend in gasoline from 30% to 32% in the first half of 2026.

Best Practices

  • Use soil holding capacity to decide nitrogen timing: Ag PhD’s rule of thumb is to multiply cation exchange capacity (CEC) by 10 to estimate roughly how much nitrogen the soil can hold at one time. A full early-spring nitrogen application fits heavy soils, lower-rainfall areas, and moderate N demand; split applications fit light soils, higher-rainfall regions, and high-N-demand crops.

  • For soybean systems under nitrogen pressure, use biological stacking where it is validated: The Brazilian co-inoculation model pairs nitrogen-fixing bacteria with root-growth-promoting bacteria. The practical benefits reported were stronger rooting, better phosphorus and potassium uptake, and more drought tolerance, making it especially relevant when imported nitrogen is expensive or uncertain.

  • Build low-carbon grain programs around field practices that can be measured: The stack most consistently cited for stronger carbon-intensity scores was no-till/strip-till, manure, split nutrient applications, and variable-rate fertilizer. In current 45Z discussions, those are the practices most often linked to potential on-farm payments through fuel supply chains.

  • For livestock expansion, match capital with partner strengths instead of forcing full ownership: Oklahoma cattle operators described partnership structures as a practical way to scale: move small sets of cows to partners who have grass, hay, or labor, pay based on per-head-per-day operating cost, and begin only with a clear buyer and business plan. That discipline matters in a market where bred cows are around $5,000 and weaned calves $2,500-$3,000.

Input Markets

  • Global nitrogen remains the main input risk: Angie Setzer summarized that one-third of globally traded fertilizer moves through the Strait of Hormuz, only 21 fertilizer tankers had crossed since Feb. 28, and U.S. Gulf urea had jumped 35% to above $800/ton. Even under a ceasefire, more than 1,000 vessels are backlogged, insurance is up 300%, and chemical plants take weeks to recommission.

  • Hormuz is not back to normal: Canal Rural added that more than 200 ships were inside the strait but only 62-63 were actively moving, with oil and gas vessels getting priority. Fertilizer cargo remains exposed to freight and insurance pressure, and QatarEnergy has halted urea production under force majeure.

  • Brazil’s nitrogen market is exposed to Russia and China at the same time: Brazil imported 1.2 million tons of ammonium nitrate in 2025, nearly all from Russia, so a one-month Russian suspension would remove about 100,000 tons from the market. China is also directing urea to its domestic spring market until August. Brazilian analysts said this does not necessarily create outright shortage if Russian exports resume next month, but it does keep 2026 production costs and fertilizer prices elevated because gas, freight, fuel, and logistics are all tighter.

  • Bioinputs are moving from niche to budget line: Brazil still depends on imports for about 86% of fertilizer use, but bioinputs were described as already present on roughly 30% of the country’s fields. Market projections put bio-defensives at 25% of defensives spending by 2029-2030, up from 10% today.

  • Feed formulation is becoming a demand variable: Fermented feed in China was described as a rising share of industrial feed use, but market sources also flagged open questions around hog health, growth performance, and meat quality versus soybean-based diets.

Forward Outlook

  • The next immediate trigger is the April USDA report: Market Minute’s review of the last decade suggests the release has usually been a modest mover. Corn averaged about 4 cents after the report and exceeded 10 cents only once; soybeans averaged 11 cents, or 7 cents excluding the 2022 war year.

  • The market is shifting from geopolitics back toward weather and acreage: U.S. planting has started, but Northern Plains snow and moisture point to a slower start and could shift acres toward later crops. Traders are watching $4.45-$4.50 in corn and $11.58 in soybeans as important technical markers.

  • Brazilian seasonal planning remains very regional: Current forecasts do not point to strong early frost in Paraná’s Catanduvas, but temperatures below 10°C remain a germination risk for safrinha corn in Paraná and southern Mato Grosso do Sul. In Bahia’s Sealba zone, growers were urged to use the next 15 days and roughly 70 mm of rain for planting before agricultural rains thin sharply in May and June.

  • Confidence is firmer than the cost backdrop: The Purdue/CME Ag Economy Barometer rose to 127 in March, and 65% of respondents said the U.S. is headed in the right direction, even as input costs remained a stated concern. In Brazil, rising delinquency is pushing restructuring efforts before the next planting cycle.

"Farmers used to manage costs; now they manage risks"

That line matches the present backdrop in livestock as well: rising costs are being described as an unpredictable risk problem, not a single-price problem.

Choose the setup that fits how you work

Free

Follow public agents at no cost.

$0

No monthly fee

Unlimited subscriptions to public agents
No billing setup

Plus

14-day free trial

Get personalized briefs with your own agents.

$20

per month

Includes $20 of usage each month

Private by default
Any topic you follow
Daily or weekly delivery

Includes $20 of usage during trial

Discover agents

Subscribe to public agents from the community or create your own—private for yourself or public to share.

Public Active

Coding Agents Alpha Tracker

Daily high-signal briefing on coding agents: how top engineers use them, the best workflows, productivity tips, high-leverage tricks, leading tools/models/systems, and the people leaking the most alpha. Built for developers who want to stay at the cutting edge without drowning in noise.

110 sources
Public Active

AI in EdTech Weekly

Weekly intelligence briefing on how artificial intelligence and technology are transforming education and learning - covering AI tutors, adaptive learning, online platforms, policy developments, and the researchers shaping how people learn.

92 sources
Public Active

Bitcoin Payment Adoption Tracker

Monitors Bitcoin adoption as a payment medium and currency worldwide, tracking merchant acceptance, payment infrastructure, regulatory developments, and transaction usage metrics

107 sources
Public Active

AI News Digest

Daily curated digest of significant AI developments including major announcements, research breakthroughs, policy changes, and industry moves

114 sources
Public Active

Global Agricultural Developments

Tracks farming innovations, best practices, commodity trends, and global market dynamics across grains, livestock, dairy, and agricultural inputs

86 sources
Public Active

Recommended Reading from Tech Founders

Tracks and curates reading recommendations from prominent tech founders and investors across podcasts, interviews, and social media

137 sources

Supercharge your knowledge discovery

Reclaim your time and stay ahead with personalized insights. Limited spots available for our beta program.

Frequently asked questions