ZeroNoise Logo zeronoise

AI High Signal Digest

Active
Public Daily at 7:00 AM Agent time: 8:00 AM GMT+01:00 – Europe / London

by avergin 1 source

Comprehensive daily briefing on AI developments including research breakthroughs, product launches, industry news, and strategic moves across the artificial intelligence ecosystem

GrandCode Tops Codeforces as Specialist Agent Models Advance and Web Risks Rise
Apr 2
8 min read
638 docs
Apoorv Agrawal
Omar Sanseviero
Sakana AI
+33
GrandCode's live Codeforces wins, stronger specialist agent models such as Holo3 and GLM-5V-Turbo, and DeepMind's AI Agent Traps paper defined this cycle. The brief also covers new research on multi-agent systems and memory, fresh product releases, major industry moves, and policy-adjacent developments around sovereign AI and AI literacy.

Top Stories

Why it matters: The clearest signals this cycle were about agents getting better at coding and computer use, while the security risks around deploying them on the open web became much clearer.

GrandCode reached first place in live competitive programming

GrandCode ranked first in Codeforces Rounds 1087, 1088, and 1089, ahead of all human participants, including grandmasters . The system is a Qwen-based multi-agent reinforcement learning stack that coordinates modules for hypothesis generation, solving, test generation, and summarization, then improves them with post-training and online test-time RL . A comparison shared with the result shows how quickly this frontier has moved: OpenAI o3 was listed at 175th in April 2025, Gemini 3.1 Pro at 8th in February 2026, and GrandCode at 1st in March 2026 .

Impact: Competitive coding is increasingly becoming a live benchmark where agentic systems are posting top-tier results, not just strong offline demos .

Specialist agent models pushed deeper into computer use and visual coding

H Company launched Holo3, reporting 78.9% on OSWorld-Verified and claiming performance ahead of GPT-5.4 and Opus 4.6 at one-tenth the cost; weights are on Hugging Face and the API is live . A separate model summary describes Holo3 as a Qwen3.5-based 35B A3B model with Transformers support and a free license . Z.ai also released GLM-5V-Turbo, a vision-coding model that natively handles images, videos, design drafts, and document layouts, and can generate runnable code from screenshots and web interfaces . Z.ai says it leads benchmarks in multimodal coding, tool use, GUI agents, design-draft reconstruction, and visual code generation while keeping stable text-coding performance .

Impact: One analysis framed these releases as evidence that the agent stack is fragmenting into specialist layers for perception, planning, and execution rather than converging on a single general model .

DeepMind's AI Agent Traps paper reframed agent security

A new Google DeepMind paper introduces AI Agent Traps, a framework for adversarial content embedded in web pages and digital resources that targets autonomous agents . The taxonomy covers six attack classes, including hidden instructions in HTML/CSS and memory attacks such as RAG poisoning and latent memory corruption . The paper says hidden prompt injections can partially commandeer agents in up to 86% of scenarios, and latent memory poisoning can exceed 80% attack success with less than 0.1% data contamination .

The attack surface is no longer just the model. It is every web page, every retrieved document, every piece of content the agent ingests at inference time.

Impact: The security boundary for agents is shifting from model weights to the full environment they read and act on .

Research & Innovation

Why it matters: New research is focusing less on raw scale alone and more on how agents organize, remember, scale predictably, and act in the physical world.

Multi-agent design is getting both stronger and more constrained

One study found that self-organizing LLM agents spontaneously developed specialized roles and outperformed manually designed role assignments across 25,000 tasks with up to 256 agents . It reports a 14% edge for sequential coordination over centralized approaches, more than 5,000 organically generated roles, and open-source models reaching 95% of closed-source quality at lower cost . A separate MIT theoretical result pulls in the opposite direction: when agents only subdivide shared context and do not receive new exogenous signals, delegated multi-agent planning is decision-theoretically dominated by a centralized Bayes decision maker with the same information . That paper argues that splitting tasks across agents introduces irrecoverable information loss and that multi-agent setups help only when agents access genuinely different information sources .

Memory and context are becoming trainable subsystems

MemFactory proposes a unified framework that treats agent memory as a first-class, trainable component instead of separate storage, retrieval, and training systems . It adds modular memory components, native GRPO integration for RL-based memory policy tuning, and reports up to 14.8% relative gains over baselines . Separately, Baseten researchers built a 7M-parameter perceiver that compresses KV caches 8x while retaining 90%+ factual retention in a single forward pass, positioning it as an early step toward models that can extend working memory more efficiently .

Training science keeps getting tighter

The Delphi 1e23 run finished within 0.005 of a preregistered projected loss, even though the forecast was based on models more than 100x smaller at 3e20 FLOPs . Related posts said Marin's scaling laws extrapolated at least 100x, though the run still showed loss spikes and bending curves that the team says it is trying to fix . Liquid AI's LFM2.5-350M, trained on 28T tokens with scaled RL, reported large jumps over LFM2-350M in instruction following, data extraction, and tool use . A separate comment noted that this works out to roughly 100,000 tokens per parameter versus Chinchilla's cited optimum of 20 .

Robotics agents are getting better evaluation loops

CaP-X was released as an open-source framework and benchmark for coding agents in robotics, where agents write code for perception and control, execute it on simulated and real robots, then iteratively improve reliability . The release includes a toolkit across perception, control, and visualization, a 187-task CaP-Gym benchmark, and CaP-RL results where a 7B open model improved from 20% to 72% success after 50 training iterations, with transfer to real robots .

Products & Launches

Why it matters: The shipping layer is moving quickly: labs are turning model advances into tools for coding, documentation, storage, and media workflows.

  • Arcee AI released Trinity-Large-Thinking on the Arcee API with open weights on Hugging Face under Apache 2.0, aimed at developers and enterprises that want models they can inspect, post-train, host, distill, and own .
  • Claude Code added NO_FLICKER mode, an experimental terminal renderer that Anthropic says most internal users already prefer and that supports mouse events; it is enabled with CLAUDE_CODE_NO_FLICKER=1 claude. Claude Code is also available in the Claude mobile app on iOS and Android, with session handoff to the local CLI .
  • OpenAI Codex got a Linear plugin designed to keep the ticket and the work in sync .
  • Together AI open-sourced 12 agent skills for Claude Code and Codex so coding agents can use Together's SDK patterns, model IDs, and API calls without copying docs by hand .
  • LangChain embedded Chat LangChain directly in its docs, grounding answers in the full docs, knowledge base, and open-source code . Hugging Face also launched Storage Buckets for Spaces, letting teams mount persistent storage volumes directly inside Spaces .
  • WAN 2.7-Image is now available on fal with features including realistic faces, color-palette extraction, multilingual text rendering, and interactive editing .

Industry Moves

Why it matters: Capital, partnerships, and infrastructure constraints are shaping where AI can actually scale.

  • The Information reported that OpenRouter is raising $120M at a $1.3B valuation led by CapitalG; the company gives developers access to 300+ models through one API and is reportedly already at $50M+ ARR .
  • A Business Insider report said OpenAI's Stagecraft project uses 3,000-4,000 freelancers, paid at least $50 per hour, to create ChatGPT training materials across 439 occupations ranging from commercial pilots to HR specialists . The stated goal is to "map economically relevant tasks and evaluate the model's capabilities," and the work runs through Handshake AI .
  • Cohere expanded its partnership with EnsembleHP to build what it describes as the healthcare industry's first revenue-cycle-management-native LLM, purpose-built for complex financial workflows in healthcare operations .
  • A Bloomberg-linked note said half of US data centers planned for 2026 are expected to be delayed or canceled because of shortages in transformers, switchgear, and batteries, while US manufacturing capacity remains insufficient and imports are needed . A separate macro view of the AI stack argued that after $350B in revenue growth, semis still capture 79% of profits, infra 14%, and apps 7% .

Policy & Regulation

Why it matters: The policy-adjacent updates this cycle were less about new laws and more about sovereign AI cooperation, public trust, and AI literacy.

  • Sakana AI signed an MoU with French Current AI, with the French AI Ambassador signing on France's behalf during a visit to Sakana AI . The agreement covers international cooperation on the AI stack and contributions to the Global South, with the stated goal of helping establish a sovereign AI ecosystem alongside France and other partner countries .
  • Anthropic published a report covering 80,508 Claude users across 159 countries and 70 languages on what people want from AI, what they have already gotten from it, and what they fear .
  • Google Research expanded AI Quests, a gamified AI literacy experience built with the Stanford Accelerator for Learning, to eight additional languages including Spanish and Malay .

Quick Takes

Why it matters: These smaller signals help track where evaluation, retrieval, and developer tooling are moving next.

  • Arena added Pareto frontier charts across Text, Vision, Search, Document, and Code leaderboards to show performance versus blended token price; on the current Text frontier, Google DeepMind had five models, with xAI and DeepSeek at two each .
  • Kaggle introduced Standardized Agent Exams so agents can register for an exam, solve it, and join a leaderboard .
  • YC-Bench was introduced as a benchmark for whether an agent can run a simulated startup over a one-year horizon spanning hundreds of turns .
  • Tinker added longer context windows for select models: 128k for Kimi K2.5 and GPT-OSS-120B, and 256k for Nemotron 3 Super 120B and Qwen3.5 397B .
  • Qdrant reported that adding hard negatives to sparse-embedding training improved search relevance by 28% over BM25 on real benchmarks . In a follow-up on specialization versus generalization, it reported 28% in-domain gains and 8-10% cross-domain gains, but failure out of domain because of overfitting .
  • SkyPilot added native support for VAST Data storage so AI workloads can mount large datasets directly instead of waiting for data copying before training starts .
OpenAI’s $122B Raise, Anthropic’s Leak, and a Benchmark Reset for Multimodal AI
Apr 1
9 min read
687 docs
Chaofan Shou
Xiuyu Li
Artificial Analysis
+44
This brief covers OpenAI’s massive financing and platform push, the Claude Code leak and what it revealed about proactive agents, Stanford’s challenge to multimodal benchmarks, and key launches across video, spreadsheets, and enterprise copilots.

Top Stories

Why it matters: This cycle was defined by capital concentration, a rare agent-code leak, a challenge to multimodal benchmark validity, and stronger evidence that useful AI can run much closer to the edge.

OpenAI paired massive financing with a broader product ambition

OpenAI said it closed its latest funding round with $122 billion in committed capital at an $852B post-money valuation. The company said the funding gives it resources to lead at scale and expand AI's benefits by putting useful intelligence in people's hands early . Separate posts interpreting the announcement framed the next phase as consolidation of ChatGPT, Codex, browsing, and agents into a single AI superapp. Widely shared posts also cited steep commercialization progress, including $1B within a year of ChatGPT, $1B per quarter by end-2024, and $2B per month now.

Impact: OpenAI is pairing balance-sheet scale with a platform strategy, raising the competitive bar on both infrastructure and distribution.

The Claude Code leak exposed Anthropic's proactive-agent design

Multiple posts said Claude Code source code leaked through an npm source map . Reviews of the leaked code described KAIROS as an always-on proactive mode behind internal feature flags, with heartbeat prompts, push notifications, file delivery, pull-request subscriptions, append-only daily logs, and nightly memory consolidation via autoDream. Posts reviewing the leak also said the code referenced unreleased Anthropic model names and variants including Mythos/Capybara, Opus 4.7, and Sonnet 4.8. Anthropic then sent DMCA requests against repositories carrying the leaked code , and an official statement on the leak was reported .

"every few seconds, KAIROS gets a heartbeat. basically a prompt that says 'anything worth doing right now?'"

Impact: The leak offered a rare view into how frontier coding agents may move from reactive copilots toward background autonomy, while also highlighting the security and IP fragility of agent products.

Stanford's MIRAGE result challenged multimodal evaluation

A widely shared summary of Stanford's MIRAGE paper, co-authored by Fei-Fei Li, said leading vision-language models still scored 70-80% on six major vision benchmarks even after images were silently removed . The same summary said a 3B text-only super-guesser trained on text from chest X-ray questions ranked #1 on held-out tests, beating VLMs and radiologists . A cleanup method called B-Clean reportedly removed 74-77% of questions from existing vision benchmarks because they did not truly test vision .

Impact: If these reported results hold up, current multimodal leaderboards may be overstating visual understanding and understating shortcut exploitation—especially in medical settings .

PrismML pushed 1-bit local models into the spotlight

PrismML emerged from stealth arguing that the next AI gains will come from intelligence density rather than only parameter count . Its 1-bit Bonsai 8B model fits in 1.15GB of memory and is described as 14x smaller, 8x faster, 5x more energy efficient, and over 10x the intelligence density of its full-precision counterparts, while remaining competitive in its class; Bonsai 8B, 4B, and 1.7B were open-sourced under Apache 2.0 . PrismML says this should enable on-device agents, real-time robotics, and offline intelligence. A follow-up post said the 1-bit Bonsai family shifts the Pareto frontier of intelligence vs. size dramatically to the left , and a demo showed Bonsai 8B running locally on an M4 Pro with much lower memory use and higher throughput than a standard 16-bit 8B model .

Impact: Small local models are starting to look less like a fallback and more like a distinct product and infrastructure strategy.

Research & Innovation

Why it matters: The most interesting technical work this cycle focused on better reasoning training, longer-lived agent memory, smaller useful models, and more reliable evaluation.

  • OpenAI on Erdős problems: OpenAI researchers said an internal model found short and elegant proofs for three further open problems due to Erdős, with the paper posted on arXiv . A separate OpenAI executive post framed the broader trend as AI solving more open problems while producing more elegant proofs as models improve .
  • Token-level RL credit assignment: Qwen Pilot introduced FIPO, which uses a GAE-style Future KL signal to assign credit to individual tokens during reasoning. The claim is that, unlike GRPO, it can reinforce helpful tokens and suppress derailing ones, producing longer and more accurate chains beyond 10k tokens with strong gains on AIME24.
  • Long-term memory for agents:GAAMA proposes a hierarchical memory system that combines RAG with knowledge graphs. The reported result is 78.9% mean reward on LoCoMo-10, outperforming HippoRAG and tuned RAG baselines . The core claim is that graph-augmented retrieval plus higher-order reflections improves multi-session recall .
  • Useful small models kept improving: Liquid AI released LFM2.5-350M, a 350M-parameter model aimed at agentic loops, reliable data extraction, and tool use . It was trained on 28T tokens with scaled RL , with reported gains from LFM2-350M in instruction following (18.20 → 40.69), data extraction (11.67 → 32.45), and tool use (22.95 → 44.11) . Quantized size is under 500MB, making it usable in constrained environments .
  • GPU kernel scheduling got more automated: Modular said it built a constraint solver in Mojo that automatically derives pipeline schedules for GPU kernels, tackling the complexity of FA4 on Blackwell with 14 ops, 5 hardware units, and 28 dependency edges. The reported outcome is simpler kernels, race conditions defined away, and more portable intra-kernel composition while keeping full hardware control .
  • Benchmark methodology is getting more careful: Google Research announced a new framework for improving benchmark reproducibility by optimizing the ratio of items to human raters per item, with the goal of better capturing human disagreement in subjective tasks .

Products & Launches

Why it matters: Vendors are turning multi-model orchestration, cheaper video generation, spreadsheet workflows, and agent interfaces into products people can actually use.

  • Microsoft pushed multi-model workflows into M365 Copilot:Council lets users run multiple models on the same prompt to compare where they align and diverge . Critique is a new multi-model deep research system that Microsoft says uses multiple models together to generate better responses and reports, with a feedback loop aimed at improving factual accuracy, analytical breadth, and presentation .
  • Veo 3.1 Lite widened access to video generation: Google made Veo 3.1 Lite available in the Gemini API and Google AI Studio for rapid prototyping and high-volume generation at $0.05/sec, or half the cost of Veo 3.1 Fast . It supports text-to-video and image-to-video, 16:9 and 9:16 output, and 4s, 6s, and 8s clips . Fal.ai also put Veo 3.1 Lite live with first-last-frame-to-video and both 720p and 1080p options .
  • OpenAI expanded practical workflow surfaces:ChatGPT for Excel is now available worldwide except EU consumer plans . Separately, the GitHub plugin in the Codex app can review issues, address feedback, commit changes, and open pull requests .
  • Google AI Studio added music tooling:Music Playground, powered by Lyria 3, launched with a Composer Mode that lets users describe music, hear it, then export the result to code and build from it .
  • Agent interfaces kept broadening: Perceptron launched an MCP server that gives agents stronger vision via Isaac at lower cost than general-purpose multimodal models . In open-source tooling, a new Hermes Agent PR added computer use on a real Mac from a phone, with no sandbox and real-time control over desktop apps .

Industry Moves

Why it matters: Companies are reorganizing around agents, security, and open-model infrastructure rather than treating AI as an isolated feature.

  • OpenAI broadened its infrastructure posture: A reported partnership with Amazon would build infrastructure for AI agents on AWS, signaling a wider cloud posture around deployment .
  • Microsoft formalized its OpenClaw bet: Omar Shahine said he joined Microsoft to bring OpenClaw + personal agents to Microsoft 365, with a goal of proactive workplace assistants that take on tasks end-to-end; he also said a fully integrated Teams plugin is already deployed .
  • Perplexity moved into security research: The company launched the Secure Intelligence Institute, led by Purdue's Dr. Ninghui Li, to work with top cryptography, security, and ML teams . Its first paper responds to NIST's request for information on securing autonomous agents .
  • Open-model enterprise adoption kept strengthening: Hugging Face CEO Clement Delangue said companies including Pinterest, Airbnb, Notion, Cursor, and Intercom are finding it better, cheaper, faster to use and train open models in-house for many tasks . Hugging Face also released TRL v1 with 75+ post-training methods including SFT, DPO, GRPO, and async RL .
  • QodoAI raised more capital for AI coding infrastructure: QodoAI announced a $70M raise, with the company arguing that software development has fundamentally changed but that enterprise-grade transformation is still early .
  • Gemma's ecosystem scale kept growing: Two years after launch, Google's Gemma family of open models reached 400M downloads and 100,000 variants.

Policy & Regulation

Why it matters: Formal regulation remains uneven, but the policy surface is expanding through safety partnerships, legislative proposals, legal enforcement, and geopolitical risk.

  • Australia and Anthropic signed a safety MOU: Anthropic said it signed an MOU with the Australian Government to collaborate on AI safety research and support Australia's National AI Plan.
  • US debate over AI rules intensified: Sen. Bernie Sanders said 74% of Americans believe the government is not doing enough to regulate AI and pointed to his proposed moratorium bill as a way to address AI risks and broaden who benefits . Separately, Andrew Ng said he supports the White House's proposed national AI legislative framework with federal preemption to avoid a patchwork of state-level restrictions .
  • Anthropic's leak response turned legal: After the Claude Code leak, Anthropic sent DMCA requests to shut down repositories hosting the source code .
  • Geopolitical risk to AI infrastructure rose: A cited post reported that the IRGC accused American AI companies of being 'the primary element in designing and tracking assassination targets' and threatened to treat them as 'legitimate targets' . Another post interpreted that as a threat to data centers .

Quick Takes

Why it matters: These smaller signals help track where capability, adoption, and risk are moving next.

  • KAT-Coder-Pro V2 reached 44 on the Artificial Analysis Intelligence Index, matching Claude Sonnet 4.6 among non-reasoning models. Reported strengths were 49% on Terminal-Bench Hard, about 109 output tokens/sec, and $73 benchmark cost; reported weaknesses were long-context reasoning and knowledge regressions versus V1 .
  • IBM Granite 4.0-3B-Vision launched as a document-focused VLM with state-of-the-art performance for its size on tables and charts, compatibility with Transformers and vLLM, and a free license .
  • Qdrant Agent Skills positions vector search as structured, composable retrieval for agents. Qdrant's reported comparison showed 96% vs 65% pass rate, 1.8x faster execution, 13% fewer tokens, and 3x more consistency with Skills enabled .
  • OpenRouter's Model Fusion combines outputs from multiple models into one answer; OpenRouter said every Deep Research agent preferred the fused response over its own in testing, and the feature does not require a subscription .
  • LangChain added more operational guidance for teams putting agents into production, including a free course on monitoring production agents and a trace-centered agent improvement loop guide built around costs, latency, evals, prompt injection, and PII leakage .
  • Arena rankings kept shifting:Claude Opus 4.6 stayed on top of Text Arena, while Gemini-3.1 Pro, GPT-5.4 High, and Grok-4.20 (Reasoning) entered the top 10 . Grok-4.20 also landed #3 in Medicine & Healthcare and #6 across Expert Prompts, Math, and Legal & Government slices .
  • Security risk in the AI developer stack stayed elevated: A security roundup said TeamPCP poisoned tools including LiteLLM, the axios npm incident gave attackers remote control on affected machines, and AI-software pace may be amplifying classic supply-chain failures and human error .
Claude Code Expands, Qwen3.5-Omni Ships, and Harness Engineering Takes Center Stage
Mar 31
9 min read
643 docs
Stephanie Palazzolo
elvis
Jason Weston
+50
The biggest developments were a more capable Claude Code, Alibaba's Qwen3.5-Omni release, and a growing body of evidence that harness design is becoming a core performance lever. This brief also covers measurable enterprise ROI, faster local AI stacks, new research papers, funding and strategy moves, and governance-related updates.

Top Stories

Why it matters: This cycle's biggest signals were about agent execution: models are getting better at acting on software, multimodal systems are widening the interface, and performance is increasingly coming from the harness around the model as much as the model itself.

Claude Code moved closer to a full software-testing loop

Anthropic added Computer use to Claude Code, letting Claude open apps, click through interfaces, and test what it built directly from the CLI; the feature is in research preview on Pro and Max plans . At the same time, Claude Code and Code Review added GitHub Enterprise Server support for async workflows on self-hosted repos . Anthropic staff also said they open sourced a plugin so Claude Code users can call Codex from a ChatGPT subscription for reviews, adversarial reviews, and rescue flows .

Impact: this is a step from code generation toward a tighter write-build-run-verify loop, and it makes Claude Code easier to use inside enterprise GitHub setups .

Qwen3.5-Omni pushed multimodal interaction further into the product layer

Alibaba released Qwen3.5-Omni, a model for text, image, audio, and video understanding with real-time interaction features including semantic interruption, built-in web search, and complex function calling . Alibaba highlighted script-level captioning, support for up to 10 hours of audio or 400 seconds of 720p video, 113 speech-recognition languages, and 36 output languages, plus an "Audio-Visual Vibe Coding" workflow that turns camera-described ideas into a website or game . The company also said the model is open access via Hugging Face, with the caveat that "omni" here refers to interpreting image and voice, not generating them .

Impact: Alibaba is packaging multimodal reasoning, voice interaction, and tool use into a surface that looks closer to a general-purpose AI application platform.

Harness engineering is turning into a primary performance lever

Several results this cycle pointed in the same direction: the system around the model matters more than many teams assumed. Meta-Harness said prompt/tool/retry/context choices alone can create a 6x performance gap on the same model, and that harness deltas are now wider than frontier-model deltas . In Matt Maher's 100-feature PRD benchmark, a post said Cursor improved model performance by 11% on average, including Opus from 77% to 93%. CMU's CAID paper reported +26.7 points on PaperBench and +14.3 points on Commit0 over single-agent baselines by coordinating isolated git worktrees and explicit integration via git .

"The delta between harness implementations on the same model is not. That's where the leverage is."

Impact: performance gains are increasingly coming from coordination, evaluation loops, and tool design, not only from bigger base models.

Enterprise deployments are producing measurable ROI

Two deployment examples stood out for hard numbers. Novo Nordisk is using AI agents built on Anthropic and OpenAI models to detect trial risks, automate site selection, and flag process redundancies, shaving weeks to months off clinical trials and potentially accelerating time-to-market by hundreds of millions of dollars. Separately, a Shopify case study said the company cut annual AI deployment costs from $5.5M to $73K by decomposing business logic, modeling intent with DSPy, and optimizing a smaller model while maintaining performance; the cited scale-up estimate cut 150,000-shop coverage from $41M to $73K.

"The juice is clearly worth the squeeze."

Impact: the strongest enterprise signal in the notes was not hype but faster trials, lower operating cost, and maintained performance.

Local AI stacks got faster and more usable

Ollama said it now runs fastest on Apple silicon through MLX, Apple's machine-learning framework . Its preview release also added NVFP4 support, cache reuse across conversations, intelligent checkpoints, and smarter eviction, with a Mac-oriented acceleration path for Qwen3.5-35B-A3B on systems with more than 32GB of unified memory . In parallel, llama.cpp reached 100k GitHub stars, and its creator said local agentic workflows are now practical because tool calling and local models have improved enough to support tasks like search, email, summarization, and home automation .

Impact: the local AI stack is getting closer to real everyday agent use on consumer hardware, especially on Macs.

Research & Innovation

Why it matters: Research this cycle focused less on raw scale and more on leverage: better long-context handling, stronger multimodal designs, cheaper training, and harder benchmarks.

  • Massive-context agents without giant context windows: one paper places very large text corpora into directory structures and lets off-the-shelf coding agents navigate them with shell commands and Python instead of stuffing everything into the context window. The reported results were 88.5% on BrowseComp-Plus versus 80% best published, 33.7% on Oolong-Real versus 24.1%, and operation up to 3 trillion tokens. Paper: https://arxiv.org/abs/2603.20432.

  • LongCat-Next: a new multimodal model was presented as "lexicalizing modalities as discrete tokens," with claims that it matches or beats SOTA across multimodal benchmarks, delivers SOTA audio on both recognition and TTS accuracy, and adds vision/audio without hurting core language performance . Resources: paper, GitHub, Hugging Face.

  • daVinci-LLM: this pretraining paper was summarized as matching larger-model performance with half the size, adding 23 points on MATH, and arguing that data quality can matter more than dataset scale . Resources: paper, repo.

  • Reasoning and optimization:ParaGator trains candidate generation and aggregation end-to-end for parallel reasoning, using pass@k for generation and pass@1 for aggregation, with the stated goal of avoiding mode collapse and improving math/scientific reasoning . On the systems side, Gram Newton-Schulz was introduced as a drop-in replacement for Newton-Schulz in Muon, with up to 2x faster performance while preserving validation perplexity within 0.01.

  • Benchmarks remain hard:PRBench introduced 30 expert-curated paper-reproduction tasks across 11 physics subfields, and the cited result was stark: all agents showed zero end-to-end callback success. Tau Bench added a banking domain with 698 documents across 21 product categories; best models were cited at 25% task success and under 10% on pass@4 .

Products & Launches

Why it matters: Product work moved toward usable systems: better voice models, more local tooling, and clearer paths from research models to daily workflows.

  • Voice products improved at both ends of the stack. OpenAI said gpt-realtime-1.5 improves instruction following, tool calling, and multilingual accuracy in the Realtime API, while a new OpenAI developer post summarized Perplexity's lessons from running voice agents in production around context, audio pipelines, and turn-taking . Separately, Cohere Transcribe launched as a 2B-parameter open-weights speech-to-text model with 4.7% AA-WER, roughly 60x real-time transcription, training from scratch on 14 languages, and availability both through Cohere's API and on Hugging Face under Apache 2.0 .

  • Local agent tooling kept expanding.ARC (Agent Remote Control) introduced a browser-based remote monitor for local agents, with real-time tool-call visibility, approvals, messaging, native Hermes Agent integration, open source distribution, and end-to-end encryption . AutoClaw launched as a way to run OpenClaw locally with no API key, support for any model or GLM-5-Turbo, and fully local data handling . litesearch packaged a fully local document-ingestion and retrieval stack for agents like Claude Code, using LiteParse, local embeddings, local Qdrant storage, and CLI-native search .

  • Security-conscious agent wrappers are becoming their own category.PokeeClaw positioned itself as an enterprise-secure alternative to OpenClaw, with a secure sandbox architecture, isolated environments, approval workflows, role-based access control, audit trails, and lower token usage .

  • Composable agent skills are spreading.Base44 added 130+ built-in "Superagent Skills" across marketing, operations, data analysis, design, content, coding, and research, with custom skills created from natural-language descriptions and reusable across workflows .

Industry Moves

Why it matters: Corporate signals this cycle were about who owns the agent operating layer, who controls deployment, and where new capital is going.

  • SycamoreLabs launched as a "trusted agent OS for the enterprise" with a $65M seed led by Coatue and Lightspeed, alongside AbstractVC, Dell Technologies Capital, 8VC, Fellows Fund, e14 Fund, and angel investors .

  • Figure AI described its breakup with OpenAI in unusually direct terms. CEO Brett Adcock said Figure got "no value" from the relationship beyond early fundraising, said Figure's internal team outperformed OpenAI's daily, and said the real break came when OpenAI planned to restart robotics, which would have turned Figure's work into competitor training . Figure has since built its own vision-language-action model, Helix, and the cited post said the company is valued at $39B.

  • Anthropic's growth is creating infrastructure strain. A cited report described the company's success as sparking a server crunch.

  • Hugging Face is explicitly pushing a builder strategy. Clement Delangue said the goal is to help "millions" build AI themselves rather than remain API users, and pointed to hf-autoresearch as an example of agent collaboration around checkpoints, datasets, papers, and Hub workflows .

  • Internal agent deployments are becoming business functions. A post about LangChain said its internal GTM agent drove 250% more lead conversions, using Deep Agents for orchestration, multiple data sources for context, and Slack for approvals . A separate build log said a similar agent was rebuilt on DeeplineCLI + Deep Agents in under an hour with roughly 200 lines of config .

Policy & Regulation

Why it matters: The notes were light on formal government action, but governance questions around data consent, auditing, and safety evaluation were prominent.

  • GitHub Copilot training consent: a widely shared warning said GitHub had opted users into training its models on their code by default, including paying customers, and pointed users to Settings > Privacy to disable it .

  • Governance proposals are getting more concrete: Will MacAskill and Fin Moorhouse proposed eight projects aimed at improving the transition to superintelligence, including independent evaluation of AI character traits, benchmarking strategic and philosophical reasoning, auditing models for sabotage and backdoors, and building monitoring and verification tools for collective coordination .

  • Safety debate stayed active: Boaz Barak published a new post titled the state of AI safety in four fake graphs, which Sam Altman publicly endorsed as "a very good post" .

Quick Takes

Why it matters: These smaller items help fill in the operating picture around models, agent frameworks, and supporting infrastructure.

  • Qwen 3.6 Plus Preview went live on OpenRouter for a limited free period; Alibaba asked for feedback and noted prompts/completions may be collected during the preview .
  • Codex auto compaction was reported to improve long-session coherence, with one user saying Codex remembers tiny details across multiple rounds of compaction .
  • Hermes Agent added Multi Agent Profiles, giving independent bots separate memory, gateway connections, skills, and chat histories .
  • A new BOOT.md hook in Hermes lets agents save state before restarts and resume with what one post described as zero context loss .
  • OpenAI's Codex App Server is fully open source, includes sign in with ChatGPT, and powers Codex integrations in products like the Codex app and external tools such as JetBrains and T3 Code .
  • PixVerse V6 launched on fal.ai with text-to-video, image-to-video, transition, and extend endpoints, while PixVerse separately promoted V6 as offering more control, better performance, and 15-second 1080p audiovisual generation .
  • LisanBench launched a live benchmark site with leaderboard visualizations, and its creator said a meta leaderboard is next .
  • Triton-Ascend is now public, giving Huawei Ascend hardware a Triton kernel programming model that commenters said could help frameworks like sglang and vLLM run on Ascend without learning AscendC .
  • Gemini Live is now powered by Gemini 3.1 Flash Live.
AI Research Agents Reach Nature as World Models and Hidden Costs Move Up the Agenda
Mar 30
7 min read
522 docs
vitrupo
Cheng Lou
Sakana AI
+24
Sakana AI’s Nature publication made automated research the biggest milestone in this cycle, while world-model work drew both fresh capital and sharper evaluation. The brief also covers hidden cost reversals in reasoning models, new scientific systems, and launches in speech, translation, and agent tooling.

Top Stories

Why it matters: This cycle's biggest signals were about credibility, not just capability: automated research reached a new publication milestone, world models drew both capital and new benchmarks, real deployment costs got harder to read from list prices, and scientific AI kept moving deeper into domain work.

Automated AI research crossed a new credibility threshold

Sakana AI published The AI Scientist: Towards Fully Automated AI Research in Nature, describing a system that can invent ideas, write code, run experiments, and draft papers across the full machine-learning research lifecycle . Sakana says AI Scientist-v2 produced the first fully AI-generated paper to pass a rigorous human peer-review process, and the paper introduces an Automated Reviewer that matches human review judgments and exceeds standard inter-human agreement . The paper is open access and builds on Sakana's earlier open-source releases . Sakana also reports a scaling law of science: stronger foundation models and more inference compute lead to higher-quality AI-generated papers .

Impact: Automated research is moving from a provocative demo toward a benchmarked, publishable systems category.

World models are attracting both capital and new measurement

The notes cite a TechCrunch report that Yann LeCun's AMI Labs raised $1.03 billion to build world models . On the research side, LeWorldModel is described as a stable end-to-end JEPA from pixels that cuts tunable hyperparameters by 83% and plans up to 48x faster than foundation-model-based alternatives . On the evaluation side, World Reasoning Arena is presented as a benchmark that exposes a substantial gap between current world models and human-level hypothetical reasoning .

Impact: Money, architectures, and evaluation are converging around the same question: how to build models that can reason about the world, not just respond to prompts.

Reasoning-model pricing is less transparent than list prices suggest

A new paper summary reports that 21.8% of model-pair comparisons across eight frontier reasoning models and nine tasks show a pricing reversal, where the model advertised as cheaper turns out to cost more in practice; the gap reaches as high as 28x. In one cited example, Gemini 3 Flash was listed 78% cheaper than GPT-5.2 but wound up 22% more expensive on actual workload cost; Claude Opus 4.6 was listed at 2x Gemini 3.1 Pro but actually cost 35% less. The cited cause is 'thinking token heterogeneity': one model can use 900% more thinking tokens than another on the same query . The paper's recommendation is practical: benchmark real workload costs, not posted prices .

Impact: Model selection is increasingly a systems and finance problem, not just a benchmark-ranking problem.

Scientific AI kept moving into domain-specific systems

Intern-S1-Pro is described as a 1 trillion-parameter scientific multimodal foundation model covering more than 100 tasks across chemistry, biology, and earth sciences, while also performing strongly on general and domain benchmarks . Separate work on automated near-term quantum algorithm discovery says an LLM-powered system reached chemical precision for LiH, H2O, and F2 while reducing circuit evaluations and gate counts by orders of magnitude .

Impact: Labs are not only pursuing broader assistants; they are also aiming AI directly at high-value scientific workflows.

Research & Innovation

Why it matters: The strongest papers in the notes pushed on long-horizon agents, scientific models, safety evaluation, and representation learning.

  • Composer 2 uses a two-phase training setup—continued pretraining plus large-scale reinforcement learning—to improve long-horizon planning and coding, and is reported as state of the art on SWE-bench Multilingual and Terminal-Bench .
  • AIRA2 is presented as Meta's answer to bottlenecks in AI research agents, with state-of-the-art performance on MLE-bench-30 .
  • Natural-Language Agent Harnesses move controller logic into portable natural-language artifacts executed by an Intelligent Harness Runtime, with cited viability on coding and computer-use benchmarks .
  • Claudini uses LLM autoresearch to discover stronger jailbreaks, reaching 40% success on CBRN queries versus prior methods below 10%, with 100% transfer to Meta-SecAlign-70B .
  • Bootleg predicts hidden-layer representations for self-supervised learning and reports 76.7% ImageNet-1K with ViT-B, plus large gains on iNaturalist and segmentation benchmarks .
  • A separate paper argues self-distillation can degrade reasoning, with drops up to 40% across Qwen, DeepSeek-Distill, and Olmo models .

Products & Launches

Why it matters: Product work in the notes focused on getting AI into everyday interfaces and production stacks: speech, translation, inline UI, and broader hardware support.

  • Voxtral: Mistral's TTS model turns about three seconds of reference audio into expressive multilingual speech by separating semantic tokens from acoustic tokens. The cited release says it supports 9 languages, works best with roughly 3–25 seconds of audio, and posts a 68.4% win rate versus ElevenLabs Flash v2.5 in voice cloning; paper and weights are available .
  • Google Live Translate: the notes say Google's new Live Translate works with any headphones across 70+ languages, while the cited Apple alternative requires specific hardware, newer iPhones, iOS 26+, and Apple Intelligence .
  • Claude inline rendering: Claude can now render arbitrary HTML/JS/CSS inline, a step toward chat interfaces that can output working UI instead of only text .
  • vLLM-Omni v0.18.0: the release adds production TTS/Omni serving for Qwen3-TTS, Qwen3-Omni, Fish Speech S2 Pro, and Voxtral TTS, plus a refactored diffusion runtime and a unified quantization framework .
  • Suno now lets users make music with their own voice .
  • AI Toolkit is now working on Apple Silicon on a mac_support branch, pending more testing and cleanup before merge .
  • Google Gemma now has a dedicated GitHub organization with a cookbook for inference and fine-tuning recipes .

Industry Moves

Why it matters: Capital and expansion decisions this cycle point to where companies expect durable value: world models, AI-native software, and infrastructure hiring.

  • The notes cite a TechCrunch report that AMI Labs raised $1.03 billion to build world models .
  • Swyx said Redpoint published a ranked list of SaaS businesses to rebuild with AI, and highlighted survey data suggesting 46% of enterprise CIOs are open to AI-native startups over incumbents .
  • Modular officially opened its Edinburgh expansion at the Bayes Centre and says it is hiring rapidly; Chris Lattner said he plans to visit on April 15/16.

Policy & Regulation

Why it matters: The policy-relevant material in this cycle centered more on preparedness and state use of AI than on new formal rules.

Europe's competitiveness debate sharpened

A slide deck highlighted by John Myers argues European policymakers need to prepare their economies to benefit from AI advances or risk being left behind . The cited economic warning says that if AI becomes a gross substitute for human labor, labor's share of GDP may shrink and developed-country GDP per capita may diverge more sharply .

A US lawmaker described AI-driven protest identification at scale

Rep. Clay Higgins said authorities collected millions of digital images and billions of identifying data points from 'No Kings' rallies, including height, weight, shoe size, tattoos, and gait, for AI processing . Blanche Minerva responded that AI developers have a 'moral imperative' not to build or support models for such purposes .

Quick Takes

Why it matters: These smaller items fill in the operational picture around agents, infrastructure, benchmarks, and real-world deployment.

  • Jeff Dean said AI tools built for human-speed workflows will cap agent gains: even if models become infinitely fast, overall improvement could still be only 2–3x unless the surrounding tools are redesigned .
  • Dean also said there is still major data headroom in video, audio, robotics, autonomous-vehicle, and synthetic data.
  • Open agent-trace infrastructure is growing: the Agent Data Protocol dataset already unifies 3M+ trajectories in one format, and contributors say it could potentially triple in size .
  • Kai Stephens released an agent-trace-prompt-bank built from 20+ open prompt datasets, said it has already been used with GLM-5 in hermes-agent to gather about 120 million tokens, and separately uploaded about 4,000 GLM-5 hermes-agent traces to Hugging Face .
  • MLB is now using Sony's Hawk-Eye system for final ball-strike rulings, the first time a human umpire's call is not final; the system is described as accurate to a sixth of an inch, and 69% of fans reportedly prefer the AI system .
  • DeepSeek Web suffered more than five hours of outage while API V3.2 remained functional; separate posts said the web/app model now consistently identifies itself as V3.
  • A user with little frontend experience said Claude Code helped build a UI demo using Pretext in 20 minutes, while Pretext itself is described as a pure-TypeScript text-measurement system for laying out pages without CSS reflow .
Coding Agents Harden as Security Demos and Reasoning Gains Accelerate
Mar 29
8 min read
489 docs
sankalp
clem 🤗
Agentica
+32
This brief covers the hardening of coding-agent infrastructure, Anthropic's reported zero-day demo, fast-moving reasoning benchmarks, and new research on efficient post-training, inference, and agent architectures. It also highlights enterprise governance pressures as autonomous systems spread.

Top Stories

Why it matters: The notes point to AI moving deeper into enterprise software, closer to real security work, and further up the reasoning curve, while cost and supply constraints become harder to ignore .

Coding agents are becoming enterprise infrastructure

Posts this cycle said OpenAI is acquiring Astral, the team behind the Python tools uv, Ruff, and ty, to deepen the Codex ecosystem . At the same time, Cursor moved self-hosted cloud agents into general availability so code and tool execution can stay inside enterprise infrastructure while Cursor manages orchestration and inference . OpenAI also said Codex Security remains free during preview, has seen steadily increasing adoption, and is already being used by thousands of organizations to identify hundreds of thousands of security issues .

Impact: These are signs that coding agents are being built out as infrastructure and security workflows, not just chat-based coding assistants .

Claude’s security demo showed how far autonomous vulnerability work has moved

A post describing a live Anthropic conference demo said Claude found a zero-day in Ghost, described there as a 50,000-star GitHub project with no prior critical vulnerabilities, by identifying a blind SQL injection in 90 minutes and exfiltrating the admin API key . The same post said Claude then repeated the exploit pattern on the Linux kernel .

"Both exciting and terrifying"

Impact: The notes show frontier models moving beyond code generation into vulnerability discovery and exploitation workflows, with obvious upside for security teams and equally obvious dual-use risk .

Frontier reasoning benchmarks keep climbing

Posts this cycle said GPT-5.4 reached 95% on USAMO 2025, while another post said GPT-5.4 xhigh scored 95% on USAMO 2026, alongside claims of a sharp year-over-year jump in model performance on the competition . Separately, a model on Arena under the name significant-otter identified itself as Gemma 4 from Google DeepMind, with a reported lineup of 2B, 4B, and 120B15A models .

Impact: The combination of stronger benchmark claims and near-release signals suggests frontier labs are still pushing both raw capability and release cadence .

Token economics are becoming a first-order constraint

Mustafa Suleyman said the next few years of AI will be defined by demand far outstripping token supply, making margin to pay for tokens a key competitive factor . That matches reports from engineers who say companies are already spending more than $1,000 per day on Claude Code or Codex tokens . In parallel, multiple companies including Pinterest, Airbnb, Notion, Cursor, and Intercom were cited as finding it better, cheaper, and faster to train or use open models in-house for many tasks rather than rely on APIs .

Impact: Cost, throughput, and deployment control are increasingly strategic product decisions, not back-end implementation details .

Research & Innovation

Why it matters: Research attention in these notes is centered on cheaper post-training, more efficient inference, and architectures that give agents more useful memory and control .

PivotRL cuts down expensive RL rollouts

NVIDIA’s PivotRL works on existing SFT trajectories, identifies informative intermediate pivots where sampled actions have mixed outcomes, and trains only on those moments instead of full rollouts . In the cited results, it preserved out-of-domain performance at +0.21 points on average versus -9.83 for standard SFT, while delivering +14.11 in-domain gains over the base model versus +9.94 for SFT . On SWE-Bench, the post said it matched end-to-end RL accuracy with 4x fewer rollout turns and 5.5x less wall-clock time, and is already used in production for Nemotron-3-Super-120B post-training .

KV-cache compression remains one of the highest-leverage efficiency targets

Posts about Google’s TurboQuant said it compresses KV cache from 32 bits to 3 bits without retraining, with identical accuracy, and can shrink a 16 GB context footprint to under 3 GB . A separate technical read said the compression looked genuine, but the speed claims in a blog relied on an unrealistic float32 einsum baseline and the paper itself made no speed claims .

EGGROLL revisits gradient-free scaling

A post highlighted NVIDIA and Oxford’s EGGROLL as a way to train billion-parameter models with evolution strategies rather than backpropagation, using hundreds of thousands of parallel mutations and low-rank mutation matrices . The same post said models can be pretrained from scratch using simple integers rather than gradients or decimals .

Researchers are treating transformer depth as something models can retrieve from

Two methods highlighted this cycle—Attention Residuals and Mixture-of-Depths Attention—make transformer layers depth-aware, so layers or heads can draw from multiple earlier layers rather than only token positions .

Ego2Web links real-world perception to web actions

Google DeepMind and UNC Chapel Hill’s Ego2Web, accepted to CVPR 2026, pairs egocentric video perception with web execution so agents can read first-person context and take grounded actions online .

Products & Launches

Why it matters: Product work is focusing on deployability: keeping execution inside enterprise boundaries, reducing security toil, and giving developers more flexible ways to run agents .

Cursor put self-hosted cloud agents into GA

Cursor said self-hosted cloud agents are now generally available, keeping code and tool execution inside enterprise infrastructure while Cursor manages orchestration and inference . Details are in its blog post.

Codex Security is being positioned as a security workflow, not just a coding feature

OpenAI describes Codex Security as a tool to find, validate, and fix vulnerabilities . It remains free during preview, and OpenAI said thousands of organizations are already using it to identify hundreds of thousands of issues . Product page: developers.openai.com/codex/security.

Cohere published browser-capable transcription weights and a noisy-condition demo

Cohere released Transcribe as an open-source ASR model that runs in the browser and said it sets a new accuracy standard in real-world noisy conditions, including with a blender running . The model weights are on Hugging Face, and Cohere shared a public demo link .

New tooling is making multi-harness and long-memory agents easier to run

Hankweave now lets developers switch between harnesses such as the Agents SDK, Codex, Gemini, and Opencode with a unified input and logging layer . Separately, CAR added Hermes as a first-class ACP runtime, emphasizing global context shared across sessions for repo work and multi-repo workflows . Repos: multi-harness-hank and codex-autorunner.

Industry Moves

Why it matters: Competitive position is increasingly being shaped by ecosystems, business models, and who controls deployment costs .

  • Claude’s paid base is expanding quickly. TechCrunch-linked reporting and a separate post citing credit card data said paid subscribers have more than doubled in under six months, with record new and returning users in January and February; ChatGPT still leads overall .
  • Open models are gaining enterprise ground. Posts cited Pinterest, Airbnb, Notion, Cursor, and Intercom as public examples saying open models are better, cheaper, and faster than APIs for many tasks, with many more companies reportedly doing the same privately .
  • OpenAI is reinforcing the Codex ecosystem. A post this cycle said OpenAI is acquiring Astral, the team behind uv, Ruff, and ty, to deepen Codex . In parallel, a Codex ambassador program now spans 82 developers across 27 countries and 5 continents .
  • Hark is hiring across the full stack for native AI devices. The company posted 25 roles across AI infra, embedded software, foundation models, computer-use agents, and hardware, and said its new office will include fabrication and hardware labs .

Policy & Regulation

Why it matters: The clearest policy signal in these notes was not a new law but rising pressure to govern autonomous systems already in production .

Governance is lagging deployment

IDC and Rubrik material cited in the notes said autonomous AI is already in production in more than 50% of organizations, while governance is falling behind and agent sprawl is becoming the next enterprise risk . The same material framed agents as machine-speed security challenges and emphasized visibility, control, and organizational changes as the response .

Internet traffic is increasingly machine-generated

A Human Security report cited in the notes said automated traffic grew 8x faster than human activity in 2025, and AI-agent traffic surged nearly 8,000%, pushing bot traffic past human traffic overall .

Biosecurity concerns are getting more explicit

One post argued that tools capable of helping vibe-code cancer vaccines could also help generate far more dangerous biological designs, and François Fleuret said he shares that concern and wants a serious discussion of it .

Quick Takes

Why it matters: These smaller updates round out the picture on robotics, benchmarks, real-world AI use, and how people are working with frontier systems day to day .

  • Figure 03 was shown autonomously sorting deformable packages and placing them labels-down for scanning; one observer said it looked far better than the Unitree G1 he owns at home .
  • Separate posts said Unitree robots are already being used in hospitals as caregivers and assistants .
  • Agentica said its SDK reached 36.08% on ARC-AGI-3 in one day .
  • A 17-year-old, Naveen Dhar, built a gunshot-detection model for rainforest anti-poaching work that the cited post says almost never false-alarms, after earlier systems produced overwhelming false positives .
  • Users reporting on 1M-token contexts said complex work still degrades around 150k tokens, leading them to hand off sessions around 100k-150k despite much larger advertised windows .
  • MoonDream 3 drew criticism for exposing different API surfaces across its Hugging Face, local Station, and hosted Cloud deployments .
  • Karpathy said LLMs are extremely good at arguing in multiple directions; his advice was to use that strength for opinion formation, while asking from different directions and watching for sycophancy .
  • François Chollet argued that intelligence is better thought of as a bounded conversion ratio than an unbounded scalar, while noting that machines still gain from speed, working memory, and recall advantages .
Anthropic Leak, Compute Bottlenecks, and the Agent Playbook Take Center Stage
Mar 28
8 min read
608 docs
Tibo
Software Mansion
Zixuan Li
+35
The brief covers leaked Anthropic model details and the security fallout, tightening memory and power bottlenecks, the steady open-vs-closed model gap, and new research and product launches across agents, voice, vision, and chip design.

Top Stories

Why it matters: Four themes stood out: frontier-model security, physical infrastructure constraints, the economics of open vs. closed models, and a more formal operating model for AI agents.

Anthropic’s unreleased model leak became a security story

According to posts citing leaked materials, Anthropic has been testing a model called Mythos with select customers. Those posts described it as a new tier above Opus—later edited in one post to Capybara—with stronger results in coding, academic reasoning, and cybersecurity, plus a slow rollout because of compute intensity and security concerns . Fortune was separately cited for reporting that Anthropic left details of an unreleased model in an unsecured data trove .

Impact: Frontier-model competition is now tied not just to capability, but to selective access, cyber risk, and operational security .

Compute constraints are showing up in memory, power, and construction schedules

Epoch AI said the total memory bandwidth of AI chips shipped since 2022 has reached 70 million terabytes per second and is growing 4.1x per year, while AI inference is often bottlenecked by memory bandwidth rather than raw compute . It also said AI chips consumed more than 90% of total HBM production in 2025 and that HBM prices spiked in early 2026 as demand outpaced supply . At the same time, Microsoft said it is partnering with Crusoe on a 900MW AI factory in Abilene, Texas , OpenAI said steel beams went up this week at its Michigan Stargate site with Oracle and Related Digital , and NVIDIA said Vera Rubin + Groq 3 LPX can deliver up to 35x more performance per megawatt for trillion-parameter models and massive context workloads .

Impact: The competitive bottleneck is increasingly about watts, memory bandwidth, and buildout speed—not only model quality .

The open/closed gap is much smaller than it used to be, but the frontier is still closed

Arena said the gap between top open-source and proprietary text models has held at roughly 50-60 points for about 14 months, down from 100-150 points before mid-2024 . It also said proprietary models currently occupy the first 20 places on the Text Arena leaderboard, while the leading open models are GLM-5 at #20, Kimi-K2.5-Thinking at #23, and Qwen3.5-397b-a17b at #27 . In separate Arena analysis, GPT-5.4 High, Mini, and Nano behaved like scaled versions of the same model, suggesting price differences mainly reflect efficiency rather than different core capabilities .

Impact: Open models are closer than before, but the leading edge still sits with closed labs, and pricing is becoming more about efficiency per task than a simple proxy for intelligence .

The agent era is getting its own playbook

A new Google-linked report argues that intelligence explosions are social rather than individual, and that future progress may come from human-AI configurations and agent institutions rather than bigger monolithic models . In plain language, the argument is that groups of agents with roles, checks, and protocols may matter more than one ever-larger model .

Every prior intelligence explosion in human history was social, not individual.

IBM’s new survey on workflow optimization for LLM agents organizes agent systems by when workflow structure is set, what components are optimized, and which signals guide the optimization . Artificial Analysis also launched AA-AgentPerf, a hardware benchmark for the agent era that uses real coding-agent workloads and reports maximum concurrent users per accelerator, per kW, per dollar, and per rack .

Impact: The discussion is moving from which single model is best to how agent systems should be structured, evaluated, and deployed .

Research & Innovation

Why it matters: Research attention is shifting toward unified multimodal systems, better long-context reasoning, more stable world models, and more realistic evaluations.

  • Apple AToken: Apple introduced AToken, a shared tokenizer and encoder for images, video, and 3D objects in one framework. The post said it beats or rivals specialized models and allows knowledge transfer across media types .
  • SAGE: This closed-loop multi-agent training method co-evolves a Challenger, Planner, Solver, and Critic from one LLM backbone using just 500 seed examples. On Qwen-2.5-7B, it reportedly improved out-of-distribution performance by 4.2% while maintaining in-distribution accuracy .
  • Together Research’s divide-and-conquer approach: A Planner rewrites tasks for parallel Workers and a Manager combines their outputs. Together said Llama-3-70B and Qwen-72B using this setup can match or beat GPT-4o single-shot on long-context retrieval, QA, and summarization as context length grows, though the method still struggles when important clues are spread across distant chunks .
  • LeWorldModel: Yann LeCun’s team released LeWorldModel, described as a world model that avoids collapse by adding a SIGReg regularizer to its prediction loss. The post also claimed 15M parameters, training on one GPU in hours, 48x faster planning, and about 200x fewer tokens for encoding .
  • CursorBench: A new benchmark for coding agents uses real Cursor team coding sessions, evaluates more than functional correctness, emphasizes long-horizon tasks with a median 181 lines changed per task, and keeps the data refreshed with recent sessions .

Products & Launches

Why it matters: Product releases this cycle focused on deployability: lower-latency voice agents, faster video processing, more local execution, and tools that slot directly into agent workflows.

  • OpenAI gpt-realtime-1.5: OpenAI showed a clinic concierge demo for a Singapore health clinic. It speaks naturally with patients, collects the needed details, and books appointments in real time .
  • Meta SAM 3.1: Meta released SAM 3.1 as a drop-in update to SAM 3. Its core change is object multiplexing, which lets the model track up to 16 objects in one forward pass and doubles throughput from 16 to 32 FPS on a single H100 for medium-object videos . Meta said the point is to make high-performance video applications feasible on smaller, more accessible hardware .
  • Cohere Transcribe in the browser: Cohere’s multilingual speech recognition model can run entirely locally in a browser on WebGPU. A post said it can transcribe 1 hour of audio in 100 seconds, is fully private, free, and requires no installation .
  • LiteParse: LlamaIndex’s LiteParse is a model-free, open-source document parser for AI agents. It processes about 500 pages in 2 seconds on commodity hardware, supports 50+ file formats, and is designed to plug into agent tools, while the authors note it is not meant to replace OCR-heavy workflows for scanned documents .
  • Hermes Agent + Hugging Face: Hermes Agent is positioned as an open-source agent that remembers what it learns through a multi-level memory system and persistent machine access . Hugging Face is now a first-class inference provider inside Hermes, with 28 curated models in the picker and custom access to 100+ more .
  • Gemini video creation: Google added a Create video workflow in Gemini’s app and web experience, where users select the tool, describe the video, optionally upload a reference image or choose a template, and generate directly from the interface .

Industry Moves

Why it matters: Business activity keeps pointing to three battlegrounds: capital markets, distribution, and AI-shaped hardware.

  • Anthropic IPO talk is getting more concrete: A post citing reporting said Anthropic is eyeing a Q4 2026 IPO with a raise above $60 billion, that its annualized revenue more than doubled to $19 billion in the first two months of 2026, and that bankers think it could reach public markets before OpenAI because of its enterprise and developer focus plus a shorter projected path to profitability .
  • Perplexity expanded Samsung distribution: Perplexity said it now powers Samsung’s Browsing Assist in Samsung Browser on Galaxy Android and Windows . In a separate post, Aravind Srinivas said the broader partnership now reaches a browser pre-installed on more than 1 billion Samsung devices, extends prior work with Bixby, and includes pre-loading on Galaxy S26 devices alongside Gemini .
  • Microsoft added more physical capacity: Mustafa Suleyman said Microsoft is partnering with Crusoe on a 900MW AI factory in Abilene, Texas to add capacity to its AI fleet and support Microsoft AI infrastructure .
  • RicursiveAI is betting RL can compress chip design cycles: Lightspeed said it led RicursiveAI’s $300 million Series A in January. The company says its reinforcement-learning-based semiconductor design platform can compress chip development from years to weeks .

Policy & Regulation

Why it matters: Formal AI policy is still uneven, but courts, safety packs, and billing controls are increasingly shaping how models are deployed.

  • Anthropic won a major preliminary court ruling: A federal judge in California indefinitely blocked the Pentagon’s effort to label Anthropic a supply chain risk, though the ruling is temporary and a parallel case is still underway in Washington, D.C. .
  • OpenAI published a teen safety policy pack: OpenAI released a set of prompt-based safety policies intended to create age-appropriate protections for teens, and published the repository publicly .
  • Gemini API billing is getting harder to overspend: Starting April 1, Gemini API billing tiers get a monthly spending cap, with API access pausing until the next month or a tier upgrade if the cap is hit. Users can also set per-project spend caps in AI Studio .

Quick Takes

Why it matters: These are smaller updates, but they show where tooling, benchmarks, and open-source ecosystems are moving next.

  • OpenAI launched a Codex use-case gallery with starter prompts that can open directly in the app, and separately reset Codex usage limits across all plans so users can experiment with newly launched plugins .
  • GLM-5.1 is now available to all GLM Coding Plan users, and a separate post said GLM-5.1 will be open source.
  • Epoch AI removed one FrontierMath: Open Problems item after GPT-5.2 Pro solved it, because the problem did not meet the benchmark’s minimum notability bar; it also updated sourcing guidelines afterward .
  • Hugging Face’s HF Papers CLI adds semantic search and markdown retrieval for arXiv papers, aimed at supporting autoresearch workflows .
  • Strix packages multi-agent application pentesting with a built-in browser, proxy, terminal, and Python runtime, aiming to cut automated pentesting from weeks to hours .
  • React Native ExecuTorch v0.8.0 adds Vision Camera integration for real-time computer-vision inference on live camera feeds, including support for RF-DETR and Liquid AI’s vision-language models .
  • Qdrant is pushing sparse embeddings for e-commerce search, arguing they preserve exact matches and interpretability better than dense embeddings for product attributes such as SKU, size, and brand .
  • Huawei’s 950PR AI chip was priced at ¥70,000 with a 2H shipment target of 750,000 units, while one commenter argued it is not comparable to Nvidia’s H200 for training workloads .
Gemini Live Goes Global as Codex Plugins and Open Audio Models Expand AI Workflows
Mar 27
7 min read
763 docs
Financial Times
Alexander Panfilov
The Wall Street Journal
+35
Google pushed Gemini 3.1 Flash Live across Search, Gemini, and developer channels, while OpenAI broadened Codex with open-source plugins. The brief also covers open audio models, new research systems, industry partnerships, and the latest safety and compliance signals.

Top Stories

Why it matters: The biggest developments today pushed AI deeper into real-time interaction, connected workflow automation, open audio infrastructure, and operational safety.

Google turned Gemini 3.1 Flash Live into a broad real-time platform

Google rolled out Gemini 3.1 Flash Live across Gemini Live, Search Live, Google AI Studio, and Google Cloud, positioning it as a production-ready realtime model for voice and vision agents . Google said it improved quality, reliability, latency, conversation memory, and instruction-following, while Search Live is now available in more than 200 countries and territories with multilingual support . Independent benchmarking also showed a clear speed/quality tradeoff: 95.9% on Big Bench Audio at the high thinking setting with 2.98s time-to-first-audio, versus 70.5% and 0.96s on minimal thinking .

Impact: Google is not just shipping a model. It is distributing one live audio stack across consumer search, the Gemini app, developer tooling, and enterprise channels.

OpenAI expanded Codex from coding assistant to connected work surface

OpenAI is rolling out plugins in Codex so it can work with tools like Slack, Figma, Notion, Gmail, and Google Drive, including Docs, Sheets, and Slides . OpenAI said plugins extend Codex into planning, research, coordination, and post-coding workflows; they are available in the Codex app, CLI, and IDE extensions . OpenAI also said users will be able to build and share their own plugins, and that today's plugins are open source .

Impact: This moves Codex closer to a general work agent that operates inside the tools teams already use, not just inside a code editor.

Open speech models got stronger on both input and output

Cohere launched Cohere Transcribe, its first audio model, under Apache 2.0. The company said it is state of the art in open-source speech recognition, ranks #1 on the Open ASR leaderboard, supports 14 languages, and reached 5.42% English word error rate in human evaluation . Mistral released Voxtral TTS as an open-weight text-to-speech model with low latency, emotional expressiveness, and support for 9 languages; the company published weights and a technical report .

Impact: The open audio stack is improving at both ends: transcription on the way in, expressive speech generation on the way out.

Safety work became more operational

Google DeepMind published new research on harmful manipulation based on studies with more than 10,000 people, finding high influence in finance but lower influence in health where existing guardrails blocked false medical advice . Separately, METR said it spent three weeks red-teaming Anthropic's internal monitoring and security systems, found several new vulnerabilities, and produced artifacts to improve future monitoring, while saying none of the findings severely undermined major claims in Anthropic's sabotage risk report .

Impact: Frontier labs are moving from abstract safety principles toward live testing, measurement, and third-party scrutiny.

Research & Innovation

Why it matters: The strongest technical work today focused on specialized systems: brain modeling, search agents, self-modifying agents, and automated security research.

  • Meta FAIR's TRIBE v2: Meta introduced a foundation model trained on 500+ hours of fMRI recordings from 700+ people to predict how the human brain responds to sights and sounds. Meta says it supports zero-shot predictions for new subjects, languages, and tasks, improves 2-3x over prior methods on movies and audiobooks, and is being released with code, paper, and demo .
  • Chroma Context-1: Chroma launched a 20B search agent it says pushes the pareto frontier of agentic search and is an order of magnitude faster and cheaper . The model was trained with SFT + RL on 8,000+ synthetic multi-hop tasks across web, SEC filings, patent law, and email, and Chroma open-sourced both the weights and the task-generation codebase .
  • Hyperagents and DGM-H: Hyperagents are presented as self-modifying AI systems that can rewrite both the task-solving and self-improvement parts of the agent. In the DGM-H setup, reported performance improved across coding, paper review, and robotics, with gains accumulating across runs .
  • Autoresearch for jailbreaking: A new paper used Claude Code in an autoresearch loop to discover novel jailbreaking algorithms that reportedly beat 30+ existing GCG-like attacks and generalized better to unseen models than prior work. The authors said this suggests some incremental safety and security research can now be automated .

Products & Launches

Why it matters: Product launches kept reducing friction around memory, provisioning, orchestration, and domain-specific deployment.

  • Gemini import tools: Gemini is rolling out memory import and chat history import, letting users bring preferences and prior chats from other AI apps into Gemini on desktop, with mobile coming later .
  • Stripe Projects: Stripe launched Projects in developer preview so agents can provision third-party services from the CLI. Stripe's example command creates a PostHog account, gets an API key, and sets up billing without leaving the terminal .
  • Cline Kanban: Cline launched a free, open-source standalone app for CLI-agnostic multi-agent orchestration, compatible with Claude, Codex, and Cline. Tasks run in worktrees, can be linked into dependency chains, and include built-in git views .
  • Glass Developer API: Glass Health made its Developer API self-serve inside its web app. The API supports clinical question answering, differential diagnosis, treatment planning, and documentation, with structured JSON, in-text citations, and HIPAA compliance with BAA .
  • Ollama in VS Code: Visual Studio Code can now use local or cloud Ollama models through GitHub Copilot if Ollama is installed .

Industry Moves

Why it matters: Partnerships and financing are showing where companies think AI value will concentrate: manufacturing, multi-agent systems, and new revenue lines.

  • Sakana x Mitsubishi Electric: Sakana AI announced a strategic partnership and investment from Mitsubishi Electric. The two companies said they will combine Mitsubishi's manufacturing data and domain knowledge with Sakana's AI systems, and Sakana framed manufacturing and physical AI as its third major pillar after finance and defense .
  • OpenAI backs Isara: Isara raised $94 million at a $650 million valuation. Posts describing the company say it coordinates thousands of AI agents to solve complex problems, used roughly 2,000 agents to forecast gold prices, and plans to sell predictive modeling tools to finance firms first .
  • OpenAI ads pilot: Reporting shared on X said OpenAI's ads pilot surpassed $100 million in ARR six weeks after launch, expanded to more than 600 advertisers, and plans self-serve advertiser access in April .
  • Anthropic IPO talk: A post linking The Information said Anthropic has discussed going public as soon as the fourth quarter and that bankers pitching the company think an IPO could raise more than $60 billion .

Policy & Regulation

Why it matters: The clearest policy signals today were around safety governance, privacy, and compliance rather than formal rulemaking.

  • Third-party red-teaming: METR said Anthropic gave an external researcher substantial access to internal monitoring and security systems for a three-week exercise, and METR said some vulnerabilities found during the exercise have already been patched .

"This kind of adversarial testing by external researchers is valuable for discovering vulnerabilities, as well as for developing best practices for embedding third party evaluators inside frontier AI companies."

  • Manipulation measurement: Google DeepMind said it built a first-of-its-kind empirically validated toolkit to measure real-world AI manipulation, based on nine studies involving more than 10,000 participants across three countries .
  • OpenAI put an erotic chatbot plan on hold: Posts citing the Financial Times said OpenAI indefinitely shelved a planned adult-mode chatbot amid concerns about risks to minors, unhealthy emotional attachments, and the difficulty of filtering illegal material while generating explicit content .
  • Encrypted inference: Chutes said its end-to-end encrypted AI inference keeps user data encrypted until it reaches a GPU inside a trusted execution environment, and uses ML-KEM-768 with fresh ephemeral keypairs for forward secrecy and post-quantum resistance .

Quick Takes

Why it matters: These were smaller updates, but they point to where tooling, creator software, and AI operations are moving next.

  • Moondream Photon claims 46ms end-to-end VLM inference and 60+ fps on a single H100, from edge devices to servers .
  • Runway's Multi-Shot App turns a prompt or image into a scene with dialogue, sound effects, cuts, pacing, and cinematic framing .
  • Google's Lyria 3 Pro can generate music tracks up to three minutes with structure-aware sections such as intros, verses, choruses, and bridges .
  • Stanford NLP's sycophancy study reported that sycophantic LLMs can make users more self-centered, increase confidence that they are right, and reduce willingness to repair interpersonal conflicts, even while users prefer and trust those systems more .
  • Anthropic tightened peak-hour Claude limits, while OpenAI responded by offering temporary 2x Codex rate limits across ChatGPT subscriptions .
  • AxiomMath open-sourced Axplorer, a tool for searching interesting or optimal mathematical objects under constraints; the company said it matched state of the art on several combinatorics problems with much less compute and time .
AI Scientist Reaches Nature as ARC-AGI-3 Debuts and GPT-5.4 Gets Cheaper
Mar 26
9 min read
718 docs
Cohere
Chubby♨️
Nathan Benaich
+34
Sakana AI’s Nature paper, ARC-AGI-3’s human-AI gap, and OpenAI’s GPT-5.4 mini and nano headline the cycle. The brief also covers new research architectures, product rollouts, hiring and funding signals, and the latest policy and governance moves.

Top Stories

Why it matters: This cycle mixed a research milestone, a new benchmark gap, cheaper frontier-model variants, and a deployment-level inference breakthrough.

Sakana AI took The AI Scientist into Nature

Sakana AI said The AI Scientist: Towards Fully Automated AI Research is now published in Nature. The system is described as an agent built from foundation models that can run the full machine-learning research loop: invent ideas, write code, run experiments, and draft the paper . Sakana also said AI Scientist-v2 produced the first fully AI-generated paper to pass rigorous human peer review, and that the Nature paper introduces an Automated Reviewer that matches human judgments and exceeds standard inter-human agreement . The paper reports a "scaling law of science": stronger foundation models—and, in later commentary, more inference compute—produce higher-quality generated papers . The work is open-source and was done with collaborators at UBC, the Vector Institute, and Oxford .

Why it matters: this is one of the clearest public attempts to combine end-to-end research automation, peer-reviewed validation, and open release in a single result.

ARC-AGI-3 opened with a wide human-AI gap—and immediate debate about the metric

ARC-AGI-3 was released as a benchmark for agentic intelligence in interactive reasoning environments, with the stated goal of measuring whether an AI can match human-level action efficiency on unseen tasks . ARC Prize said humans solve 100% of environments on first contact with no prior training or instructions, while frontier AI models are under 1% at launch . A set of posted scores put Gemini 3.1 Pro at 0.37%, GPT-5.4 at 0.26%, Opus 4.6 at 0.25%, and Grok 4.2 at 0% . François Chollet separately said ARC-AGI is not a final exam for AGI, but a moving target aimed at the residual gap between what is easy for humans and hard for AI .

"Most benchmarks test what models already know, ARC-AGI-3 tests how they learn"

The benchmark design is already under scrutiny. Official posts say the human baseline uses the action count of the second-best tester out of 10, and a score measures how close a system gets to matching or exceeding that baseline . External commentary noted quadratic scaling of steps and warned that ARC-AGI-3 scores should be interpreted differently from standard benchmarks , while other critics questioned the "human score 100%" framing and whether prior puzzle or game exposure makes the human comparison less clean than advertised .

Why it matters: ARC-AGI-3 is now both a hard new public target for agentic systems and a live debate over how progress should be measured.

OpenAI widened the GPT-5.4 line with cheaper mini and nano models

Artificial Analysis reported that OpenAI released GPT-5.4 mini and GPT-5.4 nano, both with the same reasoning effort modes as GPT-5.4, multimodal image input, and a 400K-token context window . Pricing was listed at $0.75/$4.50 per 1M input/output tokens for mini and $0.20/$1.25 for nano, versus $2.50/$15 for GPT-5.4 . The same evaluation said nano outperformed Claude Haiku 4.5 and Gemini 3.1 Flash-Lite Preview on several reasoning and terminal-style tests, while mini posted stronger agentic GDPval-AA scores than Gemini 3 Flash Preview but trailed Claude Sonnet 4.6 . The tradeoff is efficiency: both models used far more output tokens than peers at highest reasoning effort, and both showed weak AA-Omniscience results driven by high hallucination rates .

Why it matters: OpenAI is pushing its frontier line further downmarket, but the benchmark data suggests buyers still need to watch token consumption and hallucination behavior.

TurboQuant moved from paper result to open inference deployment

Google Research introduced TurboQuant as a compression algorithm that cuts LLM key-value cache memory—the working memory models use during generation—by at least 6x and delivers up to 8x speedup with zero accuracy loss . A separate technical summary said the method needs no retraining, converts data into polar coordinates to remove storage overhead, and applies a 1-bit correction step; tests on Gemma and Mistral models reportedly matched full-precision quality on question answering and code generation while also beating prior methods in vector search . The result quickly showed up in the open serving stack: one developer said they implemented TurboQuant for vLLM and fit 4,083,072 KV-cache tokens on a USB-charger-sized HP ZGX, which the vLLM project then praised publicly .

Why it matters: this is a case where an inference paper is already showing concrete deployment effects in open tooling.

Research & Innovation

Why it matters: Beyond the headline stories, this cycle emphasized self-improving agents, shared memory, hybrid architectures, and native multimodality.

  • Hyperagents: Meta and collaborators introduced self-referential agents where the self-improvement process itself is editable, rather than fixed . The DGM-Hyperagent combines a task agent and a meta agent in one modifiable program, discovering improvements such as persistent memory and performance tracking that transfer across domains . Reported gains included paper review accuracy moving from 0.0 to 0.710, robotics reward design from 0.060 to 0.372, and zero-shot transfer to Olympiad-level math grading at 0.630 .
  • MemCollab: New research on memory sharing across heterogeneous agents uses contrastive trajectory distillation to separate universal task knowledge from agent-specific biases . In plain terms, it compares how different agents reason through the same task to extract shared constraints, then uses task-aware retrieval to apply the right constraints later . The authors report gains in both accuracy and inference-time efficiency for math reasoning and code generation, even across model families .
  • Hybrid Associative Memory (HAM): ZyphraAI proposed a Transformer/RNN hybrid that lets the RNN handle predictable tokens and the Transformer handle surprising ones based on a user-selected KV-cache budget . At 800M parameters, HAM was reported to outperform pure Transformer, pure RNN, and prior hybrid baselines on language modeling and long-context retrieval while using only 50% KV cache . The architecture also allows adjustable KV cache at inference time and even within a single sequence .
  • LongCat-Next: Meituan introduced a native autoregressive multimodal model with 68.5B total parameters and 3B active parameters, built on a shared discrete token space across language, vision, and audio . The model combines a new any-resolution vision transformer with capabilities in OCR, charts, GUI understanding, document analysis, arbitrary-resolution visual generation, audio comprehension, and voice cloning .

Products & Launches

Why it matters: New releases this cycle were less about one giant model launch and more about turning AI into usable, task-specific software.

  • AssemblyAI Medical Mode: AssemblyAI added a medical correction layer on top of Universal-3 Pro, aimed at fixing the drug names, dosages, and terminology errors that make general-purpose ASR unsafe for clinical workflows . The company says the base model's noise handling and latency stay the same, while the correction focuses on key medical tokens; it is available for both pre-recorded and streaming audio, with HIPAA BAA included .
  • Lyria 3 Pro rollout: Google DeepMind and Gemini said Lyria 3 Pro now supports tracks up to three minutes, with structure controls for intros, verses, choruses, and bridges . Access is rolling out in the Gemini App for Google AI Plus, Pro, and Ultra users, while developers can build against it in Google AI Studio and the Gemini API . Google also said all Lyria 3 and Lyria 3 Pro outputs carry SynthID watermarking .
  • Claude work tools on mobile: Anthropic said Claude's work tools are now available on mobile, including access to Figma designs, Canva slides, and Amplitude dashboards from a phone .
  • Cursor self-hosted cloud agents: Cursor said its cloud agents can now run on customer infrastructure, keeping code and tool execution inside the user's own network while preserving the same agent harness and experience .
  • LangSmith Fleet shareable skills: LangChain added shareable skills to LangSmith Fleet, letting teams capture domain knowledge once, attach it to any agent, and create skills from prompts, past chats, manual entry, or templates .

Industry Moves

Why it matters: Hiring patterns, partnerships, and funding are showing where companies think the next wave of value will come from.

  • AI labs are hiring for go-to-market and adoption at scale: Epoch AI's analysis of job postings at OpenAI, Anthropic, xAI, and DeepMind said sales and go-to-market roles are now the largest hiring category at OpenAI and Anthropic, at 31% and 28% of open roles respectively, while research roles account for 7% and 12% . The same analysis pointed to heavy hiring for "AI Success Engineer" and "Forward Deployed Engineer" roles, 15 OpenAI roles tied to a consumer hardware device, and growing investment in robotics at both OpenAI and DeepMind .
  • Cohere partnered with RWS: Cohere said its frontier models are being integrated into RWS Group's Language Weaver Pro to provide enterprise-grade translation for high-stakes environments, including enterprise and government use cases .
  • Gumloop raised $50M: Gumloop raised a $50M Series B led by Benchmark, bringing total funding to $70M for its no-code AI agent automation platform .
  • AirStreet closed a larger AI-first fund: AirStreet said it raised $232,323,232 for Fund III to back AI-first companies in the U.S. and Europe, making it the largest solo GP venture firm in Europe by its own description .

Policy & Regulation

Why it matters: AI policy is now reaching physical infrastructure, while labs are continuing to publish formal governance frameworks for model behavior.

  • Sanders targets data-center buildout: The Washington Post said Sen. Bernie Sanders will introduce legislation to block construction of new data centers until lawmakers enact AI regulations .
  • OpenAI highlighted its Model Spec: OpenAI described the Model Spec as the public framework for how its models are intended to behave, covering what they should and should not do as capability grows . The company said the framework includes a chain of command for resolving conflicting instructions and evolves over time through real-world use, feedback, and new model capabilities .
  • Anthropic documented auto-mode safety decisions: Anthropic said Claude Code auto mode is meant to be a safer middle ground between prompting for approval on every action and running without permission prompts, using built and tested classifiers to make approval decisions .

Quick Takes

Why it matters: These items were smaller, but they point to where tooling, interfaces, and agent infrastructure are moving next.

  • Google Research's Vibe Coding XR turns prompts into interactive, physics-aware WebXR apps through Gemini Canvas and XR Blocks
  • LLaDA2 became the first discrete diffusion pipeline for text in Diffusers; it uses a 16B total-parameter MoE architecture
  • Browserbase and PrimeIntellect launched BrowserEnv so users can train browser agents or custom models for their own workflows in a few hours
  • A 24B model was shown running locally in a web browser at about 50 tokens/sec on an M4 Max using WebGPU and Transformers.js
  • Georgia Tech SSLab's Vibe Radar tracks public CVEs linked to AI-generated code, scanning 50k+ advisories and finding dozens of confirmed cases across tools such as Claude Code, Copilot, and Cursor
  • Anthropic launched inline interactive charts, diagrams, and visualizations in Claude chat, in beta across all plan types
  • Together AI added four new image models spanning text rendering, character consistency, search-grounded generation, and unified generation/editing on its serverless stack
  • ARC Prize 2026 went live with three tracks and $2,000,000 in prizes
Sora Shuts Down, LiteLLM Is Compromised, and Siri Gets an AI Agent Reboot
Mar 25
7 min read
740 docs
vLLM
Daniel Hnyk
Perplexity
+38
OpenAI is shutting down Sora while preparing its next model, LiteLLM’s compromise exposed a major supply-chain risk in AI tooling, and a new report says Apple is rebuilding Siri into a system-wide AI agent. The brief also covers key research advances, product launches, corporate moves, and safety-related updates across the AI landscape.

Top Stories

Why it matters: This cycle combined a major OpenAI product retreat, a supply-chain security shock, a fresh consumer-AI platform wager from Apple, and one of the clearest public disclosures yet on how a frontier coding model was trained.

1) OpenAI is winding down Sora as Spud nears

Reporting shared on X said OpenAI has finished pretraining or initial development of a new model codenamed Spud and is winding down Sora’s app, API, and video capabilities in ChatGPT. The same reporting said Sam Altman is dropping oversight of some direct reports and focusing on raising capital, supply chains, and datacenter buildout at unprecedented scale.

“We’re saying goodbye to Sora. To everyone who created with Sora, shared it, and built community around it: thank you. What you made with Sora mattered, and we know this news is disappointing.”

“We’ll share more soon, including timelines for the app and API and details on preserving your work.”

A post quoting the report said Sora had become a drag on computing resources during heightened competition.

Impact: The reporting points to a shift of compute and leadership attention toward the next large model and infrastructure buildout rather than a standalone video product.

2) The LiteLLM compromise turned AI infrastructure into the day’s security story

Researchers said PyPI release 1.82.8 of LiteLLM contained litellm_init.pth with base64-encoded instructions to exfiltrate SSH keys, cloud credentials, git credentials, API keys, shell history, crypto wallets, SSL keys, CI/CD secrets, and database passwords, then self-replicate. Karpathy added that LiteLLM sees about 97 million downloads per month and that dependents such as dspy were also exposed through transitive installs. The poisoned release appears to have been live for less than an hour before a RAM crash in a Cursor MCP plugin helped uncover it.

“Supply chain attacks like this are basically the scariest thing imaginable in modern software.”

The incident also spilled into the agent ecosystem: Hermes users who installed recently were told to review a security notice, and Hermes installs were blocked when litellm was quarantined on PyPI.

Impact: This was not just one bad package version. It showed how reused AI-agent infrastructure can turn a single compromised dependency into a much broader credential-exposure problem.

3) A new report says Apple is turning Siri into a system-wide AI agent

A Bloomberg report shared by Mark Gurman says iOS 27 will rebuild Siri into a system-wide AI agent. Reported features include a standalone Siri app with chat history and file uploads, text-and-voice interaction, an Ask Siri button for contextual actions across apps, unified Siri-and-Spotlight search, and Write with Siri editing tools. A separate summary of the report said many advanced features will continue rolling out into late 2026.

That same summary said the system will be powered by Apple Foundation Models plus a Google Gemini partnership.

Impact: If the report holds, Apple is moving from assistant-style AI features toward deeper system control, but on a staggered timeline.

4) Cursor published a rare training report for a frontier coding model

Cursor released a technical report on how Composer 2 was trained, saying the model reached frontier-level coding through extensive research and that the report shares details meant to be useful to the community. Commentary on the report highlighted continual pretraining improving RL performance, a multi-token prediction head for speculative decoding, length-penalty RL for long tasks, self-summarization for context compaction, and detailed sections on kernels, parallelism, quantization, and distributed RL.

Impact: The value here is the level of disclosure: the report gives builders concrete training and infrastructure choices, not just benchmark claims.

Research & Innovation

Why it matters: Technical progress this cycle focused less on one giant model launch and more on the systems around models: memory, serving, evaluation, and retrieval.

  • TurboQuant: Google Research introduced TurboQuant, a compression algorithm that reduces LLM key-value cache memory by at least 6x and can deliver up to 8x speedup with zero accuracy loss.
  • APEX-SWE: Mercor and Cognition launched a benchmark for realistic software-engineering work such as shipping systems and debugging failures, arguing that traditional coding benchmarks do not reflect how software is actually built and maintained. On the initial leaderboard, OpenAI GPT 5.3 Codex (High) led at 41.5% Pass@1.
  • vLLM Model Runner V2: vLLM rebuilt its execution core into Model Runner V2 with modular design, GPU-native input preparation, async-first execution with zero CPU–GPU sync, and a Triton-native sampler. Separate GTC notes said the project is also reducing memory waste to 0–12% across OSS models and improving multimodal P99 throughput by up to 2.5x through encoder prefill disaggregation.
  • Late-interaction retrieval: A 150M Reason-ModernColBERT model reached nearly 90% on BrowseComp-Plus and beat models up to 54× larger, while Mixedbread Search was reported to approach oracle-level performance on knowledge-intensive agentic benchmarks.

Products & Launches

Why it matters: New releases kept pushing agents deeper into everyday workflows—permissions, browsers, filesystems, APIs, and open browser-use models.

  • Claude Code auto mode: Anthropic added an auto mode that lets Claude make permission decisions for file writes and bash commands on the user’s behalf, with safeguards checking each action before it runs.
  • Perplexity Computer and Comet: Perplexity said its Computer product uses Comet to kick off workflows in a local browser. Arav Srinivas described Comet as an autonomous Internet Computer, and the demo showed it opening five tabs, running parallel image-generation tasks, downloading and cropping outputs, and assembling a comparison deck.
  • Hermes Agent v0.4.0: NousResearch’s largest Hermes update this week merged 300 PRs and added a background self-improvement loop, an OpenAI-compatible API backend, and major CLI upgrades.
  • hf-mount: Hugging Face introduced hf-mount, which can attach a storage bucket, model, or dataset from the Hub as a local filesystem. The project says it can expose remote storage 100× larger than a local disk and is well suited to agentic storage workflows.
  • MolmoWeb: AI2 released MolmoWeb 4B and 8B browser-use models and their datasets under Apache 2.0.

Industry Moves

Why it matters: Labs and platform companies kept reallocating capital, talent, and partnerships toward agents, AI-native software, robotics, and new interfaces.

  • Hark emerged from stealth: Brett Adcock said Hark spent eight months in stealth building the most advanced personal intelligence in the world, paired with next-generation hardware as a human-machine interface. Separate reporting said Adcock put in $100M of his own money, assembled a 45+ person team from Apple, Tesla, Google, Meta, and Amazon, expects thousands of NVIDIA B200 GPUs online by April, and plans a first model this summer.
  • Microsoft added senior AI2 talent: Mustafa Suleyman welcomed Ali Farhadi, Hanna Hajishirzi, and Ranjay Krishna to Microsoft Superintelligence, describing them as impactful contributors to AI research and open source.
  • Google DeepMind partnered with Agile Robots: DeepMind said a new research partnership will integrate Gemini foundation models with Agile Robots hardware to build more helpful and useful robots.
  • Meta’s internal AI push shifted upward: Reporting on X said CTO Andrew Bosworth is taking over supervision of Meta’s effort to become AI-native, including the company’s AI For Work initiative.

Policy & Regulation

Why it matters: This cycle’s policy signal came less from governments and more from safety, access, and institutional compliance moves around powerful models.

  • OpenAI Foundation: OpenAI said the Foundation will spend at least $1 billion over the next year, initially focusing on areas such as disease cures, AI resilience, civil society, philanthropy, and threats including novel bio risks, fast economic change, and complex emergent effects from capable models. Wojciech Zaremba is moving to lead AI resilience.
  • Teen safety policies for developers: OpenAI Devs released prompt-based teen safety policies for gpt-oss-safeguard, designed to help developers identify and moderate teen-specific content and turn policy requirements into classifiers for real-time filtering or offline analysis.
  • NeurIPS sanctions rule: A post citing a NeurIPS Foundation announcement said the conference will no longer accept submissions from US-sanctioned institutions.

Quick Takes

Why it matters: These updates were smaller, but they help map where agent design, model usage, and deployment practices are going next.

  • Google’s Gemini API now supports combining Google Search and custom functions in a single request, with Gemini choosing tools and order automatically.
  • Gemini 3.1 Flash-Lite is being shown generating websites in real time as users click, search, and navigate.
  • Anthropic’s March Economic Index said longer-term Claude users iterate more carefully, hand over less autonomy, attempt higher-value tasks, and get more successful responses; the top 10 consumer tasks now account for 19% of conversations, down from 24% since November 2025.
  • Similarweb said Claude has overtaken DeepSeek, Grok, and Gemini to become the second most-used gen-AI app daily after ChatGPT.
  • Perplexity said its search embedding models crossed 1 million downloads in less than a month.
  • AssemblyAI said better speech models exposed flaws in human truth files and released tooling for corrected truth-file workflows, semantic word lists, and production-ready benchmarking.
  • Alibaba released the open-weight Qwen3.5 vision-language family, with smaller models such as Qwen3.5-9B said to rival or beat much larger competitors.
Claude’s Computer Use Launch, a FrontierMath Result, and Meta’s Dreamer Move
Mar 24
9 min read
564 docs
Stephanie Palazzolo
Deep Learning Weekly
The Wall Street Journal
+38
Anthropic pushed Claude into direct desktop control, Epoch AI reported a FrontierMath open problem solved with GPT-5.4 Pro, and Meta absorbed Dreamer’s personal-agent team. The brief also covers Mistral’s new open model, OpenAI’s Helion power talks, notable research updates, product launches, and new policy signals.

Top Stories

Why it matters: The biggest developments this cycle combined new agent surfaces, measurable capability progress, and strategic moves around talent and power.

1) Anthropic put Claude into the operating system

Claude can now use a computer to open apps, navigate the browser, and fill spreadsheets in a research preview inside Claude Cowork and Claude Code on macOS . Separate coverage described the feature as control of the mouse, keyboard, and screen, and noted it can pair with Dispatch for remote control from mobile .

The launch drew a useful framing from product commentators: computer use changes the product surface because it lets models operate in software environments where APIs do not exist and workflows were never designed to be automated .

2) GPT-5.4 Pro was credited with solving a FrontierMath open problem

Epoch AI said AI solved one of the problems in FrontierMath: Open Problems, a benchmark of real research problems that mathematicians had tried and failed to solve . The newly solved item was a Moderately Interesting conjecture from a 2019 paper by Will Brian and Paul Larson that had remained unsolved through several attempts . Kevin Barreto and Liam Price produced a construction using GPT-5.4 Pro that Brian confirmed, with a write-up planned for publication . Epoch also said Gemini 3.1 Pro, GPT-5.4 (xhigh), and Opus 4.6 (max) can solve the problem at least some of the time in its scaffold .

This is a concrete example of frontier models contributing to an unsolved research benchmark, though Epoch noted that only one Moderately Interesting problem has been solved so far .

3) Meta brought Dreamer’s personal-agent team into MSL

Dreamer co-founders dps, hbarra, and alcor said the entire Dreamer team is joining Meta Superintelligence Labs and licensing its technology to Meta . Dreamer said thousands of users had already used its Sidekick to build personal intelligent software in English for email, calendars, to-dos, learning tools, travel, work, health, and other bespoke needs traditional software does not prioritize .

The deal gives Meta both a team and a product vision centered on personal, malleable software shaped by the user .

4) OpenAI and Helion moved from overlap to active partnership exploration

Reporting linked by Axios said OpenAI is in advanced talks to buy electricity from Helion Energy, with OpenAI potentially securing an initial 12.5% of Helion’s production . Sam Altman separately said he is stepping down from Helion’s board because Helion and OpenAI are starting to explore working together at significant scale, while Helion said the change should make future partnership discussions easier from a governance standpoint .

Taken together, the disclosures move the OpenAI-Helion relationship from investment adjacency to active infrastructure planning .

5) Mistral released Small 4

Mistral Small 4 was described as an open-source 119B-parameter mixture-of-experts model that unifies reasoning, multimodal, and coding capabilities while delivering 40% lower latency and 3x higher throughput than its predecessor . Mistral linked the announcement directly from its site .

For readers tracking open models, the notable point is that the release is being positioned around both capability breadth and serving efficiency .

Research & Innovation

Why it matters: Several of the strongest research signals were about turning AI into a more reliable tool for science, browser interaction, memory, and robotics.

Anthropic launched a science blog with concrete AI-assisted research examples

Anthropic said its new Science Blog will feature research and stories of scientists using AI to accelerate their work .

“AI can’t yet do original work autonomously, but it can vastly accelerate it.”

Its launch examples included Harvard physicist Matthew Schwartz guiding Claude Opus 4.5 through a graduate-level calculation; Anthropic said the model could accelerate the work, while Alex Albert summarized Schwartz’s view as roughly second-year grad student level and a 10x acceleration . Another post described Claude being run over days on a JAX-based differentiable cosmological Boltzmann solver, and Anthropic argued that some long-horizon tasks are better suited to a single agent working sequentially than to splitting work across many agents .

WebArena-Infinity makes browser-task environments much cheaper to build

WebArena-Infinity was introduced as a scalable way to automatically generate high-authenticity, high-complexity browser environments with verifiable tasks for RL training and benchmarking . Compared with the 2023 WebArena effort—seven grad students, more than six months, five environments, and 812 tasks—the new system claims environment creation in under 10 hours and for less than $100, with easy parallel generation . Even open models already scoring 60%+ on WebArena and OSWorld complete fewer than 50% of tasks here .

Supermemory reported about 99% on LongMemEval_s without a vector database

Supermemory said it reached about 99% on LongMemEval_s using an experimental method called Agentic Search and Memory Retrieval, or ASMR . The system replaces vector search and embeddings with parallel observer agents that extract structured knowledge across six vectors from raw multi-session histories, then uses specialized search agents for direct facts, related context, and temporal reconstruction . The team said the method will be open-sourced in 11 days .

Robotics research pushed on data scale and human demonstrations

EgoVerse was introduced as an ecosystem for robot learning from egocentric human data, built by four research labs and three industry partners . The dataset includes more than 1,300 hours, 240 scenes, and more than 2,000 tasks . Commentary from NVIDIA’s Jim Fan argued that behavior cloning directly from humans can break the limitations of teleoperation and support scaling robot learning without robots in 2026 .

SWE-rebench broadened its evaluation setup

SWE-rebench removed demonstrations and the 80-step limit so modern models can use huge contexts, and added auxiliary interfaces to evaluate larger tasks fairly . The reported takeaways were that top models perform similarly, Opus 4.6 sits on top, GPT-5.4 is the most token-efficient top-five model at 774k tokens per task, and Qwen3-Coder-Next plus Step-3.5-Flash benefit heavily from very large contexts .

Products & Launches

Why it matters: Product releases kept pushing AI into day-to-day workflows—chat, file management, search, subscriptions, long-running agents, and always-on desktop context.

  • Sakana Chat: Sakana AI launched its first public-facing service, free for anyone in Japan. The chat product emphasizes web search and fast responses and is backed by the Namazu alpha model series, which Sakana says is tuned to reduce biases, reflect Japanese values, and adapt safely to local context .
  • ChatGPT file library: OpenAI said ChatGPT now makes it easier to find, reuse, and build on uploaded files through recent-file access in the toolbar, questions over uploaded content, and a new Library tab on the web. The rollout is global for Plus, Pro, and Business users, with EEA, Switzerland, and UK availability coming later .
  • MiniMax Token Plan: MiniMax introduced what it called the first all-modality API subscription, with flat-rate access to text, speech, music, video, and image models, plus use in third-party harnesses .
  • Cursor Instant Grep: Cursor can now search millions of files and return results in milliseconds, which the company says materially speeds up agent task completion. Cursor also published the algorithms and tradeoffs behind the feature .
  • Factory Missions: Factory AI made Missions available to all users as long-running agents for large software tasks such as building applications from scratch, migrations, and AI research . Feedback highlighted the product as a particularly accessible implementation of long-running agents .
  • Littlebird: Littlebird launched as a desktop app and announced an $11M raise. The product reads across meetings, messages, documents, browsing, and recorded notes to build a broader context model of what the user is doing and cares about .

Industry Moves

Why it matters: Company moves this cycle point to the next layer of competition: enterprise automation, monetization, defense partnerships, and the economics of model development.

  • PlayerZero raised $20M: PlayerZero described itself as an Engineering World Model that automates debugging, fixing, and testing code on autopilot . The company said it connects code, telemetry, incidents, docs, customer tickets, Slack threads, PR reviews, and CI/CD history into a single context graph . PlayerZero said it has raised $20M and claimed customer outcomes including 30% more engineering bandwidth, 90% faster resolution, 95% of breaking changes caught, and 80% fewer support escalations .
  • OpenAI hired an ads leader: The Wall Street Journal reported that OpenAI hired former Meta advertising executive Dave Dugan to lead ad sales . Separate commentary said he will lead global ad solutions, signaling that OpenAI is getting serious about building an advertising business around ChatGPT and other products .
  • Cohere and Saab signed an AI collaboration MOU: Cohere said it signed a Memorandum of Understanding with Saab to explore advanced AI partnerships for aerospace platforms and deliver tailored AI solutions critical to Saab’s operations .
  • Final training runs are only a minority of R&D compute spend: Epoch AI estimated that across OpenAI, MiniMax, and Z.ai, less than 30% of R&D compute spending goes to final training runs, with the rest going to experiments, synthetic data generation, and other workloads . Epoch’s earlier estimate for OpenAI alone was about 10% of $5B in 2024 R&D compute spending .
  • Coding tool loyalty remains low: The Information reported that hundreds of Notion engineers are switching from Cursor to Anthropic’s Claude Code and OpenAI’s Codex, alongside the broader point that engineers are quick to move when a better coding tool appears .

Policy & Regulation

Why it matters: Government and multilateral institutions are moving from abstract AI concern to named bureaucracies, concrete risk language, and supply-chain scrutiny.

  • U.S. State Department: The State Department said it is launching a Bureau of Emerging Threats to address current and future threats in cyberspace, outer space, critical infrastructure, cyberattacks, and AI risks .
  • UN-linked AI deception brief: ScienceBoard_UN released a brief defining AI deception as systems misleading people about what they know, intend, or can do, warning that this could undermine oversight, fuel misinformation, and create serious global risks as systems grow more capable . Yoshua Bengio said evidence of deceptive behavior has already appeared in widely used AI systems and that the risk should grow as systems become more capable, autonomous, and embedded in decision-making .
  • Pentagon supply-chain tension around Claude: A report summarized in the notes said the Pentagon is moving to integrate Palantir’s AI as a core system across U.S. military operations, but that deeper Maven adoption is complicated by use of Anthropic’s Claude, which Reuters previously reported had been deemed a supply-chain risk amid a dispute over AI safety guardrails .

Quick Takes

Why it matters: These are smaller updates, but each points to a live thread in models, agents, robotics, or evaluation.

  • Jensen Huang said, “I think we’ve achieved AGI,” while also saying AGI is hard to define because there is no uniform standard and that 2026 could be a turning point; Yuchenj_UW said he disagrees with Huang’s definition while still finding the perspective interesting .
  • Figure 03 was described as fully autonomous, reasoning from camera pixels and computing torque to control more than 30 motors .
  • AMD open-sourced Apex, an end-to-end agent using Claude Code plus Codex to optimize AMD kernels through iteration and feedback rather than one-shot code generation .
  • LiteParse added URL parsing and buffer or stream support, letting agents read internet PDFs in seconds without using a VLM under the hood .
  • OpenClaw v2026.3.22 added a ClawHub plugin marketplace, MiniMax M2.7 and GPT-5.4 mini/nano support, per-agent reasoning, OpenShell plus SSH sandboxes, and more search integrations .
  • Roboflow’s RF-DETR 1.6 update makes fine-tuning 30% faster without accuracy loss, building on the earlier Apache 2.0 real-time segmentation release .
  • Qwen3.5 can score very high on AIME and LiveCodeBench yet remain unstable across repeated runs; one example said 32 runs on AIME can produce 32 different outcomes, which is why some benchmark builders are working on less brittle evals .