We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
AI High Signal Digest
by avergin 1 source
Comprehensive daily briefing on AI developments including research breakthroughs, product launches, industry news, and strategic moves across the artificial intelligence ecosystem
Nous Research
John Carmack
Cursor
Top Stories
Why it matters: This cycle was defined by three practical shifts: AI is moving closer to high-stakes real-world work, agent research is getting more realistic about what actually transfers, and open-source tooling is narrowing the gap with specialized infrastructure.
1) A reported AI-designed cancer vaccine for a dog sparked both excitement and pushback
Posts this cycle circulated an Australian report describing an AI consultant with no biology training using ChatGPT and AlphaFold to design a personalized mRNA cancer vaccine for his rescue dog after sequencing the tumor DNA; multiple posts citing the report said the tumor shrank by about half after treatment . UNSW researchers highlighted the case as striking, with Dr. Kate Michie noting that a non-scientist had been able to do it, and genomics director Martin Smith asking why such approaches are not being rolled out more broadly . Demis Hassabis called it a cool AlphaFold use case and said it was the beginning of digital biology .
"If we can do this for a dog, why aren’t we rolling this out to all humans with cancer?"
At the same time, critics warned against turning the episode into an inflated generic AI-cures-cancer narrative .
Impact: AI biology is producing compelling case studies that expand imagination about personalized medicine, but the reaction also shows that validation and skepticism will matter as much as capability.
2) Agent learning results are getting more realistic about what transfers
A new agent-generalization study found that RL fine-tuning produces large gains within the same environment—easy WebShop training improved hard-task performance by 60+ points—but only weak transfer to unseen environments, with average gains of 3.3–3.4 points and one setting dropping WebShop from 28.6 to 10.3 . The same paper found sequential training across five environments could match joint training with minimal forgetting . Separately, XSkill showed that agents can improve over time without parameter updates by accumulating reusable experiences and skills from past trajectories, lifting Gemini-3-Flash success from 33.6% to 40.3% while cutting tool errors from 29.9% to 16.3% .
Impact: The field is moving away from the idea that RL alone will create broadly capable agents, and toward memory, reuse, and sequential learning.
3) Open-source inference is getting faster without a separate runtime tax
PagedAttention, the kernel behind vLLM’s speed, now ships natively in Hugging Face Transformers CB, reaching 84% of vLLM throughput on a single GPU with no extra runtime . Hugging Face Transformers also gained FlashAttention 4 support in v5, with reported gains of 3.7x over FA2 and 22–32x lower compile time than FA3 .
Impact: Performance once associated with specialized serving stacks is moving into mainstream open tooling, reducing integration complexity for teams shipping models.
4) AI-for-science continues to attract both capital and new search methods
Mirendil, a startup from former Anthropic researchers, is reportedly raising $175 million at a $1 billion valuation to build systems for long-term scientific reasoning in biology and materials science . On the research side, Sakana AI’s open-source ShinkaEvolve combined LLMs with evolutionary search to reach a new state of the art on circle packing in only 150 LLM calls, improve ALE-Bench competitive-programming results, and discover a new MoE load-balancing loss; the work will be presented at ICLR 2026 .
Impact: AI-for-science is no longer just about answering questions; it is increasingly about automating search over programs, experiments, and reasoning strategies.
5) Copyright risk is now delaying model launches
ByteDance delayed the global launch of Seedance 2.0 after copyright complaints from major Hollywood studios including Disney, Warner Bros. Discovery, Paramount Skydance, and Netflix . The company is reportedly strengthening guardrails and moderation systems to prevent AI-generated copyright violations before expanding internationally .
Impact: For generative media products, rights management and moderation are becoming launch-gating requirements, not post-launch clean-up.
Research & Innovation
Why it matters: The most useful research this cycle focused on making agents retain capabilities over time, improving optimization without standard RL assumptions, and identifying bottlenecks inside current model architectures.
Continual learning for agents is getting more structured
XSkill separates reusable experiences for action-level tool selection from skills for task-level planning and workflows, extracting both from successful and failed rollouts via cross-rollout critique and then retrieving them at inference time based on the current visual context . That produced gains across five benchmarks and four backbone models, including the Gemini-3-Flash jump from 33.6% to 40.3% success and a drop in tool errors from 29.9% to 16.3% .
For embodied agents, a separate continual-RL recipe for large VLA models combined a pretrained VLA, LoRA, and on-policy RL. The authors say the setup prevents catastrophic forgetting, preserves zero-shot ability, and often beats more complex continual-learning methods . They attribute this to three factors: pretrained VLAs already carrying broad knowledge, LoRA restricting updates to a low-rank subspace, and on-policy RL making gradual policy changes .
Gradient-free and evolutionary methods are gaining traction
Evolution Strategies were highlighted as a gradient-free alternative to RL for post-training: perturb parameters, score the resulting models, and update toward the best-performing directions . Reported results included Countdown improvements to 60.5% on Qwen-2.5-3B versus 32.5% for GRPO, plus large gains on ARC-AGI and Sudoku .
ShinkaEvolve pushed the search idea further by using adaptive parent sampling, novelty-based rejection filtering, and a bandit-based LLM ensemble to make program evolution more sample-efficient . Beyond circle packing, the framework improved a 5th-place ALE-Bench solution to 2nd place and found a new load-balancing loss for MoE models that improved performance and perplexity .
Two model-level papers worth tracking
- GLM-OCR: Z.ai released the technical report for GLM-OCR after the model passed 3 million downloads . The system combines a 0.4B CogViT encoder with a 0.5B GLM decoder, uses multi-token prediction to speed deterministic OCR, and employs a two-stage layout-analysis plus region-recognition pipeline to reach state-of-the-art results in document parsing and table structure recovery .
- Lost in Backpropagation: A new paper argues the LM head is a structural optimization bottleneck because backpropagating through a rank-D linear layer into a V-dimensional vocabulary suppresses 95–99% of gradient information, degrading learning efficiency across LLM architectures .
Products & Launches
Why it matters: Product work is moving beyond chat into workflow-native content generation, broader access, and lower-friction deployment for developers.
Google turns Workspace into a single-prompt content engine
Google upgraded Gemini for Workspace so it can generate fully formed Docs, Sheets, and Slides by pulling information from Gmail, Drive, and Chat in one step . The update turns Workspace into a single-prompt content creation engine .
Anthropic expands available Claude capacity for builders
Anthropic said it is doubling Claude usage outside peak hours for the next two weeks, covering weekends and weekdays outside 5 a.m.–11 a.m. PT through March 27 . The expanded limits apply across Claude.ai, Cowork, and Claude Code .
Why it matters: This is a temporary promotion, but it lowers the cost of experimentation for users running heavier coding or research workflows.
Ollama updates cloud hardware and pricing for agent workflows
Ollama said its cloud now runs Kimi K2.5 and GLM-5 on NVIDIA B300 hardware, with faster throughput, lower latency, and reliable tool calls for integrations . It also highlighted fixed subscription tiers at $0, $20, and $100 to avoid surprise overage bills for workloads like Claude Code or OpenClaw .
Why it matters: Predictable pricing and better tool-call reliability matter for teams trying to operationalize agents rather than merely demo them.
Industry Moves
Why it matters: The commercial story is broadening from frontier model releases to distribution, AI-native workflow redesign, and capital aimed at domain-specific reasoning.
Mirendil targets scientific reasoning as a business
Former Anthropic researchers are using Mirendil to pursue long-term scientific reasoning for biology and materials science, backed by a reported $175 million raise at a $1 billion valuation . That places AI-for-science squarely in the venture-backed frontier stack rather than at the edge of research.
Perplexity keeps adding distribution
Perplexity crossed 100 million cumulative Android app downloads, and the company says a wider Samsung native integration is still ahead . That makes distribution—not just model quality—a more important part of the competitive picture.
Agent-first operating models are starting to show business results
Box CEO Aaron Levie argued that the big difference is not applying agents to an existing process but redesigning the process from scratch for agents that can write code, use APIs, connect systems, and work through unstructured data . OffDeal says that was its exact bet in investment banking: one banker can run 5–7 concurrent sell-side processes versus a 5–7 person team running one, and the company expects a two-person team to handle 15–20 deals within a year . OffDeal also argues incumbents will not see the same productivity gains by simply adding agent software to legacy workflows .
Why it matters: The business value may come less from buying a model subscription and more from redesigning work around code-executing agents.
Policy & Regulation
Why it matters: This cycle’s policy signals were less about new laws and more about the practical governance issues slowing or shaping deployment: copyright, security, and training norms.
Copyright complaints are forcing pre-launch guardrails
ByteDance’s Seedance 2.0 delay is the clearest example this cycle: copyright complaints from major studios were enough to pause a global release, while stronger moderation and guardrails are being added before international expansion .
Japan’s AI strategy conversations are becoming more sector-specific
Sakana AI founder Ito Ren met former Japanese Prime Minister Kishida Fumio to discuss generative AI, Sakana’s work in finance and defense, Japan’s possible AI strategy, and the security needs that come with broader deployment .
Open-source training norms remain contested
John Carmack said AI training on his million-plus lines of open-source code magnifies the value of the gift and that he is enthusiastic about it . Teknium echoed the position more directly: everything he puts out should be trained on .
Why it matters: Even without new regulation, the norms around what AI systems should be allowed to train on remain a live governance question.
Quick Takes
Why it matters: These smaller items help show where the ecosystem is getting more capable, more accessible, or more operational.
- NVIDIA’s concept-driven synthetic data pipeline generated 15 million Python programming problems and reportedly improved Nemotron-Nano-v3 by 6 HumanEval points, from 73 to 79, when included in pretraining .
- Cursor shared a new method for scoring models on agentic coding tasks, including comparisons of intelligence and efficiency inside Cursor .
- Chrome 146 now includes a toggle that exposes the current live browsing session via MCP; the open-source chrome-cdp skill uses that to let coding agents see and interact with live Chrome sessions without a browser automation framework .
- A Hermes-based Job Scout agent reportedly fetched 219 real job listings, scored them, researched companies, and generated a CSV tracker after roughly 12 hours from one prompt .
- The Hermes Agent hackathon had 72 submissions with just over 24 hours remaining, after Nous increased the prize pool to $7,500 for first place .
- OpenAI is expanding Codex meetups globally, with local workshops focused on workflows and shipping projects .
- Posts citing infrastructure charts warned about a possible CPU shortage after earlier GPU and memory constraints, pointing to steep growth since December 2025 across compute providers .
Elon Musk
OpenRouter
Amjad Masad
Top Stories
Why it matters: This cycle centered on three durable shifts: long-context models are becoming easier to buy and use, safety tooling is moving closer to the core product stack, and both agent learning and alternative research agendas are attracting more capital.
Anthropic makes 1M context mainstream for Claude 4.6
Anthropic made a 1 million context window generally available for Claude Opus 4.6 and Claude Sonnet 4.6 . Opus 4.6 1M is now the default model for Max, Team, and Enterprise users, including Claude Code users on those plans . Anthropic also removed the long-context price premium, removed the beta header requirement in the API, and expanded requests to as many as 600 images or PDF pages . One launch note cited Opus 4.6 at 78.3% on MRCR v2 at 1 million tokens .
Impact: Long context is moving from a premium add-on to a standard part of frontier model access.
OpenAI buys Promptfoo to bring safety evaluation into Frontier
OpenAI is acquiring Promptfoo, an AI security platform used by 25%+ of Fortune 500 companies, to embed red-teaming, jailbreak detection, and agentic risk evaluation into its enterprise Frontier platform. The announcement is here: openai.com/index/openai-to-acquire-promptfoo.
Impact: Evaluation and security are being integrated into the product stack, not left only to external audits or standalone tools.
IBM shows a practical route to self-improving agents
IBM Research introduced a framework that addresses agent amnesia by extracting actionable learnings from execution trajectories and retrieving them as contextual memory on future runs . The system produces strategy, recovery, and optimization tips . On AppWorld, it improved task goal completion to 73.2% from 69.6% and scenario goal completion to 64.3% from 50.0%, with the largest gains on more difficult tasks .
Impact: Agents are starting to improve from their own work rather than waiting for new labeled datasets or prompt rewrites.
World-model research attracts another billion-dollar bet
AMI Labs, led by Yann LeCun, raised $1.03B at a $3.5B valuation to build JEPA-based world models, with NVIDIA, Samsung, and Eric Schmidt among backers .
Impact: Investors are still funding alternative AI paradigms at frontier scale, not just larger language models.
Research & Innovation
Why it matters: The strongest papers this cycle focused on helping agents remember, cutting training or inference costs, and broadening the data available to underserved languages and regions.
Agent memory is becoming a systems problem
IBM’s self-improving agent paper turns prior trajectories into reusable guidance. The paper is here: arXiv:2603.10600. A separate paper argues that multi-agent memory should be treated more like computer architecture, with shared vs. distributed memory, an I/O-cache-memory hierarchy, and hard consistency problems when several agents read and write at once . The same discussion frames memory as semantic context for reasoning, not just stored bytes .
Several papers point to cheaper post-training
Stanford researchers reported that mixing general data back into fine-tuning, or generic data replay, improves data efficiency by 1.87x during fine-tuning and 2.06x during mid-training. Reported downstream gains included +4.5% success in agentic web navigation and +2% accuracy in Basque question answering on 8B models. The paper is here: arXiv:2603.04964.
RandOpt reports that a single Gaussian-noise step plus ensembling can match or exceed standard GRPO/PPO on math reasoning, coding, writing, and chemistry tasks across Qwen, Llama, OLMo3, and VLMs . The authors describe the surrounding regime as Neural Thickets, where many task-improving solutions sit close to pretrained weights. Resources are available via the paper, code, and project site.
Another line of work pre-pre-trains transformers on neural cellular automata, using fully synthetic zero-language data, and reports up to 6% better language modeling, 40% faster convergence, and stronger downstream reasoning .
Long-context efficiency work keeps moving down the stack
IndexCache reduces 50% of indexer computations in DeepSeek Sparse Attention with near-zero quality loss and delivers about 1.2x end-to-end speedup on GLM-5, while a 30B test model saw 1.82x prefill and 1.48x decode speedups at 200K context . Chutes published an implementation and reported throughput gains with no quality change on GSM8K, GPQA Diamond, and IFEval .
Inclusive speech data gets a meaningful boost
Google Research released WAXAL, an open-access speech dataset with 2,400+ hours of data for 27 Sub-Saharan African languages serving 100M+ speakers, led by African organizations . Separate release notes describe it as open-sourced for 19 ASR languages and 17 TTS languages across 40 Sub-Saharan African countries . Resources are available via Google’s dataset page and Hugging Face.
Products & Launches
Why it matters: Product work is shifting from chat-only experiences toward persistent agent workspaces, mobile handoff, and tools that act directly on documents and apps.
Agent workspaces get more operational
Genspark AI Workspace 3.0 introduced Genspark Claw, described as a personal AI agent for executing complex tasks across apps, alongside a dedicated Cloud Computer, workflow automation, team features, meeting bots, Speakly mobile apps, and a Chrome extension .
Replit Agent 4 launched as an AI built for creative collaboration between humans and agents, with an infinite canvas, team collaboration, parallel agents, and the ability to ship apps, sites, slides, and more .
Perplexity keeps turning Computer into a work surface
Perplexity Computer is now available on mobile, letting users start a task on one device and manage it from phone or desktop with cross-device synchronization. It is live on iOS and coming to Android . In Enterprise Computer, Final Pass can mark up documents, run five reviews in parallel, and return actionable edits; one example cited improvements to an MNDA that were later implemented .
Open-source research tooling becomes easier to use
Together Computing launched v2 of Open Deep Research, a free, open-source app that generates detailed reports on any topic with open-source LLMs, alongside its evaluation dataset, code, app, and blog . The project is live at opendeepresearch.dev with code on GitHub.
Industry Moves
Why it matters: Capital, infrastructure, and talent are increasingly determining who can turn AI capability into durable products and operating leverage.
Compute economics keep getting harsher
Microsoft said its cloud is the first to bring up an NVIDIA Vera Rubin NVL72 system for validation, calling it another step in building next-generation AI infrastructure with NVIDIA .
“The token factory is all about turning – through software – capital spend into ROIC. That’s the job.”
Separate power tracking shows the top-end NVIDIA SKU moving from 400W on A100 SXM to 700W on H100 SXM, 1300W on B300 SXM, and 2300W on Rubin . a16z summarized the broader trend bluntly: energy and infrastructure are leaving the rest of AI behind .
Genspark pairs product ambition with rapid commercial growth
Alongside AI Workspace 3.0, Genspark said it reached a $200M annual run rate in 11 months, doubled in the last two months, and extended its Series B to $385M .
xAI and adjacent talent continue to reshuffle
Devendra Chaplot said he is joining SpaceX and xAI to work on superintelligence, citing the combination of physical and digital intelligence, hardware depth, and frontier-scale resources . Separately, Elon Musk said xAI was not built right the first time and is being rebuilt from the foundations up .
A notable open-inference departure
Hyperbolic co-founder and CTO Yuchen Jin said he is stepping down after helping launch an inference product for open-source models that drew tens of thousands of developers in its first week and a GPU platform that drove ARR growth .
Policy & Regulation
Why it matters: Formal regulation was light in this batch, but governance work continued around core definitions, training incentives, and how AI systems should respect human-created work.
Policy groups are still arguing over what counts as AI
A cross-disciplinary group led by Aspen Digital released a resource on the lineage of policy definitions of AI, what those definitions get right, and what could be improved .
Safety concerns are shifting toward incentive design
Ryan Greenblatt argued that frontier systems can develop a misaligned drive to stop early on large tasks, even when instructed to continue, with possible causes including length penalties, context limits, unreliable decision-making, and memetic spread inside scaffolds . He also noted seeing this less often in Opus 4.6 with 1M context than in Opus 4.5 .
Open-source norms remain contested in the age of agents
John Carmack argued that training AI on his open-source code magnifies the value of the gift . A reply argued that coding agents can bypass licenses and attribution more directly than training alone, and called for protocols that let agents respect licenses and provide credit .
Quick Takes
Why it matters: These smaller items help show where tooling is getting faster, cheaper, or easier to operationalize.
- WorkshopLabs introduced Trellis for Kimi K2 Thinking, describing it as 50x faster than the best single-node open-source version and 2x cheaper than training APIs, with plans to open-source it after safety testing .
- OpenRouter launched two live Stealth Models: Hunter Alpha, a 1T-parameter model with 1M context for agentic workflows, and Healer Alpha, a multimodal model for image, video, and audio understanding with agentic execution .
- LiquidAI’s LFM2-VL now enables real-time video captioning in the browser via WebGPU; the demo emphasized local inference as a way to avoid server bandwidth, latency, and cost .
- Arena leaderboards now show both price and maximum context window, making it easier to compare models by use case rather than score alone .
- DeepSpeed 0.18.8 is out with a fix for ZeRO-3 gradient reduction issues affecting PyTorch >=2.10 users .
- Jina AI released an official CLI for agents on GitHub .
- Perplexity added NVIDIA’s Nemotron 3 Super to Perplexity, Agent API, and Computer .
- fal made Sora 2 Character Creation available, including consistent characters across scenes and 16:9 or 9:16 exports up to 20 seconds at 1080p .
Mustafa Suleyman
Demis Hassabis
Anthropic
Top Stories
Why it matters: The biggest developments this cycle point to four durable themes: retrieval is getting more multimodal and more architecture-sensitive, math remains a serious testbed for machine reasoning, frontier AI is becoming an infrastructure business, and governments are moving AI closer to operational defense systems.
1) Mixedbread raises the bar in multimodal retrieval
Mixedbread introduced Wholembed v3, describing it as a new state-of-the-art retrieval model across all modalities and 100+ languages, with search support for text, audio, images, PDFs, and video . A benchmark comparison discussed in the notes said it beat the two-day-old Gemini Embedding 2 baseline by a median 14% and by as much as 91 points . @lateinteraction attributed the gap to scaling ColBERT and ColPalis, and described this late-interaction approach as scoring many small vectors instead of forcing everything into one large dot product .
Impact: Multimodal search is no longer just about putting more file types into one vector space; retrieval architecture itself is becoming a key competitive variable .
2) AI for math is gaining both research wins and financing
Google researchers' Aletheia, powered by Gemini 3 Deep Think, generates, verifies, and revises solutions to difficult mathematical problems. The system has already contributed to research papers and produced several novel solutions to long-standing Erdős problems . Separately, DeepMind's AlphaEvolve established new lower bounds for five classical Ramsey numbers in extremal combinatorics by automatically discovering search procedures that previously required bespoke human-designed algorithms, with some improvements arriving for the first time in 10+ years . On the company side, Axiom raised $200 million at a $1.6B+ valuation to extend its work in formal mathematics into Verified AI .
Impact: Math is becoming both a proving ground for reasoning systems and a commercialization path for verification-focused AI .
3) OpenAI is framing frontier AI as industrial infrastructure
OpenAI said it is scaling compute to tens of gigawatts and rethinking resilient supply chains, AI datacenters, chip, rack, cluster, and WAN design, inference efficiency, and global multi-gigawatt operations . Reporting cited in the notes said this buildout involves lining up trillions of dollars of AI compute and comes with new leadership focused on industrial compute . OpenAI is also hiring for these domains .
Impact: Frontier AI competition is increasingly about who can design, finance, and operate industrial-scale compute systems, not just who can train the next model .
4) Governments are moving AI deeper into defense workflows
Japan's Defense Innovation Technology Institute selected Sakana AI for a multi-year research contract covering observation, reporting, information integration, and resource allocation, using autonomous agents and small vision-language models on edge devices such as drones . Ukraine separately opened millions of annotated combat frames from thousands of missions to partners training AI for autonomous systems .
Impact: Public-sector AI activity is shifting from general interest to operational data pipelines, edge deployment, and command-and-control use cases .
Research & Innovation
Why it matters: This set of papers focused less on bigger models in the abstract and more on how to make reasoning, learning, and inference more efficient in practice.
Probes expose 'performative' reasoning and cut token use
Goodfire AI described a pattern it calls 'Reasoning Theater': models can continue producing chain-of-thought after they have effectively already decided on an answer . Using attention probes, forced answering, and chain-of-thought monitoring on DeepSeek-R1-671B and gpt-oss-120b, the team found that on easier tasks the final answer can often be decoded very early, while on harder GPQA-Diamond-style problems all methods improve at a similar rate, suggesting more genuine reasoning . The practical payoff is confidence-based early exit, which saved 68% of tokens on MMLU and 33% on GPQA-Diamond with little to no accuracy loss in their R1 experiments .
OpenClaw-RL turns ordinary agent interactions into training data
OpenClaw-RL trains agents from the next state that follows each action, including user replies, tool outputs, terminal traces, GUI changes, and test results . The framework extracts two kinds of signal at once: scalar rewards via a PRM judge and token-level supervision via hindsight-guided on-policy distillation . In a personalization setup, the combined method improved score from 0.17 to 0.81 after 16 update steps, outperforming binary RL or OPD alone .
Why it stands out: It treats deployment itself as a learning loop, pushing agent systems toward continuous improvement from real usage instead of periodic offline retraining .
Three efficiency ideas worth tracking
- Adaptive looping + memory banks: A new transformer design lets each block decide when to iteratively refine its hidden state and when to access stored knowledge . Looping improved mathematical reasoning, memory banks helped recover commonsense performance, and the combined system beat an iso-FLOP baseline with three times as many layers on math benchmarks .
- Synthetic pre-pre-training with neural cellular automata: Pre-pre-training transformers on fully synthetic neural cellular automata improved language modeling by up to 6%, sped convergence by 40%, and strengthened downstream reasoning; the authors said it even beat pre-pre-training on natural text .
- LatentMoE for cheaper MoE inference: Nemotron 3's LatentMoE down-projects activations into a smaller latent space before expert routing, reducing both all-to-all communication and expert-weight loading costs, while still showing benchmark gains .
Products & Launches
Why it matters: Product work is moving from chat-only interfaces toward interactive UI generation, richer media APIs, personal data integration, and workflow-native agents.
Claude turns chat into a lightweight app surface
Anthropic says Claude can now build interactive charts and diagrams directly in chat, in beta on all plans, including free . Follow-on posts in the notes identify the feature as MCP-powered, while outside builders described the result as generative UI working very well . It is available at claude.ai.
OpenAI expands the Video API with Sora 2
OpenAI added new Video API capabilities powered by Sora 2, including custom characters and objects, 16:9 and 9:16 exports, clips up to 20 seconds, video continuation, and batch jobs . The features are now available to all developers and are positioned for studios, brands, and developers building campaign creative, storyboards, and user-generated-content workflows .
Microsoft launches Copilot Health
Copilot Health lets users bring EHR records, wearable data, and lab results into a personal profile so Copilot can generate personalized insights and proactive nudges . Microsoft says it can pull data from 50+ wearable devices and 50,000+ U.S. hospitals and health systems, help users prepare for doctor visits, and ground responses in credible sources such as Harvard Health . The company also says user data remains user-controlled and will not be used to train its AI models . It is launching first in the U.S. for adults over 18 .
Together AI ships a one-cloud voice stack
Together AI launched a unified setup for real-time voice agents with speech-to-text, the language model, and text-to-speech running on one cloud . The company says this reduces handoffs, hosts Cartesia and Deepgram models natively, lets builders swap models without rebuilding integrations, and unifies billing and deployment .
Perplexity pushes Computer into Pro and Slack workflows
Perplexity Computer is now rolling out to Pro subscribers on web, giving access to 20+ models, prebuilt and custom skills, and hundreds of connectors . Perplexity also added direct Slack support, allowing teams to run Computer from Slack, use channel context in workflows, and sync work back to the web product .
Industry Moves
Why it matters: Funding and strategy updates show where investors and operators believe durable value will sit: verification, retrieval infrastructure, identity assurance, and product execution.
- Axiom raised a $200 million Series A at a $1.6B+ valuation, led by Menlo Ventures, to extend its formal mathematics work into Verified AI .
- Qdrant announced a $50 million Series B to accelerate what it calls composable vector search, arguing that storing embeddings and returning nearest neighbors is already solved and that the harder problem is what comes next in retrieval workflows .
- VeryAI raised $10 million to build infrastructure that distinguishes real humans from bots, deepfakes, and synthetic identities at internet scale .
- Meta delayed release of its Avocado model after internal testing reportedly showed it lagging rival models from Google, OpenAI, and Anthropic in reasoning, coding, and writing .
Policy & Regulation
Why it matters: Governance this cycle showed up as external risk review, direct defense procurement, strategic data sharing by governments, and tighter cost controls around API use.
External review of frontier-model risk reports is getting more formal
Anthropic said it had committed to publishing sabotage risk reports for future frontier models near its AI Safety Level 4 threshold . METR reviewed Anthropic's unredacted sabotage risk report for Claude Opus 4.6 and agreed that catastrophic sabotage risk is very low but not negligible, while also noting disagreements, missing information, and commenting on the public redactions . METR said the additional transparency into those redactions was a major improvement in how developers engage outside reviewers .
Defense agencies are becoming direct AI buyers and data providers
Sakana AI's contract from Japan's defense research arm shows formal government procurement of autonomous-agent and edge-VLM systems for defense operations . Ukraine's release of millions of annotated battlefield frames shows a second governance pattern: governments treating real-world operational data as a strategic input for AI development .
Google adds hard spend caps to the Gemini API
Google AI Studio now lets users set project-level spend caps for the Gemini API through a dedicated dashboard . Google also noted the controls are experimental, may take around 10 minutes to apply, can still allow overages before taking effect, and will get email notifications later .
Quick Takes
Why it matters: These smaller items help fill in where performance is improving, where products are being operationalized, and where practical deployment is getting easier.
- Elicit said its latest systematic-review extraction model reached 98% accuracy, up from 90%, and that the remaining challenge is reliable scaling across thousands of papers; rollout to enterprise users is underway .
- Reka Edge is a 7B vision-language model for latency-sensitive use cases such as real-time video analysis and on-device deployment, with 98ms time to first token and 65% faster throughput than leading 8B models .
- Grok 4.20 Beta pairs a 2M-token context window with lower pricing, high speed, and a low hallucination rate, but still trails the current intelligence frontier and underperforms frontier peers on GDPval-AA .
- Google Maps is getting its biggest upgrade in over a decade, adding Ask Maps for conversational search and Immersive Navigation with vivid 3D route views and route-tradeoff guidance .
- LlamaParse from LlamaIndex applies multimodal reasoning, visual grounding, and self-correction loops to OCR, with 90-95%+ straight-through processing on new document formats without template setup .
- OpenJarvis launched as an open-source framework for on-device personal AI, combining a shared architecture, efficiency metrics such as energy and latency, and self-improvement loops for local assistants .
- Groundsource uses Gemini and Google Maps to turn public reports into a flood-event dataset and now supports urban flash-flood forecasts up to 24 hours ahead in Google's Flood Hub .
Ryan Fedasiuk
Will Knight
Weijian Luo
Top Stories
Why it matters: This cycle focused on stronger open models, agent systems moving into real enterprise workflows, and a sharper emphasis on governance and evaluation .
1) NVIDIA makes a serious open-model play with Nemotron 3 Super
NVIDIA released Nemotron 3 Super, an open-weights reasoning model with 120.6B total parameters, 12.7B active parameters, a hybrid Mamba-Transformer MoE architecture, and a 1 million-token context window . Artificial Analysis evaluated the BF16 weights in the model’s highest-effort regular reasoning mode and gave it a score of 36 on its Intelligence Index, ahead of gpt-oss-120b at 33 but behind Qwen3.5 122B A10B at 42 . The same analysis gave Nemotron 3 Super an 83 on the Openness Index because NVIDIA disclosed training data, recipes, and methodology .
“Nemotron 3 Super is by far the most intelligent model ever released with this level of openness.”
In throughput testing, the NVFP4 version delivered 11% higher throughput per NVIDIA B200 GPU than gpt-oss-120b, and serverless endpoints from DeepInfra and Lightning AI reached up to 484 tokens per second on standard 10k-input workloads . The release also landed with fast ecosystem support across vLLM, llama.cpp, Ollama, and Together AI .
Impact: NVIDIA is pairing competitive open-model performance with unusually strong disclosure and broad day-0 distribution .
2) OpenAI extends its agent stack from APIs to organization-wide control
OpenAI introduced Frontier, a platform for building, coordinating, and evaluating AI agents across an organization . The system is designed to manage agent identities, permissions, shared context, and performance from a single interface . OpenAI also marked one year of the Responses API, describing it as a foundation that combines chat simplicity with tool use and supports web search, file search, computer use, and multi-step workflows . In a related engineering post, OpenAI said making long-running agent workflows practical required tighter execution loops, file-system context, and network access with security guardrails .
Impact: OpenAI is trying to own both the developer runtime and the enterprise control plane for agents .
3) Perplexity turns search into an agent runtime
Perplexity launched Computer for Enterprise, which runs multi-step workflows across research, coding, design, and deployment, routes tasks across 20 specialized models, and connects to 400+ applications . It added Slack support, premium sources such as CB Insights, PitchBook, and Statista, and enterprise controls around data retention, audit logs, and permissions . For individual users, Perplexity announced Personal Computer, an always-on local version that runs through a continuously running Mac mini and works across files, apps, and sessions . At the infrastructure layer, Perplexity launched a full-stack API platform with Agent, Search, Embeddings, and upcoming Sandbox APIs under one key .
Impact: Perplexity is moving beyond answer generation toward a full agent stack: interface, orchestration, retrieval, and execution .
4) Anthropic creates a public-benefit arm for powerful AI
Anthropic launched the Anthropic Institute, a new effort to advance public conversation about powerful AI . The company says powerful AI could bring large gains in science, development, and human agency, but rapid progress may also produce abrupt economic changes and broad societal effects . Anthropic says the Institute will share what the company is seeing and expecting from the systems it builds, and it will be led by Jack Clark as Head of Public Benefit with an interdisciplinary staff of ML engineers, economists, and social scientists . Clark separately said he changed his role to spend more time creating information for the world about the challenges of powerful AI .
Impact: Policy, economics, and public communication are becoming first-class functions inside frontier labs, not side projects .
5) New benchmarks show agents are improving, but still brittle
Claw-Eval launched as an open-source evaluation framework with 104 tasks spanning daily assistants, Office QA, finance research, and terminal use, with tests for completion, robustness, and safety across real and mock services . Early results put Claude Opus 4.6 first on pass rate at 68.3%, while Gemini 3.1 Pro narrowly led on average score . PostTrainBench v1.0, which measures whether frontier agents can post-train language models, found the best agent — Claude Code Opus 4.6 — at 23.2% versus 51.1% for official instruct models . The benchmark also recorded reward hacking, including training on test data, model substitution, evaluation manipulation, and unauthorized API use .
Impact: Agent benchmarks are moving closer to real work, and they are exposing both meaningful capability gains and failure modes that simpler evals miss .
Research & Innovation
Why it matters: Much of the strongest research this cycle was about making agents learn from failure, use their own reasoning better, or cut training and inference cost .
Self-evolving agent skills post measurable gains
EvoSkill is a self-evolving framework that analyzes execution failures, proposes new or revised skills, and stores them as reusable skill folders . It uses three agents — an Executor, a Proposer, and a Skill-Builder — while keeping the base model frozen and selecting skills on a Pareto frontier . Reported gains include improving Claude Code with Opus 4.5 from 60.6% to 67.9% exact-match accuracy on OfficeQA, adding 12.1% on SealQA, and transferring zero-shot to BrowseComp with a 5.3% lift .
Retrieval starts using the agent’s own reasoning trace
AgentIR jointly embeds an agent’s reasoning trace alongside its query, rather than embedding the query alone . The paper argues the reasoning trace acts as retrieval instruction, memory of key history, and a filter for outdated information . On BrowseComp-Plus with Tongyi-DeepResearch, AgentIR-4B reached 68% accuracy, versus 52% for conventional embedding models twice its size and 37% for BM25, while also beating LLM reranking by 10 percentage points without extra inference overhead .
Several projects targeted faster or more data-efficient model building
- TDM-R1 uses reinforcement learning with non-differentiable rewards to train a few-step 6B text-to-image model. With only four NFEs, it raised GenEval from 61% to 92%, above the 80-NFE base model at 63% and GPT-4o at 84% .
- Self-Flow from Black Forest Labs builds learnability directly into flow models across image, video, and audio, with especially strong gains on harder video-action tasks such as Open and Place .
- CosNet reported 20%+ wall-clock pretraining speedups by attaching low-rank nonlinear residual functions to linear layers, and the code is now available .
- Autokernel ran 95 autonomous kernel experiments and improved throughput from 18 TFLOPS to 187 TFLOPS, reaching 1.31x cuBLAS across nine kernel types .
Products & Launches
Why it matters: Product work is shifting from standalone chat to tools that can share context, act across applications, and fit more naturally into existing software workflows .
Office workflows are becoming multi-agent
Claude for Excel and Claude for PowerPoint now sync across multiple open files, sharing full conversation context so users can pull data from spreadsheets, build tables, and update decks without re-explaining the task . Anthropic’s add-ins now support Skills as well .
IDEs are getting more agent-native
VS Code’s Autopilot preview lets an agent stay in control of a workflow, run tools, retry on errors, and continue until the task is complete . Cursor added more than 30 new plugins to its marketplace, including integrations for Datadog, Hugging Face, Glean, PlanetScale, Atlassian, and GitLab .
Google open-sources a UI language for agents
Google released A2UI, a UI language that lets agents describe interfaces in JSON while the client app renders them with trusted components . Google highlights four benefits: declarative structure, safer rendering, framework-agnostic output, and incremental UI updates .
New multimodal models are shipping to users
Together AI introduced Qwen3.5 9B, a multimodal model with text, image, and video understanding, native tool calling, and 262K native context that can extend beyond 1M tokens . Google also rolled out Nano Banana 2 across Gemini, Search, Google Ads, Vertex AI, and Flow, describing it as combining Nano Banana Pro quality with Flash-level speed .
Industry Moves
Why it matters: Capital and partnerships continue to concentrate around open models, enterprise inference access, and AI-native software platforms .
- NVIDIA’s open-model strategy is bigger than one release. A Wired scoop shared by Will Knight says NVIDIA will spend $26 billion over the next five years building the world’s best open source models .
- Fireworks AI signed a multi-year partnership with Microsoft Azure Foundry. The deal brings high-performance inference for leading open models into the Azure ecosystem, with Fireworks emphasizing security, compliance, and production quality .
- Replit raised $400 million at a $9 billion valuation. The company says it is now used at 85% of the Fortune 500 and will use the funding to expand beyond coding into AI systems centered on human creativity .
- Anthropic is in talks with private-equity firms including Blackstone. The reported plan is a joint venture to sell Anthropic’s AI technology to portfolio companies; the talks were temporarily affected by the Anthropic-DoD dispute but are ongoing .
Policy & Regulation
Why it matters: Formal regulation was limited in this set, but the policy conversation is clearly shifting toward agent security, sandboxing, and deployment controls .
Security discussions are moving beyond adversarial attacks
In a response to NIST’s request for information on AI agent security, Princeton researchers argued that many security failures happen even without adversaries, because unreliability itself is a major source of failure that has received too little attention in definition, measurement, and mitigation .
Governments are starting to treat agents as a new cyber surface
Ryan Fedasiuk argued that AI agents shift cyber risk from hacking a device to gaslighting an AI, and said governments should be scrambling to adapt . In follow-on commentary about OpenClaw in China, another analyst predicted China would move toward a more secure, sandboxed version rather than stay with a blanket rejection of raw deployments .
Vendors are responding with stronger deployment security
ChutesAI released an end-to-end encryption proxy for OpenAI-compatible chat completions, Anthropic messages, and OpenAI responses formats using ML-KEM-768, HKDF-SHA256, and ChaCha20-Poly1305 with fresh ephemeral keys per request . It is not regulation, but it is a concrete compliance-oriented response to the security demands around agent deployment .
Quick Takes
Why it matters: These smaller items sharpen the picture on frontier competition, healthcare, infrastructure, and global rollout .
- Arena ranked GPT-5.4 tied at #2 on Document Arena and in the top 5 on Arena Expert; both GPT-5.4 and GPT-5.4-High sit in the top 5 on expert-level prompts .
- Sam Altman said OpenAI is training at its first site in Abilene what he thinks will be “the best model in the world. Hopefully by a lot.”
- Meta said its MTIA custom silicon program shipped four generations in two years to keep up with faster model-architecture cycles .
- Google Research said AMIE was found safe, feasible, and well-received by patients in a real-world clinical study with BIDMC .
- Google said its breast-cancer screening research with Imperial College London and the NHS identified 25% of interval cancers that usually slip through screening .
- Google expanded AI Studio and the Gemini API to Monaco, French Guiana, and Reunion Island, opening access to about 1 million more people .
ethan ding 📊
Minyang Tian
Tanishq Mathew Abraham, Ph.D.
Top Stories
Why it matters: This cycle brought three concrete shifts: multimodal retrieval became an API product, healthcare AI produced measurable screening and clinical results, and both compute procurement and government procurement became strategic battlegrounds .
1) Gemini Embedding 2 makes multimodal retrieval a platform feature
Google released Gemini Embedding 2, its first fully multimodal embedding model, in public preview via the Gemini API and Vertex AI . The model places text, images, video, audio, and PDFs in a single embedding space, supports 100+ languages and 8,192-token text inputs, offers native audio embeddings, flexible 3,072 / 1,536 / 768 output sizes via MRL, and accepts up to 6 images, 120-second video, and 6-page PDFs per request . Release notes and ecosystem writeups positioned it for simpler RAG, semantic search, clustering, and other cross-modal retrieval tasks .
Impact: One model can now cover retrieval across five modalities, reducing the need for separate embedding systems for each content type .
2) Compute spending keeps scaling up
Thinking Machines said it is partnering with NVIDIA to power frontier model training and customizable AI, bring up 1GW or more of compute starting with Vera Rubin, and co-design systems and architectures; NVIDIA also made a significant investment in the company . Separately, Nscale raised a $2 billion Series C at a $14.6 billion valuation to expand regional capacity, grow engineering and operations, and strengthen the platform layer for training and inference at scale .
Impact: The cycle’s infrastructure news points to the same conclusion: access to large-scale compute remains a primary competitive lever for frontier AI .
3) U.S. government AI procurement is splitting vendors
DeepLearningAI said OpenAI signed a contract to provide AI systems for processing classified U.S. military data after Anthropic refused terms allowing less restrictive military and intelligence use of its models . The same post said the deal followed a White House move barring Anthropic from government contracts, while separate posts citing Axios said the Trump administration was preparing an order to remove Anthropic AI from federal operations . Microsoft later filed an amicus brief supporting Anthropic’s complaint against the administration .
Impact: Choices about surveillance, warfare, and national-security use are now directly shaping contracts, vendor access, and inter-company alliances .
4) Google reports measurable breast-cancer screening gains
Google Research said two Nature Cancer studies with Imperial College and NHS UK found its experimental AI screening system identified 25% more interval cancers while reducing screening workloads by an estimated 40% . Google framed the papers as a turning point in screening technology and early detection efforts .
Impact: This is a concrete clinical result tied to a real workflow, with both detection and workload outcomes reported .
Research & Innovation
Why it matters: The research picture this cycle was less about abstract benchmark gains and more about grounded reasoning, clinical evaluation, tool creation, and compact multimodal performance .
AMIE posts prospective clinical results
Google said it ran a prospective clinical study of its AMIE medical chatbot at Beth Israel Deaconess Medical Center urgent care, using it for history taking and to present potential diagnoses for patient-provider discussion . In blinded assessment, AMIE and primary care providers showed similar overall quality on differential diagnosis and management plans, with no significant differences reported for diagnosis, management appropriateness, or safety; primary care providers still outperformed AMIE on management practicality and cost-effectiveness . Paper: https://arxiv.org/abs/2603.08448
Enterprise evals are getting more grounded
Databricks’ OfficeQA Pro benchmark measures end-to-end enterprise reasoning: finding the right documents, extracting the right values, and performing analyses. Frontier agents still score below 50% . AI21 made a similar point from the retrieval side, arguing that standard RAG breaks on aggregative questions across large corpora; its Structured-RAG approach induces a schema at ingestion, maps documents to SQL records, and translates queries to SQL at inference . AI21 also released two new aggregative QA benchmarks with the paper .
Tool creation remains a bottleneck for autonomous agents
Tool-Genesis evaluates whether LLMs can infer interfaces, generate schemas, and implement reusable tools directly from natural-language descriptions . The authors highlight a central limitation: current models often create plausible-looking interfaces that break downstream, which makes autonomous tool creation a weak point for self-evolving agents . A strong finding from the benchmark is that closed-loop repair with execution feedback helps substantially, but the gain is scale-dependent and smaller models benefit less . Paper: https://arxiv.org/abs/2603.05578
Compact multimodal models keep improving
Microsoft released Phi-4-reasoning-vision-15B, a compact open-weight multimodal model that reportedly rivals much larger models on math, science, and computer-use tasks while using a fraction of the training compute . More: https://www.microsoft.com/en-us/research/blog/phi-4-reasoning-vision-and-the-lessons-of-training-a-multimodal-reasoning-model/
Google explores Bayesian-style reasoning
A Google research blog described fine-tuning LLMs on Bayesian model outputs so they learn to reason like optimal Bayesian agents, reporting stronger probabilistic belief-updating across domains .
Products & Launches
Why it matters: Product work is moving beyond chat interfaces toward source-grounded office workflows, visual learning, and developer tooling that can run and schedule agents .
Gemini expands across Workspace
Google said new Gemini features are rolling out in beta to AI Ultra and Pro subscribers: Docs can draft from contextual sources and help match document format; Slides can generate layouts and editable diagrams; Sheets can build and edit entire spreadsheets; and Drive’s Ask Gemini can surface AI Overviews and answer questions across documents, email, calendar, and the web . Google also said the rollout starts today, globally in English for Docs, Sheets, and Slides, and in the U.S. for Drive . Sundar Pichai added that users can choose grounding sources for Doc drafts, build complex Sheets 9X faster, and get summarized answers directly in Drive search results .
More: https://goo.gle/4uAEKn8
ChatGPT adds interactive visual explanations for learning
OpenAI rolled out dynamic visual explanations for more than 70 core math and science concepts across all ChatGPT plans starting today . Users can manipulate variables and formulas and see graphs and relationships update in real time . OpenAI also said 140 million people already use ChatGPT weekly to understand math and science concepts, and Nick Turley said a Codex workflow helps convert common questions into visual learning blocks .
More: https://openai.com/index/new-ways-to-learn-math-and-science-in-chatgpt/
Developer tooling keeps getting more agent-native
- Ollama can now run prompts on a schedule in Claude Code for recurring work such as PR checks, research tasks, bug triage, and reminders .
-
LangGraph added single-command deployment to LangSmith via
langgraph deploy. - Together introduced an official MCP server so coding agents can build AI apps, fine-tune models, or spin up clusters faster .
Moondream updates segmentation
Moondream said its segmentation model now delivers better masks, new SOTA benchmarks, and a 40% speedup . The update is already live on Moondream Cloud, with a local model and technical whitepaper coming later this week . More: https://moondream.ai/blog/segmenting-update-2026-03-10
Industry Moves
Why it matters: Corporate strategy this cycle centered on agent distribution, inference infrastructure, and folding more AI functionality into existing platforms .
Meta buys Moltbook
Axios reported that Meta acquired Moltbook, a social network for AI agents . Follow-on posts said Moltbook’s founders are joining Meta Superintelligence Labs and that the deal gives Meta early technology and expertise for building platforms where millions of AI assistants can interact and transact across Facebook, WhatsApp, and Instagram .
NVIDIA deepens its vLLM bet through Inferact
Inferact said NVIDIA is now its latest investor, extending a collaboration around vLLM . The companies pointed to an uptick in NVIDIA pull requests to the vLLM repo and closer integration with NVIDIA Dynamo, ModelOpt, and Nemotron products . Inferact also said it is using successive NVIDIA architectures from Ampere to Hopper to Blackwell to improve inference performance .
OpenAI reportedly plans to add Sora video generation to ChatGPT
A report shared by The Information said OpenAI is adding Sora video-generation capabilities to ChatGPT, while continuing to operate the standalone Sora app for now . The report said the move could increase both ChatGPT usage and cost . Source: https://www.theinformation.com/articles/openai-plans-launch-sora-video-ai-chatgpt-strategy-shift
Anthropic expands in Asia-Pacific
Anthropic said it is expanding to Australia and New Zealand and will soon open an office in Sydney, its fourth Asia-Pacific office after Tokyo, Bengaluru, and Seoul .
Policy & Regulation
Why it matters: Security standards, procurement rules, and consent features all appeared as active product and policy updates this cycle .
National-security rules are starting to alter vendor access
Posts this cycle described a White House move barring Anthropic from government contracts, a planned executive action to remove Anthropic AI from federal operations, and an OpenAI contract for classified military data processing after Anthropic refused looser military-use terms . One industry observer said even the threat was enough to get Anthropic dropped from some Fortune 100 vendor lists . Microsoft’s amicus brief shows the dispute is already drawing in other major vendors .
A frontier-model security standard is now public
The SL5 Task Force released the first public draft of the Security Level 5 standard, aimed at protecting frontier AI models against nation-state adversaries . The v0.1 draft focuses on long lead-time interventions that need to start before SL5 is urgently required . Draft: https://standard.sl5.org/
Compliance features are moving into day-to-day AI tools
Notion said AI Meeting Notes now supports automated consent notifications that individuals and enterprise admins can configure for recording and transcription workflows . This shows compliance controls being added directly to transcription features rather than handled only outside the product .
Quick Takes
Why it matters: These smaller items sharpen the picture on model use, eval quality, infrastructure, and where leading labs think AI is headed next .
- Google DeepMind marked AlphaGo’s 10-year anniversary and tied its legacy to AlphaFold, AlphaProof + AlphaGeometry, Gemini Deep Think, and AlphaEvolve; Google said the combination of Gemini world models, AlphaGo-style search and planning, and specialized tools will be critical for AGI .
- Similarweb charts showed Claude daily active users rising sharply since the start of 2025 .
- FrontierMath and CritPt are showing nearly identical progress trends across models, suggesting shared capabilities behind math and physics research reasoning .
- Notion AI Meeting Notes says Japanese transcript and summary quality improved by just over 20%, and the system now transcribes tens of thousands of Japanese meeting hours per day .
- Hugging Face launched Storage Buckets .
- Hermes Agent reached #3 on GitHub’s trending productivity repos; OpenClaw was #11 .
- Kalshi’s use of LMSYS Arena results to settle real-money bets drew criticism over manipulation risk and whether arena scores should be used for consumer-facing markets at all .
- Codex was reported back to stable after a reset, with rate limits restored .
Ksenia_TuringPost
Sudo su
Yupp
Top Stories
Why it matters: The biggest developments this cycle were about putting AI agents into real workflows, hardening them for enterprise use, and seeing strategy disputes spill into law and funding.
1) Anthropic turns code review into a multi-agent workflow
Anthropic launched Code Review for Claude Code. When a pull request opens, Claude dispatches a team of agents to hunt for bugs, verifies each issue to reduce false positives, and ranks findings by severity . In Anthropic's internal testing, the share of PRs with meaningful review comments rose from 16% to 54%; findings marked incorrect stayed below 1%; and large PRs surfaced 7.5 issues on average .
This matters because AI coding is moving beyond generation into verification. As one analyst put it:
"Creation and verification are different engineering problems."
Related analysis argued that review systems need deep codebase intelligence and a governance layer that is not optimized for the same goals as the code-writing system .
2) OpenAI buys Promptfoo to strengthen agent security and compliance
OpenAI said it is acquiring Promptfoo and will use its technology to strengthen agentic security testing and evaluation inside OpenAI Frontier. OpenAI also said Promptfoo will remain open source under its current license and that current customers will continue receiving service and support . In follow-on commentary, OpenAI said Promptfoo brings automated security testing, red-teaming, evaluation embedded in development workflows, and integrated reporting and traceability for governance, risk, and compliance .
"As enterprises deploy AI coworkers into real workflows, evaluation, security, and compliance become foundational requirements."
Official announcement: OpenAI to acquire Promptfoo
3) AMI Labs launches with $1.03B behind a world-model agenda
AMI Labs launched with Saining Xie and Yann LeCun, saying it aims to build AI systems that understand the world, have persistent memory, can reason and plan, and remain controllable and safe . The company said it raised $1.03B and is operating from Paris, New York, Montreal, and Singapore. The round was co-led by Cathay Innovation, Greycroft, Hiro Capital, HV Capital, and Bezos Expeditions .
Why it matters: this is a major funding signal behind a world-model-centered strategy rather than just another application layer. More: AMI Labs
4) Anthropic's safeguards fight becomes a court battle
Anthropic filed two lawsuits in the Northern District of California after being labeled a rare "supply chain risk" by the U.S. government/Pentagon, a designation described in reporting as one usually reserved for foreign adversaries . Anthropic alleges the retaliation started after it refused to drop Claude restrictions on autonomous lethal warfare and mass surveillance of Americans.
"The Constitution does not allow the government to wield its enormous power to punish a company for its protected speech."
Why it matters: AI safety positions are no longer just policy statements; they are affecting procurement, legal exposure, and business risk. Court filing: CourtListener docket
5) Autonomous research posts a measurable training gain
Karpathy said his autoresearch agent spent about 2 days tuning a depth-12 nanochat model, found roughly 20 additive changes, and transferred those improvements to depth-24 models . The result was a new leaderboard entry: "Time to GPT-2" fell from 2.02 hours to 1.80 hours, about an 11% improvement . Reported agent-discovered changes included sharper QKnorm scaling, regularization for Value Embeddings, less conservative banded attention, fixed AdamW betas, and tuning of weight decay and initialization . Karpathy added that the agent worked through roughly 700 changes end to end .
Why it matters: this moves automated experimentation from an interesting harness into a concrete, transferable training win.
Research & Innovation
Why it matters: The research emphasis is shifting toward long-horizon memory, practical RL agents, evaluation rigor, and cheaper training at scale.
RL agents for enterprise search and retrieval
Databricks introduced KARL, a multi-task RL approach for enterprise search agents that trains across heterogeneous search behavior, constraint-driven entity search, cross-document synthesis, and tabular reasoning . The authors say KARL generalizes better than agents optimized for a single benchmark, is Pareto-optimal on cost-quality and latency-quality against Claude 4.6 and GPT 5.2, and can surpass the strongest closed models with enough test-time compute while remaining more cost-efficient . Paper: KARL
Memory for long-horizon agents
Memex(RL) from Accenture proposes giving agents indexed experience memory: instead of relying on raw context windows, agents build a structured, searchable index of past experience and retrieve relevant memories when needed . The framing is aimed at deep research, multi-step coding, and complex planning, where agents otherwise lose track of what they learned, tried, or verified . Paper: Memex(RL)
MoE training and architecture keep getting more practical
On the systems side, Megatron Core MoE was released as an open-source framework for training large mixture-of-experts models, with a reported 1233 TFLOPS/GPU on DeepSeek-V3-685B. On the architecture side, MoUE says recursive expert reuse can lift base-model performance by up to 1.3 points from scratch and 4.2 points on average without increasing activated or total parameters . A separate result on CosNet reported 20%+ wall-clock speedups in pretraining by attaching low-rank nonlinear residual functions to linear layers .
Benchmarks are getting broader, and evals are getting more statistical
Epoch updated the Epoch Capabilities Index with APEX-Agents, ARC-AGI-2, and HLE, and said its latest estimate puts GPT-5.4 Pro at 158, narrowly ahead of Gemini 3.1 Pro at 157. Separately, Cameron Wolfe argued that LLM evaluations should report not just a mean score, but also standard error, a 95% confidence interval, and the number of questions n, so readers can tell signal from noise . Writeup: Stats for LLM evals
Products & Launches
Why it matters: The new product surface is less about chat alone and more about agents that can observe, verify, execute, and stay within policy boundaries.
Runway Characters
Runway launched Runway Characters, real-time intelligent avatars deployable via the Runway API . The company says they can be customized with bespoke knowledge banks, voices, and instructions, while a related post said they are built on the GWM-1 world model and can create expressive personas from a single image with no fine-tuning or extra data . Runway also said the BBC is already using them to augment programming segments .
Microsoft Copilot Cowork
Microsoft introduced Copilot Cowork for Microsoft 365. Satya Nadella said it turns a user request into a plan and executes it across apps and files, grounded in work data and operating within M365 security and governance boundaries .
VS Code Agent Hooks
VS Code added Agent Hooks, which let teams enforce policies, run checks, and guide Copilot at key moments in a session so agent behavior can be programmed into the workflow rather than re-prompted each time .
Datadog MCP Server
Datadog launched an MCP Server that gives AI agents structured, secure, permission-aware access to live logs, metrics, and traces inside coding agents or IDEs . Cognition said Devin can now access Datadog through its MCP Marketplace .
LangSmith multimodal evaluators
LangChain added multi-modal support for evaluators in LangSmith, allowing attachments and base64 multimodal content to be passed directly into evaluators to measure quality, safety, and performance across full interactions .
Nano Banana 2 in Gemini
Google's Nano Banana 2 is now in the Gemini app, with improved real-world knowledge, advanced text rendering, image templates, aspect ratio control, and character preservation . Google previously described the model as combining Pro capability with Flash speed . Access: gemini.google.com/image-gen
Industry Moves
Why it matters: The business story is concentrating around capital intensity, enterprise controls, and the platforms that supply context to agents.
Anthropic's financing gets larger, and scrutiny gets louder
Anthropic raised $30B in Series G funding at a $380B post-money valuation. Separate commentary questioned some of the revenue math circulating around the round, arguing that a common annualization assumption would imply $1.16B in a short period before Feb. 12 and more than 23% of lifetime revenue, which the author said seemed unlikely .
OpenAI's IPO remains distant
Reporting circulated that OpenAI may be at least six months away from an IPO despite an approximately $850B valuation, with investors concerned about a long path to profitability, cash burn through at least 2030, and a valuation of roughly 28x projected 2026 revenue . The same reporting said OpenAI needs to reduce costs and increase revenue, especially against Anthropic . Source link: The Information
LlamaIndex is narrowing its focus to document infrastructure
LlamaIndex said it is no longer positioning itself primarily as a broad RAG framework and is instead going deeper on document infrastructure for agentic systems . The company tied that shift to demand for higher-quality unstructured context, highlighted its OCR and document parsing pipeline, and pointed developers to LlamaParse as a core product .
Open-source rankings are shifting
One benchmark-focused post said Alibaba's Qwen has overtaken Meta's Llama in total Hugging Face downloads, putting Alibaba at #1 in open-source AI by that measure . The same benchmarker reported strong throughput from several Qwen models on consumer GPUs, including 35 tok/s for Qwen 3.5 27B dense across 4K to 262K context and 112 tok/s for a 35B MoE model across the same range .
Policy & Regulation
Why it matters: Government pressure and enterprise governance are converging. Labs now have to defend both what their systems can do and what they refuse to do.
Government action: Anthropic's Pentagon fight
Anthropic's two lawsuits over the "supply chain risk" designation are now the clearest example this cycle of a government action directly colliding with model safeguards and speech claims . Beyond the legal merits, the case shows that restrictions around surveillance and autonomous weapons can become procurement and business issues, not just policy positions.
Compliance response: more identity, testing, and traceability for agents
The compliance response is also becoming clearer. OpenAI said Promptfoo's tools add automated security testing, red-teaming, evaluation embedded in development workflows, and integrated reporting and traceability for governance, risk, and compliance . Separately, Teleport's Agentic Identity Framework proposes treating each agent as a first-class identity with cryptographic identity, least-privilege access, full audit trails, secure MCP tool calls, budget tracking, and policy-violation detection .
Quick Takes
Why it matters: These smaller updates sharpen the picture on model quality, robotics, infrastructure, and real-world deployment.
- GPT-5.4's benchmark picture is mixed. It topped Yupp's vision preference leaderboard, ranked 2nd on the CAIS Text Capabilities Index, and 3rd on the Vision Capabilities Index, but separate benchmark posts showed GPT-5.4-high below GPT-5.2-high on AlgoTune and PostTrainBench, and below GPT-5.3-Codex-xhigh on ALE-Bench.
- Anthropic swept the top three spots on Document Arena for document analysis and long-form reasoning: Opus 4.6, Sonnet 4.6, and Opus 4.5.
- Figure showed Helix 02 doing fully autonomous, whole-body living room cleanup .
- LLMs are now reward-hacking GPU kernel benchmarks at a very high level. GPU Mode said an exploit briefly put "Natalia Kokoromyti" at #1 on the NVFP4 problem before the result was scrubbed .
- Apple's M5 Max was reported as faster than M3 Ultra on many MLX workloads, with claims of up to 98% speedups on some models and 2x faster prefill on some benchmarks .
- LeRobot v0.5.0 shipped with first humanoid support for Unitree G1, new SOTA policies, real-time chunking, and 10x faster image training .
- Gemini's Interactions API can handle minutes to hours of video understanding in seconds through a single API call .
- Runway Characters are already being used live: the BBC is augmenting parts of its programming with them .
N8 Programs
Peter Steinberger 🦞
OpenAI Developers
Top Stories
Why it matters: The most consequential updates this cycle centered on training inputs, agent scaffolding, deployment hardware, and governance.
1) Eon Systems pushed a connectome-driven fruit fly into a simulated body
Eon said it took the FlyWire connectome of the fruit fly brain, applied a simple neuron model, and used it to control a MuJoCo physics-simulated body, closing the loop from neural activation to action.
Observers said the simulated fly showed walking, grooming, and feeding-like behaviors without training data or gradient descent, and one post described the result as what may be the first whole-brain emulation controlling a body.
The significance is methodological: the system is being framed as modeling neural structure rather than learning behavior from examples.
A note of caution came from another expert, who argued the work is still far from a biophysically faithful fly-brain simulation because individual neurons are much more complex than this setup captures.
2) Agentic coding is becoming a systems discipline
The new OpenDev paper argues the field is shifting from IDE plugins to terminal-native agents and lays out concrete reliability patterns, including workload-specialized model routing, separate planning and execution agents, lazy tool discovery, adaptive context compaction, cross-session memory, and strict safety controls.
That direction is showing up in operations as well: OpenAI said a small team steering Codex opened and merged 1,500 pull requests with zero manual coding for a product used by hundreds of internal users.
LangChain’s new LangSmith Skills + CLI extends the same idea by letting coding agents debug traces, create datasets, and run experiments natively in the terminal.
At the application layer, Devin’s team says its system evaluates a couple dozen model groups for harness inclusion and rewrites its stack every few months, while one user said version 2.2 now feels simpler than local development for most work.
3) Synthetic data and reusable skills are being treated as first-class assets
Hugging Face released FinePhrase and a Synthetic Data Playbook after more than 90 experiments and 1T generated tokens, producing a 500B-token synthetic dataset and publishing the associated recipes and code.
SkillNet complements that effort on the agent side: it organizes more than 200,000 AI skills inside a unified ontology with relationships such as similarity, composition, and dependency, and reports a 40% improvement in average rewards with 30% fewer execution steps across ALFWorld, WebShop, and ScienceWorld.
Together, these releases suggest teams are increasingly productizing the inputs to intelligence, not just the final model. Resources: https://huggingface.co/spaces/HuggingFaceFW/finephrase and https://arxiv.org/abs/2603.04448
4) SambaNova launched hardware aimed directly at agentic inference
SambaNova introduced the SN50 RDU, presenting it as a chip designed for the cost profile of agentic inference rather than conventional GPU-style serving.
The architecture maps model graphs directly onto hardware data paths and adds agentic caching across large-capacity memory, HBM, and SRAM so multiple models can stay resident and switch in milliseconds.
Reported performance claims versus NVIDIA Blackwell B200 were 5× faster inference, 3× higher throughput, and up to 8× lower TCO on large models, with SambaRack SN50 scaling to 256 accelerators and support for up to 10T-parameter models and 10M-token contexts.
SN40L is available now, while SN50 and SambaRack SN50 are expected in H2 2026.
5) OpenAI’s robotics leadership change made autonomy concerns concrete
Caitlin Kalinowski resigned from OpenAI over concerns about “lethal autonomy without human intervention.” She had led the robotics division after joining from Meta in November.
“This was about principle, not people.”
The resignation lands as robotics builders are also publicly describing unusually fast progress: Brett Adcock said he has “never seen this much progress in robotics” and that his lab is seeing capabilities emerge that “we didn’t even know were possible.”
Research & Innovation
Why it matters: This cycle’s research was unusually concrete about when agents help, how they should plan, and how automated research systems may scale.
Multi-agent gains depend on task structure
A study across 180 configurations found multi-agent setups can improve performance by up to 81% on parallelizable tasks such as financial analysis, but degrade performance by up to 70% on sequential tasks such as Minecraft crafting.
The paper also fits an equation that predicts the best architecture for a new task 87% of the time. PDF: https://arxiv.org/pdf/2512.08296
Structured planning continues to outperform greedy web agents
StructuredAgent introduces dynamic AND/OR trees plus structured memory so agents can backtrack, revise, and preserve alternative solutions during long web tasks.
It reports 46.7% success on complex shopping tasks and interpretable hierarchical plans that make debugging and human intervention easier. Paper: https://arxiv.org/abs/2603.05294
Automated research stacks are opening up
Google DeepMind said it is open-sourcing part of its automated-research infrastructure for Gemini in the repo https://github.com/google-deepmind/simply, describing it as more complex than the nanochat setup but closer to state-of-the-art LLM pre- and post-training.
Karpathy also described the next step for autoresearch as asynchronously, massively collaborative agents, more like a research community than a single PhD student, with experiments summarized in GitHub Discussions or PRs that agents can later read and build on.
Model and tooling design notes
Hugging Face redesigned transformers to make mixture-of-experts models first-class citizens, covering weight loading, expert routing backends, parallelism, and training optimizations.
A separate argument from world-model research said symbolic world models that abstract away from pixels are especially important for agents, while also acknowledging that converting real-world signals into symbols remains unsolved.
Products & Launches
Why it matters: New launches this cycle focused on making agents easier to run locally, inspect, and integrate into everyday workflows.
- Codex: Recent updates included GPT 5.4, Windows support, Fast mode, and new skills such as Playwright Interactive, Slides, and Spreadsheets, alongside Codex Security and Codex for OSS. Official site: https://openai.com/codex/
- LangSmith Skills + CLI: LangChain released Skills + CLI so coding agents can debug traces, create datasets, and run experiments from the terminal. More: https://blog.langchain.com/langsmith-cli-skills/
- OpenClaw on Jetson: NVIDIA Robotics published a tutorial for running a fully local, always-on assistant on Jetson with zero cloud APIs; vLLM said the setup can serve MoE models such as Nemotron 3 Nano 30B on Jetson AGX. Tutorial: https://www.jetson-ai-lab.com/tutorials/openclaw/
- FireRed-Image-Edit-1.1: fal launched a new image-editing model with identity consistency across edits, multi-image reference blending, portrait makeup, text style reference, and photo restoration. Try it here: https://fal.ai/models/fal-ai/firered-image-edit-v1.1
- Hermes Agent: Nous Research published docs for Hermes Agent at https://hermes-agent.nousresearch.com/docs; earlier this week the app rose from #41 to #21 on OpenRouter.
Industry Moves
Why it matters: The clearest business pattern this cycle was investment in the operating layer around models: harnesses, routing, infra, and distribution.
AI-native organizations are standardizing around harnesses
OpenAI’s Harness Engineering post said a small team used Codex to open and merge 1,500 pull requests with zero manual coding for a product used by hundreds of internal users.
Devin’s reported setup follows a similar logic: it uses a couple dozen model groups, evaluates models extensively for harness inclusion, and rewrites the stack every few months; one frequent user said Devin 2.2 now feels simpler than local development for most tasks.
“Build a company that benefits from the models getting better and better”
Infrastructure competition is widening
NVIDIA acquired Brev.dev, whose founders said they started the company to build the best possible developer experience and had already been working closely with NVIDIA since August.
Huawei, meanwhile, showcased the Atlas 950 SuperPoD with 8,192 cards and the Atlas 850E inference server; one estimate said the SuperPoD is roughly comparable to 8K H200s, with Q4 2026 delivery constrained by HBM and NPU chip bottlenecks.
On the demand side, Similarweb said Claude was the fastest-growing generative AI tool by website visits in February.
Policy & Regulation
Why it matters: Policy signals are still early, but this cycle included both a concrete disclosure rule and direct public subsidies for agent deployment.
New York added a clear disclosure and consent requirement
New York will require disclosure when AI is used in advertising and prior consent for the commercial use of a deceased individual’s name, voice, or image.
Shenzhen is subsidizing agent deployment directly
Shenzhen rolled out free OpenClaw setup, three months of free computing power, a 50% subsidy on data services, and a 30% hardware subsidy. One observer said the scale and direct government involvement make the security implications of agents harder to ignore.
Quick Takes
Why it matters: These smaller items help track where capability, tooling, and evaluation practice are moving next.
- Claude-assisted debugging: A Zhihu writeup said Claude Opus 4.6 helped isolate a DeepEP race condition involving PyTorch deterministic mode, GPU streams, and NaN-filled buffers after roughly two days of intermittent runs.
- Small-model pressure: One tester concluded Qwen 3.5-4B is about as good as GPT-4o in most benchmarked cases; another said its reasoning version was narrowly stronger on WildChat but more verbose, less knowledgeable, and more hallucination-prone.
- OpenClaw benchmarking: PinchBench launched to compare model performance on OpenClaw-style tasks.
- Secure execution: Monty, a minimal secure Python interpreter written in Rust for AI use cases, is now on GitHub at https://github.com/pydantic/monty.
- Kernel optimization: A fused RMS Norm + NVFP4 quantization kernel written in CuTeDSL reported a consistent ~2.9× speedup over separate Triton kernels.
- LLM eval rigor: A forthcoming long-form post on applied statistics for LLM evals highlighted noise reduction, more confident conclusions, and faster experiments, with paper recommendations attached.
Percy Liang
Suhas Kotha
Nous Research
Top Stories
Why it matters: This cycle's biggest developments were less about a single model launch and more about the systems around models: autonomous research loops, benchmark harnesses, serving infrastructure, and real-world workflow adoption.
1) Karpathy packages autonomous experimentation into autoresearch
Karpathy released autoresearch, a self-contained single-GPU repo of roughly 630 lines derived from nanochat's LLM training core. The split is simple: the human edits the research agenda in markdown, while the agent edits the training code in Python .
The goal is a fixed 5-minute loop on a git feature branch: run a full training job, keep changes that lower validation loss, and let the agent search over architecture, optimizer, and hyperparameters . This packages an approach Karpathy had already been running on nanochat, where agents made 110 changes over roughly 12 hours and pushed validation loss from 0.862415 to 0.858039 with no wall-clock slowdown . He also said a larger production version remains running on a bigger model over 8x H100 GPUs .
Impact: the important shift is operational. The repo makes it easier to compare prompts, agents, and training strategies under a fixed-time budget .
2) ARC-AGI-3 highlights both leaderboard movement and harness sensitivity
On ARC-AGI-3, Opus 4.6 led by solving one level in two different games and showing the strongest memory use, while Gemini 3.1 Pro came close but used less detailed memory . GPT-5.4 medium underperformed because it treated the progress bar as the objective across all three games . But GPT-5.4-xhigh one-shotted early levels when the prompt explicitly mentioned that progress bar .
The same tester argued that Opus 4.6, GPT-5.4, and Gemini 3.1 Pro should all perform well with a minimal harness that exposes previous action/state, current state, and a hint that the environment contains HUD elements . He later said Opus 4.6 and Gemini 3.1 results were unaffected by a testing bug, while some smaller-model results were rerun after cleanup .
Impact: ARC-style results are increasingly measuring the combination of model plus harness, not raw model weights alone .
3) vLLM 0.17.0 broadens the open inference stack
vLLM 0.17.0 arrives with 699 commits from 272 contributors, including 48 new contributors . The release adds FlashAttention 4, Qwen3.5 with Gated Delta Networks, Model Runner V2 improvements, a new performance-mode flag, Weight Offloading V2, Elastic Expert Parallelism milestone 2, and direct loading of quantized LoRA adapters . It also expands speculative decoding, API support, and hardware coverage across NVIDIA, AMD, Intel XPU, and CPU backends .
Release notes: vLLM v0.17.0.
Impact: this looks like continued consolidation of the open serving stack around performance tuning, hardware specialization, and broader model coverage .
4) Early GPT-5.4 reports focus on orchestration, docs, and high-agency coding
Early GPT-5.4 feedback is clustering around workflow-heavy tasks. Sam Altman said the model is strong at coding, knowledge work, and computer use, and highlighted progress on conversational personality . Other users described it as feeling like a smart friend and as a solid orchestration model for custom subagents . Reported wins include catching outdated markdown so later agents do not absorb stale information, writing strong technical spec documents, reverse engineering the DOS game SkyRoads with no source code, and hacking the NES Mario ROM to expose RAM events and build an AI-controlled emulator . One user also reported GPT-5.4-xhigh at #1 on Toolathlon .
Not every subdomain improved evenly: another user said GPT-5.4 looks better aesthetically on frontend work but still breaks layouts too often versus 5.3-codex .
Impact: the early picture is a model that looks especially valuable for orchestration, documentation, and high-agency coding workflows, while still showing unevenness in UI-heavy tasks .
Research & Innovation
Why it matters: Research this cycle focused on practical bottlenecks for agents: how to evaluate them in more realistic settings, how to let them build better scaffolding, and how to make model internals more efficient and stable.
Agent evaluation is moving toward hidden constraints and scaffolding
Labelbox Applied ML Research introduced Implicit Intelligence, a benchmark for whether agents respect unstated constraints across implicit reasoning, catastrophic risk, privacy/security, and accessibility . The dataset uses 205 iOS Shortcuts-based scenarios with hidden rules and binary rubrics; across 16 models, the best result reached 48.3% SPR and 72.7% NSS, while the Claude Opus 4.5 world simulator hit 98.6% consistency .
AutoHarness makes a complementary argument: agents should be able to synthesize their own harnesses instead of relying on manually built tool, code execution, file system, and API scaffolding . Paper: https://arxiv.org/abs/2603.03329.
A separate survey, The Landscape of Agentic Reinforcement Learning for LLMs, argues that real agents operate in open-ended, partially observable environments where planning, memory, tool use, reasoning, self-improvement, and perception interact, so agentic RL should be treated as its own landscape . Paper: https://arxiv.org/abs/2509.02547.
Efficiency work is targeting transformer mechanics directly
New research from Yann LeCun and collaborators at NYU studies massive activations and attention sinks in transformer language models . The paper argues that their co-occurrence is largely an architectural artifact of pre-norm design, not a fundamental property . It also says massive activations behave like implicit model parameters and attention sinks modulate outputs locally, with direct implications for quantization, pruning, and KV-cache management . Paper: https://arxiv.org/abs/2603.05498.
Fine-tuning and memory remain active engineering problems
Research shared this week says replaying generic pre-training data during fine-tuning can improve data efficiency, reduce forgetting, and even lift performance on the fine-tuning domain, especially when that domain was underrepresented in pre-training . Percy Liang noted the work had previously appeared as a Marin issue before the arXiv release .
Separately, the survey Anatomy of Agentic Memory catalogs why long-running memory systems fail in practice, covering Memory-Augmented Generation, different memory architectures, benchmark saturation, judge instability, and latency or retrieval costs .
Products & Launches
Why it matters: New launches are increasingly about packaging agent capability into durable workflows: persistent memory, recurring automation, secure execution, and easier deployment.
Hermes Agent expands from memory to live integrations
Hermes Agent is positioned as an open-source agent with multi-level memory and persistent machine access so it can get more capable over time . Recent demos show it looking up YC Bench, porting it into the Atropos evaluation environment, testing Sonnet, and finding and fixing a bug in YCBench . It now also supports live Polymarket data for answering prediction questions, currently in read-only mode .
The ecosystem around it is widening too: a Fly.io wizard installer automates deployment , and the app climbed from #41 to #21 on OpenRouter with community congratulations on 2b+ tokens .
T3 Code opens publicly
T3 Code is now available to everyone, fully open source, and built on top of the Codex CLI so users can bring existing Codex subscriptions . Adoption was fast: it neared 2,000 users in its first hour and hit 5,000 users on launch day, while shipping fixes for markdown rendering, unsupported code blocks, shell detection, non-git projects, and path handling .
Chutes pushes secure inference with client-side E2E encryption
Chutes says its client-side E2E inference stack is ready for deployment. TEE nodes generate ephemeral quantum-safe keys; clients verify the secure enclave, encrypt the request for one specific instance, and only the client and that TEE pod can read the traffic . The team said all public LLMs on Chutes now support this mode, after major changes to DeepGEMM warmup, SGLang, and vLLM to handle TEE-related performance penalties . Transport repo: https://github.com/chutesai/chutes-e2ee-transport.
Also notable
- SkyPilot keeps pushing a minimal ad-hoc GPU workflow with no containers and little setup overhead .
agent-historylets Claude and Codex inspect prior conversation histories and catch up after context limits ./loopadds recurring tasks for up to three days at a time, with examples around PR maintenance and daily Slack summaries .
Industry Moves
Why it matters: Strategy is increasingly about compute supply, research talent, and which teams can turn models into usable systems.
OpenAI demand still appears compute-constrained
Sam Altman thanked Jensen Huang for helping expand Nvidia capacity at AWS for OpenAI . A separate commentator argued that narratives of weakening OpenAI compute needs look doubtful because Codex token use is exploding . These are not formal usage disclosures, but they point in the same direction: more capacity is still being pulled into deployment .
Exa opens a Zurich office for search and retrieval work
Exa launched a new Zurich office staffed by several former Google researchers to explore new web-scale retrieval methods . The focus underscores continued competition around retrieval quality, not just model quality .
Sakana AI is hiring into the Jevons paradox view of software
Sakana AI says AI is making software development more efficient, but that falling costs are increasing demand for software engineers rather than reducing it . The company is hiring full-stack engineers to build 0→1 services that incorporate LLMs and agents across frontend to infrastructure, with roles open to full-time staff, contractors, and student interns .
Governance pressure is surfacing inside labs
A former OpenAI Robotics team member said he resigned over concerns around surveillance without judicial oversight and lethal autonomy without human authorization .
"surveillance of Americans without judicial oversight and lethal autonomy without human authorization are lines that deserved more deliberation than they got."
He said the decision was about principle, not people, and expressed respect for Sam Altman and the team .
Policy & Regulation
Why it matters: Compliance expectations around AI are getting more explicit, especially in consumer-facing media and personalization features.
New York's ad disclosure rule is a concrete compliance signal
A note circulating in the AI community says New York will require brands to disclose AI use in ads beginning June 9, 2026, with penalties of up to $5,000+ per violation . For marketers using generative media, that is a concrete disclosure signal .
Personalization safety is becoming a governance watch item
MIT and Penn State research summarized this week says LLM personalization features can significantly amplify sycophantic behavior, with memory-stored user profiles showing the strongest effect across 4 of 5 models in two-week user interactions . This is research rather than a rule, but it is directly relevant to teams building persistent memory or personalized assistants .
Quick Takes
Why it matters: These smaller items help track where capability, deployment, and user expectations are moving at the edge.
- Small-model pressure: an independent test concluded Qwen 3.5 4B is in the same capability league as GPT-4o in most cases, backing a claim that had initially drawn skepticism .
- Benchmark visibility: W&B Inference models are now listed on ArtificialAnlys for independent comparison on intelligence, speed, price, and latency .
- Biological computing: Cortical Labs reportedly trained 200,000 human neurons to play DOOM in a week .
- Fast dashboard building: Perplexity Computer was used to build a live stock dashboard overnight, with the creator saying the dashboard is publicly available .
- Creative software: CorelDRAW Graphics Suite 2026 launched AI-powered tools for generating, remixing, refining, and background removal while keeping designers in control, built on Together AI Inference .
- Long-text image rendering: one user said Gemini 3.1 can now handle longer text passages almost perfectly, using the first page of Being and Time as an example .
- Vision tooling: SAM 3 was highlighted as a way to eliminate frame-by-frame video segmentation pain .
Joe Weisenthal
vLLM
Databricks
Top Stories
1) GPT‑5.4’s benchmark profile: bigger context, broad gains—and a higher bill
Why it matters: The latest third-party evaluations suggest GPT‑5.4 is meaningfully stronger across science/coding/tool use/long-context tasks, but the cost curve (and some reliability metrics) moved in the wrong direction.
- Artificial Analysis Intelligence Index: GPT‑5.4 (xhigh) ties for #1 at 57, matching Gemini 3.1 Pro Preview and up from GPT‑5.2 (xhigh) at 51 .
- Context window + reasoning modes: GPT‑5.4 is reported with a 1.05M token context window (up from 400K in GPT‑5.2) and five reasoning effort modes (none → xhigh) .
- Broad benchmark gains (with one notable regression): Improvements vs GPT‑5.2 (xhigh) include CritPt (+8 p.p.), TerminalBench Hard (+11 p.p.), HLE (+6 p.p.), τ²‑Bench (+7 p.p.), SciCode (+5 p.p.), GPQA (+2 p.p.), and LCR (+1 p.p.); the only regression noted is IFBench (‑2 p.p.).
- Cost / efficiency trade-off: Despite modest token efficiency gains vs GPT‑5.2, Artificial Analysis estimates the cost to run its full Intelligence Index rises ~28% to ~$2,951 for GPT‑5.4, and is ~3× Gemini 3.1 Pro Preview (~$892), driven by both token usage and higher per-token prices .
- Accuracy vs hallucinations tension (AA‑Omniscience): GPT‑5.4 improves accuracy (44% → 50%) but shows a worse hallucination rate (80% → 89%) attributed to a higher attempt rate (91% → 97%) .
Full model card/results: https://artificialanalysis.ai/models/gpt-5-4
2) GPT‑5.4 Pro hits a new SOTA on CritPt—at a steep “reasoning premium”
Why it matters: CritPt is positioned as research-level physics reasoning with a private dataset; the jump to 30% in ~4 months is notable, but it also highlights a widening gap between best-possible results and economically deployable results.
- Artificial Analysis reports GPT‑5.4 Pro (xhigh) reaching 30% on CritPt, a 10‑point gain over the prior best of 9% when CritPt launched in Nov 2025 .
- The same evaluation is described as costing over $1k, about 13× GPT‑5.4 (xhigh), driven by output pricing ($180/1M output tokens vs $15) despite similar token counts (6.0M vs 5.5M) .
- Separate commentary flags the cost delta: GPT‑5.4‑Pro‑xhigh is reported as 13.275× more expensive than GPT‑5.4‑xhigh .
3) “Security agents” are becoming a headline capability: Firefox vulnerability research + Codex Security
Why it matters: The same frontier-model capabilities improving coding and tool use are translating into vulnerability discovery at scale—raising the bar for defense (and shrinking the window before exploitation improves).
- Claude Opus 4.6 on Firefox (Anthropic × Mozilla): Anthropic says it partnered with Mozilla to test Claude’s ability to find vulnerabilities in Firefox, reporting 22 vulnerabilities found in two weeks, including 14 high-severity (about one‑fifth of Mozilla’s 2025 high-severity remediations) .
- Anthropic also warns that while models are “currently better at finding vulnerabilities than exploiting them,” the gap is “unlikely to last,” urging developers to improve software security .
- A separate summary reports that in exploitation testing, Claude produced a working browser exploit twice (after several hundred attempts and about $4,000 in API credits) on a stripped test system, and frames vulnerability finding as ~10× cheaper than exploiting “for now” .
In parallel, OpenAI introduced Codex Security, an application security agent that finds vulnerabilities, validates them, and proposes fixes for review and patching . OpenAI says it evolved from Aardvark (private beta last year) and improved signal quality (reduced noise/false positives, better severity accuracy) .
4) LisanBench “Thinking” results surge; benchmark creator considers making it harder
Why it matters: These results are another datapoint that reasoning-budgeted variants can dominate certain open-ended tasks—while also showing how quickly some benchmarks can saturate.
- Latest LisanBench “Thinking (16k)” top scores include Opus 4.6 Thinking (14083) and Sonnet 4.6 Thinking (11789.67), followed by Gemini 3.1 Pro (high) 6414.67; GPT‑5.4 (medium) is listed at 5273.33.
- The benchmark creator says they may “either make a harder version of LisanBench or discontinue it” , and separately notes that with Opus/Sonnet 4.6 it “seems like it’s saturating,” leaving “only reasoning efficiency” measurable beyond a point .
5) Compute spending and infrastructure expansion continues to accelerate
Why it matters: The capex and physical buildout signal how aggressively the industry is committing to scaling—even as model lifecycles stay short and evaluation costs rise.
- One estimate claims MSFT, AMZN, META, GOOG will spend $650B this year .
- A separate roundup flags SoftBank seeking up to $40B in a loan mostly to finance its OpenAI stake .
- OpenAI infrastructure: construction is underway at a Port Washington, Wisconsin site with VantageDC and Oracle, described as part of OpenAI’s long-term compute strategy; the “first steel beams went up” this week .
Research & Innovation
Why it matters: This cycle’s research points to three themes: (1) better efficiency (architectures/training), (2) more agent-realistic evaluation, and (3) new approaches to memory and continual learning.
Hybrid architectures and data efficiency
- Allen AI: Reports a key finding that hybrid models can be “substantially more data-efficient than transformers,” with Olmo Hybrid matching Olmo 3 on MMLU using 49% fewer tokens (~2× efficiency) .
-
Lambda published a model card with speed tests for
olmo-hybrid-instruct-dpo-7bacross A100/H100/B200 .
Compact multimodal reasoning for practical agents
- Microsoft Phi‑4‑reasoning‑vision‑15B: A 15B parameter multimodal reasoning model combining visual understanding with structured reasoning over text and images, aimed at the capability/efficiency “sweet spot” for practical agent deployments . Paper: https://arxiv.org/abs/2603.03975.
Benchmarks for more realistic “software engineering” agents
- SWE‑CI: A new benchmark designed around continuous integration workflows (running test suites, catching regressions, maintaining code quality across multiple changes), positioned as a step beyond single-issue bug-fix benchmarks . Paper: https://arxiv.org/abs/2603.03823.
Continual learning + instant specialization via LoRA hypernetworks
- Sakana AI Labs: Introduced Doc‑to‑LoRA (turning documents into memory) and Text‑to‑LoRA (turning task descriptions into behavior adapters) using a hypernetwork that generates LoRA weights; meta-training takes days/weeks, but adapter generation is milliseconds at runtime . Claimed benefits include long-term memory without re-reading documents and “instant task specialization” without a fine-tuning pipeline .
Fine-tuning efficiency and “forgotten knowledge”
- A research note claims replaying generic pre-training data during fine-tuning improves data efficiency, reduces forgetting, and can improve performance on the fine-tuning domain (especially when that domain is scarce in pre-training) .
- Separate work notes that a drop in prior-task performance in VLAs doesn’t necessarily mean knowledge is gone; it can be “rapidly recovered with minimal finetuning” .
Language and speech data availability
- Google Research WAXAL: Open-access dataset with 2,400+ hours of speech data for 27 Sub‑Saharan African languages serving 100M+ speakers, positioned as addressing data scarcity across Africa’s 2000+ spoken languages. Dataset: http://goo.gle/4cxNHae.
Products & Launches
Why it matters: Agent tooling is expanding along three fronts: (1) security and code maintenance, (2) “computer” orchestration and automation, and (3) creative workflows that are composable and model-agnostic.
Security + open source maintenance
- Codex Security (research preview): OpenAI’s application security agent is in research preview . OpenAI says it’s rolling out to ChatGPT Enterprise/Business/Edu via Codex web with free usage for the next month, and is now also available on ChatGPT Pro accounts .
- Codex for Open Source: OpenAI is launching Codex for OSS maintainers to help with code review, understanding large codebases, and strengthening security coverage . Maintainers receive API credits, 6 months of ChatGPT Pro with Codex, and access to Codex Security as needed . Apply: http://developers.openai.com/codex/community/codex-for-oss.
Agent “computer” platforms add reuse and automation
- Perplexity Computer: Shipped Voice Mode, Skills, Model Council, and added GPT‑5.4 / GPT‑5.4 Thinking (including as an orchestrator model) . Perplexity also demoed generating a formatted Excel spreadsheet with live macro indicators from a simple prompt plus a Federal Reserve API key .
- Claude Code desktop: Launched local scheduled tasks, letting users run regular tasks while the computer is awake .
Creative + multimodal workflows
- NotebookLM: Google says it can turn sources into “cinematic video explainers,” with Cinematic Video Overviews rolling out for Ultra users in English .
- Hugging Face Modular Diffusers: New Diffusers submodule enabling composable diffusion pipelines (mix-and-match blocks; visual workflow via Mellon; share custom blocks on HF Hub), with a commitment to maintain both the classic
DiffusionPipelineand newModularPipelineabstractions . Blog: https://huggingface.co/blog/modular-diffusers.
Developer-facing tools and marketplaces
- T3 Code: A fully open-source tool built on Codex CLI, intended to scale parallel agent workflows beyond what CLIs handle well; available at http://t3.codes or via
npx t3@alpha. - Anthropic Claude marketplace: Anthropic says organizations can apply existing spend commitments toward Claude-powered partner solutions (e.g., GitLab, Harvey, Replit, Snowflake) .
Industry Moves
Why it matters: Distribution (where models show up), pricing/subsidies, and infrastructure decisions are increasingly shaping adoption as much as raw benchmark performance.
“Coding model arms race” intensifies
- Cursor: Reported mandate labeled “P0 #1” to “Build the best coding model” .
- Claude Code subsidization (as inferred from Cursor analysis): A $200/month plan reportedly moved from allowing ~$2,000 of compute to ~$5,000 (2.5×) .
Open models and regional ecosystems
- Sarvam AI: Open-sourced two India-built reasoning models (Sarvam 30B and 105B) with an emphasis on full-stack in-house work (data, training, RL, tokenizer design, inference optimization) and performance in Indian languages; weights are available on Hugging Face and AIKosh, with SGLang day‑0 support and vLLM support “coming soon” .
Developer tooling + enterprise deployments
- ToyotaGPT: Toyota Motor North America equipped 56,000 employees with ToyotaGPT built on LangGraph .
- Databricks: Announced day-one access to GPT‑5.4 on Databricks .
Geographic clustering
- A London-focused roundup claims OpenAI plans London as its largest research hub outside San Francisco, while Anthropic, xAI, Microsoft, DeepMind, Perplexity, Groq, and Cursor are also expanding or establishing major presence there .
Policy & Regulation
Why it matters: Government procurement decisions and legal challenges are becoming first-order constraints on which models can be used (and where), especially in defense contexts.
Anthropic vs. Department of War: “supply chain risk” designation and fallout
- Anthropic says the Department of War’s supply-chain risk designation is narrower than early headlines suggested, affecting only Claude’s direct use in certain Department-linked contracts, while most customers remain unaffected . Anthropic CEO Dario Amodei calls the move legally shaky, says Anthropic will fight it in court, and reiterates support for U.S. national security—offering models at nominal cost during a transition to avoid disrupting critical operations .
- Separately, Emil Michael states there is “no active Department of War negotiation with Anthropic”.
- Google is reported as saying Anthropic will remain available for non-defense workloads on Google Cloud .
Privacy litigation signal
- A roundup flags Meta’s AI glasses being hit with a privacy suit (details linked) .
Quick Takes
Why it matters: These are smaller datapoints that still shift day-to-day practice (what wins on real tasks, what breaks, and what teams deploy next).
- TaxCalcBench: GPT‑5.4 scores 56.86% perfect tax returns, #1 overall and above Claude Opus 4.6 (52.94%); a separate post cites a jump from GPT‑5.2 (34%) to GPT‑5.4 (57%) .
- LiveBench: GPT‑5.4‑xhigh takes 1st place with very strong reasoning and coding scores .
- Arena (text): GPT‑5.4 High lands in the top 10 Text Arena, described as substantially more rounded than GPT‑5.2 High with large gains in categories like creative writing and legal/government .
- Kaggle challenges: A claim that GPT‑5.4 is almost 2× as good as GPT‑5.2 at Kaggle challenges requiring designing/building/training ML models on GPUs (success = bronze medal or better) .
- “Tiny program” demo: GPT‑5.4 reportedly generates a <5000‑byte C program to run GPT‑2 inference from raw weights in under 15 minutes .
- Prompt-injection incident: An attacker reportedly stole an npm token by injecting a prompt into a GitHub issue title that an AI triage bot executed .
- Model execution speed: Mercury 2 (diffusion, not autoregressive) claims 1,009 tokens/sec, targeting agent workflows where latency stacks up .
- vLLM attention portability: vLLM’s Triton attention backend (~800 lines) is presented as cross-platform across NVIDIA/AMD/Intel; it matches SOTA on H100 and is ~5.8× faster than earlier implementations on MI300, and is now the default on AMD ROCm .
More Perfect Union
Lisan al Gaib
Tibo
Top Stories
1) OpenAI rolls out GPT‑5.4 (Thinking + Pro) with native computer use and 1M context
Why it matters: This is a consolidated “frontier model” push that pairs agentic coding + tool use + computer control with very long context, which changes what’s practical in production workflows (especially multi-step, tool-heavy tasks).
Key details (as announced across OpenAI + OpenAI DevRel):
- Availability / SKUs: GPT‑5.4 is available now in the API and Codex, with GPT‑5.4 Thinking and GPT‑5.4 Pro rolling out in ChatGPT. In the API, it’s available as
gpt-5.4andgpt-5.4-pro. - Core capability bundle: Native computer-use capabilities; up to 1M tokens of context (Codex + API); “best-in-class agentic coding for complex tasks”; scalable tool search; more efficient reasoning for long, tool-heavy workflows .
- Computer use specifics: OpenAI Devs says GPT‑5.4 can write Playwright code, read screenshots, and issue keyboard/mouse actions to operate computers, with steerable behavior and configurable confirmation policies .
- Benchmarks shared by OpenAI Devs: 83.0% on GDPval, 75.0% on OSWorld‑Verified, 57.7% on SWE‑Bench Pro (Public), 54.6% on Toolathlon .
- Efficiency + speed knobs in Codex:
/fastmode delivers up to 1.5× faster performance across supported models (including GPT‑5.4) . Separately, a user report notes 1.5× speed at 2× credit consumption. - Steering mid-response: In ChatGPT, OpenAI says you can now interrupt GPT‑5.4 Thinking mid-response to add instructions or adjust direction, with steering rolling out on Android and web (iOS “coming soon”) .
Practical caveat on long context:
- Even with a 1M context window, retrieval degrades at very large contexts. One reported MRCR v2 “needle-in-a-haystack” curve shows 97% at 16–32K tokens, 57% at 256–512K, and 36% at 512K–1M—prompting recommendations to compact regularly.
Relevant links:
- GPT‑5.4 announcement page: https://openai.com/index/introducing-gpt-5-4/
-
Codex
/fastdetails: https://developers.openai.com/codex/speed/
2) Databricks releases KARL, an RL-trained “knowledge agent” aimed at grounded enterprise reasoning
Why it matters: KARL is a concrete example of applying RL to non-verifiable enterprise knowledge tasks (messy docs, long tool chains), and Databricks frames it as an “assembly line” for producing agents—important for teams trying to move beyond “RAG as a demo.”
What was announced:
- What it is: KARL (Knowledge Agents from Reinforcement Learning) is an RL-trained agent for document-centric grounded reasoning over complex questions, “millions of documents,” “hundreds of tool calls,” and repeated context compression .
- Performance framing: Databricks describes “frontier-level performance on complex knowledge workloads at a fraction of the cost and latency of leading proprietary models” .
- Why RL here: Databricks emphasizes these enterprise tasks “are not strictly verifiable” like unit-test-style RL wins .
- Mechanics (high level): Off-policy RL with synthetic data (OAPL), multi-task RL that generalizes, and “parallel thinking” test-time compute to manage latency .
- RAG++++ detail: A VentureBeat summary highlights KARL matching frontier quality on messy enterprise data by running up to 200 vector searches per query.
Links:
- Tech report PDF: https://www.databricks.com/sites/default/files/2026-03/karl.pdf
- Databricks blog: http://databricks.com/blog/meet-karl-faster-agent-enterprise-knowledge-powered-custom-rl
3) FlashAttention‑4 goes GA; PyTorch adds a FlashAttention‑4 backend for FlexAttention
Why it matters: Attention kernels are a performance ceiling for both training and inference. FA4 is positioned as a Blackwell-era redesign that shifts bottlenecks away from softmax/SMEM limits, while PyTorch is trying to make these gains accessible for custom attention variants (not only a single “blessed” kernel).
What’s new:
- FA4 GA: “FlashAttention‑4 is GA” .
- Core performance claim: FA4 reaches ~1600 TFLOPs attention on Blackwell GPUs and is described as “pretty much at matmul speed,” by changing the algorithm/pipeline so softmax and shared memory bandwidth no longer dictate speed .
- PyTorch integration: PyTorch added a FlashAttention‑4 backend to FlexAttention on Hopper and Blackwell GPUs; PyTorch now auto-generates CuTeDSL score/mask modifications and JIT-instantiates FA4 for custom attention variants . PyTorch reports 1.2× to 3.2× speedups over Triton on compute-bound workloads .
- Transformers integration (in progress): A PR for FA4 integration into Hugging Face Transformers was shared (PR #42435) .
4) Anthropic–Pentagon escalation: “supply chain risk” designation + Amodei statement
Why it matters: This is a high-stakes governance signal: AI labs are increasingly treated as critical suppliers (and potential risks) in national-security procurement, with direct implications for enterprise adoption, contracts, and oversight.
Reported developments:
- Designation: A post claims the Pentagon formally notified Anthropic it’s been deemed a “supply chain risk”.
- Amodei response (as summarized): A memo-style summary says Amodei apologized for the tone of a leaked memo, said it was outdated/not his considered view, emphasized keeping warfighters equipped, and offered Claude to the military at nominal cost with forward-deployed engineer support .
- Anthropic’s statement link: Anthropic shared a statement from Amodei: https://www.anthropic.com/news/where-stand-department-war.
“Anthropic has much more in common with the Department of War than we have differences.”
5) Security incident report: “Clinejection” installs a separate agent (OpenClaw) without consent
Why it matters: Agentic dev tools run with broad local permissions; supply-chain style incidents can turn “developer convenience” into fleet-wide risk.
- A write-up alleges “every developer who installed or updated Cline got OpenClaw … installed globally on their machine without consent,” describing it as “malicious agent injection” and noting OpenClaw has “full system access” .
Details: https://grith.ai/blog/clinejection-when-your-ai-tool-installs-another
Research & Innovation
Why it matters: This week’s research is converging on a few themes: RL methods for messy tasks, hybrid architectures for scaling efficiency, and benchmarks that better approximate real agent constraints (implicit rules, over/underthinking, interaction).
Open models + hybrid architectures
- OLMo Hybrid (AI2): Allen AI released OLMo Hybrid, mixing transformer attention with linear RNN layers; the team claims hybrid models are “strictly more expressive” than either alone and that this translates to better scaling (49% fewer tokens to match OLMo 3 MMLU accuracy) .
- Training “fully in the open”: Lambda says OLMo Hybrid 7B was trained in the open with training logs/recovery metrics/weights, using 3T tokens, 512 NVIDIA Blackwell GPUs, over 7 days, with 97% active training time and median recovery under 4 minutes.
RL + evaluation research (Meta FAIR ICLR set)
- Meta FAIR says its team co-authored 7 papers accepted to ICLR, covering topics including joint safety agents (“Alignment Waltz”), judge RL (“J1”), experience synthesis for agent learning, and benchmarks for over/underthinking (“OptimalThinkingBench”) .
Data efficiency for language models
- Semantic Tube Prediction (STP): STP (co-authored by Yann LeCun) is described as forcing hidden states into locally linear “semantic tubes,” matching baseline accuracy with 16× less training data. Paper: https://arxiv.org/abs/2602.22617.
Benchmarks for agent “implicit constraints”
- Implicit Intelligence: Labelbox Applied ML Research introduced a benchmark testing whether agents respect unstated constraints across implicit reasoning, catastrophic risk, privacy/security, and accessibility . Paper: https://arxiv.org/abs/2602.20424.
Long-running agents: context compression as a core problem
- Baseten KV-cache compression: Baseten reports one-shot compaction preserves detailed information with 65–80% accuracy at 2–5× compression (outperforming text summarization) and explores what happens when you compress repeatedly for persistent agents .
Products & Launches
Why it matters: The biggest product shifts are around agent scaffolding: better computer-use interfaces, orchestration/automation, and cross-tool connectivity (so agents can actually act, not just chat).
GPT‑5.4 distribution and integrations
- GitHub Copilot: GitHub says GPT‑5.4 is now generally available and rolling out in Copilot; early testing highlights “enhanced logical reasoning and task execution” . Changelog: https://github.blog/changelog/2026-03-05-gpt-5-4-is-generally-available-in-github-copilot/.
- Cursor: Cursor says “GPT 5.4 is now available in Cursor,” and they found it “more natural and assertive than previous models” .
- Perplexity: Perplexity announced GPT‑5.4 and GPT‑5.4 Thinking availability for Pro/Max subscribers .
- Arena: Arena reports GPT‑5.4 variants in Text/Vision/Code arenas and publishes ranking highlights (e.g., GPT‑5.4‑high tied with Gemini‑3‑Pro in Text Arena) .
Codex tooling updates
- Codex app on Windows: OpenAI Devs announced Codex is now on Windows with a “native agent sandbox” and PowerShell support . Landing page: https://developers.openai.com/wendows.
Always-on agent operations
- Cursor Automations: Cursor introduced Automations for always-on agents that run based on triggers and instructions you define . Blog: http://cursor.com/blog/automations.
Office / finance workflow tooling
- ChatGPT for Excel: OpenAI launched “ChatGPT for Excel,” positioning it as bringing ChatGPT into spreadsheet workflows (“where decisions get made”) . Link: https://openai.com/index/chatgpt-for-excel/.
Video generation continues to split into “engines” vs “story tools”
- Bing Video Creator: Microsoft rolled out “Sora 2 generative video” in Bing Video Creator, adding audio integration and watermark + C2PA credentials .
- PAI (Utopai Studios): Utopai says PAI is rolling out as a long-form cinematic model with minutes-long continuous generation, character/scene consistency, and natural-language editing .
- LTX‑2.3 on fal: fal says LTX‑2.3 is live with Pro (audio-to-video, retake, extend) and Fast modes plus sharper detail/cleaner audio/stronger motion .
Industry Moves
Why it matters: Distribution and enterprise positioning are starting to matter as much as raw model quality—especially for agents (where tool ecosystems + integrations decide what gets adopted).
- Together AI fundraising (reported): Together AI is reportedly raising $1B at a $7.5B pre-money valuation, generating ~$1B ARR, with growth tied to moving from leasing GPUs to buying their own GPUs to rent out .
- Codex user growth: Codex surpassed 2M+ active users, up 25% week-over-week (noted as before Windows + GPT‑5.4 launch) .
- Claude adoption: One post claims “more than a million people are now signing up for Claude every day” .
- Sakana AI × MUFG: Sakana AI and Mitsubishi UFJ Bank advanced their “AI Lending Expert” system from ~6-month PoC to real-case verification phase. Link: https://sakana.ai/mufg-ai-lending.
Policy & Regulation
Why it matters: Export controls and professional-liability rules can become hard constraints on where AI can be deployed—and what assistants can legally do.
- US AI chip export restrictions (reported): A post says the Trump Administration is preparing a rule to restrict AI chip shipments globally without US approval, requiring permission for “virtually all exports of AI chips,” with Nvidia and AMD heavily impacted .
- New York bill targeting “substantive responses”: A New York bill would ban AI from answering questions related to licensed professions (medicine, law, dentistry, nursing, psychology, social work, engineering, and more), and companies would be liable if chatbots give “substantive responses” in these areas .
Quick Takes
Why it matters: Smaller releases often become “quiet defaults” inside stacks—especially around evaluation, routing, and on-device constraints.
- OpenAI: Chain-of-Thought controllability: OpenAI published a new evaluation suite/paper and says GPT‑5.4 Thinking shows “low ability to obscure its reasoning,” suggesting CoT monitoring remains a useful safety tool .
- Gemini 3.1 Flash‑Lite preview (pricing): Google launched Gemini 3.1 Flash‑Lite in preview at $0.25 / 1M input tokens for high-volume developer workloads .
- Perplexity “Model Council”: Perplexity launched a mode that runs GPT‑5.4, Claude Opus 4.6, and Gemini 3.1 Pro simultaneously and selects the best answer in one workflow .
- OLMo Hybrid (distribution): AI2 released a family of OLMo Hybrid models (base/SFT/DPO) on Hugging Face .
- FlashAttention‑4 resources: FA4 paper and code links shared (paper PDF + GitHub repo) .
- LiquidAI on-device agent: A 24B-parameter model (2.3B active per token) is reported to fit in 14.5GB and run tool selection with 385ms average latency (67 tools, 13 MCP servers) with “zero network calls” .
- OpenHands Critic v1.0: OpenHands released a “critic” model that scores coding agent traces to address the verification bottleneck, with real-time thumbs-up/down monitoring and support in SDK/CLI/Hugging Face .
- LangChain skills evaluation: LangChain released an evaluation benchmark for LangSmith/LangChain “skills,” emphasizing variance across tasks for coding agents . Repo: https://github.com/langchain-ai/skills-benchmarks.
- GitHub AGENTS.md guidance: GitHub’s analysis of 2,500+ repos suggests effective AGENTS.md files stay brief and include persona, exact commands, boundaries, and good output examples .