# Muse Spark Reopens the Frontier Race as Agent Platforms Mature

*By AI High Signal Digest • April 9, 2026*

Meta’s new frontier model led the cycle, while Anthropic pushed fully hosted agents and new benchmarks showed how difficult real-world agent work remains. Research also advanced protein design, math, memory systems, and automated scientific writing.

## Top Stories

*Why it matters:* The biggest developments this cycle combined a new frontier model, a push toward fully hosted agent infrastructure, and better evidence about where agent systems still break in real work.

### Meta launches Muse Spark and re-enters the frontier race

Meta released Muse Spark as the first model from Meta Superintelligence Labs after a nine-month rebuild of its AI stack, and the model now powers Meta AI [^1]. Meta describes it as a natively multimodal reasoning model with tool-use, visual chain of thought, and multi-agent orchestration [^2]. Artificial Analysis scored it **52** on the Intelligence Index, placing it in the top five models it has benchmarked [^3]. On those benchmarks, Muse Spark was also notably token efficient at **58M** output tokens, versus **157M** for Claude Opus 4.6 and **120M** for GPT-5.4 [^3].

The model’s strongest third-party results were in vision and reasoning: **80.5%** on MMMU-Pro and **39.9%** on Humanity’s Last Exam, while agentic performance trailed leaders on GDPval-AA and TerminalBench Hard [^3]. Meta is also gradually rolling out Contemplating mode, which has multiple agents reason in parallel, and says the model is still weaker in long-horizon agentic systems and coding workflows [^4][^5]. Muse Spark is available at [meta.ai](http://meta.ai) and in the Meta AI app, with private preview API access for select partners [^2].

**Impact:** This is both a capability jump and a strategy change. Muse Spark is Meta’s first frontier model since Llama 4 Maverick and its first frontier release that is not open weights [^3].

### Anthropic moves further up the stack with Claude Managed Agents

Anthropic introduced Claude Managed Agents as a public beta on the Claude Platform, positioning it as a way to build and deploy agents at scale [^6]. The product pairs a performance-tuned agent harness with production infrastructure so teams can move from prototype to launch in days [^6]. Anthropic’s engineering blog describes it as a hosted service for long-running agents [^7].

**Impact:** Anthropic is packaging more of the agent stack as a hosted service, shifting competition from model access alone toward runtime, orchestration, and deployment infrastructure.

### APEX-Agents-AA shows how hard real agent work still is

Artificial Analysis launched APEX-Agents-AA, a benchmark based on **452** long-horizon tasks from investment banking, management consulting, and corporate law, using MCP-based tools and pass@1 grading across three runs per task [^8]. The leaderboard is tightly clustered at the top: **GPT-5.4** at **33.3%**, **Claude Opus 4.6** at **33.0%**, and **Gemini 3.1 Pro Preview** at **32%** [^8].

The implementation runs inside Stirrup, Artificial Analysis’s open-source agent harness, and one outside summary noted a very large gap between proprietary and open-source models on this workload [^8][^9][^10].

**Impact:** The result is a useful reality check. Even the leading models are completing only about one-third of these long-horizon professional tasks.

### DISCO pushes generative AI deeper into experimental science

DISCO is a new diffusion system for joint protein sequence-structure co-design from Mila and Frances Arnold’s Caltech lab, with Yoshua Bengio also highlighting the release [^11][^12]. In the headline example, it engineered an enzyme for selective **C(sp³)–H insertion**—described as one of the most challenging transformations in organic chemistry—using a single plate, without pre-specified catalytic residues, templates, theozymes, or inverse folding [^11].

**Impact:** This is a strong example of multimodal generative modeling moving beyond software tasks into experimentally grounded molecular design.

## Research & Innovation

*Why it matters:* The most interesting research this cycle focused on memory, writing, training methods, and formal reasoning—areas that directly affect whether AI systems become more useful in extended workflows.

### OpenAI reports five more Erdős problem solutions

OpenAI released a paper describing solutions to **five further Erdős problems** using an internal model [^13]. One highlighted result is a counterexample for **Erdős Problem 1091**, and the paper’s Figure 5 was produced by Codex [^13].

### Google’s PaperOrchestra targets automated research writing

PaperOrchestra is a multi-agent system that turns raw ideas, notes, and experimental logs into submission-ready LaTeX manuscripts [^14]. It uses specialized agents for literature synthesis, plot generation, conceptual diagrams, and iterative refinement, and introduces **PaperWritingBench**, built from reverse-engineered materials from **200** top AI conference papers [^14]. In side-by-side human evaluations, it posted **50–68%** absolute win-rate margins on literature review quality and **14–38%** on overall manuscript quality over autonomous baselines [^14].

### MIA treats agent memory as something that evolves during use

The Memory Intelligence Agent combines a non-parametric memory manager, an RL-trained planner, and an executor, with bidirectional conversion between parametric and non-parametric memory plus test-time learning during inference [^15]. Reported gains include up to **9%** improvement for GPT-5.4 on LiveVQA and **31%** average improvement across **11** benchmarks with a lightweight **7B** executor [^15].

### Thinking Mid-training inserts reasoning before post-training

A new Thinking Mid-training recipe adds supervised fine-tuning and reinforcement learning between pretraining and post-training, using interleaved thoughts to teach models when and how to reason [^16]. On base Llama-3-8B, the authors report a **3.2x** improvement on reasoning benchmarks compared with direct RL post-training [^16].

## Products & Launches

*Why it matters:* Product releases were less about generic chat and more about making models cheaper, more grounded, or easier to use in real workflows.

### Qwen3.6 Plus improves Alibaba’s hosted model offering

Alibaba released **Qwen3.6 Plus**, a proprietary model with native vision input and a **1M-token** context window, available through Alibaba Cloud’s API [^17]. Artificial Analysis scored it **50** on the Intelligence Index, up **5 points** from Qwen3.5 397B [^17]. It also improved on agentic and reliability-oriented measures, including **1373 Elo** on GDPval-AA and an AA-Omniscience move from **-30** to **+3** via reduced hallucination [^17][^18][^17].

A notable commercial angle is cost: Artificial Analysis estimated about **$483** to run the full Intelligence Index on Qwen3.6 Plus, versus much higher costs for frontier proprietary peers [^17][^19].

### Google brings notebooks into Gemini

Google is rolling out Notebooks in Gemini as a project workspace where users can organize chats, notes, documents, and PDFs, and get answers grounded in those sources [^20]. The feature syncs with NotebookLM in both directions, so sources added in one appear in the other [^21][^22]. Rollout starts on the web for Google AI Ultra, Pro, and Plus subscribers [^23].

### Cognition ships SWE-1.6 in Windsurf

Cognition released **SWE-1.6**, which it describes as its best model on both intelligence and model UX, matching its Preview model on SWE-Bench Pro while improving behavior on other axes [^24]. It is available in Windsurf with a **200 tok/s** free tier and a **950 tok/s** fast tier [^24][^25].

### LiquidAI targets edge reasoning with LFM2.5-VL-450M

LiquidAI released **LFM2.5-VL-450M**, a vision-language model built for real-time reasoning on edge devices [^26]. It supports bounding boxes, object detection, function calling, and nine-language multilingual use, and processes a **512×512** image in about **240ms** on-device [^27][^26].

## Industry Moves

*Why it matters:* Labs are making bigger strategic bets on distribution, compute scale, and applied AI programs beyond core model releases.

### Meta shifts its release strategy

Muse Spark is not just a new model. It is Meta’s first frontier release that is not open weights, and Meta is integrating it across Meta AI, Facebook, Instagram, and Threads while saying larger models are already in development [^3][^5].

### xAI outlines a larger training slate at Colossus 2

Elon Musk said Colossus 2 now has **seven models** in training: Imagine V2, two **1T** variants, two **1.5T** variants, a **6T**, and a **10T** model [^28]. In follow-up posts, Musk said the **1T** model is about **2–3 weeks** away, the **1.5T** about **4–5 weeks**, and a pre-training phase is about **two months** [^29][^30].

### OpenAI Foundation commits major Alzheimer’s funding

The OpenAI Foundation said it is taking an end-to-end AI approach to Alzheimer’s, spanning early diagnosis, disease understanding, and drug discovery [^31]. It is finalizing **over $100M** in grants across **six institutions** this month [^31].

## Policy & Regulation

*Why it matters:* Safety disclosures and evaluation frameworks continued to shape de facto standards for deployment.

### Meta publishes a safety framework with Muse Spark

Meta released Muse Spark alongside an **Advanced AI Scaling Framework** that covers evaluation across bio, chem, cyber, and loss-of-control risks before and after mitigations [^32]. In that framework, Muse Spark achieved a **98% bioweapons refusal rate** on BioTier-refuse, which Meta says was the highest among the models it benchmarked [^32]. Meta says this is the start of a safety system designed to scale with future model capability [^33].

### ClawsBench highlights how weak agent safety can still be

ClawsBench measures both capability and safety in stateful agent environments built around tools like Google Workspace CLI and Slack MCP [^34]. One key finding is that scaffolding matters more than model choice: adding skills moved results from **0–8%** to **39–63%** [^34]. Another is that capability and safety can diverge: Opus led capability at **63%** but also tied for the worst unsafe-action rate at **23%**, while GPT-5.4 had the lowest unsafe-action rate at **7%** but only mid-tier task performance [^34]. Only **1** out of **7,224** trials explicitly detected a prompt injection [^34].

## Quick Takes

*Why it matters:* Smaller releases still showed rapid movement in video generation, developer tooling, model serving, and workflow automation.*

- **Bytedance’s Dreamina Seedance 2.0** moved to **#1** in Video Arena for both text-to-video and image-to-video, with large gains over its prior version [^35][^36][^35][^36].
- Google added **Flex** and **Priority** service tiers to the Gemini API, including a **50% lower-cost** tier for latency-tolerant workloads and a priority tier for critical apps [^37].
- **W&B Automations** is now live, adding metric alerts, Slack notifications, and webhook-driven actions like triggering eval pipelines or killing failed jobs [^38][^39][^40].
- **Cursor’s** code review agent now learns from PR activity to self-improve in real time; the company says **78%** of issues it finds are resolved before merge [^41].
- **Nomic** and **Muna** released on-device layout models for PDF understanding, with no server, no API key, and local parsing of **500-page** PDFs [^42].
- **SWE-bench** crossed **1 million downloads**; an easier inference stack and **SWE-bench Multimodal** are next [^43].
- **NVIDIA** and **vLLM** submitted the first MLPerf vision-language-model benchmark using vLLM [^44].
- **Runway** added custom voices for Runway Characters, generated from text prompts [^45][^46].

---

### Sources

[^1]: [𝕏 post by @alexandr_wang](https://x.com/alexandr_wang/status/2041909376508985381)
[^2]: [𝕏 post by @AIatMeta](https://x.com/AIatMeta/status/2041910285653737975)
[^3]: [𝕏 post by @ArtificialAnlys](https://x.com/ArtificialAnlys/status/2041913043379220801)
[^4]: [𝕏 post by @AIatMeta](https://x.com/AIatMeta/status/2041910291395760175)
[^5]: [𝕏 post by @AIatMeta](https://x.com/AIatMeta/status/2041910288480649636)
[^6]: [𝕏 post by @claudeai](https://x.com/claudeai/status/2041927687460024721)
[^7]: [𝕏 post by @AnthropicAI](https://x.com/AnthropicAI/status/2041929199976640948)
[^8]: [𝕏 post by @ArtificialAnlys](https://x.com/ArtificialAnlys/status/2041896261826310598)
[^9]: [𝕏 post by @ArtificialAnlys](https://x.com/ArtificialAnlys/status/1998785300433338692)
[^10]: [𝕏 post by @scaling01](https://x.com/scaling01/status/2041909847306723568)
[^11]: [𝕏 post by @jarridrb](https://x.com/jarridrb/status/2041893841301860542)
[^12]: [𝕏 post by @Yoshua_Bengio](https://x.com/Yoshua_Bengio/status/2041929520421151043)
[^13]: [𝕏 post by @mehtaab_sawhney](https://x.com/mehtaab_sawhney/status/2042072817395757467)
[^14]: [𝕏 post by @dair_ai](https://x.com/dair_ai/status/2041897121948000469)
[^15]: [𝕏 post by @omarsar0](https://x.com/omarsar0/status/2041895109252542730)
[^16]: [𝕏 post by @jaseweston](https://x.com/jaseweston/status/2041864833214095484)
[^17]: [𝕏 post by @ArtificialAnlys](https://x.com/ArtificialAnlys/status/2041970925873320203)
[^18]: [𝕏 post by @ArtificialAnlys](https://x.com/ArtificialAnlys/status/2041970932198052018)
[^19]: [𝕏 post by @ArtificialAnlys](https://x.com/ArtificialAnlys/status/2041970934794612897)
[^20]: [𝕏 post by @Google](https://x.com/Google/status/2041984380336279782)
[^21]: [𝕏 post by @Google](https://x.com/Google/status/2041984382861299852)
[^22]: [𝕏 post by @Google](https://x.com/Google/status/2041984377102512293)
[^23]: [𝕏 post by @Google](https://x.com/Google/status/2041984385843405304)
[^24]: [𝕏 post by @cognition](https://x.com/cognition/status/2041588234191552782)
[^25]: [𝕏 post by @cognition](https://x.com/cognition/status/2041944195188093351)
[^26]: [𝕏 post by @liquidai](https://x.com/liquidai/status/2041912441060143251)
[^27]: [𝕏 post by @maximelabonne](https://x.com/maximelabonne/status/2041916697234317361)
[^28]: [𝕏 post by @elonmusk](https://x.com/elonmusk/status/2041754402239975479)
[^29]: [𝕏 post by @elonmusk](https://x.com/elonmusk/status/2041894999823151124)
[^30]: [𝕏 post by @elonmusk](https://x.com/elonmusk/status/2041756412783817079)
[^31]: [𝕏 post by @woj_zaremba](https://x.com/woj_zaremba/status/2041970232521322877)
[^32]: [𝕏 post by @summeryue0](https://x.com/summeryue0/status/2041956901769113948)
[^33]: [𝕏 post by @summeryue0](https://x.com/summeryue0/status/2041956911797727473)
[^34]: [𝕏 post by @xdotli](https://x.com/xdotli/status/2041941268884759001)
[^35]: [𝕏 post by @arena](https://x.com/arena/status/2041713742485045590)
[^36]: [𝕏 post by @arena](https://x.com/arena/status/2041969275196567780)
[^37]: [𝕏 post by @_philschmid](https://x.com/_philschmid/status/2041906698038702196)
[^38]: [𝕏 post by @wandb](https://x.com/wandb/status/2041948335863689338)
[^39]: [𝕏 post by @wandb](https://x.com/wandb/status/2041948419263230189)
[^40]: [𝕏 post by @wandb](https://x.com/wandb/status/2041948680606117904)
[^41]: [𝕏 post by @cursor_ai](https://x.com/cursor_ai/status/2041969870234120231)
[^42]: [𝕏 post by @andriy_mulyar](https://x.com/andriy_mulyar/status/2041893915347812710)
[^43]: [𝕏 post by @OfirPress](https://x.com/OfirPress/status/2041889888035864689)
[^44]: [𝕏 post by @vllm_project](https://x.com/vllm_project/status/2042029880217567497)
[^45]: [𝕏 post by @runwayml](https://x.com/runwayml/status/2041940020861223079)
[^46]: [𝕏 post by @c_valenzuelab](https://x.com/c_valenzuelab/status/2041947331080175988)