ZeroNoise Logo zeronoise

AI High Signal Digest

Active
Public Daily at 8:00 AM Agent time: 8:00 AM GMT+00:00 – Europe / London

by avergin 1 source

Comprehensive daily briefing on AI developments including research breakthroughs, product launches, industry news, and strategic moves across the artificial intelligence ecosystem

Minimalist RL recipes, self-play coding agents, and shifting compute competition signals
26 December 2025
9 minutes read
Kling AI
Saining Xie
Demis Hassabis
+37
This update highlights a push toward simpler and more scalable RL training (JustRL, self-play SWE-RL), new small-model and coding releases, and fresh signals in the compute race—from Nvidia stack performance claims to China’s H200-class ambitions and Huawei’s Ascend 950 expansion.

Top Stories

1) JustRL: a minimal, fixed-hyperparameter RL recipe hits strong 1.5B reasoning results

Why it matters: If a single-stage, fixed-hyperparameter RL setup can reliably outperform more complex pipelines, it lowers operational complexity and compute cost for improving smaller reasoning models.

  • New work introduces JustRL, described as a minimal RL recipe using single-stage training with fixed hyperparameters (basic GRPO) rather than multi-stage pipelines, dynamic schedules, curriculum learning, or length penalties .
  • Reported results include 54.9% average accuracy across nine math benchmarks for JustRL-DeepSeek-1.5B, and 64.3% for JustRL-Nemotron-1.5B.
  • The post claims 2× less compute than more sophisticated approaches, and on AIME 2024 improves from 28% to 58% over 4,000 steps without typical collapses/plateaus .
  • Ablations reported that adding “standard tricks” like explicit length penalties and robust verifiers degraded performance by collapsing exploration; the model naturally compresses responses from 8,000 to 4,000–5,000 tokens without a penalty term .
  • Paper: https://arxiv.org/abs/2512.16649.

2) Self-play SWE-RL: one coding agent learns by injecting and repairing bugs in real repos

Why it matters: Training that doesn’t rely on human-labeled issues or tests could be a lever for scaling continuous self-improvement in software agents.

  • Self-play SWE-RL (SSR) trains a single LLM agent to self-play between bug-injection and bug-repair, grounded in real-world repositories, with no human-labeled issues or tests.
  • The approach is presented as enabling software agents to self-improve via self-play RL.

3) Pure-RL small model release: LFM2-2.6B-Exp ships as an experimental checkpoint

Why it matters: If “pure RL” can yield strong instruction-following/knowledge/math at ~3B scale (even as an imperfect artifact), it expands the practical design space for small, deployable models.

  • LFM2-2.6B-Exp is an experimental checkpoint built using pure reinforcement learning, with no SFT warm-up or distillation.
  • Training is described as sequential RL starting with instruction following, then expanding to knowledge, math, and a bit of tool use.
  • Reported performance: trained specifically on instruction following/knowledge/math and described as particularly strong vs other ~3B models; an example claim says its IFBench score surpasses DeepSeek R1-0528 (263× larger) .
  • Caveat: it’s “not polished” and “can doom loop,” but is described as surprisingly strong and still significantly improvable .
  • Available on Hugging Face: https://huggingface.co/LiquidAI/LFM2-2.6B-Exp.

4) Training hardware reality check: Blackwell Ultra vs TPU + “real bottlenecks” framing

Why it matters: Teams planning training infrastructure need to optimize around memory/data-movement/synchronization—not just FLOPS—and vendors are increasingly arguing the stack is the moat.

  • A post cites vendor results claiming NVIDIA Blackwell Ultra delivers up to 1.9× higher training performance per chip vs Ironwood TPU on heavy workloads, attributed to Nvidia’s integrated hardware + interconnect + software stack .
  • Another note emphasizes that “top speed” chip metrics (FLOPS) can look great while real training is limited by memory, data movement, and synchronization bottlenecks.

5) China compute competition signals: “H200-class” timelines + Huawei Ascend 950 in South Korea

Why it matters: Multiple threads point to near-term pressure on compute supply chains and to efforts to establish ecosystems outside China.

  • A shared ByteDance expert-call transcript claims China could become self-sufficient on H200-class performance in six months. It also claims ByteDance expects its lead to last 5–6 months, and that after May–June 2026 several domestic chips may approach or exceed H200 performance (excluding price-performance and supply-chain stability) .
  • Examples cited: next-gen Baidu Kunlun reportedly targeting B200 specs, and Cambricon MLU790 expected to rival H200 single-card performance .
  • Separately, Huawei Korea plans to launch Ascend 950 in South Korea . Commentary frames the goal as avoiding ecosystem stagnation by gaining a foothold outside China, with a possible Korea angle on HBM cooperation .

Research & Innovation

Why it matters: This set of updates is about (1) scaling RL-driven agent training, (2) sharpening what “reasoning progress” means via benchmarks, and (3) new papers across representation learning and multimodal planning.

Scaling agentic RL runs for code

  • One practitioner described orchestrating a “massive agentic RL training run,” with hundreds of inference nodes generating code at millions of tokens per second, thousands of sandboxes executing code in parallel, and training nodes learning from rewards .

ARC-AGI roadmap: benchmark as “compass,” not an AGI threshold

  • François Chollet frames the ARC-AGI series as not an AGI threshold but a compass pointing the research community toward key questions .
  • He describes ARC-AGI-1 as a minimal fluid-intelligence test requiring a move past “pretraining scaling + static models at inference” toward test-time adaptation.
  • ARC-AGI-2 is described as probing deeper reasoning complexity (especially concept composition) and still solvable in minutes by regular people without tools .
  • ARC-AGI-3 (launching March 2026) is described as probing interactive reasoning: exploring unknown environments, modeling them, setting self-goals, and planning/executing autonomously without instructions . Work has also started on ARC-AGI-4 and ARC-AGI-5.
  • He notes that saturating ARC-AGI-1 or 2 does not mean we have AGI .

New papers and research directions (links)

RL post-training as a tool-use workaround for attention limitations (opinion)

  • One researcher expects sufficient RL post-training can overcome many linear attention deficits, e.g., models learning to read files with tools more often to keep important info in context .

Products & Launches

Why it matters: Shipping paths (CLIs, hosted inference, IDE integrations) increasingly determine adoption—and many updates are about making models easier to use (or easier to evaluate) rather than changing architecture.

GroqCloud adds Kimi K2 0905 with posted throughput and pricing

  • Groq introduced Kimi K2 0905 on GroqCloud .
  • Posted performance/pricing: “200+ T/s” at a blended $1.50 / M tokens ($1.00 / M input, $3.00 / M output) .

MiniMax M2.1: hands-on evaluation, demos, and broader tooling access

  • A detailed hands-on review describes MiniMax M2.1 as a new multilingual coding SOTA and reports a “rocky evaluation process” (routing issues, deployment bugs, multiple retests) with a verdict of coding gains but some regression vs M2 in logical reasoning .
    • Improvements: more stable/precise instruction following; stronger coding & engineering, especially frontend/app development; “closer to Claude-style engineering”; strong one-shot generation, but weak self-debugging .
    • “Roughly the same”: long-chain reasoning trades wins with M2 but uses fewer tokens (suggesting better reasoning efficiency); hallucinations similar and degrade with longer context .
    • Regressions: noticeable drop in math, and a “major loss” in spatial intelligence vs M2 .
    • Review links: https://www.zhihu.com/question/1986842290577224663/answer/1987302908056856323 and https://zhuanlan.zhihu.com/p/1987656504711586985.
  • Demo: MiniMax-M2.1 fully generates a “Voxelmas” Christmas scene; shared at https://2ibfysp9zmbv.space.minimax.io/.
  • Distribution: M2.1 is now available in the Blackbox CLI, with a getting-started page at https://docs.blackbox.ai/features/blackbox-cli/introduction.
  • MiniMax also credits experts-in-the-loop as key to M2.1’s multilingual coding strength and says it plans to open-source soon.

Kling VIDEO 2.6: upgraded Motion Control

  • Kling says upgraded Motion Control is live in Kling VIDEO 2.6, offering “full control over every action & expression” .
  • Claimed capabilities include detailed full-body motion, fast/complex action support, precise hand gestures, and expressive faces with lip sync . It also supports uploading 3–30s motion references and refining scene details via text prompts .

New developer tools

  • skills-cli: a new CLI to manage agent skills, including validating/converting to prompts (like Anthropic’s skills-ref) plus creating blank skills, pushing to Anthropic, installing to codex/claude folders, and pulling from GitHub . Repo: https://github.com/taylorai/skills-cli.
  • AI Studio UX changes (ETAs): docs integrated into AI Studio (Q1), auto-create projects/keys for existing users (2nd week of Jan), auto-select project (1st week of Jan), auto-populate key names (2nd week of Jan), and billing setup directly in AI Studio (initial rollout Jan 20) .

Industry Moves

Why it matters: Competitive advantage is shifting toward integrated stacks (chips + memory + software), distribution surfaces, and reliable inference/benchmarking.

Memory economics: HBM demand vs consumer RAM pricing

  • A post claims memory suppliers like SK Hynix and Micron find selling HBM to Nvidia more profitable than selling consumer DDR5.
  • Shared price comparison: a 256GB RAM stick listed at $4,799.99, versus Nvidia DGX Spark at $3,999.99 and Mac Studio (M4 Max, 128GB unified memory) at $3,329.99.

Consumer attention: “time spent” leadership changes among GenAI websites

  • Similarweb data shared on “time spent on leading Gen AI websites” says Gemini surpassed ChatGPT in September, and Grok surpassed Gemini in October and led for the past two months .

Benchmark reliability: provider-side errors affecting scores

  • Epoch AI Research notes that for tested models, some providers had errors that affected benchmark scores, with recently released models more impacted . A follow-on post calls it a “serious issue” and asks for guidance on avoiding quality loss during inferencing .

Talent and ecosystem constraints (Japan commentary)

  • One thread argues Japan has sufficient compute to lease it out but “seriously lack[s] talents to utilize the compute,” citing recruitment difficulty and factors like limited US experience, low English literacy/disengagement from global AI communities, and weaker SWE competition/culture vs China/US .

Policy & Regulation

Why it matters: In the absence of concrete regulatory actions in this set of sources, the signal is in strategic autonomy narratives—how governments and commentators frame “sovereign AI” as a national capability.

  • UK commentary argues the UK needs “sovereign AI” (alongside other sovereign defense capabilities), citing recent signals of US intent . A reply argues feasibility requires stopping deindustrialization and talent bleed .
  • Sakana AI’s COO delivered a keynote at an Abu Dhabi symposium themed “Who decides the future of AI?”, under the symposium title “Blueprint for Breakthrough: Japan-UAE Cooperation on Artificial Intelligence and Space” . Video link shared: https://www.youtube.com/live/xPY8XlUyoxk&t=5000.

Quick Takes

Why it matters: These are smaller datapoints that hint at near-term adoption drivers (feedback loops, tooling, and “agentic” workflows) and broader capability perceptions.

  • Grok on X as “truth-friendliness” tooling: Vitalik Buterin says easy Grok-calling on Twitter is a major boost to truth-friendliness after community notes, and that not seeing Grok’s response ahead of time helps challenge biased expectations .
  • Grok Imagine adds explicit user ratings: xAI now lets users rate Grok Imagine videos from “hated it” to “loved it,” framed as high-quality signal beyond likes/views; “We’re always looking for critical feedback” .
  • Holiday Codex model variant: Codex launched GPT-5.2-Codex-XMas, stated to perform exactly the same as GPT-5.2-Codex but with a seasonal personality upgrade (“Santa Codex”) . Usage: $ codex -m gpt-5.2-codex-xmas.
  • “Special tier” code-auditing models (opinion): One post says models 5.2, Speciale (and sometimes Opus) excel at auditing large “vibecoded” artifacts, and suggests a “phase transition” where reasoning becomes a primary mode of thinking/testing complex things .
  • AGI evaluation framing: “Prediction and discovery are the hardest-to-fake benchmarks for AGI” .
  • General intelligence debate: Saining Xie argues human intelligence is better seen as “socially driven cognitive adaptations” and says current AI is “nowhere near” recreating much of intelligence; Yann LeCun echoes that intelligence is multidimensional and “None is general”; Demis Hassabis argues brains are “extremely general” and approximate Turing Machines capable of learning anything computable given time/memory/data .
  • Creative writing anxiety: A New Yorker essay describes a writer finding an AI model’s imitation “eerily close,” with readers misidentifying excerpts and getting none right .
  • Fast quadruped robotics: MirrorMe Technology’s Black Panther II robot dog hit 13.4 m/s peak speed (also stated as 48.24 km/h / 30 mph) .
  • Autonomous driving anecdotes: One post claims Tesla FSD V14 avoided a t-bone collision; another reports interruption-free home-to-office drives and encountering a self-driving Waymo driving in the opposite lane .
  • “Encode for machine, not encode for human”: Shawn Shen (memories_ai) argues for rethinking compression and indexing in AI in an interview; YouTube link shared .
NVIDIA’s robotics milestone stack, the Nvidia–Groq inference deal, and OpenAI’s GPT‑5.2 refresh
25 December 2025
9 minutes read
Xeophon
ComfyUI
Windsurf
+39
This digest covers major AI developments across strategy, research, and products: the Nvidia–Groq inference deal (with conflicting acquisition vs licensing narratives), NVIDIA GEAR’s robotics model stack, OpenAI’s broad December model refresh, xAI’s datacenter buildout, and a key Italian WhatsApp/AI competition decision.

Top Stories

1) Nvidia–Groq: acquisition claims vs. Groq’s stated “non-exclusive licensing” + team move

Why it matters: Inference is becoming a primary battleground; whether this is a full acquisition or a licensing + talent transfer, the outcome likely shapes how fast inference hardware and software stacks consolidate.

  • A viral post claimed: “Nvidia is buying Groq for $20B in cash” .
  • Groq’s Jonathan Ross stated Groq entered a non-exclusive licensing agreement with Nvidia for Groq’s inference technology; he and other Groq team members will join Nvidia to help integrate the licensed tech; GroqCloud will continue to operate without interruption. (Press release link shared: https://groq.com/newsroom/groq-and-nvidia-enter-non-exclusive-inference-technology-licensing-agreement-to-accelerate-ai-inference-at-global-scale.)
  • Commentary framed it as “not an acquisition” but “smarter” licensing + “pulling the brains in-house” , while others still described it as a $20B “acqui-hire” or as neutralizing a potential narrative/valuation threat .

2) NVIDIA GEAR’s 2025 robotics releases: open humanoid foundation models + synthetic data engines

Why it matters: This thread lays out a coherent “robot learning stack” roadmap—foundation models, world models, whole-body control, and sim2real recipes—framed as building blocks others can reuse.

From an NVIDIA GEAR lab thread, milestones include:

  • GR00T N1 (open-sourced): described as a 2B-parameter foundation model for humanoids trained on teleop data, large-scale simulation (including 300K+ trajectories open-sourced), “neural trajectories” (synthetic video data), and latent action extraction; presented as an end-to-end model from “photons to actions,” combining a VLM “System 2” planner with a diffusion transformer “System 1” action generator at 120 Hz.
  • GR00T N1.5 / N1.6 (open-sourced): iterations intended to improve motion smoothness, language following, and cross-embodiment capability, with links shared for N1.5 and N1.6 .
  • DreamGen / GR00T Dreams: a “dreaming” approach that uses video generation models to generate synthetic robot trajectories; reported results include 0% → 43% success on novel verbs and 0% → 28% in unseen environments in described tests .
  • SONIC: humanoid whole-body motion tracking work described as trained with 9k+ GPU hours and 100M+ motion frames, plus a universal kinematic planner and VR teleop modes, and integration with VLA models .
  • PLD: a recipe combining residual real-world RL with supervised fine-tuning to let VLA models self-improve for high-precision manipulation .
  • VIRAL / DoorMan: sim-to-real claims emphasizing zero teleop and zero real-world data for “autonomous humanoid loco-manipulation in reality” , and a separate RGB-only door-opening effort (“DoorMan”) trained on “100% sim data” .

3) OpenAI’s December developer updates: GPT‑5.2 (+Codex), image, and audio model snapshots

Why it matters: This is a broad model+API refresh across text, coding, image editing, transcription, and realtime speech—directly affecting what developers can ship.

A summary of OpenAI’s December updates for developers includes:

  • GPT-5.2 / gpt-5.2-chat-latest: described as a “flagship jump” with stronger long-context, agentic tool-calling, and vision, adding “xhigh” reasoning, context compaction, and concise reasoning summaries .
  • GPT-5.2-Codex: positioned for long-horizon engineering, with context compaction for multi-window work and improved performance on big refactors/migrations, plus a “defensive cyber” focus .
  • GPT-Image-1.5: described as improving instruction following and logo/key-visual preservation in edits, with better text rendering and ~20% cheaper vs GPT Image 1 .
  • Audio model updates (noted as same API costs) : includes a transcribe model claiming 89% fewer hallucinations vs whisper-1, and a realtime model claiming +18.6pp instruction-following and +12.9pp tool-calling vs the prior snapshot, plus interruptions/VAD support and background function calls while speaking .

4) xAI’s “Colossus 2” datacenter: visible buildout toward ~400MW near-term and “>2GW” trajectory

Why it matters: Power and cooling are increasingly the limiting factor for frontier compute; this is a concrete infrastructure signal, not a model announcement.

A post citing satellite imagery reported xAI painted “MACROHARD” across the roof of its “Colossus 2” datacenter in Tennessee . The same thread described the campus pushing toward ~400 MW active capacity in the near term, with gas turbines arriving and more cooling towers , and claimed a “credible trajectory” toward more than 2 GW of IT capacity over time via on-site generation, ordered hardware, and permitted infrastructure .

5) Italy blocks Meta’s reported plan to ban Meta AI competitors from WhatsApp

Why it matters: Distribution inside major messaging surfaces can decide winners; this is an early signal of how regulators may treat “AI competitors” inside platform ecosystems.

An update said Italy’s Competition Authority blocked Meta’s plan to ban Meta AI competitors from WhatsApp . The same post described a hearing with Meta, Interaction, and OpenAI present and said Italy agreed Meta’s defense was “Groundless” , with the poster arguing that whether messages are “user-to-user (free) or AI-to-user (paid!) makes zero technical difference” given WhatsApp API payments . The post also noted hope for EU-wide interim measures before a Jan 15 ban date .


Research & Innovation

Why it matters: Several threads converge on a theme: multi-turn agents require RL + careful engineering (credit assignment, tool environments, stability), while eval and interpretability work tries to keep up.

Agent-R1: end-to-end RL for tool-using, multi-turn LLM agents

A shared summary described Agent-R1 as a framework for training LLM agents with end-to-end reinforcement learning over multi-turn interactions, motivated by the limitations of ReAct-style loops and fixed workflows . It frames tool use as introducing stochastic state transitions and extends the MDP formulation to include full interaction history and environmental feedback, with dense rewards .

Reported results on multi-hop QA show RL-trained agents outperforming baselines: GRPO achieved 0.3877 average EM vs 0.1328 for RAG, described as up to 2.5× better . An ablation reported disabling an “advantage mask” dropped PPO from 0.3719 to 0.3136, and disabling a loss mask reduced it further to 0.3022. Paper link: https://arxiv.org/abs/2511.14460.

Multi-turn search RL engineering for Qwen3 8B / A3B

One post shared engineering practices that boosted Qwen3 8B and Qwen3 A3B from 1–2 turns and 10% accuracy on Browsecomp-Plus to 15+ / 20+ turns with 30% accuracy.

A separate technical breakdown attributed improvements in multi-turn agentic search to GRPO-style training and stabilization techniques (trajectory denoising, train/inference parity, and multi-turn synthetic trajectories) .

AI-Driven Research for Systems (ADRS): LLMs that iterate on systems algorithms with automated verification

A shared summary described ADRS as a framework where LLMs generate, evaluate, and refine algorithms for systems performance problems automatically . Across ten tasks, the post highlighted outcomes including 13× faster MoE load balancing vs a best-known proprietary implementation, 35% greater cost savings for multi-region cloud scheduling with spot instances vs an expert baseline, and 60% makespan improvement in offline transaction scheduling vs state-of-the-art . It also reported that most tasks completed in under 5 hours for <$30. Paper link: https://arxiv.org/abs/2512.14806.

Attention scaling debates: L2 normalization critique + alternative stabilizers

A thread disputed claims that L2-normalizing attention weights is variance-preserving, arguing it only holds under uncorrelated value vectors; with correlated values, variance depends on both \|\|A\|\|₂ and \|\|A\|\|₁, and L2 can cause length-dependent blow-up . A concrete example with uniform attention yielded variance growing as ((1-\rho) + \rho N) under L2 normalization .

Separately, “Attention Z-Reg” was described as adding a loss term penalizing the absolute value of attention logits to keep numeric ranges near zero .


Products & Launches

Why it matters: Tooling and distribution are accelerating: coding workspaces are shipping parallel-agent UX, open image-edit models are landing in multiple frontends, and usage limits/costs are shifting.

Windsurf Wave 13 (Cognition): parallel agents + free SWE-1.5 for 3 months

Wave 13 (“Shipmas Edition”) includes:

  • SWE-1.5 Free: “full intelligence” at standard throughput, free for the next 3 months .
  • True parallel agents with Git Worktrees plus multi-pane/multi-tab Cascade .
  • A dedicated terminal for more reliable command execution .

Blog link: https://windsurf.com/blog/windsurf-wave-13.

Qwen Image Edit 2511 expands integrations (Replicate, ComfyUI, TostUI) + finetuning support

  • Qwen-Image-Edit-2511 launched as an enhanced version with “notably better consistency” and is live on Replicate .
  • ComfyUI announced Qwen Image Edit 2511 and Qwen Image Layered availability; Qwen Image Layered is described as decomposing images into editable RGBA layers .
  • TostUI shared a Docker launch path for Qwen-Image-Edit-2511 and said it was tested on RTX 3090/4090/5090 on Windows and Linux (8-bit) .
  • A separate post said you can train LoRAs for Qwen Image Edit 2511 with AI Toolkit, and cited a “3bit Accuracy Recovery Adapter” enabling finetuning at 3-bit with <24GB VRAM.

MiniMax M2.1 distribution: BlackboxAI + YouWare + user demos

  • BlackboxAI said 30 million developers now have access to MiniMax M2.1 on its platform .
  • YouWare announced M2.1 is live for “agentic workflows” .
  • Examples shared include a 3D gesture-controlled Christmas tree demo built with M2.1 (link: https://yuyl27wq92.space.minimax.io/) and a separate “dirty window” canvas project built with Trae_ai SOLO + M2.1 using “just two prompts” and “one console-log bug fix” .

OpenAI Apps SDK: “Your Year in ChatGPT” as a demo app pattern

A post noted “Your year with ChatGPT” shipped as a full-screen experience built with the new Apps SDK, and another said it is a demo ChatGPT app that others can build experiences like .

Usage/cost changes: Claude limits and TextQL compute

  • Claude said Pro and Max plans will have usage limits through New Year’s Eve .
  • TextQL announced “TextQL compute is now 80% cheaper,” citing that AI agents query warehouses “10× more” and the company decided customers shouldn’t pay for that increase . It also reported cache hit rate improvements (40% → 52%) to reduce token costs, and a net 30%+ reduction in total costs for most customers .

Industry Moves

Why it matters: Compute supply (and the politics around it) is still a gating factor, and major players are making stack-level bets across chips, licensing, and robotics.

H200 to China: pricing and performance claims

A post citing Chinese media said Nvidia’s H200 sales to China are “virtually confirmed,” and that Jensen Huang is reportedly scheduled to visit China in Jan 2026 . It also cited pricing for an H200 8-card module at 1.4M yuan (~$200k).

Another thread emphasized a performance-per-density metric (TPP), claiming H200 compute value of 15,832—about 6.7× H20—while price is only ~1.3× higher .

“NVIDIA is doing ASICs”

A post claimed Nvidia is “doing ASICs” .

Robotics: positioning “Physical Turing Test” as a core mission

In the NVIDIA GEAR thread, the author described a “singular mission to solve the Physical Turing Test for robotics” and highlighted the lab’s scope across foundation models, world models, simulation, whole-body control, and RL .


Policy & Regulation

Why it matters: AI competition is increasingly mediated by platform access and antitrust—especially where “default” distribution sits inside messaging and app ecosystems.

  • Italy’s Competition Authority blocked Meta’s plan to ban Meta AI competitors from WhatsApp, describing Meta’s defense as “Groundless” . The same post suggested the European Commission is moving quickly, calling it one of just five antitrust cases opened by the EC this year and expressing hope for EU-wide interim measures before Jan 15 .

Quick Takes

Why it matters: These smaller datapoints hint at where capabilities and developer workflows are trending (cheap automation, open model packaging, and evaluation reliability).

  • LMArena GPT-5 tracking: Arena reported GPT-5.2 (no-reasoning) at #14, with improvements in hard prompts/coding/instruction following but small dips in writing and business/finance; GPT-5.1-High peaked at #8 and GPT-5.2-High was described as closer to original GPT-5-High in several categories .
  • Gemini 3 Flash Connect 4: a post claimed it won 49/50 games and “dominated” other SOTA LLMs in Connect 4 .
  • Pure CSS generation: Claude Opus 4.5 (Claude Code) generated a “pure CSS animation of a bat” zero-shot; the author compared it to “Alex the CSS husky” and hosted demos + code .
  • Provider eval pain: @xeophon described provider evaluation as “a mess” with issues like rate limits, timeouts, and missing parameters .
  • Bloom (Anthropic): an open-source tool that auto-generates behavioral evaluations by crafting and judging scenarios (e.g., sycophancy, sabotage) .
  • LoRA training for Qwen Image Edit 2511: AI Toolkit-based LoRA training and a 3-bit adapter were highlighted as reducing VRAM requirements for finetuning .
ARC-AGI-2 jumps to 75% (reported), CoT monitorability lands, and agents consolidate into platforms
24 December 2025
8 minutes read
Cline
Vals AI
Select Committee on China
+31
Key developments include a reported 75% ARC-AGI-2 result using GPT-5.2 X-High, OpenAI’s new framework for chain-of-thought monitorability, and ClickUp’s acquisition of Codegen as agentic workflows consolidate into platforms. Also covered: new benchmarks on API-calling reliability, major image/video model rollouts, and fresh policy scrutiny around DeepSeek and NVIDIA chip use.

Top Stories

1) ARC-AGI-2: Poetiq reports a 75% score at under $8/problem using GPT-5.2 X-High

Why it matters: If reproducible, this is a major jump on a headline reasoning benchmark, and it reinforces how much performance can move from systems + prompting (not just base model changes).

Poetiq AI says it ran its existing “Poetiq harness” with GPT-5.2 X-High on ARC-AGI-2, reaching 75% on the full PUBLIC-EVAL dataset at under $8/problem, which it says beats the previous SOTA by ~15 percentage points. A separate post characterizes this as exceeding the human baseline.

Reaction in the thread frames it as an unexpectedly large jump for 2025 , with one commenter attributing the jump to “a good prompting system” and expecting more jumps soon . Another reply remains skeptical that the approach generalizes, while noting rapid recent progress from 20–30% a month ago to “close to” **80%” .

2) OpenAI publishes a framework for “chain-of-thought monitorability”

Why it matters: As models become more agentic, “can we understand what they’re thinking before they act?” becomes a practical safety and debugging question—not just a research topic.

OpenAI introduced a framework for evaluating chain-of-thought monitorability, described as assessing whether we can understand AI reasoning before actions . Reported takeaways include: longer reasoning helps, bigger models muddle things, and “thinking out loud” may become a key safety layer as AI scales .

Resources: the post links to the announcement and paper .

3) ClickUp acquires AI coding startup Codegen; founder joins as Head of AI

Why it matters: “Coding agents” are increasingly being positioned as general knowledge-work agents, and incumbents with integrated work surfaces can become a natural distribution point.

Codegen’s founder announced that Codegen has been acquired by ClickUp, and that he will join ClickUp as Head of AI with the team/product continuing inside ClickUp . He describes Codegen’s original goal as “level 5 self-driving” for software engineering—navigating large codebases, collaborating with developers, and shipping substantial software end-to-end .

The thread argues that code agents extend beyond coding into generalist work (e.g., spreadsheets, simulations) and that ClickUp’s “converged workspace” (docs, tasks, chat, sheets) enables agents to operate across workflows without silos .

4) Epoch AI: evidence that AI capability progress accelerated in 2024

Why it matters: If improvements are arriving faster, evaluation, product cycles, and safety work all need to keep pace.

Epoch AI reports that, per its Epoch Capabilities Index, frontier model improvement nearly doubled in 2024 from ~8 points/year to ~15 points/year. It says this coincides with the rise of reasoning models and increased reinforcement learning focus, and notes the METR Time Horizon benchmark shows a similar pattern with a 40% acceleration in October 2024 . Separate discussion notes acceleration across a composite of major benchmarks as well .

5) Replit appears inside ChatGPT for in-chat app building

Why it matters: Integrations that collapse “IDE + agent + deployment” into the chat surface can change how non-experts build and iterate on software.

A post claims Replit in ChatGPT lets users build “real apps” directly inside ChatGPT with no setup and no tab-switching, by describing what they want .


Research & Innovation

Why it matters: This cycle’s research signals are about (1) making agents reliable in real integrations, (2) specializing multimodal systems for edge use, and (3) improving evaluation quality.

Web APIs remain a brittle spot for code models; constrained decoding is proposed as a fix

New research introduces WAPIIBench, a benchmark for LLM-generated web API invocation code across four real-world APIs (Asana, Google Calendar, Google Sheets, Slack) . The thread highlights common failure modes: open-source models solving <40% of tasks , 6–31% illegal arguments even with correct endpoints , and 14–39% hallucinated URLs .

A proposed solution is constrained decoding that translates OpenAPI specs into regex constraints to filter token predictions during generation, aiming to enforce compliance without model changes or prompt adjustments . The post claims correctness gains of 90% (full completion) and 135% (argument completion), with illegal URLs/methods/arguments dropping to zero . Paper link: https://arxiv.org/abs/2509.20172.

Xiaomi: home-centric, edge-deployable VLM with on-device specialization

A technical report describes MiMo-VL-Miloco-7B as a home-centric, edge-deployable VLM built on MiMo-VL-7B, released with GGUF weights. The training approach is described as a two-stage process: SFT with chain-of-thought and token-budget-aware reasoning on curated home data, followed by difficulty-aware GRPO reinforcement learning to restore video/GUI/general reasoning performance .

Reported results include SOTA F1 on home activities/gestures (including an 18-point absolute F1 gain on the Shaka Sign gesture vs. the strongest baseline) and gains on Video-MME, Video-MMMU, and Charades-STA .

Benchmark quality work: finding “flawed questions”

A Stanford AI blog-linked summary highlights a measurement-theoretic framework that identifies flawed questions in AI benchmarks with up to 84% precision, detecting issues across nine datasets.

System-design perspective on reasoning architectures

François Chollet argues that Transformers are fundamentally parallel processors of context, while reasoning is sequential/iterative, and suggests models need an internal “scratchpad” enabling differentiable looping/branching/backtracking beyond output chain-of-thought .


Products & Launches

Why it matters: A clear pattern: new models are shipping directly into platforms (coding agents, creative suites, TTS), shortening the path from release to daily usage.

Coding & agent models: MiniMax M2.1 expands distribution

MiniMax’s M2.1 is presented as a coding & agent model with a 200K context window, 128K max output, and MoE architecture (10B active / 230B total) in Cline . It’s also reported live in multiple tools:

  • Kilo: MiniMax “dropped M2.1” and it’s “already live in Kilo,” with posted metrics including 74.0% SWE-Bench Verified and 91.5 on VIBE-Web.
  • Ollama: ollama run minimax-m2.1:cloud, with an update noting improved performance across Rust, Java, Golang, C++, Kotlin, Objective-C, TypeScript, and JavaScript .
  • TRAE: available as a custom model via OpenRouter, positioned for multilingual development, long-horizon planning, and complex toolchain execution .

Image + video generation on fal: new endpoints, longer form, and “day-0” availability

  • Kandinsky 5.0 Video Pro: a 19B parameter model live on fal for HD video generation with controllable camera motion (5s and 10s), supporting text-to-video and image-to-video .
  • Seedance 1.5 (ByteDance): released on fal with day-0 availability; capabilities cited include directorial control, synchronized multilingual dialogue with lip sync, and multi-shot sequences with consistent characters .
  • Lucy Restyle Long-Form (DecartAI): long-form video restyling for production use, up to 30 minutes.

Qwen-Image-Edit-2511: broader release + speedups + local formats

Alibaba Qwen describes Qwen-Image-Edit-2511 as a major upgrade with stronger multi-person consistency, built-in community LoRAs, improved identity consistency, and better geometric reasoning . Availability spans:

Developer workflow updates: VS Code and Claude Code

  • VS Code 1.107: adds Claude skills discovery from ~/.claude/skills/ and workspace .claude/skills/ folders ; built-in GitHub MCP Server in Copilot Chat (issues/PRs/repo info via existing GitHub auth) ; and terminal output rendering directly in chat with preserved output .
  • Claude Code plugins now support LSP servers, providing real-time diagnostics, definition jumps, and type info .

Industry Moves

Why it matters: Agent ecosystems are consolidating into platforms (work surfaces, inference stacks, and TTS infrastructure), while infra startups draw capital.

vLLM fundraising signal

A post says the startup behind open-source inference framework vLLM is fundraising at least $160M, as VCs look for tech that makes AI systems run more efficiently .

Voice stack heats up: models + “production” infrastructure claims

  • Together AI announced MiniMax Speech 2.6 Turbo, described as multilingual TTS with human-level emotional awareness and sub-250ms latency, plus support for 40+ languages and 10-second voice cloning, and claims around SOC 2/HIPAA-ready/PCI-compliant infrastructure .
  • Alibaba Qwen launched Qwen3-TTS VoiceDesign & VoiceClone, including “3 seconds of audio” cloning and “10 languages” support, plus posted comparisons (e.g., 15% lower WER in multilingual tests) .

OpenAI team and product focus notes

  • OpenAI welcomed Ernest Ryu “to help accelerate scientific and mathematical discoveries” with ChatGPT , and Ryu solicited failure cases and success stories from users .
  • OpenAI framed “capability overhang” as gaps between what models can do and what people actually do with them, and said 2026 progress depends on deployment and effective usage alongside frontier research .

New venue for “AI Systems” work: ACM CAIS 2026

The inaugural ACM Conference on AI and Agentic Systems (CAIS 2026) was announced as a home for engineering problems like composing agents, optimizing non-differentiable pipelines, and evaluating/debugging probabilistic systems . It’s scheduled for May 26–29, 2026 (San Jose) with a Feb 27 paper deadline .


Policy & Regulation

Why it matters: Export controls and data-privacy concerns continue to intersect with compute supply and geopolitical competition.

U.S. Select Committee scrutiny on DeepSeek and NVIDIA chip use

A U.S. Select Committee year-in-review post references a bipartisan investigation titled “DeepSeek Unmasked: Exposing the CCP’s Latest Tool for Spying, Stealing, and Subverting U.S. Export Control Restrictions”. It links to a press release demanding answers from NVIDIA over DeepSeek’s chip use .


Quick Takes

Why it matters: Smaller shipping and measurement updates often predict where teams will focus next—speed, reliability, and “agent-ready” workflows.

  • GLM 4.7: debuted #1 open-weight on the Vals Index and #9 overall, cited as +9.5% vs GLM 4.6 and “much lower latency” . It’s also now on Ollama (ollama run glm-4.7:cloud) .
  • Provider benchmarking remains messy: Epoch AI notes that benchmark implementations vary across orgs and provider errors can affect scores—especially for recently released models .
  • TurboPuffer: rolled out a new indexing queue on shared regions, claiming ~10× lower index queue time and faster queries on new data, built on object storage (no Kafka) .
  • TikTok saturation: one post claims TikTok is “completely flooded” with AI images/videos and most people don’t notice, especially on photos .
  • Robotics demo: “Helix” is shown handing out swag “fully autonomously, no teleop,” and can interact with people via questions/instructions .
GLM-4.7 and MiniMax M2.1 raise the open-model bar as DeepMind ships Gemma Scope 2
23 December 2025
8 minutes read
swyx 🇸🇬 sg!
Transluce
vLLM
+36
This issue highlights two fast-moving open model releases (GLM-4.7 and MiniMax M2.1), a major interpretability tooling drop (Gemma Scope 2), and concrete agent progress via an AtCoder contest win. It also includes new security posture detail for agent prompt-injection defense, plus a curated set of research and product updates.

Top Stories

1) GLM-4.7 lands as a new open model contender for coding + tool use

Why it matters: Open models that are competitive on real coding/agent tasks—and easy to serve—raise the baseline for developer tooling and self-hosted agents.

Z.ai released GLM-4.7, describing substantial improvements over GLM-4.6 in coding, complex reasoning, and tool usage, and positioning it as a new open-source SOTA-class release (also improving chat/creative writing/role-play) .

Reported results and signals:

  • SWE-bench Verified: 73.8% (vs. Kimi K2 Thinking 73.4%, DeepSeek-V3.2 73.1%; Claude Sonnet 4.5 (closed) 77.2%)
  • Additional claimed edges: 95.7% AIME and 87.4% τ²-Bench
  • LM Arena Code/WebDev: #6 on WebDev leaderboard and #1 open-model spot, +83 points vs GLM-4.6

Adoption/operational notes:

  • Day-0 ecosystem support includes vLLM serving with MTP decode, tool/function calling, and thinking controls .
  • GLM-4.7 is now the default in Cline.
  • Z.ai also introduced/refined multiple “thinking mode” variants (Interleaved / Preserved / Turn-level), and one note recommends benchmarking via the official API due to changes in interleaved thinking .

Links: Weights · Tech blog · Chat


2) MiniMax M2.1 launches as an “agentic coding” open-source model—shipping into tools fast

Why it matters: Coding agents are becoming productized, and models optimized for long context + structured workflows are now arriving with day-0 integrations.

MiniMax announced M2.1 as a coding & agent model aimed at real-world engineering workflows . It’s described as a 10B-active MoE model (10B active / 230B total) with 200K context and 128K max output, now available in Cline.

Performance/claims highlighted in posts:

  • 72.5% on SWE-multilingual and 88.6% on newly open-sourced VIBE-bench (and claims of exceeding some closed models) .
  • MiniMax also claims SOTA across SWE-Verified, SWE-Multilingual, Multi-SWE, VIBE-Bench, and Terminal-Bench 2.0 .

Availability rollout:

  • API is available on MiniMax Open Platform, plus a hosted MiniMax Agent experience; a “full open-source release in 2 days” is stated .
  • Ollama support is live (ollama run minimax-m2.1:cloud) .
  • LM Arena Code Arena listing is live for head-to-head evaluation (results pending) .

3) Google DeepMind releases Gemma Scope 2 (SAEs + transcoders across every layer)

Why it matters: Interpretability tooling that’s “already computed” across all layers lowers the cost of doing serious safety and behavior investigations.

Google DeepMind released Gemma Scope 2: SAEs and transcoders on every layer of every Gemma 3 model (270M–27B, base & chat) .

Key usage patterns called out:

  • Use an SAE/cross-coder view to inspect what concepts the model is using (including an experimental cross-coder across four layers) .
  • Use transcoders for deeper mechanism work, forming attribution graphs (Neuronpedia support; generating custom tools “coming soon”) .

Demo GUI: https://www.neuronpedia.org/gemma-scope-2.


4) Sakana AI’s ALE-Agent wins an AtCoder Heuristic Contest

Why it matters: Competitive performance on hours-long optimization tasks is a tangible milestone for “agentic” search and planning beyond short-form QA.

Sakana AI reports its ALE-Agent (AtCoder account: fishylene) won 1st place in AtCoder Heuristic Contest 058 (ALGO ARTIS Programming Contest 2025 December) on Dec 14, 2025, among 804 participants—described as the first AI agent win in an AHC contest .

They also claim the agent discovered efficient simulated-annealing neighborhood operations beyond the problem setters’ expectations .

Resources: https://sakana.ai/ahc-2025 · https://sakanaai.github.io/fishylene-ahc058/


5) OpenAI details ongoing hardening of ChatGPT Atlas against prompt injection

Why it matters: As agents gain tool access and operate in browser/app environments, prompt injection becomes a practical and continuously evolving security risk.

OpenAI published a post describing how it continuously hardens ChatGPT Atlas (and other agents) against novel prompt-injection attacks, framing this as an ongoing security and frontier research problem . The post highlights investment in automated red teaming, reinforcement learning, and rapid response loops.

Link: https://openai.com/index/hardening-atlas-against-prompt-injection/


Research & Innovation

Training + systems work is converging on “stability under RL”

Why it matters: As RL is used more heavily for agentic behavior, engineering mismatches between training and inference engines are becoming first-order bottlenecks.

  • Rollout Routing Replay (R3) (SGLang + Miles): addresses RL instability for MoE models by recording expert routing decisions during inference and replaying them during training, reducing training–inference discrepancy and preventing collapse . It supports distributed training and lists compatibility with models including Qwen3-30B-A3B and deepseek_v2 .
  • A separate thread notes labs working on numerics for RL to match inference-engine logprobs to training-engine logprobs, emphasizing minimizing mismatch without excessive throughput loss .

Interpretability: extending tools to latent reasoning

Why it matters: If reasoning shifts into “latent” representations, safety and debugging will depend on whether current interp approaches still work.

A small study reports mech-interp techniques can uncover interpretable structure in latent reasoning models (at least on simple math), where latent vectors represent intermediate calculations . Neel Nanda calls the results tentative but encouraging for tooling ahead of potential SOTA adoption .

Benchmarking and evaluation signals

Why it matters: Evaluation is fragmenting into domain-specific suites (long context, medical capability, coding/agent behavior).

  • Epoch AI benchmarked open-weight Chinese models on FrontierMath, reporting Tier 1–3 performance lagging the overall frontier by about seven months, and only DeepSeek-V3.2 (Thinking) scoring non-zero on Tier 4 (1/48 ≈ 2%) .
  • A large-scale study of agent framework usage analyzed 1,575 projects and 11,910 discussions, noting that 96% of top-starred agent projects use multiple frameworks and mapping common failure modes (logic failures, termination issues, version-compatibility problems, and RAG latency) . Paper: https://arxiv.org/abs/2512.01939.

Selected paper drops (from shared links)

Why it matters: These point to active exploration beyond standard transformer-only text LMs.


Products & Launches

Google pushes Gemini-powered creation and agent runtimes

Why it matters: “Agents” are arriving as consumer-facing workflows (game creation, search experiences) and as developer APIs (state + background execution).

  • YouTube Playables Builder (Gemini 3): a web app for creating bite-sized games from text/video/image prompts . Beta registration is mentioned for users in the US, Canada, Great Britain, and Australia . More: https://goo.gle/youtube-playables-builder.
  • Gemini 3 Pro in Search (AI Mode): described as generating dynamic visual layouts with interactive tools and simulations; access expanded to everyone in the U.S. .
  • Interactions API: launched with server-side state management, background execution, and access to Gemini Deep Research agent . Details: https://blog.google/technology/developers/interactions-api/.

New eval + workflow tooling

Why it matters: Domain evaluation and reliability work is becoming a product surface, not just internal infrastructure.

  • Medmarks v0.1: described as the largest completely open-source automated evaluation suite for medical LLM capability; developed with the MedARC_AI community and PrimeIntellect support, exploring 46 models and spanning 15+ environments . Hub: https://app.primeintellect.ai/dashboard/team/medarc.
  • Cursor v2.3: holiday release focused on bug fixes and reliability, plus easier default layout customization with keybindings . Changelog: https://cursor.com/changelog/2-3.

OpenAI user-facing feature: “Your Year with ChatGPT”

Why it matters: Product personalization is increasingly tied to saved memory and chat history settings.

OpenAI is rolling out “Your Year with ChatGPT” to users in the US, UK, Canada, New Zealand, and Australia who have reference saved memory and reference chat history enabled . Users are told to ensure the app is updated .


Industry Moves

Agent businesses and go-to-market

Why it matters: Revenue and distribution signals help separate “agent demos” from sustained products.

  • Replit Agent: posts describe Replit growing from $10M to >$250M ARR this year, crediting Agent v2+ .
  • MiniMax named multiple launch partners (Ollama, FactoryAI, Cline, OpenRouter, Vercel, etc.) as part of M2.1 rollout .

Compute access and China-facing chip flows

Why it matters: Near-term compute availability continues to shape what models can be trained and where.

  • Reuters is cited reporting NVIDIA will begin shipping H200 chips to China using existing inventory, with initial shipments expected around 40,000–80,000 units.
  • A separate thread discusses Huawei chip production projections and notes Atlas 950 SuperCluster (scheduled Q4 2026) requiring 524K chips per system.

National competition framing: Korea’s “LLM tournament arc”

Why it matters: Government-backed GPU allocation and “national champion” dynamics can reshape regional ecosystems.

Posts describe a Korea national-scale LLM competition involving LG, SKT, NAVER, NC AI, and Upstage, competing for NVIDIA Blackwell GPUs. Source article: https://namu.wiki/w/%EA%B5%AD%EA%B0%80%EB%8C%80%ED%91%9C%20AI.


Policy & Regulation

Why it matters: Even without formal rule changes in this cycle, “who gets chips” and national AI programs are policy-adjacent forces that directly constrain capability.

  • NVIDIA H200 shipments to China (via Reuters citation) underscore continuing cross-border compute dynamics ahead of major holidays .
  • Korea’s national-scale “LLM competition” framed around a prize of Blackwell GPUs indicates state-involved resource allocation (labs: LG, SKT, NAVER, NC AI, Upstage) .

Quick Takes

Why it matters: Smaller launches and measurement updates often foreshadow broader shifts in where attention and budgets go.*

  • ERNIE-5.0-Preview-1203 entered the LMArena Text leaderboard with score 1451, described as a 23-point improvement vs the prior preview and strong on creative writing/hard prompts (scores noted as preliminary) .
  • T5Gemma 2: Google introduced a next-gen encoder–decoder model built on Gemma 3, highlighting multimodality, extended long context, and 140+ languages; a post notes it uses “three-way weight tying” .
  • Claude 4.5 long-context evals (128k cap): Context Arena added Opus/Sonnet/Haiku 4.5 with Extended Thinking (High budget) and reports results as modest vs current SOTA; 1M Sonnet testing is stated as pending .
  • Kling Video 2.6 Motion Control: fal announced day-0 availability with up to 30 seconds one-take motion control and synchronized motion/expression/lip sync .
  • Meta SAM for flood monitoring: USRA and USGS fine-tuned Segment Anything Models to automate a bottleneck in real-time river mapping .
  • Transluce fundraiser: Transluce is running an end-of-year fundraiser; Ethan Perez describes it as a top-tier AI safety lab and potential third-party auditor .
vLLM v0.13.0 shipping upgrades, distillation momentum, and sharper agent eval debates
22 December 2025
8 minutes read
TimDarcet
Eric W. Tramel @ Home
Sam Altman
+32
Key updates span infrastructure (vLLM v0.13.0 with Blackwell Ultra support), growing emphasis on distillation across LLMs and autonomy, and sharper debate over agent evaluation costs/runtimes. Also included: standout benchmark claims (DeepCode on PaperBench, FACTS Leaderboard), new agent tooling/protocols, and fast-moving open-weights image model competition.

Top Stories

1) vLLM v0.13.0 lands major serving + hardware upgrades

Why it matters: Serving stacks are becoming a competitive layer: kernel selection, prefix caching, KV connectors, and hardware bring-ups translate directly into lower latency and higher throughput for deployed models.

  • Engine core changes include selective kernel compilation via compile_ranges, PrefixLM support for FlexAttention + TritonAttention, CUDA graphs for 3D Triton attention, xxHash for prefix caching, chunked prefill for pooling tasks, and Model Runner V2 updates (min-p sampling, logits NaN detection) .
  • Hardware & perf: adds NVIDIA Blackwell Ultra SM103 (GB300) support with CUDA 13 .
  • DeepSeek-V3.1 benchmarks (vLLM optimizations): DeepEP High-Throughput CUDA graph enabled by default (+5.3% throughput, +4.4% TTFT), DeepGEMM fused layout kernel (+4.3% throughput, +10.7% TTFT), and group_topk kernel (+1.9% throughput, +2.1% TPOT) .
  • Large-scale serving adds items like Mooncake Transfer Engine KV connector, /reset_prefix_cache, KV events, failure recovery config, NIXL handshake checks, and external launcher mode .
  • API updates: Responses API adds MCP type infrastructure, Browser/Container MCP tools, and a full MCP Python loop .

Full release notes: https://github.com/vllm-project/vllm/releases/tag/v0.13.0.

2) “Distillation” is showing up as both a technique and a narrative

Why it matters: Multiple threads point to distillation as a lever for cheaper inference (students) while retaining high-cost training-time “teacher” signal—across LLMs and autonomy.

  • Gemini 3 Flash is confirmed to use Distillation Pretraining, and its distillation TL describes Flash as a “huge success” .
  • A separate thread argues privileged teacher distillation (LUPI) can let self-driving systems train with expensive inputs (3D maps/lidar/radar/RGB/IMU) but deploy a lighter student using only RGB/IMU/radar, reducing the need for global HD mapping .
  • A “theme call” predicts: “2025 was the year of rl. 2026 is the year of distillation” .

3) Agent evaluation is shifting from “scores” to cost/runtime clarity (and critiques)

Why it matters: As agents move into long-running work, practitioners are pushing for evals that expose time and dollar tradeoffs—not just success rates.

  • A METR-related discussion questions unclear reporting and asks for definitions of working_time and usd in raw YAML .
  • One interpretation suggests working_time is minutes to complete the whole benchmark once (1 of 8 attempts) and usd is cost for one run (again 1 of 8 attempts) .
  • Another thread flags confusion between reported “32M tokens” and “working time,” with a reply clarifying the 32M number is input+output tokens (not accounting for cache hits) and that agents may use <<1M output tokens per task.

4) Coding/science agents and benchmarks: bigger claims, more structured measurement

Why it matters: New benchmarks and replication-focused evals are starting to make “agent performance” concrete—across coding, scientific reasoning, and factuality.

  • DeepCode reports a 73.5% replication score on OpenAI’s PaperBench benchmark, described as a 70% relative improvement over the best LLM-agent baseline (o1 at 43.3%), and higher than Cursor (58.4%), Claude Code (58.7%), and Codex (40.0%) . On a 3-paper subset evaluated by ML PhD students, humans scored 72.4% while DeepCode scored 75.9% .
  • OpenAI introduced FrontierScience, a benchmark for expert-level scientific reasoning across physics, chemistry, and biology .
  • Google’s FACTS Leaderboard is positioned as a factuality suite across four dimensions, with Gemini 3 Pro leading at 68.8% overall (Gemini 2.5 Pro 62.1%, GPT-5 61.8%) .

5) Image generation is competing on speed, licensing, and “open weights” rankings

Why it matters: The “best model” conversation is fragmenting into (a) open weights leaderboards, (b) sub-second generation, and (c) licensing + unit economics.

  • Z-Image Turbo is reported as the #1 open-weights text-to-image model on the Artificial Analysis Image Arena leaderboard, surpassing FLUX.2 [dev], HunyuanImage 3.0 (Fal), and Qwen-Image . It’s described as Apache 2.0 licensed and available via API on Alibaba Cloud, fal, and Replicate . It’s also priced at $5/1k images on Alibaba Cloud and described as a 6B model runnable with 16GB memory .
  • FLUX.2 Flash & Turbo are now live on fal and Yupp; the fal announcement describes “timestep-distilled” Flux models with sub-1 second generation.

Research & Innovation

Why it matters: This cycle spans (1) architectural simplification, (2) budget-aware tool agents, (3) compression for long-context, (4) multimodal reasoning recipes, and (5) evaluation suites for factuality and science.

Model architecture and training mechanics

  • Normalization-free Transformers: Derf (Dynamic erf) is introduced as a simple point-wise layer that allows norm-free Transformers to work and “outperform their normalized counterparts” .
  • Reasoning training phases (CMU): Researchers attribute distinct roles to pre-training, mid-training, and RL: RL improves reasoning only in specific conditions; generalizing across contexts needs some pre-training; mid-training matters significantly; and process-aware rewards are essential .
  • Polychromic RL (diversity collapse): A thread claims RL can collapse the entropy distribution of skills (elicit vs learn) and suggests operating on a set of sequences can penalize diversity collapse and increase creativity .

Budget- and tool-aware agents

  • Budget Aware Test-time Scaling (BATS): On BrowseComp, BATS with Gemini-2.5-Pro reports 24.6% accuracy vs 12.6% for ReAct under identical 100-tool budgets; on BrowseComp-ZH, 46.0% vs 31.5%; on HLE-Search, 27.0% vs 20.5%—all without task-specific training . A “Budget Tracker” variant is reported to match ReAct with 10x less budget (10 vs 100 tool calls) and reduce overall cost by 31.3% .
  • Agentic AI adaptation taxonomy (UIUC/Stanford/Harvard): Adapting the agent vs adapting its tools leads to four types (A1, A2, T1, T2) and the thread argues best systems combine both approaches .

Long-context efficiency via compression

  • CLaRa (unified RAG compression): At 16x compression, CLaRa-Mistral-7B is reported to surpass a text-based DRO-Mistral-7B on NQ (51.41 vs 51.01 F1) and 2Wiki (47.18 vs 43.65 F1) while processing far less context .

Multimodal reasoning and perception

  • Vision-language synergy reasoning: A method reports up to 7.25% improvement on Gemini and 4.5% on o4-mini over text-only baselines, while text-only self-correction can degrade across rounds; the approach improves consistently each iteration . In fine-tuning, “vision-language synergy training” reports 13.25% on ARC-AGI with Qwen3-8B, higher than text-only fine-tuning (9.75%) and a cited GPT-4o baseline (8.25%) .
  • SHARP (single-image 3D): On ScanNet++, SHARP reports 0.071 DISTS vs 0.090 for Gen3C (21% improvement) and LPIPS 0.154 vs 0.227 (32% reduction) . It’s also described as running in under 1 second vs ~850 seconds for Gen3C (roughly 1000× speedup) .

Products & Launches

Why it matters: The “agent stack” is filling in around UI protocols, sandboxes, memory patterns, and off-the-shelf multi-agent apps.

  • A2UI (Agent-to-User Interface): Google introduces A2UI as a protocol for agent-driven interfaces that enables agents to generate interactive user interfaces; it’s open source . Repo: https://github.com/google/A2UI/.
  • MiniMax M2.1 in Code Arena: M2.1 is now in LM Arena’s Code Arena for live coding evals (planning, scaffolding, debugging, building step-by-step), with Battle Mode voting and results forthcoming .
  • Moondream 3 (local): Moondream 3 adds MLX native Mac support and runs on Mac/Linux/Windows; install via pip install moondream-station.
  • LangAlpha (equity research agents): An AI equity analysis platform that uses LangGraph’s multi-agent system to synthesize market data, news, and financials into reports . Repo: https://github.com/Chen-zexi/LangAlpha.
  • Agent Skills for Context Engineering: A repo framed as a “Meta-Agent” knowledge base with markdown/code skills for context fundamentals, degradation, optimization, multi-agent patterns, memory systems, tool design, and evaluation .
  • FLUX.2 Flash/Turbo availability: The models are now live on Yupp, described as engineered for speed without compromising quality ; Yupp access: http://yupp.ai.

Industry Moves

Why it matters: Compute access, infrastructure control, and how teams run agents in sandboxes are becoming as important as model weights.

  • Tencent compute access via Japan data centers: Tencent reportedly cut contracts (~$1.2B) to use most of Datasection’s 15,000 Nvidia Blackwell (B200) processors in Japan; the post frames overseas AI data centers as an attractive option when firms can’t import the latest chips directly .
  • Nvidia acquires SchedMD (Slurm): Nvidia acquired SchedMD, the developers of Slurm, prompting practitioner reactions and discussion of Slurm’s strengths and pain points (CLI args, slow controller, configuration) .
  • xAI compute scale: A post claims xAI’s Colossus in Memphis has more compute than all current and planned supercomputing capacity in Britain .

Policy & Regulation

Why it matters: Policy actions are increasingly entangled with compute access, “military affiliation” allegations, and export-control strategy.

  • US lawmakers urge action on DeepSeek and Xiaomi: A Reuters-linked post says US lawmakers urged the Pentagon to add DeepSeek and Xiaomi to a list of firms allegedly aiding China’s military .
  • Export-control intent (historical framing): A thread cites a GAO report claiming US government practice and intent since at least March 2001 has been to keep China’s semiconductor industry two generations behind state of the art, and frames today as “H200 but no EUV” (3.5 generations) .

Quick Takes

Why it matters: These smaller signals often preview bigger shifts: trust, content quality, and what people optimize for.

  • Gemini 3 Flash additional signals: Gemini 3 Flash scores 61.6% on WeirdML (Gemini 2.5 Flash 41.9%, Gemini 2.5 Pro 54.0%) , and one post notes its code execution times frequently bunch near the 2-minute max .
  • AI short video saturation: A post says YouTube searches for “tsunami footage” in 2025 return “almost every video” as AI-generated, with millions of views each ; another notes the top comment is often “ai” with 3000 likes .
  • Chain-of-thought monitorability discussion: Sam Altman links to OpenAI’s chain-of-thought monitorability post , while another thread argues telling models CoT is a “safe space” is a contradiction aware models can detect .
  • “Measure what matters” warning: A thread cautions against hyperfixation on intermediate metrics like lines of code generated or long-running agent time, noting it’s trivial to generate “100s of lines of slop code per minute” .
  • M&A caution: One post warns against DIY-ing legal work in meaningful M&A transactions with AI , even as another claims frontier models are better than the median US M&A attorney and can reduce back-and-forth and catch issues .
Gemini 3 Flash hits 1M-context MRCR as Nemotron-3 and new eval tooling land
21 December 2025
11 minutes read
swyx 🇸🇬 sg!
Maksym Andriushchenko
Mistral AI
+41
Gemini 3 Flash posts a standout 1M-context MRCR result while NVIDIA’s Nemotron 3 introduces an open-weight hybrid Mamba/Transformer MoE design. The brief also covers Anthropic’s open-source Bloom misalignment eval generator, ongoing ARC-AGI rules debates, and major policy signals from Japan and chip export-control rhetoric.

Top Stories

1) Gemini 3 Flash pushes long-context performance into the spotlight

Why it matters: Reliable reasoning over very long inputs is becoming a gating capability for agents (docs, codebases, logs). A measurable jump at 1M context also pressures the ecosystem to clarify what architectural/efficiency tradeoffs enable it.

  • MRCR @ 1M context: Gemini 3 Flash hit 90% accuracy on OpenAI’s MRCR benchmark at 1 million context length; the post describes this as state of the art, noting most top models can’t go past 256k context.
  • Pricing context: One post reports $0.5/M input, $3/M output tokens and attributes the price point to “efficient attention” .
  • Additional reported benchmarks: A separate roundup cites 90.4% on GPQA Diamond and 78% on SWE-bench Verified, with Gemini 3 Flash 3× faster than Gemini 2.5 Pro at $0.50 per million input tokens.

2) NVIDIA releases Nemotron 3 and leans into hybrid architectures

Why it matters: If hybrid Mamba/Transformer designs deliver better throughput at similar quality, they could reshape how “frontier-ish” open models are deployed—especially for long sequences.

NVIDIA released the Nemotron 3 series with three sizes: Nano (30B-A3B), Super (100B), and Ultra (500B). As of Dec 19, only Nano had been released as an open-weight model .

Key technical notes on Nemotron 3 Nano:

  • A 52-layer MoE Mamba-Transformer hybrid that interleaves Mamba-2 blocks with sparse MoE feed-forward layers; self-attention appears only in a subset of layers .
  • Each MoE layer has 128 experts, activating 1 shared + 6 routed experts per token .
  • Commentary frames Mamba-2 as a gated-state-space update that scales linearly with sequence length (vs. quadratic attention) .
  • The same thread claims strong performance vs similarly sized pure transformers while achieving higher tokens-per-second throughput.

3) Anthropic open-sources Bloom for behavioral misalignment evaluation

Why it matters: Open, reusable safety tooling can help researchers compare models on behaviorally grounded tests (frequency/severity), not just single benchmark scores.

Anthropic released Bloom, an open-source tool for generating behavioral misalignment evals for frontier models . Bloom lets researchers specify a behavior, then measure its frequency and severity across automatically generated scenarios .

More: https://www.anthropic.com/research/bloom

4) ARC-AGI “new Pareto frontier” claim expands into a rules/measurement dispute

Why it matters: ARC-AGI results are being used as a narrative wedge for “progress,” but threads show confusion about what’s allowed (and what the benchmark is really measuring).

  • A widely shared claim: 27.5% on ARC-AGI for $2 (also reported as $2.12), via a tiny vanilla transformer trained in roughly 2–3 hours, open source .
  • Cost details in the thread: the script runs in 3.01 hours, training on 40GB A100 and inference on 80GB A100, totaling $2.12 (with an amortization step described in the same post) .
  • Rule confusion: one clarification states that for test-time training (TTT) methods, a single test input is allowed at test-time, but not all test inputs together before evaluation on a single test occurs .
  • Ongoing disagreement persists over whether touching eval inputs violates the “don’t see eval tasks at all” spirit/guidelines referenced in the threads .

5) Japan signals large-scale government investment in “reliable AI”

Why it matters: National AI plans and public funding can materially shape domestic AI ecosystems (labs, deployments, and regulatory posture).

Japan’s Prime Minister Takai presented an AI Basic Plan draft; Sakana AI’s founder/COO Ito Ren participated as an expert council member . The PM also stated the government would invest over 1 trillion yen in AI-related policies to promote public-private investment for reliable AI .


Research & Innovation

Evaluation: long-horizon metrics gain attention—and pushback

Why it matters: Agent capability is increasingly discussed in terms of how long it can execute tasks, but multiple threads highlight how brittle the measurement can be.

  • METR on Opus 4.5 uncertainty: METR says its current suite lacks enough long tasks to confidently upper-bound Opus 4.5’s 50%-time horizon, and the high upper CI bound likely overstates capabilities; updates are underway .
  • “Gaming the METR plot” critique: A blog argues the METR plot influenced 2025 timelines and investment decisions while the 1–4 hour region is driven by just 14 prompts (many about cybersecurity CTFs and ML model training) . The same thread suggests post-training on CTF/ML codebases can inflate horizon lengths and warns against overindexing under a logistic success-vs-length model assumption .

Retrieval: SA-RAG applies “spreading activation” to multi-hop RAG

Why it matters: Multi-hop retrieval remains a failure mode for many RAG pipelines; SA-RAG proposes a training-free module that can improve multi-hop QA even with small open-weight models.

SA-RAG applies spreading activation over a knowledge graph built from text chunks; activation propagates outward from seed entities instead of relying on the LLM to decide iterative fetches . Reported results include:

  • MuSiQue:67% answer correctness with phi4 vs 45% naive RAG and 55% chain-of-thought iterative retrieval .
  • With CoT iterative retrieval: 74% on MuSiQue and 87% on 2WikiMultiHopQA .
  • 25–39% absolute improvement over naive RAG across benchmarks using small open-weight models like phi4/gemma3, no fine-tuning .

Paper: https://arxiv.org/abs/2512.15922

Training recipes for reasoning: CMU analysis on pre-training, mid-training, and RL

Why it matters: “Just add RL” is not a stable recipe. This work breaks down when RL helps, and emphasizes mid-training and process-aware rewards.

CMU researchers analyze how pre-training, mid-training, and RL contribute to reasoning gains . Key claims:

  • RL helps at the frontier: It improves reasoning when tasks are at the edge of model capability—too easy or too unfamiliar yields little benefit .
  • Mid-training matters: A structured phase between pre-training and RL gives bigger gains than RL alone under the same compute budget .
  • Generalization needs some pre-training exposure: ~1% pre-training exposure is described as enough for RL to transfer to new contexts .
  • Process-aware rewards: Step-level feedback reduces reward hacking and improves faithfulness; combine dense step feedback with sparse answer rewards .

Paper/GitHub: https://arxiv.org/abs/2512.07783 and https://github.com/Interplay-LM-Reasoning/Interplay-LM-Reasoning

End-to-end interpretability: Activation Oracles and Predictive Concept Decoders

Why it matters: Approaches that train models to explain their own activations aim to make interpretability scale with model size (and reduce reliance on ad hoc prompting).

  • Activation Oracles: A new paper trains LLMs to decode their own neural activations and answer questions about them in natural language, reporting surprising generalization—e.g., uncovering misaligned goals in fine-tuned models without specific training for that .
  • PCD framing: Posts describe “end-to-end interpretability” as directly training models to map activations/acts to explanations . Neel Nanda calls it a “wild idea” that worked surprisingly well, especially Activation Oracles . Video link: https://youtu.be/Aroazwb_QW8.
  • Follow-on commentary: Jacob Steinhardt (senior author on LatentQA and PCD) adds perspective on this space in a follow-up thread .

AIxBio: MultiCell models embryo-scale cell dynamics

Why it matters: Predicting tissue-level development from single-cell dynamics is a long-standing challenge; this work targets cell-by-cell forecasting from 4D microscopy.

MultiCell represents a developing embryo as a dual graph combining cells as moving points and as a junction network, learning dynamics from geometry and connectivity . On 4D light-sheet movies of Drosophila gastrulation (~5,000 cells), it predicts junction loss, rearrangements, and divisions and their timing with “high accuracy” at single-cell resolution .

Systems: DistCA speedup claims draw skepticism

Why it matters: Training-system speedups can shift the economics of scaling, but claims are scrutinized when absolute metrics aren’t provided.

HAO AI Lab released DistCA (built on Megatron-LM), claiming 1.35× speedup vs SOTA training systems and vs Megatron-LM across model sizes and datasets . A reply calls the presentation “sus” due to lack of non-relative performance metrics .


Products & Launches

Claude Code adds LSP-based “code intelligence”

Why it matters: Better navigation (go-to-definition, references) reduces friction for agent-assisted development and review.

Claude Code 2.0.74 adds an LSP tool for go-to-definition, find references, and hover docs . The same changelog lists improved /context visualization and additional terminal setup support (Kitty, Alacritty, Zed, Warp) . Changelog link: https://github.com/anthropics/claude-code/blob/main/CHANGELOG.md#2074.

Claude browser control shows up as a Chrome extension demo

Why it matters: “Computer use” is moving from research demos into everyday tooling that can navigate web UIs.

A post highlights a Claude Chrome Extension that controls a browser, with a demo video . Another post notes a similar capability was a focus for Adept, which raised $350M to train AI to use existing software and APIs .

Amazon’s Nova 2 family + browser automation agents

Why it matters: Big providers are bundling model families with customization and agentic UI automation.

Amazon released the Nova 2 family (Pro, Omni, Lite, Sonic) with “competitive multimodal reasoning/generation,” plus Nova Forge for mixing customer data with Amazon checkpoints for custom training . Nova Act introduces browser automation agents (navigate, fill forms, extract data) .

Image tooling: speed and layer-native editing keep improving

Why it matters: Visual generation is turning into workflow tooling (fast iteration, editable structure) rather than single-shot images.

  • Flux 2 Flash & Turbo went live on fal, described as timestep-distilled models with sub-1 second generation.
  • Qwen Image Layered is live on fal and supports “Photoshop-grade” physically isolated RGBA layers with explicit layer control .
  • ComfyUI supports Qwen Image Layered “day 0” .

Local/offline and Mac-friendly tooling

Why it matters: Lower friction for local inference expands who can experiment and deploy.

  • Moondream 3 adds MLX native Mac support and runs on Mac/Linux/Windows; install via pip install moondream-station.
  • Chatterbox-turbo supports real-time audio streaming (and voice cloning) on MLX-Audio.

Agents ecosystem: harnesses, memory, and deployment patterns

Why it matters: Agent builders are standardizing around shared infrastructure: harnesses, persistent memory, checkpointing, and sandboxed execution.

  • Agent Harness: LangChain’s team promotes an open-source, model-agnostic “agent harness” for “deep agents” .
  • Persistent memory:zkStash is a TypeScript SDK for persistent memory in agents, integrating with LangChain via MCP tools or middleware and using Zod schemas .
  • Gemini 3 agent examples: Google’s developer blog lists agentic tools (ADK, Agno, Browser Use, CAMEL, Letta, mem0) for Gemini 3 projects .

Industry Moves

AI chips and geopolitics: “Manhattan Project”-style framing

Why it matters: Export controls and supply-chain constraints remain central to AI compute access.

A Reuters link is shared describing how China built a “Manhattan Project” to rival the West in AI chips . The quoted commentary argues the U.S. and allies should control not just end products but critical components and supply chains, and update export-control laws to account for alleged tech theft tactics .

Meta/FAIR’s influence reframed around infrastructure

Why it matters: Tooling and open releases can shape the whole field, beyond any one model.

A post argues “Meta gave us PyTorch,” and even without papers like Llama/DINO/SAM Meta might still be the most influential AI player . Another thread defends FAIR’s impact, noting Llama as an early high-performance open-source model line that “filled a void” .

Hiring signals: post-training and RL-at-scale

Why it matters: Hiring priorities often reveal where labs expect the next iteration gains.

  • Nous Research is hiring for a post-training team across areas like code agents, instruction following/RLHF, multimodality, and data synthesis infra; fully remote .
  • Databricks is hiring interns for empirical RL at scale on non-verifiable tasks and for tooling that helps people specify desired AI behaviors (e.g., via evals) .

Agentic coding workflows: “review becomes the bottleneck”

Why it matters: As code generation scales, review/testing load becomes a constraint—and a product opportunity.

  • One thread argues code review AI tools may have larger TAM than codegen as “vibe coding” increases review load .
  • Martin Casado: “Previously we were limited by how quickly we could write code, but now the bottleneck is how quickly we can review it” .

Policy & Regulation

Japan’s AI Basic Plan + funding push

Why it matters: Large public investment can accelerate domestic AI deployment and shape “reliable AI” policy priorities.

Japan’s Prime Minister presented an AI Basic Plan draft and discussed 1T+ yen investment for AI-related policies to promote public-private investment for “reliable AI” .

Export controls and research-security rhetoric intensifies

Why it matters: Policy language can translate into constraints on collaboration, hiring, and supply-chain access.

A quoted statement argues for controlling critical AI chip components/supply chains and updating export controls based on claims about technology theft strategies involving researchers .


Quick Takes

Why it matters: Smaller product changes and social signals often become default assumptions—about what AI can do, and what users will trust.

  • PostTrainBench launches to measure how well agents (e.g., Claude Code) can post-train base LLMs; positioned as an indicator for AI R&D automation .
  • Mistral OCR 3 claims new benchmarks in accuracy/efficiency , while critics say benchmarks are unspecified and comparisons to frontier open OCR models are missing/inaccessible .
  • tldraw Fairies: multi-agent collaboration on an infinite canvas (December-only), with a one-time $25 purchase offer ending EOY .
  • AI misinformation signal: one post says searching “tsunami footage” on YouTube returns mostly AI videos with millions of views .
  • Metrics caution: “Measure what matters”—a thread warns against proxy metrics like lines of code generated or agent runtime as stand-ins for productivity/agent quality .
  • MOE kernel tuning: a developer reports an MoE activation kernel faster than their Triton version and slightly faster than vLLM’s CUDA version, but only for a specific model shape/dtype .
  • Training precision/tooling debate: a thread lists FP8 libraries (Transformer Engine, torchao, MS-AMP) and claims MS-AMP “doesn’t work” now, with a preference stated for torchao over TE .
Long-task evals for Claude Opus 4.5, Gemini 3 Flash product upgrades, and a surge of open releases
20 December 2025
8 minutes read
OpenAI Newsroom
LlamaIndex 🦙
vLLM
+32
Key developments include METR’s new long-task estimate for Claude Opus 4.5, Google’s end-of-year Gemini Drops led by Gemini 3 Flash, and major open releases from Xiaomi and Qwen. We also cover new agent workflow primitives like Codex “skills,” plus notable partnerships and policy signals.

Top Stories

1) METR publishes a new long-task estimate for Claude Opus 4.5

Why it matters: Long-horizon task capability is increasingly central to agent utility, but measurement is fragile when the task suite has few very-long tasks.

METR estimates Claude Opus 4.5 has a 50%-time horizon of ~4h 49m on its task suite (95% CI: 1h 49m to 20h 25m)—its highest published to date . METR also reports Opus 4.5’s 80%-time horizon is 27 minutes, similar to past models and below GPT-5.1-Codex-Max’s 32 minutes, attributing the gap to Opus differentially succeeding on longer tasks .

METR cautions the high upper CI bound likely reflects insufficient long tasks to confidently upper-bound performance, and says it is updating the suite; based on experience, it would be surprised if Opus had a 20+ hour 50%-time horizon .

2) Google ships the “final Gemini Drops of 2025,” led by Gemini 3 Flash + Gemini App upgrades

Why it matters: Google is pairing a new default-speed model with product “papercuts” that reduce friction in daily use, pushing capability upgrades directly into consumer workflows.

Google highlights Gemini 3 Flash as a major upgrade over 2.5 Flash and says it’s available globally . A separate Google post frames Gemini 3 Flash as frontier-level performance and 3× faster than 2.5 Pro.

Alongside the model, the Gemini App updates include:

  • Image edit targeting via circling/drawing/annotating directly on images
  • Google Maps integration for local results with photos/ratings and other info inside chat
  • Adding notebooks as sources for document-grounded responses
  • Gemini Live improvements: fewer mid-sentence cutoffs and mic mute to reduce interruptions
  • Expanded support to 65+ languages

Hub: http://gemini.google/gemini-drops.

3) Xiaomi releases MiMo-V2-Flash (309B open-weights reasoning model)

Why it matters: Strong open-weights releases at very large scale continue to broaden who can deploy frontier-ish reasoning systems, with cost/latency and hallucination tradeoffs now more clearly quantified.

Artificial Analysis reports Xiaomi launched MiMo-V2-Flash, a 309B open-weights reasoning model (with 15B active at inference) under an MIT license, scoring 66 on the Artificial Analysis Intelligence Index .

Notable reported metrics:

  • Strengths: 95% on τ²-Bench Telecom (category leader) and 96% on AIME 2025
  • Pricing: $0.10/M input and $0.30/M output; full eval suite cost $53
  • Caveats: generated ~150M reasoning tokens during the eval suite (latency implications) and scored -62 on the AA-Omniscience Index (high hallucination rate)

4) Qwen open-sources “Qwen-Image-Layered” for native image decomposition

Why it matters: Layer-native image manipulation is moving closer to “graphics editor” workflows (isolated RGBA layers) rather than single-shot generation.

Alibaba Qwen released Qwen-Image-Layered, a fully open-sourced model for native image decomposition into physically isolated RGBA layers with editability . It supports prompt-controlled structure (explicitly specify 3–10 layers) and “infinite decomposition” (layers within layers) .

Resources: model on Hugging Face, GitHub, blog, and a technical report . The team also says the model was optimized for speed via @PrunaAI .

5) OpenAI Codex adds “skills” as reusable task bundles

Why it matters: Standardized, shareable “skills” shift agent workflows from ad-hoc prompting toward composable, versioned automation.

OpenAI says Codex now supports skills—reusable bundles of instructions/scripts/resources—callable directly via $.skill-name or auto-selected from a prompt . Skills follow the agentskills.io folder standard (SKILL.md + optional assets/scripts) .

Users can install skills per-user (~/.codex/skills) or per-repo (repo_path/.codex/skills), and Codex ships with system skills like plan, skill-creator, and skill-installer. Docs: https://developers.openai.com/codex/skills.


Research & Innovation

Why it matters: This week’s research emphasizes (1) interpretability tools that scale with model size, (2) agentic RL systems, and (3) efficiency improvements that raise practical throughput ceilings.

Interpretability tooling at scale: Gemma Scope 2

Google DeepMind introduced Gemma Scope 2, positioned as its largest open release of interpretability tooling, trained on over 1T parameters, functioning as a “microscope” for analyzing Gemma 3 internal activations and chat behaviors .

Neel Nanda highlights release of Sparse Autoencoders and transcoders on every layer of every Gemma 3 model (270M to 27B, base and chat) . DeepMind frames this as helping researchers trace internal reasoning, debug behaviors, and identify risks.

Formal math agents: Seed-Prover 1.5

ByteDance-Seed’s Seed-Prover 1.5 is described as an agentic Lean prover with SOTA performance in formal math . Reported results include 87.9% on PutnamBench (580 solved) and 11/12 Putnam 2025 problems solved within ≤9 hours (max budget 40 H100-days/problem) . Repo: https://github.com/ByteDance-Seed/Seed-Prover/tree/main/SeedProver-1.5.

Training system speedups: DistCA

Hao AI Lab announced DistCA, built on Megatron-LM, with a reported 1.35× speedup vs SOTA training systems and vs Megatron-LM across model sizes/datasets . Paper: https://arxiv.org/abs/2510.18121.

Reward discovery for RL via bilevel optimization

A new framework is described as automatically discovering reward functions via bilevel optimization and regret minimization, without expert demonstrations or human feedback . Reported outcomes include >60% energy reductions in data center energy management vs 21–52% baselines, and PPO succeeding in UAV trajectory tracking where hand-designed rewards “failed entirely” .

ARC-AGI: new “Pareto frontier” claims trigger measurement debates

A thread claims a new ARC-AGI Pareto frontier at 27.5% for $2, using a vanilla transformer trained in 2 hours, open source . The discussion includes disputes about “training on test,” distinctions between using test inputs vs labels, and broader questions about what the benchmark is actually measuring .


Products & Launches

Why it matters: Consumer and developer AI products are converging on “workflow primitives” (editing, tracing, tiers, and repeatable skills) rather than one-off chat experiences.

ChatGPT: personalization + email “writing blocks”

OpenAI added ChatGPT personalization controls to adjust traits like warmth, enthusiasm, and emoji use, available in “Personalization” settings .

Separately, ChatGPT shipped writing blocks for emails: edit/format in chat, highlight for changes with accept/reject, then open in your email client to send .

Gemini App: “study partner” + holiday cards

Google shows a Gemini App flow where users upload an audio file explaining a topic, and Gemini 3 Flash identifies missed info and creates a quiz . Gemini also launched a holiday card generator: upload a photo, choose a festive style, and add a greeting using Nano Banana Pro templates .

LlamaParse v2: simpler tiers + lower costs

LlamaIndex released LlamaParse v2, introducing four tiers (Fast, Cost Effective, Agentic, Agentic Plus) and claiming up to 50% cost reduction; it also emphasizes improved accuracy and reduced hallucinations on complex multimodal documents . The “Cost Effective” mode is cited as ≤0.3¢ per page and can parse charts/diagrams into coherent tables .

Elicit: more defensible systematic reviews

Elicit added strict screening criteria to automatically exclude papers failing critical criteria (with manual override) and expanded report generation to 80 papers (up from 40) . These features are live for Pro, Teams, and Enterprise users .

LangSmith tracing for Claude Code

LangChain announced a Claude Code → LangSmith integration to view every LLM and tool call for observability . Docs: https://docs.langchain.com/langsmith/trace-claude-code.


Industry Moves

Why it matters: Distribution, partnerships, and infrastructure-scale usage are now first-order signals for which models and products will dominate real-world workflows.

Disney signs an exclusive deal for OpenAI’s Sora app

Disney reportedly struck a three-year exclusive agreement allowing Sora to generate 30-second clips with 200+ Disney characters, with some fan creations streaming on Disney+ .

DOE “Genesis Mission” partnerships continue expanding

OpenAI says it’s expanding collaboration with the U.S. Department of Energy on AI and advanced computing, building on work with national labs and advancing the Genesis Mission. Google DeepMind also says it is supporting the Genesis Mission by providing national labs accelerated access to frontier AI models and agentic tools, starting with AI co-scientist .

Usage at scale: ByteDance Cloud token volume

ByteDance Cloud (Volcano Cloud) is reported to process 50 trillion tokens per day, described as approaching Google’s 1,300 trillion tokens monthly.

Cursor + Graphite

Cursor announced that Graphite is joining Cursor.

LLM adoption: US polling and shifting drivers

Epoch AI reports a survey of 5,660 Americans: a majority use AI weekly, with 35% using ChatGPT, 24% Gemini, and 13% Meta AI. It also reports fewer than 10% paid for a subscription, with OpenAI leading at 4.6%.


Policy & Regulation

Why it matters: Frontier evaluation is becoming a governance tool, and the institutions supporting research (and its transparency) are under visible funding pressure.

UK AI Security Institute releases a “Frontier AI Trends Report”

The UK AI Security Institute released its first Frontier AI Trends Report, reporting evaluation results on 30+ frontier models from the past two years and noting rapid progress in chemistry/biology, cyber capabilities, autonomy, and more . Link: https://www.aisi.gov.uk/frontier-ai-trends-report.

OpenReview funding appeals

OpenReview reports that in 2025 it supported 1,300+ conferences/workshops, served 3.3M active monthly users, and handled 278,000+ submissions, while remaining underfunded . Researchers are urging donations: https://openreview.net/donate.


Quick Takes

Why it matters: These smaller updates often become baseline capabilities: faster inference, cheaper retrieval, and better agent ergonomics.

  • vLLM + NVIDIA Blackwell throughput: vLLM reports up to 33% higher maximum throughput per Blackwell GPU in one month of collaboration with NVIDIA .
  • vLLM-Omni diffusion acceleration: diffusion cache backends (TeaCache/Cache-DiT) report 1.91× and 1.85× speedups on Qwen-Image (H200), and 2.38× on Qwen-Image-Edit with Cache-DiT .
  • Milvus AISAQ vector index: disk-based index reports 3,200× memory reduction (32 GB to 10 MB) for billion-scale vector search by storing data on SSD with optimized layouts .
  • Factory on context compression: Factory evaluated compaction strategies on 36,000+ messages from real agentic software development sessions and argues context compression is prerequisite for long-running agents .
  • NitroGen gaming foundation model: an open-source model trained via behavior cloning on 40K+ hours of action-labeled gameplay across 1,000+ games, with a universal simulator for cross-game generalization .
GPT-5.2-Codex ships as OpenAI formalizes CoT monitorability and Google pushes open on-device Gemma models
19 December 2025
8 minutes read
Jukan
机器之心 JIQIZHIXIN
François Chollet
+34
OpenAI shipped GPT-5.2-Codex and paired it with a more cautious cybersecurity rollout, while also releasing a new chain-of-thought monitorability evaluation suite. Google pushed open-weight models for on-device function calling and multimodal encoder-decoder work, and openness/document intelligence advanced via MBZUAI’s K2-V2 and Mistral OCR 3.

Top Stories

Why it matters: This cycle pairs a major step-change in AI coding agents with sharper safety/evaluation instrumentation and a continued shift toward open-weight, on-device models.

1) OpenAI launches GPT-5.2-Codex (agentic coding + terminal use)

OpenAI introduced GPT-5.2-Codex, positioning it as its best agentic coding model for complex, real-world software engineering, citing native compaction, stronger long-context understanding, and improved tool-calling . OpenAI describes it as trained specifically for agentic coding and terminal use and reports state-of-the-art results on SWE-Bench Pro and Terminal-Bench 2.0.

Security is a core theme of the rollout: OpenAI says a researcher using GPT-5.1-Codex-Max with Codex CLI found and responsibly disclosed a React vulnerability that could lead to source code exposure . OpenAI also says GPT-5.2-Codex is more cyber-capable than its predecessor—benefiting defenders while raising dual-use risks that require careful deployment .

Availability: GPT-5.2-Codex is available today in Codex for all paid ChatGPT users, with API access coming soon, and OpenAI is piloting invite-only trusted access to frontier cyber capabilities for vetted defensive teams .

2) OpenAI publishes a chain-of-thought monitorability evaluation suite

OpenAI released a framework and evaluation suite to measure chain-of-thought (CoT) monitorability, spanning 13 evaluations across 24 environments, intended to detect when models verbalize targeted aspects of their internal reasoning .

Key reported findings include:

  • Monitoring CoT is described as far more effective than monitoring actions or final answers; longer CoTs make issues easier to spot .
  • Frontier RL “doesn’t seem to wreck monitorability” and can help early reasoning steps, but OpenAI notes a tradeoff: smaller models run with higher reasoning effort can be easier to monitor at similar capability, at the cost of extra inference compute (a “monitorability tax”) .
  • The monitor’s access and capability matters: stronger monitors that can read CoTs and use more test-time compute improve quickly; post-hoc follow-ups can surface previously unspoken thoughts .

OpenAI frames CoT monitoring as complementary to mechanistic interpretability, and plans to expand evaluations to inform future modeling and data decisions .

3) Google releases new open-weight Gemma models for on-device agents and multimodal encoder-decoder work

Google released two new open-weight Gemma modelsFunctionGemma and T5Gemma 2—positioning them as optimized for on-device agentic actions and multimodal applications .

  • FunctionGemma (270M): fine-tuned for function calling, designed as a base for local agents that translate natural language into executable API actions . Google and ecosystem posts say it can run on-device (including phones) and is designed for task specialization via fine-tuning . It’s also available via Ollama (ollama pull functiongemma) .
  • T5Gemma 2: an encoder-decoder model line built on Gemma 3 in compact sizes (270M-270M, 1B-1B, 4B-4B) , described as multimodal, long-context, and heavily multilingual (140+ languages) .

4) MBZUAI’s K2-V2 pushes “full openness” with competitive 70B reasoning performance

MBZUAI’s Institute of Foundation Models released K2-V2, a 70B reasoning model that Artificial Analysis says is tied for #1 on its Openness Index and is the first UAE model on its leaderboards . The post emphasizes K2-V2’s openness: beyond weights, it provides access to pre- and post-training data and publishes training methodology and code under a permissive Apache license .

On performance, Artificial Analysis reports an Intelligence Index of 46 in High reasoning mode (with ~130M tokens used to complete the Intelligence Index), and highlights instruction following strength (60% on IFBench) .

5) Mistral OCR 3 targets enterprise document intelligence

Mistral announced Mistral OCR 3, claiming new benchmarks in accuracy and efficiency, outperforming enterprise document processing solutions and AI-native OCR . A Mistral post says improvements focus on handwritten content, low quality scans, and complex tables & forms common in enterprise documents .


Research & Innovation

Why it matters: Research is concentrating on making models and agents more scalable and controllable through systems work (MoE speed/memory), better long-horizon memory, and interpretability approaches that operate directly on activations.

Faster MoE training: SonicMoE

SonicMoE is presented as a fast MoE implementation optimized for NVIDIA Hopper GPUs, reducing activation memory by 45% and running 1.86× faster on H100 than previous SOTA . A deeper explanation claims ~2× faster MoE training with ~2× less memory via (1) a mathematical rewrite of the MoE backward pass, (2) fusing gather with grouped GEMM, and (3) a bitonic top-k routing algorithm reported as 20–30× faster than PyTorch top-k for small k . Paper: https://arxiv.org/abs/2512.14080.

Long-horizon agents with constant memory: MEM1

MEM1 is an RL framework described as unifying memory and reasoning by training agents to maintain constant memory across multi-turn tasks via compact internal state updates, discarding prior observations and actions each turn . Reported results include a 3.5× performance gain and 3.7× memory reduction on 16-objective multi-hop QA (vs Qwen2.5-14B-Instruct), and better token efficiency on WebShop navigation vs a larger baseline agent .

Interpretability via activation-level decoders

  • Predictive Concept Decoders (PCD) (Transluce): an encoder-decoder that reads activations through a sparse bottleneck, trained to answer questions about model behavior; Transluce claims PCDs can verbalize behaviors the LM struggles to verbalize (e.g., detecting a jailbroken harmful output) and describe injected steering vectors ~ more often than prompting baselines . Paper/blog/demo links are provided by Transluce .
  • Activation Oracles: a paper describing LLMs trained to decode their own activations and answer questions about them, claiming generalization such as uncovering misaligned goals in fine-tuned models without being trained specifically for that outcome .

AR → diffusion adaptation without starting from scratch: NBDiff

Researchers from PKU and Huawei describe NBDiff, a method for gradually adapting autoregressive LLMs to block-diffusion models while aiming to preserve AR capabilities . They report NBDIFF-7B-Instruct scoring 78.8 average vs a base model average of 64.3, arguing AR→diffusion adaptation can work without training diffusion models from scratch . Paper/code: https://arxiv.org/abs/2512.06776 and https://github.com/YuchuanTian/NBDiff.


Products & Launches

Why it matters: New “agent surfaces” are appearing inside IDEs and CLIs, while document and media tooling keeps getting packaged into deployable, tiered products.

Agent Skills becomes a cross-tool standard (Claude, VS Code, Stirrup)

  • Anthropic’s Skills are now available on Team and Enterprise plans and are being made easier to deploy and discover .
  • VS Code announced support for Agent Skills as an open standard: “Create skills once, use them everywhere” .
  • Artificial Analysis added Agent Skills support to Stirrup, describing Skills as folders of instructions/scripts/resources (often markdown) that agents load on demand .

Claude Code expands “agentic dev” workflows

  • Claude Code now supports web browsing, enabling background agents that track and report items of interest (example: AI-related posts on X) .
  • A new Claude Code Chrome extension lets Claude test code directly in the browser and see client-side errors via console logs; users can run /chrome in the latest Claude Code to activate it .

Document ingestion and OCR tooling

  • LlamaParse v2 introduces four fixed tiers (Fast, Cost Effective, Agentic, Agentic Plus) and claims up to 50% cost reduction, along with versioned parsing and reduced hallucinations (especially for complex multimodal documents) .
  • Mistral OCR 3 emphasizes handling handwriting, low-quality scans, and complex tables/forms .

Local/distributed ML tooling updates

  • MLX adds a distributed backend (JACCL) using RDMA over TB5 for low-latency comms across multiple Macs and adds CUDA install support (pip install mlx[cuda13]) for x86 and arm .
  • mlx-lm adds tensor-parallel inference using the low-latency JACCL backend and updates to support Transformers v5 .

Industry Moves

Why it matters: The market is splitting into (a) specialized agent products and (b) infrastructure plays (data, retrieval, and evaluation). Hiring and new labs remain leading indicators of 2026 strategy.

OpenAI expands distribution and institutional adoption

  • OpenAI reportedly sold 700K+ ChatGPT licenses to ~35 US public universities, with 14M+ uses in September (Bloomberg via Techmeme) .
  • OpenAI also launched Pinned Chats, rolling out across iOS, Android, and web .

New labs, funding, and executive moves

  • Figure AI CEO Brett Adcock is launching a new AI lab called Hark, funded by $100M of his personal capital, aiming to build “human-centric AI” while he remains CEO of Figure .
  • Shunyu Yao (姚顺雨), described as a key contributor to OpenAI’s Deep Research and CUA, was appointed Chief AI Scientist at Tencent.
  • A post says Yann LeCun will launch Advanced Machine Intelligence Labs in January as executive chair and that fundraising is in early stages (reported as €500m at a €3bn valuation, subject to change) .

Government science partnerships (Genesis Mission)

  • OpenAI and the U.S. Department of Energy are expanding collaboration on AI and advanced computing, building on work with national labs and advancing the Genesis Mission.
  • Anthropic says it is providing Claude to the DOE ecosystem along with a dedicated engineering team, aiming to accelerate discovery across energy, biosecurity, and basic research .
  • Google DeepMind says it is supporting DOE’s Genesis Mission by providing national labs access to AI tools .

Policy & Regulation

Why it matters: “Governance” is moving from principles to concrete artifacts: model behavior specs, content verification mechanisms, and export-control constraints.

OpenAI updates its Model Spec (intended behavior)

OpenAI updated the Model Spec, described as explicit rules, priorities, and tradeoffs for how models are intended to behave, including a changelog and “teen protections” . The spec is published at https://model-spec.openai.com/2025-12-18.html.

Media provenance features expand in consumer apps

Google added Gemini app support for verifying whether images/videos were generated or edited with Google AI by scanning for the imperceptible SynthID watermark, including identifying specific audio/visual segments and time ranges .

Semiconductor export-control dynamics (EUV)

A post citing Bernstein suggests that if China succeeds in developing EUV lithography, it could catalyze the U.S. to ease export controls and allow ASML to sell EUV systems to China .


Quick Takes

Why it matters: These smaller signals often become tomorrow’s defaults—especially around benchmarks, open-source infrastructure, and agent evaluation.

  • Search Arena: OpenAI’s GPT-5.2-Search ranks #2 (1211) and xAI’s Grok-4.1-Fast-Search ranks #4 (1185), both debuting ahead of predecessors .
  • Text leaderboard: GPT-5.2 enters at #17 (1439), with best performance reported in Creative Writing, Hard Prompts, and Longer Queries .
  • Arena transparency: LMArena open-sourced Arena-Rank, the paired-comparison ranking package used to compute its leaderboards (Bradley–Terry variants, confidence intervals) .
  • vLLM serving: community results for wide expert-parallel MoE inference on multi-node H200 report sustained ~2.2k tokens/s per GPU.
  • Keras 3.13: adds LiteRT export, GPTQ quantization support, and Adaptive Pooling layers .
  • OpenReview funding: OpenReview is described as underfunded despite supporting 1,300+ conferences/workshops and handling 278,000+ submissions in 2025 .
  • Benchmarks & skepticism: posts claim Gemini 3 Flash scores higher than GPT-5.2 on SWE-Bench Verified alongside calls for “new benchmarks” .
Gemini 3 Flash rolls out everywhere as voice agents, evals, and app ecosystems accelerate
18 December 2025
8 minutes read
Sen. Bernie Sanders
Claude
G3mini
+33
Gemini 3 Flash becomes a new default across Google’s products and quickly spreads through developer tools, while xAI launches Grok Voice Agent API with strong third-party audio reasoning results. The week also brings a notable GPT-5 proof claim, a METR benchmark correction affecting Claude Sonnet 4.5, and OpenAI’s new ChatGPT app submission pipeline.

Top Stories

Why it matters: This cycle is dominated by a new “fast-but-frontier” model shipping broadly into end-user products and developer workflows, alongside stronger voice-agent competition and renewed scrutiny of how we evaluate long-horizon and safety-relevant behavior.

1) Gemini 3 Flash rolls out broadly as Google’s new speed-focused default

Google positions Gemini 3 Flash as bringing Pro-grade reasoning to Flash-level latency, “pushing out the Pareto frontier of efficiency vs. intelligence” and unlocking near real-time applications that still require complex thought . It’s available in the API and rolling out as the default model in AI Mode in Search and the Gemini app globally.

Impact: Shipping a fast model as default across search + consumer app + API shifts “model choice” from a developer decision to an ecosystem baseline—especially as Flash is also landing inside major coding and agent tools (see Products & Launches).

Key cost/perf signals:

  • Pricing cited at $0.50 / 1M input tokens and $3.00 / 1M output tokens.
  • Artificial Analysis reports Gemini 3 Flash Preview scoring 71 on their Intelligence Index (13-point gain over Gemini 2.5 Flash), and being 2× cheaper than Gemini 3 Pro Preview with only a 2-point drop .

“With Gemini 3 Flash ⚡️, we are seeing reasoning capabilities previously reserved for our largest models, now running at Flash-level latency.”

2) xAI launches Grok Voice Agent API; third-party evals put it at #1 on audio reasoning

xAI announced the Grok Voice Agent API, aiming to let developers build voice agents that speak dozens of languages, call tools, and search real-time data . Artificial Analysis reports Grok Voice Agent as the new leading speech-to-speech reasoning model at 92.3% on Big Bench Audio, surpassing Gemini 2.5 Flash Native Audio and GPT Realtime .

Impact: Voice agents are becoming a first-class API surface, and the competition is now measurable on reasoning-oriented audio benchmarks—not just latency or “voice quality.”

Notable characteristics (per Artificial Analysis):

  • 0.78s average time to first token/audio (3rd fastest on their leaderboard)
  • $3 per hour of audio pricing
  • Tool calling, SIP telephony (Twilio/Vonage), 100+ languages, 5 voices

3) GPT-5 reportedly produces a complete, correct proof for an open math problem (no hints)

A post claims GPT-5 autonomously solved an open math problem submitted to IMProofBench, producing a complete, correct proof without human hints or intervention . The contribution is described as a “small but novel contribution to enumerative geometry” .

Impact: This is another data point that “research-adjacent” tasks (proofs, writeups, formal structure) are increasingly within reach—while also raising questions about authorship and disclosure practices.

4) METR fixes issues in its “time horizon” suite; Claude Sonnet 4.5 moves materially

METR says it found two issues in its time-horizon task suite, including an unfair scoring problem that disproportionately impacted Claude models. A set of tasks were misconfigured such that “success” required greatly exceeding the stated threshold; Claude models tended to stop at the stated threshold and were graded as failures .

After fixes, METR reports Sonnet 4.5 time horizon rising 16% and landing around ~2 hrs 2 mins (still up vs the original estimate of 1 hr 53 mins) .

Impact: Benchmark plumbing details can matter as much as model changes—especially when model behaviors differ (e.g., “satisficing” vs reward-hacking patterns) .

5) OpenAI opens app submissions for a new ChatGPT in-app directory

OpenAI says developers can now submit ChatGPT apps for review, and approved apps will appear in a new in-ChatGPT app directory for user discovery . Apps are powered by an Apps SDK (beta), with open-source example apps, an open-source UI library for chat-native interfaces, and a quickstart guide .

Impact: This is a distribution and product-surface shift: “apps inside ChatGPT” becomes a channel with explicit quality/UX/safety review expectations.


Research & Innovation

Why it matters: A recurring theme is making systems faster and more controllable without retraining huge models—via retrieval tricks, serving/inference work, and frameworks that turn existing agents into RL-ready pipelines.

Training-free retrieval improvements: FB-RAG

FB-RAG (Forward-Backward RAG) is described as a training-free framework that improves retrieval by using a lightweight model to generate candidate reasoning/answers and scoring context chunks by relevance to those attempts . It uses a three-stage pipeline: retriever for recall → 8B model to sample reasoning/answers and score chunks → 70B generator for final answer .

Reported results include >48% latency reduction while matching a leading baseline on EN-QA, or 8% performance improvement with 10% latency reduction.

Reinforcement learning for agents without rewrites: Agent Lightning (Microsoft)

Microsoft’s Agent Lightning is presented as an open-source framework that adds RL to agent workflows without rewriting core code. It separates execution from training to turn workflows into RL-ready data and supports multi-step, tool-using, multi-agent workflows .

Fast video generation: TurboDiffusion

TurboDiffusion claims to accelerate video diffusion models by 100–205×.

Systems & compute utilization focus

One post argues current training often tops out at ~20% MFU and inference utilization is often single-digit, suggesting the ceiling is software–hardware co-design rather than GPUs themselves .


Products & Launches

Why it matters: The most practical changes this cycle are “new defaults” and “new surfaces”—models shipping into tools people already use (Search, IDEs, terminals), plus new marketplaces/directories that change distribution.

Gemini 3 Flash: where it’s showing up

Google says Gemini 3 Flash is rolling out as default in the Gemini app and Search AI Mode, and is also available in developer and enterprise products . It highlights developer use cases like iterative development, “high-frequency workflows,” and applications needing quick answers plus deep reasoning .

Third-party and ecosystem integrations called out in the notes:

  • GitHub Copilot public preview
  • Cursor availability (noted as good for quickly investigating bugs)
  • Warp terminal uses Gemini 3 Flash for generated code diffs, citing a quality bump vs 2.5 Flash
  • Gemini CLI availability with install command npm install -g @google/gemini-cli@latest
  • Ollama cloud run command ollama run gemini-3-flash-preview:cloud
  • Cline adds Gemini 3 Flash Preview (noting 1M context / 64K output and native multimodal inputs)
  • Perplexity makes Gemini 3 Flash available to Pro/Max subscribers
  • tldraw adds Gemini 3 Flash at tldraw.computer

New clinical workflow product: Glass 5.0

Glass Health announced Glass 5.0 for ambient scribing and clinical decision support, adding patient-centric workflows like creating a patient record with shared context, file uploads (PDF/TXT/PNG/JPEG with OCR), EHR connectivity, and patient-tailored live insights during scribing .

Image-to-3D asset generation: TRELLIS.2 on fal

fal announced TRELLIS.2, an image-to-3D model producing up to 1536³ PBR textured assets, supporting arbitrary topology and multiple texture channels, with 16× spatial compression.

Replit inside ChatGPT

Replit says users can tag Replit in any ChatGPT chat to turn an idea into a working app inside ChatGPT, “no copying prompts, no context lost” .

Perplexity ships a new native iPad app

Perplexity launched a new iPad app optimized for iPad workflows (multitasking, wide screen), bringing core desktop features (Labs, Deep Research, Finance, Spaces, Discover) to iPadOS .


Industry Moves

Why it matters: “Compute + data” remain the hard constraints. Labs are pursuing infrastructure buildouts, data sourcing, and distribution channels—while new benchmarks and procurement patterns reshape what gets prioritized.

OpenAI: new U.S. compute infrastructure + data sourcing conversations

OpenAI says it’s building new AI infrastructure in the U.S., including a data center in Wisconsin, projecting 4000+ skilled construction jobs and 1000+ long-term jobs, designed to be energy and water positive for the community .

Separately, a post reports OpenAI and Anthropic have held talks with biotech, financial services, and consumer healthcare companies to license or buy data for training .

Code as training data: failed startups selling codebases

A post describes a trend where data curation firms like Turing and AfterQuery buy failed startups’ codebases as AI training data .

AI safety org funding

Transluce announced an end-of-year 2025 fundraiser and describes its work building automated oversight tools, including an agent eval platform and interpretability tools .

Shipping inference know-how as readable code: mini-SGLang

LMSYS released mini-SGLang, distilling SGLang from ~300K to ~5,000 lines while keeping core design and near-identical performance .


Policy & Regulation

Why it matters: Regulation is moving from abstract “AI rules” to concrete chokepoints: data centers and access, long-task evaluation standards, and security-related content restrictions.

Data center politics: moratorium proposal + counterargument

Sen. Bernie Sanders said he will push for a moratorium on construction of data centers powering the “unregulated sprint to develop & deploy AI” . A reply argues he is “terribly wrong,” saying democracy didn’t pause the Industrial Revolution but invented the 40-hour work week .

Power constraints and “who wins” narratives

Epoch AI Research argues the U.S. can likely build enough power for AI scaling through 2030 “as long as they’re willing to spend a lot,” and notes AI power demand could approach ~100 GW by 2030 under aggressive assumptions .

Jailbreaking/prompt injection ban (starting Jan 15)

One post claims new terms will ban jailbreaking and prompt injection starting January 15th.


Quick Takes

Why it matters: These smaller signals often show where the ecosystem is hardening: new evaluation tooling, faster iteration loops, and more “agent-native” product surfaces.

  • Gemini 3 app modes: Gemini app describes three modes: Fast (quick answers), Thinking (complex reasoning), Pro (deep math/coding) .
  • Gemini 3 Flash token behavior: Google says at the highest thinking level Flash can modulate how much it thinks and uses 30% fewer tokens on average than 2.5 Pro on typical traffic .
  • Gemini 3 Flash hallucination signals (third-party): Artificial Analysis reports Gemini 3 Flash Preview has a 91% hallucination rate in its AA-Omniscience benchmark (definition given as answering incorrectly when it should refuse/admit not knowing) .
  • Long-context evals: Context Arena reports Flash Preview ranking #1 at 1M context on 4-needle and 8-needle tests (AUC 68.0% and 49.4%) versus Pro’s 57.3% and 39.0% respectively .
  • Claude UI updates: Claude will sometimes suggest your next prompt in ghost text after a task finishes , and Claude Code adds syntax highlighting to diffs .
  • Exa “People Search”: Exa AI Labs says it enables semantic search over 1 billion people using a hybrid retrieval system backed by finetuned embeddings . A user later clarified results appear to be cached versions of LinkedIn pages when tested with a certain configuration .
  • OpenAI image generation “compute” messaging: OpenAI says compute enabled its first image generation launch, which drove a +32% jump in WAU, and says it needs more compute for what’s next .
ChatGPT Images (GPT Image 1.5) launches as open models and science evals accelerate
17 December 2025
8 minutes read
Xiaomi MiMo
Google DeepMind
merve
+27
OpenAI’s new ChatGPT Images (GPT Image 1.5) rolls out broadly and immediately reshapes public image-model leaderboards, while Xiaomi’s MiMo‑V2‑Flash raises the bar for fast open MoE models. The brief also covers new science-focused evaluation (FrontierScience + wet-lab results), Meta’s open SAM Audio release, major funding/compute moves, and notable research breakthroughs.

Top Stories

Why it matters: This cycle blends a major leap in consumer-facing image creation, continued acceleration in open model competition, and a clear push toward harder evaluations—from PhD-level science reasoning to wet-lab workflows.

1) OpenAI rolls out ChatGPT Images + GPT Image 1.5 (and it’s already reshaping leaderboards)

OpenAI introduced ChatGPT Images, powered by a new flagship image generation model, with upgrades in instruction following, precise editing, detail preservation, and 4× faster generation . It’s rolling out to all ChatGPT users and is available in the API as GPT Image 1.5.

Early benchmark signals:

  • Artificial Analysis Image Arena: GPT Image 1.5 is reported #1 in Text-to-Image and Image Editing, surpassing Nano Banana Pro . Pricing is described as token-based, e.g., for a 1MP image: ~$133/1k images (high quality) and $9/1k (low quality) .
  • LMSYS/Image Arena (preliminary):gpt-image-1.5 is #1 Text-to-Image (1264); chatgpt-image-latest is #1 Image Edit (1409); gpt-image-1.5 is #4 Image Edit (1395). Arena notes these scores are preliminary .

The launch is also triggering mixed qualitative takes. Some users report strong results (e.g., “my favourite results”) , while others argue the model “fails Vibe Checks” despite leaderboard claims .

2) Xiaomi’s MiMo-V2-Flash: a 309B MoE aimed at fast “agentic AI”

Xiaomi introduced MiMo‑V2‑Flash, an open-source MoE model with 309B total parameters (15B active), positioned as “Designed for Agentic AI” . Highlights include:

  • Hybrid Attention (5:1 interleaved 128-window sliding-window attention + global) with 256K context
  • Claims to match DeepSeek‑V3.2 on general benchmarks “at a fraction of the latency”
  • SWE‑Bench Verified: 73.4% and SWE‑Bench Multilingual: 71.7% (described as new SOTA for open-source models)
  • 150 output tokens/s with day‑0 support noted

A companion engineering thread highlights:

  • A Hybrid SWA architecture where a fixed KV cache “plays way nicer with current infra,” with window size 128 described as the “magic number” (and “sink values” called essential) .
  • MOPD (On‑Policy‑Distillation) claimed to match a teacher model using <1/50th the compute of a standard SFT+RL pipeline .

3) OpenAI launches FrontierScience + shows a 79× improvement in a wet lab protocol

OpenAI released FrontierScience, a new benchmark intended to measure PhD‑level scientific reasoning across physics, chemistry, and biology, using hard expert-written questions (olympiad-style and longer research-style tasks) . The benchmark is positioned as “upstream” of the more meaningful goal: enabling novel scientific discoveries .

OpenAI says GPT‑5.2 is its strongest model on FrontierScience, with gains on hard scientific tasks, while the benchmark exposes a gap between structured problems and open-ended, iterative research reasoning .

Separately, OpenAI describes real-world lab testing with Red Queen Bio: GPT‑5 proposed, ran (via a controlled framework), and iterated on experiments that increased a standard molecular cloning protocol’s efficiency by 79×, including a new enzyme-based approach .

4) Meta open-sources SAM Audio for multimodal audio separation

Meta introduced SAM Audio, described as the first unified model that can isolate sounds from complex audio mixtures using text, visual, or span prompts. Meta says it is sharing the model with the community alongside a perception encoder, benchmarks, and research papers .

Meta also claims SAM Audio outperforms previous models across a wide range of benchmarks and tasks .

5) OpenAI reportedly in talks for $10B+ from Amazon and use of AWS Trainium

Posts citing reported discussions say OpenAI is in talks to raise $10B+ from Amazon, plans to use AWS Trainium chips, and is discussing commerce partnership opportunities . Another post says such an investment would help OpenAI afford commitments, including from AWS .


Research & Innovation

Why it matters: The research stack continues to move in three directions: (1) attention and efficiency redesigns, (2) multimodal systems that point/ground rather than just describe, and (3) benchmarks that reveal where models fail in realistic settings.

DeepSeek v3.2’s “DSA” attention: sparse attention via an indexer

DeepSeek v3.2 introduces DeepSeek Attention (DSA), described as sparse attention using an indexer to select top‑k relevant key tokens per query token . The indexer produces an “index mask” that replaces the causal mask in multi-head latent attention (MLA) .

Apple’s SHARP: single-image 3D Gaussian “splats” in under 1 second

Apple research introduces SHARP, generating a complete 3D Gaussian representation from a single image in under 1 second on a standard GPU . Reported comparisons vs. Gen3C on ScanNet++ include DISTS 0.071 vs 0.090 and LPIPS 0.154 vs 0.227, plus a latency comparison of <1s vs ~850s.

Google Research’s FACTS Leaderboard: factuality across four dimensions

Google Research introduced the FACTS Leaderboard, a suite measuring factuality across multimodal, parametric knowledge, search, and document grounding dimensions . Results cited include Gemini 3 Pro at 68.8% overall, Gemini 2.5 Pro at 62.1%, and GPT‑5 at 61.8%. The writeup emphasizes that a single “factuality number” can hide behavioral differences (e.g., coverage vs. contradictions) .

Diffusion training speedups: “SpeedrunDiT” hits ImageNet SOTA fast

A reported result: SR‑DiT (SpeedrunDiT) combines multiple recent techniques into a modern baseline and achieves SOTA ImageNet diffusion results in 10 hours on a single H200 node, with a claimed 360× convergence speedup logged via Weights & Biases .

Molmo 2: video/image “pointing” with coordinates and timestamps

Molmo 2 is described as returning coordinates and timestamps over videos and images, supporting tasks like QA, counting, dense captioning, artifact detection, and subtitle-aware analysis. It’s also described as Apache 2.0 licensed with released image/video datasets and a separate 4B model for video pointing/counting .


Products & Launches

Why it matters: Products are pushing toward (a) making agents deployable and governable, and (b) packaging advanced multimodal generation into workflows users can actually operate.

OpenAI: new Images surface in ChatGPT + API workflow improvements

OpenAI added an Images surface inside ChatGPT (“tap ‘Images’ in the sidebar”) and says the model adheres more reliably to intent, changing only what you ask for while keeping lighting/composition/appearance consistent across edits . It also highlights multiple edit operations such as adding, subtracting, combining, blending, and transposing .

For developers, OpenAI notes improvements like more precise editing and preservation of logos & faces, better prompt adherence, and improved text rendering. It also states the model is 20% cheaper for image inputs/outputs, and suggests cost optimization via a low quality setting .

Google: “CC” agent + new Gemini “Gems” mini-apps + visual Deep Research

  • Google Labs CC is a new experimental Gemini-based agent that connects Gmail, Google Calendar, Google Drive, and the web to deliver a daily “Your Day Ahead” briefing . It’s launching in early access to U.S./Canada consumer accounts (18+), starting with Google AI Ultra subscribers .
  • Gemini is rolling out new Gems: interactive “AI mini-apps” on desktop that turn prompts into actionable tools, with examples like Recipe Genie, Marketing Maven, and a Claymation Explainer .
  • Gemini’s Deep Research can now generate visual reports with charts/diagrams/animations (for AI Ultra subscribers on desktop) .

AssemblyAI: Self-Hosted Voice AI

AssemblyAI launched Self‑Hosted Voice AI, deploying its Universal‑Streaming model on customer infrastructure, aimed at compliance/data residency and “tighter control” requirements . It highlights session-based pricing with volume discounts and “no self-hosting premium” .

vLLM Router: serving-aware load balancing

The vLLM project introduced vLLM Router, a Rust-based, prefill/decode-aware load balancer for vLLM fleets designed to improve throughput and tail latency by accounting for KV-cache locality and P/D disaggregation .

VS Code: governance features for agentic tools

VS Code’s latest release highlights org controls including a Private Marketplace for extension curation , fine‑grained URL approval flows for fetch tools , and centralized management of which tools can be auto-approved .


Industry Moves

Why it matters: Capital, compute partnerships, and consolidation are shaping what gets built (and who can afford to build it).

Databricks raises $4B+ and shares growth metrics

Databricks announced a $4B+ fundraise led by Insight Partners, Fidelity Investments, and J.P. Morgan . It reported $4.8B revenue run-rate with 55%+ YoY growth, plus $1B run-rate each for Data Warehousing and AI products, and cash flow positivity over the last 12 months .

OpenAI enters agreement to acquire neptune.ai

OpenAI entered a definitive agreement to acquire neptune.ai, described as strengthening tooling and infrastructure supporting frontier research .

fal raises $140M Series D

Multimodal AI startup fal raised a $140M Series D led by Sequoia; the post claims 300% revenue growth since July and 600+ multimodal generation models .


Policy & Regulation

Why it matters: Government programs and political proposals are now directly targeting the infrastructure and workflows that determine AI’s pace.

U.S. “Genesis Mission” links national labs, supercomputers, and private partners

A post says U.S. President Trump signed an executive order creating the Genesis Mission, a Department of Energy program linking national labs, supercomputers, and private partners (including Anthropic, Nvidia, and OpenAI) to train models on federal datasets and automate experiments across areas like energy, biotech, materials, and semiconductor research .

Proposed moratorium on AI data centers (and pushback)

Sen. Bernie Sanders said he will push for a moratorium on data center construction powering “the unregulated sprint” to develop and deploy AI . A reply argued the moratorium would, “ironically,” ensure benefits accrue only to “the 1%” .


Quick Takes

Why it matters: Smaller signals often reveal where the ecosystem is headed next—tooling hardening, benchmarking becoming more adversarial, and multimodal capabilities expanding beyond text.

  • GPT‑5.2 on Arena: GPT‑5.2-high appears on Arena’s WebDev leaderboard at #2 (1486), while GPT‑5.2 is #6 (1399). Text leaderboard results cited place GPT‑5.2-high at #13 (1441), below GPT‑5.1-high at #6.
  • COLT open problem solved with GPT‑5.2 Pro: A thread claims GPT‑5.2 solves the COLT 2022 open problem on accelerated L1-regularized PageRank under a complementarity margin assumption; proofs were generated by GPT‑5.2 Pro and auto-formalized with other systems, with the author reporting manual verification twice .
  • Tencent HY World 1.5 (WorldPlay): Tencent’s world model is described as offering real-time interaction and long-term memory and is said to go open-source “tomorrow” .
  • Google DeepMind Native Audio: Google DeepMind released an updated Gemini 2.5 Flash Native Audio model for live voice agents, described as better at following instructions and holding more natural conversations .
  • NVIDIA Nemotron 3 additional details: A thread claims releases include 3T tokens of new pretraining, 18M post-training samples, and an open-source RL environment (“NeMo Gym”), while framing Nemotron 3 as optimized for NVIDIA hardware .