We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track

AI High Signal Digest
by avergin 1 source
Comprehensive daily briefing on AI developments including research breakthroughs, product launches, industry news, and strategic moves across the artificial intelligence ecosystem

Top Stories — why it matters: frontier capability, cost, and scale are shifting fast
-
Qwen3‑Max (Preview) passes the 1T‑parameter mark
- Alibaba introduced Qwen3‑Max‑Preview (Instruct), “our biggest model yet, with over 1 trillion parameters,” available now via Qwen Chat and Alibaba Cloud API. The team says it beats their prior best (Qwen3‑235B‑A22B‑2507) and improves conversations, agentic tasks, and instruction following 86 Big news: Introducing Qwen3-Max-Preview (Instruct) — our biggest model yet, with over 1 trillion parameters! 🚀https://x.com/Alibaba_Qwen/status/1963991502440562976 85 Now available via Qwen Chat & Alibaba Cloud API.https://x.com/Alibaba_Qwen/status/1963991502440562976 84 Benchmarks show it beats our previous best, Qwen3-235B-A22B-2507. Internal tests + early user feedback confirm: stronger performance, broader knowledge, better at conversations, agentic tasks & instruction following.https://x.com/Alibaba_Qwen/status/1963991502440562976 76 Qwen Chat: https://chat.qwen.ai/ Alibaba Cloud API: https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?type=model&url=2840914_2&modelId=qwen3-max-previewhttps://x.com/Alibaba_Qwen/status/1963991502440562976 82 https://chat.qwen.ai/https://x.com/Alibaba_Qwen/status/1963991502440562976 81 https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?type=model&url=2840914_2&modelId=qwen3-max-previewhttps://x.com/Alibaba_Qwen/status/1963991502440562976 . OpenRouter lists the preview with “higher accuracy in math, coding, logic, and science,” improved instruction following, and optimization for RAG and tool‑calling (no “thinking” mode) 70 Qwen3-Max, @ Alibaba_Qwen ’s most powerful model is live on OpenRouter:https://x.com/OpenRouterAI/status/1963992206446145805 69 📊 Higher accuracy in math, coding, logic, and science tasks 📖 Stronger instruction following & reduced hallucinations 🔍 Optimized for RAG + tool calling (no “thinking” mode) https://x.com/OpenRouterAI/status/1963992206446145805 . Community assessments flag strong overall performance but point to limitations in search and “thinking” traces 80 Qwen3 Max is truly, solidly, a US-grade modern frontier model. They ask $15/MT for what they serve because that is easily its weight class.https://x.com/teortaxesTex/status/1963994291765649716 79 @ teortaxesTex it certainly hallucinates extensive thinking traces…https://x.com/suchenzang/status/1963998963444687192 78 deep research version of qwen is similarly off: mixing a bunch of search results that don’t seem consistent with one anotherhttps://x.com/suchenzang/status/1963998963444687192 77 @ teortaxesTex search + thinking still seems quite lacking, unfortunately…https://x.com/suchenzang/status/1963997447039906245 .
“Scaling works — and the official release will surprise you even more.” 83 Scaling works — and the official release will surprise you even more. Stay tuned!https://x.com/Alibaba_Qwen/status/1963991502440562976
-
Kimi K2‑0905 ships weights; pushes cheaper coding and longer context
- Moonshot’s Kimi K2‑0905 update emphasizes coding (front‑end, tool‑calling), extends context to 256k, and posts weights/code on Hugging Face; chat and a high‑TPS “turbo” API are available 144 Kimi K2-0905 update 🚀 - Enhanced coding capabilities, esp. front-end & tool-calling - Context length extended to 256k tokens - Improved integration with various agent scaffolds (e.g., Claude Code, Roo Code, etc)https://x.com/Kimi_Moonshot/status/1963802687230947698 141 🔗 Weights & code: https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905 💬 Chat with new Kimi K2 on: https://www.kimi.com ⚡️ For 60–100 TPS + guaranteed 100% tool-call accuracy, try our turbo API: https://platform.moonshot.aihttps://x.com/Kimi_Moonshot/status/1963802687230947698 143 https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905https://x.com/Kimi_Moonshot/status/1963802687230947698 142 https://www.kimi.comhttps://x.com/Kimi_Moonshot/status/1963802687230947698 140 https://platform.moonshot.aihttps://x.com/Kimi_Moonshot/status/1963802687230947698 . Users report large cost deltas vs. Sonnet 4 (e.g., ~80–90% cheaper), and K2 is already live as a provider in Cline 133 live saver for my Claude Code bill lol kimi k2 0905 is 80% cheaper than Sonnet 4, up to 90% cheaper for power usershttps://x.com/ting_/status/1963865598686900331 61 Live in Cline, Kimi-K2-0905 via the @ FireworksAI_HQ provider. https://x.com/cline/status/1964033209244799183 60 https://cline.bot/blog/moonshot-kimi-k2-0905https://x.com/cline/status/1964033221349560582 .
-
Baseten raises $150M to scale inference for the AI app layer
- Baseten closed a $150M Series D led by BOND (Jay Simons joining the board) with participation from Conviction, CapitalG, 01 Advisors, IVP, Spark Capital, Greylock, Scribble, BoxGroup, and Premji Invest; customers include Abridge, Bland, Clay, Gamma, Mirage, OpenEvidence, Sourcegraph, WRITER, and Zed Industries 118 Today, we’re excited to announce our $150M Series D, led by BOND, with Jay Simons joining our Board. We’re also thrilled to welcome Conviction and CapitalG to the round, alongside support from 01 Advisors, IVP, Spark Capital, Greylock Partners, Scribble Ventures, and Premji Invest.https://x.com/tuhinone/status/1963945981382451488 117 The last eighteen months have been a whirlwind; as the AI application layer has taken off, we’ve been proud to play a small part supporting world class companies run their production workloads. Thanks to all our customers including Abridge, Bland, Clay, Gamma, Mirage, OpenEvidence, Sourcegraph, WRITER, and Zed Industries.https://x.com/tuhinone/status/1963945981382451488 92 We raised a $150M Series D! Thank you to all of our customers who trust us to power their inference.https://x.com/basetenco/status/1963979571193405447 91 This round was led by @ bondcap , with @ jaysimons joining our Board. We’re also thrilled to welcome @ conviction and @ CapitalG to the round, alongside support from O1A, @ IVP , @ sparkcapital , @ GreylockVC , @ ScribbleVC , @ BoxGroup , and Premji Invest.https://x.com/basetenco/status/1963979571193405447 90 We’re grateful to work with incredible companies like @ Get_Writer , @ zeddotdev , @ clay_gtm , @ trymirage , @ AbridgeHQ , @ EvidenceOpen , @ MeetGamma , @ Sourcegraph , and @ usebland .https://x.com/basetenco/status/1963979571193405447 . The founder’s framing underscores secular cost declines and rising usage:
“I think the token price goes down and inference should get cheaper over time. And that really just means there is going to be more inference.” “Every time we lower prices or optimize models to make it cheaper, four months later customers are spending more anyway.” “Inference prices will go down, but if the world is run by AI in 10 years, there is going to be a lot of inference. It better be cheap.”
x.com -
On‑device embeddings get a lift (smaller, faster, multilingual)
- Google DeepMind’s EmbeddingGemma targets on‑device use and tops MTEB for models under 500M parameters; supported by Hugging Face Text Embeddings Inference v1.8.1. Practitioners highlight small models’ importance for context management 136 🤗 Welcome EmbeddingGemma — a multilingual embedding model from @ GoogleDeepMind , built for on‑device use cases, leading the MTEB leaderboard for models < 500M!https://x.com/alvarobartt/status/1963637305375105179 135 Supported with @ huggingface Text Embeddings Inference v1.8.1 for fast and efficient inference!https://x.com/alvarobartt/status/1963637305375105179 134 Amazing work here ! Having very small embedding models is really key to nailing context management (among other things ofc).https://x.com/narsilou/status/1963879172083732647 .
-
Macro view: compute scaling likely to slow
- Epoch’s analysts forecast fast diffusion now and broad cognitive automation by ~2035, but expect near‑term slowdowns in compute scaling due to investor uncertainty, overinvestment risk, and rising lead times; full transcript and episode links available 68 @ Jsevillamol and @ YafahEdelman on where current AI trends carry us —and where they break. They disagree on mechanisms and outcomes, but agree on this: fast diffusion now, broad cognitive automation by ~2035, and extreme uncertainty after.https://x.com/EpochAIResearch/status/1963999866138317097 67 AI progress has been driven by enormous compute scaling, but this is likely to slow down within the next few years. The reasons: investor uncertainty, the heavy costs of overinvestment, and increasing lead times. 🧵 https://x.com/EpochAIResearch/status/1964083741778989166 66 Full transcript available in the Epoch After Hours Website: http://epoch.ai/epoch-after-hourshttps://x.com/EpochAIResearch/status/1963999870290649422 65 Watch or listen to this episode of Epoch After Hours in any of the below channels: Youtube: https://youtu.be/Ab6HfmmCFQs Apple Podcasts: https://podcasts.apple.com/us/podcast/forecasting-ai-progress-until-2040/id1790976895?i=1000725159075 Spotify: https://open.spotify.com/episode/3AOT6Jy8fO1TXUrAOh9599?si=f04256b5c88844f9 Pocket Casts: https://pca.st/kt2x5k7o RSS Feed: https://share.transistor.fm/s/260a1173https://x.com/EpochAIResearch/status/1963999872094228666 27 But as compute investments grow, the “lead time” between project initiation and product deployment gets longer–you need to buy more compute, build new data centers, and construct new fabs. For every 10× increase in compute investment, lead times grow by roughly a year. https://x.com/EpochAIResearch/status/1964083775329227212 .
Research & Innovation — why it matters: new methods are squeezing more capability from less compute
-
Agentic RL for reasoning: rStar2‑Agent (14B) reaches frontier‑level math in 510 steps
- Microsoft Research trained a 14B model with tool‑augmented RL (Python environment), reporting Pass@1 scores of AIME24 80.6, AIME25 69.8, HMMT25 52.7—meeting or exceeding larger models—and efficient reasoning with fewer tokens. The system scales output length in stages, filters/curates rollouts (GRPO‑RoC), and runs a dedicated code service handling ~45K concurrent tool calls at ~0.3s latency 35 rStar2-Agent (Microsoft Research). A 14B math-reasoning model trained with agentic RL that learns to think smarter by using a Python tool environment, not just longer CoT.https://x.com/omarsar0/status/1964045141016240182 34 It reaches frontier-level math reasoning in just 510 RL training steps.https://x.com/omarsar0/status/1964045125115662847 32 Start with non-reasoning SFT to teach tool use and formatting, then three RL stages that scale max output length 8K → 12K → 12K, and finally focus on harder problems; RL data curated to 42K math items with integer answers. https://x.com/omarsar0/status/1964045187799470572 33 GRPO-RoC oversamples rollouts, then keeps only the cleanest correct ones while preserving diverse failures, reducing tool-call errors and formatting issues during training. https://x.com/omarsar0/status/1964045156178678257 31 A dedicated, isolated code service reliably handles up to ~45K concurrent tool calls per training step with ~0.3 s end-to-end latency, and a load-balanced scheduler allocates rollouts by available KV cache to cut GPU idle time. https://x.com/omarsar0/status/1964045171575968256 30 Pass@1 AIME24 80.6, AIME25 69.8, HMMT25 52.7, exceeding or matching o3-mini (medium) and DeepSeek-R1 despite far smaller size; responses are shorter on AIME24/25 than Qwen3-14B and QWQ-32B. https://x.com/omarsar0/status/1964045202928390483 29 rStar2-Agent-14B also achieves effective reasoning with significantly fewer tokens.https://x.com/omarsar0/status/1964045218074021967 28 https://www.arxiv.org/abs/2508.20722https://x.com/omarsar0/status/1964045218074021967 .
-
Unifying post‑training: SFT and RL under one objective; Hybrid Post‑Training (HPT)
- A new paper proposes a Unified Policy Gradient Estimator that casts SFT and RL as maximizing expected reward with a KL term to a behavior policy; HPT switches between RL and SFT based on simple performance feedback and is reported to outperform strong baselines across model scales 112 Towards a Unified View of LLM Post-Traininghttps://x.com/omarsar0/status/1963971173735448858 111 The proposed Unified Policy Gradient Estimator provides a theoretical unification of a wide spectrum of post-training algorithms, covering both SFT and RL losses within a single formulation.https://x.com/omarsar0/status/1963971192706257101 110 The paper shows SFT and RL optimize the same objective of maximizing expected reward with a KL term to a behavior policy.https://x.com/omarsar0/status/1963971192706257101 109 This work proposes Hybrid Post-Training, which switches between RL and SFT using simple performance feedback to balance exploration and exploitation.https://x.com/omarsar0/status/1963971173735448858 108 HPT consistently surpasses strong baselines across models of varying scale and families.https://x.com/omarsar0/status/1963971192706257101 107 Paper: https://arxiv.org/pdf/2509.04419https://x.com/omarsar0/status/1963971192706257101 .
-
Vision‑language data at scale: FineVision
- FineVision releases a curated VLM dataset (>200 sources, 17M images, 10B answer tokens) with claims of “>20% improvement across 10 benchmarks” and enabling GUI navigation/pointing/counting; community leaders laud the effort 127 Fuck it. Today, we open source FineVision: the finest curation of datasets for VLMs, over 200 sources!https://x.com/andimarafioti/status/1963610118165000479 126 > 20% improvement across 10 benchmarks > 17M unique images > 10B answer tokens > New capabilities: GUI navigation, pointing, countinghttps://x.com/andimarafioti/status/1963610118165000479 125 FineVision 10x’s open-source VLMs. https://x.com/andimarafioti/status/1963610118165000479 124 We’re doing the work that nobody else wants to do! Welcome to FineVision, the best free open dataset to train vision language models. Let’s go open-source!https://x.com/ClementDelangue/status/1963903694061138216 .
-
On‑device RAG plumbing: sqlite‑vec
- A small vector DB extension for SQLite (C, no deps; MIT/Apache‑2.0) reports 1M×128‑dim queries in 17ms, 500k×960‑dim in 41ms, supports matryoshka slicing, 32× storage reduction via binary quantization, and runs locally/WASM—paired in examples with EmbeddingGemma and Ollama for offline personalization 98 A small, portable vector database powered by SQLite for on-device RAG? 🤯 sqlite-vec is a new vector search SQLite extension written entirely in C with no dependencies, MIT/Apache-2.0 dual licensed. https://x.com/_philschmid/status/1819276123433263115 97 sqlite-vec queries: - 1 million 128-dimensional vectors in just 17ms - 500,000 with 960-dimensional vectors in 41mshttps://x.com/_philschmid/status/1819276123433263115 96 sqlite-vec supports: 💾 Matryoshka embedding slicing 💡 Binary quantization reduces storage by 32x with minimal accuracy loss 🤏🏻 L2, cosine and Hamming distance calculations 🧮 Retrieval against Python List and NP Arrays 🛠️ SDKs for Python, Javascript, Go, Rust, Wasm and more 🧠 local direct embedding with “sqlite-lembed” for gguf models and @ ggerganov Llama.CPP ☁️ remote embedding with “sqlite-rembed” for @ OpenAI compatible APIshttps://x.com/_philschmid/status/1819276123433263115 95 Below is an simple example using Python + sqlite and Ollama. SQLite-vec is WASM compatible and runs anywhere. You can change this example to almost any language including swift, kotlin, java, javascript….https://x.com/_philschmid/status/1963952204970078579 94 You can now easily build AI applications leveraging SQLite-vec and the new Embedding Gemma directly on-device, no internet required.https://x.com/_philschmid/status/1963952204970078579 93 Script: https://github.com/philschmid/gemini-samples/blob/main/scripts/embeddinggemma-sqlite-ollama.py Sqlite-vec: https://alexgarcia.xyz/sqlite-vec/ EmbeddingGemma: https://developers.googleblog.com/en/introducing-embeddinggemma/https://x.com/_philschmid/status/1963952208367546623 .
-
Scheduling for prefill/decode architectures: ByteDance’s HeteroScale
- A coordinated autoscaling framework that balances prefill/decode across heterogeneous GPUs using decode tokens/sec as the primary signal. Reported results: prefill GPU utilization 46.8%→76.2%, overall usage efficiency +41.3%, and “hundreds of thousands” of GPU‑hours saved daily 4 HeteroScale – a coordinated autoscaling framework from @ ByteDanceOSS for LLMs with the Prefill–Decode (P/D) architecturehttps://x.com/TheTuringPost/status/1964091993430380807 3 Uses a single, reliable metric (decode token/second) to scale prefill and decode togetherhttps://x.com/TheTuringPost/status/1964091993430380807 2 • Prefill GPU utilization: 46.8% → 76.2% • Decode GPU utilization: 86.0% → 82.2% • Overall GPU usage efficiency improved by 41.3% https://x.com/TheTuringPost/status/1964092076448239825 . Paper: 1 https://arxiv.org/abs/2508.19559https://x.com/TheTuringPost/status/1964092088850796842 .
-
Self‑supervised vision backbone: Meta’s DINOv3
- A 6.7B ViT trained on >1.7B Instagram images introduces a loss to preserve patch‑level diversity; Meta reports stronger embeddings for segmentation/depth. Weights/code ship under a license allowing commercial use but forbidding military applications 45 Meta released DINOv3, a self-supervised vision transformer. Relative to its peers and predecessors, it improves image embeddings for tasks like segmentation and depth.https://x.com/DeepLearningAI/status/1964071035810046378 44 It’s a 6.7-billion-parameter model trained on over 1.7 billion Instagram images. Technical innovations include a new loss term that preserves patch-level diversity, overcoming some of the limitations of working without image labels.https://x.com/DeepLearningAI/status/1964071035810046378 43 Weights and training code ship under a license that allows commercial use but forbids military applications, appealing to developers who want a stronger, self-supervised backbone for downstream vision applications.https://x.com/DeepLearningAI/status/1964071035810046378 42 https://hubs.la/Q03GYwMQ0https://x.com/DeepLearningAI/status/1964071035810046378 .
-
Reality check on coding benchmarks: LiveCodeBench Pro
- On continuously updated competitive‑programming tasks, the best model hits ~53% pass@1 on medium difficulty and 0% on hard, underscoring current limits on highly compositional code generation 139 LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?https://x.com/arankomatsuzaki/status/1934433210387296414 138 A benchmark composed of problems from Codeforces, ICPC, and IOI that are continuously updatedhttps://x.com/arankomatsuzaki/status/1934433210387296414 137 The best model achieves only 53% pass@1 on medium-difficulty problems and 0% on hard problems https://x.com/arankomatsuzaki/status/1934433210387296414 .
Products & Launches — why it matters: users get immediate utility from new features and workflows
-
ChatGPT adds branching conversations on web for logged‑in users—explore new directions without losing the original thread 148 By popular request: you can now branch conversations in ChatGPT, letting you more easily explore different directions without losing your original thread.https://x.com/OpenAI/status/1963697012014215181 147 Available now to logged-in users on web. https://x.com/OpenAI/status/1963697012014215181 .
-
Anthropic publishes a 1‑hour deep dive on prompt engineering (what makes a good prompt, handling ambiguity/edge cases, enterprise vs. research prompting, personas/metaphors/structured logic, jailbreaking/trust/testing at scale). Watch here: https://x.com/LiorOnAI/status/19639809925.. 104 Covers: ▸ What makes a good prompt ▸ Handling ambiguity, reasoning paths, and edge cases ▸ Enterprise vs research prompting strategies ▸ Using personas, metaphors, and structured logic ▸ Jailbreaking, trust, and testing prompts at scalehttps://x.com/LiorOnAI/status/1963980992563020201 102 ▸ Source: https://www.youtube.com/watch?v=T9aRN5JkmL8https://x.com/LiorOnAI/status/1963980993619980356 .
-
Google Photos “Create” tab (U.S.) rolls out Veo 3 and generative tools
- Photo‑to‑video (“Subtle movement”/“I’m feeling lucky”), Remix (anime/comic/sketch/3D), Cinematic photos, Animations (multi‑shot GIFs), Highlight videos (montages by query), and a collage editor 56 Veo 3 is in @ GooglePhotos , right in the new Create tab — a central hub for generative AI features to spur your creativity, available now in the U.S. 🎨https://x.com/Google/status/1964016938394231043 55 Photo to video allows you to add motion to still photos and we’re starting to roll out our state-of-the-art Veo 3 video generation model. Pick between two prompts: “Subtle movement” or “I’m feeling lucky” to see your pics take on a new life in a short video clip. https://x.com/Google/status/1964016940407476426 54 Remix reimagines your photos in various styles, like anime, comic, sketch and 3D animation. https://x.com/Google/status/1964016943569973459 53 Cinematic photos are moving, 3D representations of your photos, which you can now make in the Create tab. https://x.com/Google/status/1964016954101973017 52 Finally, Animations turns multiple still shots into one animated gif. Just select “Animation” and the photos you want to include.https://x.com/Google/status/1964016958615294082 51 Highlight videos put together montages from your galleries. If you search using phrases like “mom” or “Paris,” Photos will choose the best clips and music to compile into one video. https://x.com/Google/status/1964016950977138967 50 Collage helps you pick multiple photos, select a design and choose a layout. You can also edit photos from the collage editor. https://x.com/Google/status/1964016947173216702 .
-
Runway “edit reality” on web/iOS and Aleph on mobile
- “Edit reality” demos emphasize ease (“Anyone can make this … in a few minutes”); the mobile app lets you film and apply text‑driven edits on the go 89 Edit reality with Runway. Try it yourself on web and the Runway iOS app. https://x.com/runwayml/status/1963950900663185866 88 https://x.com/runwayml/status/1963950900663185866 87 Anyone can make this, today, in a few minutes.https://x.com/c_valenzuelab/status/1963986402653204968 41 Runway Aleph has been released on mobile.https://x.com/jerrod_lew/status/1963648251795837227 40 Film the world around you and upload it to the Runway app.https://x.com/jerrod_lew/status/1963648251795837227 39 Then insert text prompts to alter the result.https://x.com/jerrod_lew/status/1963648251795837227 38 Edit reality on the gohttps://x.com/c_valenzuelab/status/1964072973725315511 .
-
Reka Research API demo: build geo‑aware agent apps (e.g., restaurant recommendations) with visible agent actions; code samples, docs, and platform available 75 Build smart, geo-aware apps with Reka Research API! 📍 Demo: Restaurant recommendations app showcasing speed, flexibility, ease of use & geo-location. Customize & see agent actions. #AI #Rekahttps://x.com/RekaAILabs/status/1964006900036509931 74 https://github.com/reka-ai/api-examples-typescripthttps://x.com/RekaAILabs/status/1964006900036509931 73 https://platform.reka.aihttps://x.com/RekaAILabs/status/1964006900036509931 72 https://docs.reka.aihttps://x.com/RekaAILabs/status/1964006900036509931 71 https://x.com/RekaAILabs/status/1964006900036509931 .
-
LlamaIndex SemTools: CLI parsing + semantic search for agents
- On 1,000 arXiv PDFs, adding fuzzy semantic search to Unix tools yielded more detailed/accurate answers across cross‑ref and temporal‑analysis tasks; authors report fixed top‑k RAG was strictly worse in these cases. Blog and repo linked 18 Command-line agents can get you really far in document search and analysis! We tested SemTools, our CLI toolkit for parsing and semantic search, with coding agents like @ claude_code on 1000 @ arxiv papers. The results show that combining Unix tools with semantic search capabilities creates surprisingly capable knowledge workers.https://x.com/llama_index/status/1964009128973783135 16 🔍 SemTools adds parse and search commands that let agents handle complex documents with fuzzy semantic keyword search 📊 Agents with semantic search provided more detailed, accurate answers across search, cross-reference, and temporal analysis tasks ⚡ CLI access proves incredibly powerful relative to effort - leveraging existing Unix tooling instead of building custom RAG infrastructure 🛠️ The combination of grep, find, and semantic search handles a wide variety of document tasks at high fidelityhttps://x.com/llama_index/status/1964009128973783135 15 In these cases standard RAG with fixed top-k retrieval is strictly worse.https://x.com/jerryjliu0/status/1964098215181168732 17 When you have a “medium” sized dataset e.g. 1000 ArXiv PDFs, we found that an extremely strong Q&A baseline is just giving agents access to the CLI, along with some tools for fast semantic search using static embeddings.https://x.com/jerryjliu0/status/1964098215181168732 14 Blog by @ LoganMarkewich : https://www.llamaindex.ai/blog/semtools-are-coding-agents-all-you-need SemTools: https://github.com/run-llama/semtoolshttps://x.com/jerryjliu0/status/1964098215181168732 .
-
Build weekend: Nano Banana hackathon and free Gemini API tier
- Global 48‑hour hackathon (Sep 6–7) with $400k in prizes; free‑tier access to gemini‑2.5‑flash‑image this weekend; online (Kaggle) and in‑person (SF) tracks 48 Official Nano Banana 48 Hour global hackathon! Build with Gemini 2.5 Flash Image aka. Nano Banana on September 6-7, 2025 with $400,000 in prizes. 🍌https://x.com/_philschmid/status/1964052906631438729 47 Special 48-hour free tier of Nano Banana in the Gemini API (starting 00:01 UTC 06/09).https://x.com/_philschmid/status/1964052906631438729 46 Judged on innovation, technical execution, impact, and presentation.https://x.com/_philschmid/status/1964052906631438729 37 https://www.kaggle.com/competitions/bananahttps://x.com/osanseviero/status/1964081812688220329 11 🔑 we’re unlocking a free tier of the Gemini API to access gemini-2.5-flash-image for this weekend onlyhttps://x.com/GoogleAIStudio/status/1964119795454111776 10 http://ai.studio/apikeyhttps://x.com/GoogleAIStudio/status/1964120019857789238 .
Industry Moves — why it matters: capital and strategy determine who ships at scale
-
OpenAI plans custom AI accelerators with Broadcom
- Reports indicate mass production of an in‑house “XPU” co‑designed with Broadcom, targeting training/deployment (e.g., GPT‑5); shipping is slated for 2026 with a reported ~$10B order commitment 64 OpenAI will begin mass-producing a custom AI accelerator co-designed with Broadcom, aiming to ease its dependence on Nvidia GPUs.https://x.com/WesRothMoney/status/1963987993082958041 63 Shipping is slated for 2026, backed by a reported $10 billion order commitment that Broadcom’s CEO hinted at during earnings.https://x.com/WesRothMoney/status/1963987993082958041 62 The in-house “XPU” will power OpenAI’s own workloads training and deploying models like GPT-5, mirroring moves by Google, Amazon, and Meta to build specialized chips and gain cost-plus supply control.https://x.com/WesRothMoney/status/1963987993082958041 .
-
Cohere Labs leadership
- Marzieh announced she’s stepping into the role of Head of Cohere Labs; peers called her a strong fit and encouraged following the team’s work 116 I’m excited to share that I’ll be stepping into the role of Head of Cohere Labs. It’s an honor and a responsibility to lead such an extraordinary group of researchers pushing the boundaries of AI research. https://x.com/mziizm/status/1963894793588613159 113 Marzieh is the perfect lead for Cohere Labs. Y’all should stay tuned for what she, @ jpineau1 and the team cook up :)https://x.com/nickfrosst/status/1963977983191851168 .
-
Inference operations and customers
- Baseten’s $150M raise (see Top Stories) reinforces demand for managed inference; the company lists customers across healthcare, dev tools, and productivity 117 The last eighteen months have been a whirlwind; as the AI application layer has taken off, we’ve been proud to play a small part supporting world class companies run their production workloads. Thanks to all our customers including Abridge, Bland, Clay, Gamma, Mirage, OpenEvidence, Sourcegraph, WRITER, and Zed Industries.https://x.com/tuhinone/status/1963945981382451488 92 We raised a $150M Series D! Thank you to all of our customers who trust us to power their inference.https://x.com/basetenco/status/1963979571193405447 .
-
Enterprise AI adoption: Devin as a data analyst
- Eight Sleep reports answering 3× more data questions weekly by invoking Devin directly in Slack; case study linked 59 The AI Data Analyst is finally here, and its name is Devin.https://x.com/cognition/status/1964035498932130116 58 @ EightSleep now answers 3x more data questions every week by using Devin directly from Slack 👇 https://x.com/cognition/status/1964035498932130116 57 Check out their blog post: https://www.eightsleep.com/blog/how-eight-sleep-uses-devin-as-a-data-analyst/https://x.com/cognition/status/1964035500869898654 .
-
Vector DB in production: Qdrant case study
- Fieldly migrated after ~10% request failures with a prior system, reporting reliable recall, 66% lower infra costs, and scaling to tens of millions of embeddings; now adding hybrid search and RRF 123 http://Fieldly.ai built a wearable note taker that continuously records, transcribes, and makes conversations instantly searchable. But their first vector database couldn’t keep up, failing on ~10% of requests.https://x.com/qdrant_engine/status/1963947534721638830 122 After migrating to @ qdrant_engine , Fieldy achieved:https://x.com/qdrant_engine/status/1963947534721638830 121 ✅ Reliable recall and search in production ✅ 66% lower infrastructure costs ✅ Scale to tens of millions of embeddings without downtimehttps://x.com/qdrant_engine/status/1963947534721638830 120 Now, Fieldy is focused on advancing retrieval quality with hybrid search, reciprocal rank fusion, and upcoming location/time-based filtering.https://x.com/qdrant_engine/status/1963947534721638830 119 Read the full case study → https://buff.ly/uPQ8He1https://x.com/qdrant_engine/status/1963947534721638830 .
Policy & Regulation — why it matters: access rules and licenses shape competition and research
-
Anthropic’s regional restrictions and data policy update
- A blog update says the company now prohibits organizations controlled from restricted jurisdictions (e.g., China). Community posts question whether the move is safety‑driven or protectionist and note fast progress by Chinese open‑weight labs (DeepSeek, Qwen). Anthropic’s consumer terms also shifted to explicit opt‑in for training, with opted‑in data retained up to five years 132 I’m confused by Anthropic’s blog post today: “This update prohibits companies or organizations whose ownership structures subject them to control from jurisdictions where our products are not permitted, like China.”https://x.com/LuozhuZhang/status/1963884496966889669 131 Many models developed by Chinese labs are open-source or released with open weights, which has significantly advanced the accessibility and democratization of AI (like DeepSeek-R1 and Qwen series). That makes me question the true intent behind Anthropic’s announcement, are they really safeguarding national security, or are they leveraging policy as a way to curb the growth of potential competitors?https://x.com/LuozhuZhang/status/1963884496966889669 130 Looking at the OpenRouter leaderboard, models like Qwen and DeepSeek are gaining traction rapidly, cutting into Anthropic’s market share. Community posts also show that some users have experimented with proxy layers to indirectly call third-party models from within Claude Code. Does this mean Anthropic can (and likely will) invoke “national security” or “restricted region” justifications to block such access at the product and network level, thereby protecting its market share and pricing power in key domains like coding assistants?https://x.com/LuozhuZhang/status/1963884496966889669 128 «and to compete globally with trusted technology companies headquartered in the United States and allied countries» Yeah… I would really like next Qwens, Kimis, GLMs and of course V4 to obliterate Dario’s sorry protectionism-dependent business. https://x.com/teortaxesTex/status/1963897200267735512 129 Based on Dario Amodei’s past public comments about export controls and national security, and considering the recent update to Anthropic’s consumer terms (“users must now choose whether to allow training on their data; if they opt in, data may be retained for up to five years”) I worry that Anthropic is drifting away from its founding ethos. Under the banner of “safety and compliance,” it appears to be moving down an increasingly rigid and closed path.https://x.com/LuozhuZhang/status/1963884496966889669 .
-
Dataset licensing tightens: NVIDIA’s Nemotron‑CC‑v2
- A widely shared thread highlights the “NVIDIA Data Agreement for Model Training,” which reportedly forbids use in open‑source projects, composing with other data, or even releasing benchmarks without permission 6 Nemotron-CC-v2 is released under “NVIDIA Data Agreement for Model Training”, which prohibits its use in anything open-source, composing it with other data, or even releasing benchmarks without NVIDIA’s permission ‼️ https://x.com/soldni/status/1964117054442594680 5 YOU CAN’T EVEN BENCHMARK THE DATASET WTFhttps://x.com/eliebakouch/status/1964214679753855199 .
-
Litigation watch (reported): Anthropic settlement
- A post citing a court filing claims a $1.5B class‑action settlement related to training on torrented copyrighted downloads; see filing link in the post 26 Anthropic class action settlement for using torrented copyrighted downloads for model training $1.5 billion (!)https://x.com/EMostaque/status/1964067872101065120 25 https://storage.courtlistener.com/recap/gov.uscourts.cand.434709/gov.uscourts.cand.434709.362.0_1.pdfhttps://x.com/EMostaque/status/1964067872101065120 .
Quick Takes — why it matters: fast signals for your radar
- SWE‑rebench (fresh GitHub PR tasks, no leakage): snapshot results are lower than SWE‑Bench Verified because issues are newer and unverified—helpful reality check for agentic coding claims 101 [SWE-rebench] Fresh August results on real GitHub PR tasks. No training-set leakage.https://x.com/ibragim_bad/status/1963702541428072871 100 Numbers are lower than SWE-Bench Verified because: 1. fresher issues not leaking 2. these are not “verified” issues (it’s more like SWE-Bench Full)https://x.com/gneubig/status/1963947072748412990 .
- FutureX leaderboard: Grok4 tops GPT‑5‑pro, ChatGPT‑Agent, Gemini Deep Think; open research agents (e.g., MiroMind 72B) perform strongly; full board posted 115 Weekly Update of FutureX: In our latest weekly leaderboard, Grok4 outperforms GPT5-pro, ChatGPT-Agent, and Gemini Deep Think! Surprisingly, @ miromind_ai ’s open-sourced deep research agent performs quite well with their 72B model! Full board at https://futurex-ai.github.iohttps://x.com/liujiashuo77/status/1963591894627459399 114 https://futurex-ai.github.iohttps://x.com/liujiashuo77/status/1963591894627459399 .
- App store signal: 3 of the top 4 U.S. Productivity apps are AIs; Perplexity hit #4 within two weeks of an iOS redesign 106 The momentum is there. Top 3 out of 4 Productivity Apps in the US App Store are AIs. 2 Google, 0.5 Microsoft. Perplexity is the only little tech app.https://x.com/AravSrinivas/status/1963981119243915698 105 At #4 within 2 weeks of the iOS redesign and update. https://x.com/dnlkwk/status/1963971261593170158 .
- AMD ROCm quality concerns: posts tally 200+ PyTorch tests skipped exclusively on ROCm and 200+ disabled; net +110 disabled since June, including attention/transformer ops; AMD team reportedly prioritizing fixes 146 Disappointingly, AMD currently has over 200 unit tests in PyTorch that are skipped exclusively (skipIfRocm) on ROCm and not on CUDA, along with another 200+ tests explicitly disabled for ROCm. The situation has deteriorated since the AMD Advancing AI event in June 2025. Since June 2025, more than 160 new tests have been disabled on ROCm, while only around 50 were re-enabled which resulted in a net increase of 110 disabled tests. This represents a major regression in ROCm PyTorch quality and significantly undermines the user experience. What’s particularly concerning is that many of these tests are not for niche or legacy operators. Critical functionality including numerous transformer tests, fused TP matmul, and even attention, the single most important operator in transformers, has been disabled for months. These issues should be treated as P0 priorities, yet they’ve instead been sidelined, leaving developers without confidence in ROCm PyTorch core capabilities. These aren’t just older ops such as RNNs or LSTMs, these ops are indispensable for modern AI workloads. Addressing the backlog of skipped and disabled tests will take months to bring down the numbers by half and medium to long term to stablize the situation to be under 50 unit tests being skipped/disabled exclusively in ROCm. That being said, we have now successfully convinced @ AnushElangovan 2 weeks ago that this is a high-priority issue. His team is now tackling it with high sense of urgency, and we’re grateful for his team’s renewed efforts.https://x.com/SemiAnalysis_/status/1963708743218339907 .
- GPT‑5 Pro coding: multiple practitioners report it reliably solves complex coding tasks after longer think time, though some note RLHF‑style small‑model errors on “real work.” Diversify models in orchestration/evals 49 I think congrats again to OpenAI for cooking with GPT-5 Pro. This is the third time I’ve struggled on something complex/gnarly for an hour on and off with CC, then 5 Pro goes off for 10 minutes and comes back with code that works out of the box. I had CC read the 5 Pro version and it wrote up 2 paragraphs admiring it (very wholesome). If you’re not giving it your hardest problems you’re probably missing out.https://x.com/karpathy/status/1964020416139448359 36 gpt-5 pro is next level for coding:https://x.com/gdb/status/1964076158221275543 20 I’m a fan of GPT-5 - unlike my partner - but it’s really very limited at doing real work, and it actually makes a lot of overly RLHF’d small model smell mistakes. Which is curious.https://x.com/mbusigin/status/1964097880739795453 .
- Qwen updates: OpenRouter lists Qwen3‑Max‑Preview with no “thinking” mode 69 📊 Higher accuracy in math, coding, logic, and science tasks 📖 Stronger instruction following & reduced hallucinations 🔍 Optimized for RAG + tool calling (no “thinking” mode) https://x.com/OpenRouterAI/status/1963992206446145805 ; a user notes “Qwen 3 Max has no ‘thinking’, interesting” 99 Qwen 3 Max has no “thinking”, interesting https://x.com/nrehiew_/status/1963983494809317814 .
- Stealth long‑context: Sonoma Sky/Dusk Alpha (via AnyCoder/OpenRouter) advertise 2M‑token context 9 two new stealth models are now available in anycoder via @ OpenRouterAIhttps://x.com/_akhaliq/status/1964184395847233822 8 Context: 2 million tokens https://x.com/_akhaliq/status/1964184395847233822 7 https://huggingface.co/spaces/akhaliq/anycoderhttps://x.com/_akhaliq/status/1964184437068874168 .
- Math OCR for reasoning data: Marker/Surya report SoTA on olmocr, beating MathPix in an internal lab eval; examples show GPT‑5 symbol errors that Marker avoided 23 Marker now is SoTA on the external olmocr benchmark for math.https://x.com/VikParuchuri/status/1964059444024655982 22 It also beats mathpix in an internal eval by a tier 1 AI research lab. https://x.com/VikParuchuri/status/1964059444024655982 21 Here we can see that marker is perfect on the doc where GPT5 misrecognized the tau. https://x.com/VikParuchuri/status/1964059446423736524 .
- GPU performance deep dive: Modular’s Blackwell matmul Part 2 covers shared‑memory access and swizzling for throughput 13 Matrix Multiplication on Blackwell Part 2 is here! This technical deep dive into GPU architecture explores how to optimize memory performance through shared access and swizzling–setting the stage for our upcoming reveal of the industry’s fastest matmul.https://x.com/Modular/status/1964028468490097053 12 https://www.modular.com/blog/matrix-multiplication-on-nvidias-blackwell-part-2-using-hardware-features-to-optimize-matmulhttps://x.com/Modular/status/1964028468490097053 .
- Weights & Biases: tracing/instrumentation upgrades “especially useful for RL” are coming to Weave 24 Coming soon to @ weave_wb ! Huge upgrades that will be especially useful for Reinforcement Learning.https://x.com/weights_biases/status/1964085692289732618 .
- OpenAI jobs platform: posts say an AI‑powered hiring product is targeted for mid‑2026, with plans to certify “AI fluency” 145 OpenAI plans to launch an AI-powered hiring platform by mid 2026, putting the outfit in close competition with LinkedIn. The company also wants to start certifying people for “AI fluency.”https://x.com/ZeffMax/status/1963686821034160280 .

Top Stories
Why it matters: These shifts expand what AI can do on-device, speed up inference, strengthen agentic workflows, and apply AI to frontier science.
-
EmbeddingGemma brings state-of-the-art on-device multilingual embeddings
- Google released a 308M-parameter open embedding model that runs offline in <200MB RAM, ranks highest among open models under 500M on MTEB, supports dynamic dimensions (768→128), and integrates with common toolchains 99 EmbeddingGemma is our new best-in-class open embedding model designed for on-device AI. 📱https://x.com/GoogleDeepMind/status/1963635422698856705 98 At just 308M parameters, it delivers state-of-the-art performance while being small and efficient enough to run anywhere - even without an internet connection.https://x.com/GoogleDeepMind/status/1963635422698856705 97 🏆 Highest ranking on the MTEB benchmark - the gold standard for text embedding evaluation 🌐 Trained across 100+ languages 🛠️ Ready to go with @ huggingface , @ llama_index , @ LangChainAI and more.https://x.com/GoogleDeepMind/status/1963635425190334492 96 Start building today with @ GoogleDeepMind Embedding Gemma and @ huggingface Sentence Transformers, @ ollama , Llama.cpp, MLX, @ lmstudio , @ weaviate_io , @ googlecloud Vertex AI, @ AMD , @ basetenco , @ Cloudflare , @ nvidia , and more.https://x.com/_philschmid/status/1963634786636841461 . A practitioner embedded 1.4M documents in ~80 minutes on an M2 Max for free, estimating ~$200 if using a hosted large model, with worse quality 25 Google’s on a roll. That’s a lot of performance for that tiny size! I just embedded 1.4 million documents in ~80 mins on my M2 Max for free.https://x.com/rishdotblog/status/1963805087014502497 24 Would’ve been ~$200 with the text-embedding-3-large, with worse quality.https://x.com/rishdotblog/status/1963805087014502497 .
-
Moonshot’s Kimi K2-0905 doubles agent context to 256K and lands across providers
- Kimi extended context to 256K, improved tool-calling and front-end coding, and published weights on Hugging Face; it’s now available via Together AI (built for agents, priced for scale) and Groq (200+ tokens/s, $1.50/M tokens, vendor claim) 27 Kimi K2-0905 update 🚀 - Enhanced coding capabilities, esp. front-end & tool-calling - Context length extended to 256k tokens - Improved integration with various agent scaffolds (e.g., Claude Code, Roo Code, etc)https://x.com/Kimi_Moonshot/status/1963802687230947698 26 🔗 Weights & code: https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905 💬 Chat with new Kimi K2 on: https://www.kimi.com ⚡️ For 60–100 TPS + guaranteed 100% tool-call accuracy, try our turbo API: https://platform.moonshot.aihttps://x.com/Kimi_Moonshot/status/1963802687230947698 23 🚀 Kimi K2-0905 just landed on Together AI!https://x.com/togethercompute/status/1963806032548843865 22 Built for agents. Priced for scale. 👆https://x.com/togethercompute/status/1963806032548843865 7 Who needs sleep? Kimi-K2-Instruct-0905 just landed. 200+ T/s, $1.50/M tokens. 256k context window. Built for coding. Rivals Sonnet 4. Available now. 👇 https://x.com/GroqInc/status/1963823577557606665 . The team advertises 60–100 TPS and “guaranteed 100% tool-call accuracy” on a turbo API; long context directly aligns with community emphasis that agents “really really need ultra long context” 26 🔗 Weights & code: https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905 💬 Chat with new Kimi K2 on: https://www.kimi.com ⚡️ For 60–100 TPS + guaranteed 100% tool-call accuracy, try our turbo API: https://platform.moonshot.aihttps://x.com/Kimi_Moonshot/status/1963802687230947698 30 Agents really really need ultra long contexthttps://x.com/Teknium1/status/1963807244190900618 .
-
Meta introduces Set Block Decoding (SBD) to accelerate LLM inference
- SBD samples multiple future tokens in parallel, reducing forward passes by 3–5× without architecture changes and staying KV-cache compatible; authors report parity with standard next-token prediction training 16 Meta introduces Set Block Decoding (SBD), a new inference accelerator for LLMshttps://x.com/arankomatsuzaki/status/1963817987506643350 15 SBD samples multiple future tokens in parallel, cuts forward passes by 3–5x, needs no arch changes, stays KV-cache compatible, and matches NTP training performance. https://x.com/arankomatsuzaki/status/1963817987506643350 14 https://arxiv.org/abs/2509.04185https://x.com/arankomatsuzaki/status/1963817990383960445 .
-
DeepMind’s “Deep Loop Shaping” improves LIGO control, published in Science
- In real hardware tests at LIGO, the method controlled noise 30–100× better than existing controllers; in simulation it reduced control noise by a factor of ten or more, stabilizing mirrors and helping observe black hole mergers up to a few hundred solar masses 56 Our results show that Deep Loop Shaping: 🔹controls noise up to 30-100 times better than existing controllers. 🔹can eliminate the most unstable, difficult feedback loop as a meaningful source of noise on LIGO for the first time.https://x.com/GoogleDeepMind/status/1963664045216579999 55 In a simulated LIGO environment, our Deep Loop Shaping method learns to suppress control noise, stabilizing the mirrors and observation band used for measuring gravitational waves. 🔇 This brings noise levels down by a factor of ten or more, helping scientists see events like black hole mergers of up to a few hundred solar masses.https://x.com/GoogleDeepMind/status/1963664041546817657 . Developed with LIGO, Caltech, and Gran Sasso; published in Science 57 We tested our controllers in hardware on the real LIGO system.https://x.com/GoogleDeepMind/status/1963664045216579999 53 Our novel Deep Loop Shaping method published in @ ScienceMagazine could help astronomers observe more events like collisions and mergers of black holes in greater detail, and gather more data about rare space phenomena. 🧵https://x.com/GoogleDeepMind/status/1963664018515849285 32 Using AI to advance our understanding of fundamental physics is the dream. Excited to see our latest AI model ‘Deep Loop Shaping’ help @ LIGO and @ Caltech detect the gravitational waves of intermediate-mass black holes better! Published in @ ScienceMagazinehttps://x.com/demishassabis/status/1963795824854335528 .
Research & Innovation
Why it matters: New methods target core bottlenecks (memory, inference speed), training stability, and agent reliability—while large-scale studies challenge optimization lore.
-
Optimizers at scale: “Fantastic Pretraining Optimizers and Where to Find Them”
- A careful benchmark (>4,000 models; 10 optimizers; 0.1B–1.2B params) finds 2× speedups over AdamW are unlikely; with rigorous tuning and scale, gains shrink to ~10% 92 (1/n) Check out our new paper: “Fantastic Pretraining Optimizers and Where to Find Them”! >4000 models to find the fastest optimizer! 2× speedups over AdamW? Unlikely. Beware under-tuned baseline or limited scale! E.g. Muon: ~40% speedups <0.5B & only 10% at 1.2B (8× Chinchilla)! https://x.com/wen_kaiyue/status/1963633867140526319 91 We did a very careful study of 10 optimizers with no horse in the race. Despite all the excitement about Muon, Mars, Kron, Soap, etc., at the end of the day, if you tune the hyperparameters rigorously and scale up, the speedup over AdamW diminishes to only 10% :-( Experiments are made possible by Marin ( https://github.com/marin-community/marin/issues/1290 ); anyone developing new optimizers: please come try your method on this benchmark!https://x.com/percyliang/status/1963648131394122222 42 Check out their wandb workspace here: https://wandb.ai/marin-community/optimizer-scalinghttps://x.com/weights_biases/status/1963700339716272545 . Methodological critiques note some scaled runs used extrapolated hyperparameters and suggest finer sweeps for sensitive hparams 2 IIUC, the scaled up run here isn’t actually tuned at all - its hparams are set via extrapolationhttps://x.com/kellerjordan0/status/1963831370272412011 1 Sensitive hparams need a more granular sweep than power-of-2 https://x.com/percyliang/status/1963648131394122222https://x.com/kellerjordan0/status/1963831370272412011 .
-
RL’s Razor: On-policy RL forgets less than SFT
- Theory and experiments suggest on-policy RL biases toward KL-minimal updates, staying closer to base models and reducing catastrophic forgetting at matched accuracy 13 RL’s Razor: On-policy RL forgets less than SFT.https://x.com/arankomatsuzaki/status/1963823603469730114 12 Even at matched accuracy, RL shows less catastrophic forgettinghttps://x.com/arankomatsuzaki/status/1963823603469730114 11 Key factor: RL’s on-policy updates bias toward KL-minimal solutionshttps://x.com/arankomatsuzaki/status/1963823603469730114 10 Theory + LLM & toy experiments confirm RL stays closer to base model https://x.com/arankomatsuzaki/status/1963823603469730114 . Authors argue this helps non-frontier labs update models more frequently without eroding prior knowledge 9 Also why we are seeing more frequent model updates from non-frontier labs now I think. The same base model gets so much more mileage if you can use your new environments to push it further without eroding prior knowledge.https://x.com/teortaxesTex/status/1963825236517773726 8 proj: https://jyopari.github.io/posts/rl_razor abs: https://arxiv.org/abs/2509.04259https://x.com/arankomatsuzaki/status/1963823606082970076 .
-
Learning When to Plan: dynamic test-time compute for agents
- First study to train LLM agents to dynamically allocate test-time planning compute in sequential decision-making, balancing performance and cost 21 Learning When to Planhttps://x.com/arankomatsuzaki/status/1963820986668626156 20 LLM agents trained with dynamic planning learn when to spend test-time compute, balancing cost & performance.https://x.com/arankomatsuzaki/status/1963820986668626156 19 This is the first work to explore training LLM agents for dynamic test-time compute allocation in sequential decision-making tasks. https://x.com/arankomatsuzaki/status/1963820986668626156 17 https://arxiv.org/abs/2509.03581https://x.com/arankomatsuzaki/status/1963820990321819980 .
-
XQuant: up to 12× memory reduction by storing activations instead of KV
- A Berkeley method avoids KV cache by quantizing/storing layer inputs (X), rematerializing K/V on the fly—authors report ~½ memory usage vs. KV, no observed accuracy drop, and net faster inference despite extra compute; variants (e.g., XQuant‑CL) target different architectures 79 It’s an idea used in XQuant – a new method by @ UCBerkeley created to reduce memory use up to 12x.https://x.com/TheTuringPost/status/1963661122088980773 76 XQuant doesn’t store usual KV cachehttps://x.com/TheTuringPost/status/1963661122088980773 74 It quantizes and stores only X - the layer input activationshttps://x.com/TheTuringPost/status/1963661122088980773 71 When needed, it rematerializes Keys (K) and Values (V) from X on the fly during inference.https://x.com/TheTuringPost/status/1963661122088980773 70 Storing X requires only about half the memory compared to standard KV caching.https://x.com/TheTuringPost/status/1963661122088980773 68 Doesn’t lead to accuracy drop.https://x.com/TheTuringPost/status/1963661122088980773 66 Regardless of adding a bit more computation (which GPUs can handle), the result is faster and more efficient inferencehttps://x.com/TheTuringPost/status/1963661122088980773 63 More on XQuant, its variants like XQuant-CL (the most effective one), XQuant for GQA models, and KV caching here: https://www.turingpost.com/p/xquanthttps://x.com/TheTuringPost/status/1963661122088980773 .
-
MiniCPM‑V‑4.5: compact VLM reports strong benchmark and token compression
- An 8B‑parameter model reports 77.0 on OpenCompass (8 benchmarks), exceeding several proprietary and larger open models, with a unified 3D-resampler achieving 96× video token compression; live demos showcase video chat and “vibe coding” apps 112 MiniCPM-V 4.5 achieves an average score of 77.0 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-latest, Gemini-2.0 Pro, and strong open-source models like Qwen2.5-VL 72Bhttps://x.com/_akhaliq/status/1963587749400727980 111 powered by a new unified 3D-Resampler over images and videos, MiniCPM-V 4.5 can now achieve 96x compression rate for video tokens, where 6 448x448 video frames can be jointly compressed into 64 video tokens (normally 1,536 tokens for most MLLMs).https://x.com/_akhaliq/status/1963587749400727980 . https://x.com/_akhaliq/status/19635877494..
-
FineVision dataset for VLMs
- Large open-source corpus for training/evaluating VLMs: 17.3M images, 24.3M samples, 88.9M turns, 9.5B answer tokens 86 Today, we are releasing FineVision, a huge open-source dataset for training state-of-the-art Vision-Language Models:https://x.com/lusxvr/status/1963609337546293448 84 > 17.3M images > 24.3M samples > 88.9M turns > 9.5B answer tokenshttps://x.com/lusxvr/status/1963609337546293448 .
-
MindJourney: test-time scaling for 3D spatial reasoning
- Pairs a VLM with a video-based world model to imagine/select views without retraining; uses spatial beam search and a QA VLM to answer spatial questions 133 MindJourney – a test-time scaling approach to make VLMs better at 3D spatial reasoning.https://x.com/TheTuringPost/status/1963353881641111643 132 To do this, it pairs a VLM with a video-based world model.https://x.com/TheTuringPost/status/1963353881641111643 131 It can be added to any existing VLM without retraining.https://x.com/TheTuringPost/status/1963353881641111643 130 Here’s how MindJourney works in an “imagine → select → answer” loop:https://x.com/TheTuringPost/status/1963353881641111643 129 ▪️ An input is a spatial question like “What would be behind this object if you turn right?” and a 2D image. ▪️ A VLM suggests tiny moves: “walk forward,” “turn left 30°,” etc. ▪️ A world model then imagines how the scene would look from that new angle, turning the 2D photo into explorable 3D space.https://x.com/TheTuringPost/status/1963353881641111643 128 ▪️ Using the Spatial Beam Search, the VLM tries different paths, generating new views. It scores each view on criteria: - How useful is it to keep exploring this trajectory? - How useful is this view for answering now?https://x.com/TheTuringPost/status/1963353881641111643 127 ▪️ After a few rounds, a Question-Answering VLM takes the original image + the helpful imagined views and gives a final answer.https://x.com/TheTuringPost/status/1963353881641111643 126 All this happens at test-time!https://x.com/TheTuringPost/status/1963353881641111643 125 https://www.turingpost.com/p/testtimescaling2https://x.com/TheTuringPost/status/1963353881641111643 117 We explore the real achievements and the pros & cons of MindJourney and other new test-time scaling methods, CoLa and TTD-DR, in this article: https://www.turingpost.com/p/testtimescaling2 It’s your guide to the latest trends and strategies in TTS.https://x.com/TheTuringPost/status/1963353881641111643 .
Products & Launches
Why it matters: New tools expand on-device capability, improve chat UX, and make developer workflows faster.
-
ChatGPT adds conversation branching (web)
- Explore alternate directions without losing the original thread; now live for logged‑in web users 45 By popular request: you can now branch conversations in ChatGPT, letting you more easily explore different directions without losing your original thread.https://x.com/OpenAI/status/1963697012014215181 43 Available now to logged-in users on web. https://x.com/OpenAI/status/1963697012014215181 .
-
EmbeddingGemma lands across open tooling and PCs
- Ready in Hugging Face Sentence Transformers, LangChain, LlamaIndex, Llama.cpp, MLX, Weaviate, GCP Vertex AI, and more 96 Start building today with @ GoogleDeepMind Embedding Gemma and @ huggingface Sentence Transformers, @ ollama , Llama.cpp, MLX, @ lmstudio , @ weaviate_io , @ googlecloud Vertex AI, @ AMD , @ basetenco , @ Cloudflare , @ nvidia , and more.https://x.com/_philschmid/status/1963634786636841461 . NVIDIA teamed with ollama and llama.cpp to accelerate EmbeddingGemma on RTX AI PCs; try via “ollama pull embeddinggemma” 64 🤝 We teamed up with @ ollama and llama.cpp to accelerate @ googleaidevs ’ EmbeddingGemma3 for hyper-efficient RAG on PC.https://x.com/NVIDIAAIDev/status/1963647595412426779 62 Try it out now with @ NVIDIA_AI_PC RTX AI PCs and workstations 👇https://x.com/NVIDIAAIDev/status/1963647595412426779 67 ollama pull embeddinggemmahttps://x.com/ollama/status/1963667967184617703 .
-
Jina code embeddings (0.5B, 1.5B; GGUF 1–4‑bit)
- New code embedding models claim SOTA retrieval despite small sizes; trained from Qwen2.5‑Coder (5.5T tokens across 92+ languages) and contrastively fine‑tuned; releases on Hugging Face and arXiv 103 Today we’re releasing jina-code-embeddings, a new suite of code embedding models in two sizes—0.5B and 1.5B parameters—along with 1~4bit GGUF quantizations for both. Built on latest code generation LLMs, these models achieve SOTA retrieval performance despite their compact size. They support over 15 programming languages and 5 tasks: nl2code, code2code, code2nl, code2completions and qa.https://x.com/JinaAI_/status/1963637135439007824 102 Traditional code embedding models face a fundamental bottleneck: there simply aren’t enough high-quality comment-code pairs for supervised training. By starting with Qwen2.5-Coder pre-trained on 5.5 trillion tokens spanning 92+ programming languages, we inherit deep semantic understanding of programming constructs, cross-language pattern recognition, and built-in knowledge of syntax and idioms. The contrastive fine-tuning then adapts this knowledge for retrieval tasks with minimal aligned data—sidestepping the data scarcity that constrains encoder-only models.https://x.com/JinaAI_/status/1963637139037720995 100 1.5b model: https://huggingface.co/jinaai/jina-code-embeddings-1.5b 0.5b model: https://huggingface.co/jinaai/jina-code-embeddings-0.5b 1.5b-GGUF: https://huggingface.co/jinaai/jina-code-embeddings-1.5b-GGUF 0.5b-GGUF: https://huggingface.co/jinaai/jina-code-embeddings-1.5b-GGUF Release note: https://jina.ai/news/jina-code-embeddings-sota-code-retrieval-at-0-5b-and-1-5b arXiv paper: https://arxiv.org/abs/2508.21290https://x.com/JinaAI_/status/1963637141675843791 .
-
Perplexity iOS: smoother streamed answers
- Major update improves streamed rendering, tables, Markdown, and showing intermediate steps 35 Last night, we quietly rolled out another major update to the Perplexity iOS app, this time focusing on answer rendering.https://x.com/jonathonstaff/status/1963748031574188156 34 Another major Perplexity iOS app update. Team cooked. Answers are now streamed smooth as butter. Tables, markdown, intermediate steps. Update and enjoy! https://x.com/AravSrinivas/status/1963758210281882029 . Maintainers call out “best‑in‑class performance while streaming” 33 The team did an incredible job on this. Best-in-class performance while streaming and delightful animations make the experience feel super polished.https://x.com/jonathonstaff/status/1963748031574188156 .
-
Elicit adds Collections and Smart De‑duplication
- Curate papers by project, auto‑detect/merge duplicates (even with mismatched titles/abstracts), then seed systematic review/data‑extraction workflows immediately; features live for all users 90 Collections: Group papers by project or task.https://x.com/elicitorg/status/1963647643537870859 89 Smart De-duplication: Automatically detect and merge versions of the same paper, even if the titles and abstracts don’t exactly match.https://x.com/elicitorg/status/1963647643537870859 88 Smart De-duplication automatically detects potential duplicates in paper uploads.https://x.com/elicitorg/status/1963647776837009661 87 Taken together, these new features let you curate the highest quality papers, organize them by project, and then use them to seed your Elicit workflows immediately.https://x.com/elicitorg/status/1963647788962779324 85 These features are live today for all users.https://x.com/elicitorg/status/1963647800878797005 83 Try it out at http://elicit.com/library .https://x.com/elicitorg/status/1963647800878797005 .
-
Androidify launches
- Create a custom Android bot from a selfie or prompt; under the hood it combines Gemini 2.5 Flash, Imagen, and Veo 3 94 Androidify has landed! Turn a 📸 into your own Android bot, using Google AI. Androidify yourself → https://goo.gle/Androidify #Androidifyhttps://x.com/AndroidDev/status/1963310185101115686 95 Androidify is a new tool that lets you make your own @ Android bot. Just upload a selfie or write a prompt, don’t forget to add the accessories, and then see what our AI builds. Under the hood, Androidify combines the capabilities of Gemini 2.5 Flash, Imagen and Veo 3 to give you the custom Android bot of your dreams. Learn more → https://goo.gle/42g380xhttps://x.com/Google/status/1963643654997778901 93 https://goo.gle/42g380xhttps://x.com/Google/status/1963643654997778901 .
-
Reka product updates
- New free tier, API, parameters, and an MCP server; full details in the roundup 65 New 🎁 Free tier, new API, new parameters, new MCP server, and much more!https://x.com/RekaAILabs/status/1963665987883897204 61 https://reka.ai/news/end-of-summer-updateshttps://x.com/RekaAILabs/status/1963665987883897204 .
-
Qwen “Boring Reality” LoRA (experimental)
- Early alpha on Hugging Face aims at phone‑like realism; public demos and a glif link available; still a work‑in‑progress 118 https://huggingface.co/kudzueye/boreal-qwen-imagehttps://x.com/multimodalart/status/1963506798058504611 122 early results for the Qwen “Boring Reality” LoRA 📸 by kudzueyehttps://x.com/multimodalart/status/1963506679787471238 121 the model is still experimental and work in progress 🚧 https://x.com/multimodalart/status/1963506679787471238 119 you can run it here: https://glif.app/@fab1an/glifs/cmf567rt40000jz04lb2vxi5xhttps://x.com/fabianstelzer/status/1963574186225406029 .
Industry Moves
Why it matters: Strategic infrastructure, M&A, and funding decisions shape where and how AI products will be delivered.
-
Atlassian to acquire The Browser Company for $610M (all‑cash)
- Team says it will operate independently with focus on Dia; the deal aims to give resources, distribution, and monetization muscle for cross‑platform support, secure syncing, and custom AI models designed for Dia 109 Today, The Browser Company of New York is entering into an agreement to be acquired by Atlassian for $610M in an all-cash transaction. We will operate independently, with Dia as our focus. Our objective is to bring Dia to the masses.https://x.com/browsercompany/status/1963579501129978167 106 The work continues because when I stop by the coffee shop near our office, nobody is using Dia yet. Our “internet computer” vision hasn’t been realized. Dia hasn’t yet changed how you work on a Tuesday morning. This deal is about giving us the resources, distribution, and monetization muscle to get there.https://x.com/joshm/status/1963575851062071314 105 Dia isn’t going anywhere. We’ll be here for the long haul, with the same team just a new partner helping us push further. We’ll take a breath this weekend, and then get back to work. Big launch next month.https://x.com/joshm/status/1963575851062071314 .
-
Together AI’s EU expansion
- GPU infrastructure now live in Sweden with lower EU latency, EU data residency/compliance, and on‑demand clusters/endpoints; supports serverless API for GPT‑OSS, DeepSeek, Llama, Qwen 124 🇸🇪 Together AI now has GPU infrastructure located in Sweden - Lower latency across Europe - EU data residency & compliance - GPU clusters + endpoints on demand - Serverless API for GPT-OSS, DeepSeek, Llama, Qwen https://x.com/togethercompute/status/1963498998720872686 123 https://www.prnewswire.co.uk/news-releases/together-ai-continues-european-expansion-infrastructure-now-live-and-operational-in-sweden-302545683.htmlhttps://x.com/togethercompute/status/1963499000692150545 .
-
OpenAI Jobs Platform (mid‑2026)
- Announced a hiring platform and “OpenAI‑Certified” to match AI‑ready workers with employers; TechCrunch reports a mid‑2026 launch and AI‑based job matching 46 We know that AI will create lots of new jobs, yet also create disruption. We’re announcing the OpenAI Jobs Platform to connect AI-ready workers with companies who need AI skills, and OpenAI-Certified for workers to learn and demonstrate their AI skills.https://x.com/fidjissimo/status/1963670140849135861 44 https://openai.com/index/expanding-economic-opportunity-with-ai/https://x.com/fidjissimo/status/1963670140849135861 40 The OpenAI Jobs Platform is set to launch in mid-2026, and will use AI to match candidates with businesses. https://techcrunch.com/2025/09/04/openai-announces-ai-powered-hiring-platform-to-take-on-linkedin/?utm_campaign=social&utm_source=X&utm_medium=organichttps://x.com/TechCrunch/status/1963684254530826389 .
-
Anthropic Fellows Program is scaling
- Hiring an operator to expand research collaborations; program supports ~50 researchers, growing ~3×/year, with millions in compute, and is cited as a source of top safety research and hires 77 We’re hiring someone to run the Anthropic Fellows Program!https://x.com/EthanJPerez/status/1963664611397546145 75 Our research collaborations have led to some of our best safety research and hires. We’re looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs 🧵https://x.com/EthanJPerez/status/1963664611397546145 69 We think the fellows program could become really big. Through our collaborations, we’re currently supporting ~50 researchers, growing ~3x/year, and spending millions of dollars in compute.https://x.com/EthanJPerez/status/1963664635296637164 72 Apply here: https://job-boards.greenhouse.io/anthropic/jobs/4888400008https://x.com/EthanJPerez/status/1963664623468769757 .
-
Funding and market moves
- Exa raised $85M at a $700M post‑money valuation for API‑based semantic search infra 116 AI search infrastructure company Exa raised $85M at a $700M post-money valuationhttps://x.com/TheRundownAI/status/1963510011205734490 115 Unlike Perplexity, which is more consumer-focused, Exa is providing developers and enterprises with an API layer for AI-powered semantic search https://x.com/TheRundownAI/status/1963510011205734490 .
- Tencent open‑sourced the Hunyuan‑MT translation model series 114 🤖 From this week’s issue: Tencent open-sourced a new lineup of language models, the Hunyuan-MT series, that is optimized for translation tasks. https://siliconangle.com/2025/09/01/tencent-open-sources-hunyuan-mt-translation-model-series/https://x.com/dl_weekly/status/1963588150971736414 113 https://siliconangle.com/2025/09/01/tencent-open-sources-hunyuan-mt-translation-model-series/https://x.com/dl_weekly/status/1963588150971736414 .
-
Infrastructure case study
- NVIDIA and Baseten report 5× throughput, 50% lower cost/token, and up to 38% lower latency for large LLMs, using Blackwell + TensorRT‑LLM + Dynamo and Baseten’s multi‑cloud capacity manager 82 📈 @ basetenco users are scaling smarter with us: ✅ 5× throughput on high-traffic endpoints ✅ 50% lower cost per token ✅ Up to 38% lower latency on the largest LLMshttps://x.com/NVIDIAAI/status/1963648255834644967 73 Built on NVIDIA Blackwell + TensorRT-LLM + Dynamo on @ googlecloud —driving efficiency, speed & adoption at scale.https://x.com/NVIDIAAI/status/1963648255834644967 81 We use the latest accelerated compute, tools like Dynamo and TensorRT-LLM, along with our Multi-cloud Capacity Manager (MCM) to drastically increase throughput and lower latency for our customers.https://x.com/basetenco/status/1963652616816398761 .
Policy & Regulation
Why it matters: Compliance thresholds and national education initiatives will affect model disclosure, deployment geographies, and workforce readiness.
-
EU AI Act model reporting threshold
- First reporting deadline passed; models over 10^23 FLOPs must be reported to regulators (≈ Llama‑2 13B scale) 138 The first deadline for EU AI act reporting passed in August, and all models over 10^23 flops must now be formally reported on to a regulatory agency.https://x.com/xlr8harder/status/1962468739099590814 137 For reference, 10^23 flops is at the level of Llama 2 13B. https://x.com/xlr8harder/status/1962468739099590814 .
-
U.S. AI Education efforts
- Microsoft will support the White House AI Education Task Force and offer Microsoft 365 Personal free for 12 months to all U.S. college and community college students 80 As the son of a teacher, I’ve seen firsthand how transformative great teaching can be. That’s why I’m especially proud of our support for the White House’s AI Education Task Force that will empower educators and the next generation to harness AI’s true power.https://x.com/yusuf_i_mehdi/status/1963670222701019156 78 Among our many commitments, we’re making Microsoft 365 Personal free for 12 months to every college and community college student in the United States.https://x.com/yusuf_i_mehdi/status/1963670222701019156 .
- Google highlighted free Gemini for Education for all U.S. high schools (with Guided Learning), $150M in grants for AI education/digital wellbeing, and expansions of its AI education accelerator 50 Every American high school has access to Gemini for Education at no cost, including tools like Guided Learning in the @ GeminiApp .https://x.com/Google/status/1963681140583084183 49 We’re committing $150 million in grants over the next three years to support AI education and digital wellbeinghttps://x.com/Google/status/1963681140583084183 47 For college students, our AI for Education Accelerator has expanded from 100 colleges and universities to 200https://x.com/Google/status/1963681140583084183 .
- AMD announced AI Learning Labs and open‑source courses as part of the White House initiative 48 Proud to be at the White House today attending the AI Education Task Force meeting led by @ FLOTUS . @ AMD is proud to expand our commitment to AI Education through new AI Learning Labs and open source courses that will give students, educators and researchers hands-on experience with AI hardware and software for education, skills training and research. We are excited to do our part to train and enable the future AI workforce.https://x.com/LisaSu/status/1963691917163569350 .
Quick Takes
Why it matters: Fast developments signal where the ecosystem is heading next.
- Kimi K2 availability spreads: Together AI and vLLM announced support; a Cline release emphasizes agent tool‑use; a Groq listing advertises 200+ T/s and 256K context 23 🚀 Kimi K2-0905 just landed on Together AI!https://x.com/togethercompute/status/1963806032548843865 18 vLLM is proud to support the great Kimi update from @ Kimi_Moonshot , better tool-calling, longer context, and more!https://x.com/vllm_project/status/1963805972352188895 31 Try it now in Cline 🌔 https://x.com/cline/status/1963804927584833725 7 Who needs sleep? Kimi-K2-Instruct-0905 just landed. 200+ T/s, $1.50/M tokens. 256k context window. Built for coding. Rivals Sonnet 4. Available now. 👇 https://x.com/GroqInc/status/1963823577557606665 .
- Claims vs. caution on Kimi vs Sonnet: a user said “meets or beats Sonnet 4,” while a Kimi team member responded it’s “not on par yet,” noting SWE‑Bench remains challenging 29 Meets or beats sonnet 4 across the boardhttps://x.com/andrew_n_carr/status/1963805265356075336 28 tbh, not on par yet, more improvement on swe-bench is quite difficult, we’re still working very hard on it.https://x.com/bigeagle_xd/status/1963808306545180792 .
- Bitnet/1‑bit hype check: a viral post claimed 100B‑parameter CPU inference with bitnet.cpp; a reply noted no 100B BitNet exists and that the “news is 10 months old” 135 You can now run 100B parameter models on your local CPU without GPUs.https://x.com/LiorOnAI/status/1963316578612605327 136 Microsoft finally open-sourced their 1-bit LLM inference framework called bitnet.cpp:https://x.com/LiorOnAI/status/1963316578612605327 134 > 6.17x faster inference > 82.2% less energy on CPUs > Supports Llama3, Falcon3, and BitNet models https://x.com/LiorOnAI/status/1963316578612605327 120 @ LiorOnAI 1) there is no 100B parameter bitnet model. 2) this news is 10 months old.https://x.com/QuixiAI/status/1963811301139927093 .
- Perplexity Comet distribution: >1M people got access in a day; mobile pre‑orders on Android; Pro users in Korea, Brazil, Spain can download now 107 More than a million people got Comet access this morning. The most widely deployed personal and agentic product in the world right now.https://x.com/AravSrinivas/status/1963633205351010795 108 Comet is coming soon to mobile and is now available for pre-orders on Android Play Store https://play.google.com/store/apps/details?id=ai.perplexity.comet&hl=en_UShttps://x.com/AravSrinivas/status/1963620578344276366 104 Pro users in South Korea, Brazil, and Spain can now download Comet. https://x.com/perplexity_ai/status/1963638853975040456 .
- Waymo at San José Airport: fully autonomous testing begins ahead of commercial rides later this year 54 We’re cleared for takeoff at @ FlySJC ! 🛫 We’ll soon begin fully autonomous testing at the airport ahead of offering commercial rides later this year. https://www.flysanjose.com/news-release/san-jose-mineta-international-airport-set-be-first-commercial-airport-californiahttps://x.com/Waymo/status/1963651067775762874 .
- Evals debate matures: industry leaders call evals a must‑have skill while others warn against “evals religion” and over‑indexing early; dogfooding remains crucial 38 Trend I’m following: evals becoming a must-have skill for product builders and AI companies. It’s the first new hard skill in a long time that PMs/engineers/founders have had to learn to be successful. The last one was maybe SQL, and Excel?https://x.com/lennysan/status/1963688207280955839 37 A few examples: @ garrytan : “Evals are emerging as the real moat for AI startups.” @ kevinweil : “Writing evals is going to become a core skill for product managers.” @ mikeyk : “Writing evals is probably the most important thing right now.” @ saranormous : “Evals = your new marketing.” @ gdb : “Evals are surprisingly often all you need.”https://x.com/lennysan/status/1963688207280955839 36 Claude Code: no evals (NOTE: i -do- also think that evals are impt, but the eval pilled ai engineers have also noticed that it is not a strict requirement for success and, at least for 0-to-1 stage, may even be anticorrelated, think thru why)https://x.com/swyx/status/1963725773355057249 .
- Data diversity matters: filtering only “highest‑quality” data hurt performance in ablations; authors and practitioners advise against English‑only filtering for VLM pretraining to avoid harming cultural understanding 60 Here’s a wild finding from our ablations: filtering for only the “highest-quality” data actually hurts performance! 🤯 Our experiments show that at this scale, training on the full, diverse dataset—even with lower-rated samples—is better. Don’t throw away your data! https://x.com/andimarafioti/status/1963610135328104945 59 PSA: Stop pretraining your VLMs on EN-filtered data, even if it improves ImageNet and COCO‼️ Doing so impairs the model’s understanding of non-English cultures❗️ I argued for years, now finally publish concrete results for this (imo) intuitively obvious recommendationhttps://x.com/giffmana/status/1793932782672248928 58 Say “NO!” to filters:https://x.com/giffmana/status/1963668351273652688 .
- GPU performance education: Modal published a human‑readable GPU Performance Glossary, with community endorsements 51 https://modal.com/gpu-glossary/perfhttps://x.com/charles_irl/status/1963664025042215323 52 I wrote up what I’ve learned along the way in an extension to the GPU Glossary – our “CUDA Docs for Humans”. Introducing: the GPU 𝔓𝔢𝔯𝔣𝔬𝔯𝔪𝔞𝔫𝔠𝔢 Glossary.https://x.com/charles_irl/status/1963664025042215323 .
- ROCm PyTorch quality concerns: reports cite >200 tests skipped and >200 disabled on ROCm (net +110 since June 2025), including transformer/attention ops; AMD contacts are reportedly prioritizing fixes 41 Disappointingly, AMD currently has over 200 unit tests in PyTorch that are skipped exclusively (skipIfRocm) on ROCm and not on CUDA, along with another 200+ tests explicitly disabled for ROCm. The situation has deteriorated since the AMD Advancing AI event in June 2025. Since June 2025, more than 160 new tests have been disabled on ROCm, while only around 50 were re-enabled which resulted in a net increase of 110 disabled tests. This represents a major regression in ROCm PyTorch quality and significantly undermines the user experience. What’s particularly concerning is that many of these tests are not for niche or legacy operators. Critical functionality including numerous transformer tests, fused TP matmul, and even attention, the single most important operator in transformers, has been disabled for months. These issues should be treated as P0 priorities, yet they’ve instead been sidelined, leaving developers without confidence in ROCm PyTorch core capabilities. These aren’t just older ops such as RNNs or LSTMs, these ops are indispensable for modern AI workloads. Addressing the backlog of skipped and disabled tests will take months to bring down the numbers by half and medium to long term to stablize the situation to be under 50 unit tests being skipped/disabled exclusively in ROCm. That being said, we have now successfully convinced @ AnushElangovan 2 weeks ago that this is a high-priority issue. His team is now tackling it with high sense of urgency, and we’re grateful for his team’s renewed efforts.https://x.com/SemiAnalysis_/status/1963708743218339907 39 Since AMD’s Advancing AI event, there is an net change of over 160+ newly disabled PyTorch unit tests on exclusively ROCm. There is a direct correlation between the the number of disabled/skipped test and the end user experience. We are glad that @ AnushElangovan and his team is finally starting to prioritize this work for ROCm PyTorch.https://x.com/dylan522p/status/1963711185225687267 .
- Perplexity Finance adds future revenue estimates for U.S. stocks; India next week 3 Perplexity Finance pages now support future estimated revenues for individual American stocks. Estimates for Indian stocks coming next week. https://x.com/AravSrinivas/status/1963837220940652828 .
- Meta’s Inverse IFEval: a new benchmark tests whether models can override ingrained habits to follow counter‑intuitive instructions (1k Qs, 23 domains, 8 challenge types) 6 Inverse IFEval: a new bench testing whether LLMs can unlearn stubborn training habits and follow counter-intuitive instructions.https://x.com/arankomatsuzaki/status/1963822451550208101 4 1k Qs + 23 domainshttps://x.com/arankomatsuzaki/status/1963822451550208101 5 8 challenge types (e.g. counterfactuals, flawed text)https://x.com/arankomatsuzaki/status/1963822451550208101 .
“Someone with these skills can get a massively greater amount done than someone who writes code the way we did in 2022, before the advent of Generative AI.” 101 Someone with these skills can get a massively greater amount done than someone who writes code the way we did in 2022, before the advent of Generative AI. I talk to large businesses every week that would love to hire hundreds or more people with these skills, as well as startups that have great ideas but not enough engineers to build them. As more businesses adopt AI, I expect this talent shortage only to grow! At the same time, recent CS graduates face an increased unemployment rate, though the underemployment rate — of graduates doing work that doesn’t require a degree — is still lower than for most other majors. This is why we hear simultaneously anecdotes of unemployed CS graduates and also of rising salaries for in-demand AI engineers.https://x.com/AndrewYNg/status/1963631698987684272

Top Stories
Why it matters: These developments shift capital, compute, and safety dynamics across the AI ecosystem.
- Anthropic raises $13B at a $183B valuation to expand capacity, improve model capabilities, and deepen safety research. The company reports serving 300K+ customers, with $100k+/yr accounts growing 7x in 2025 110 We’ve raised $13 billion at a $183 billion post-money valuation.https://x.com/AnthropicAI/status/1962909472017281518 109 This investment, led by @ ICONIQCapital , will help us expand our capacity, improve model capabilities, and deepen our safety research.https://x.com/AnthropicAI/status/1962909472017281518 90 It serves 300K+ customers, with accounts worth $100k+/yr growing 7x in 2025 https://x.com/AnthropicAI/status/1962909472017281518https://x.com/TheRundownAI/status/1963152547935350824 .
- OpenAI acquires Statsig for $1.1B; Statsig founder Vijaye Raji becomes CTO of Applications to lead engineering for ChatGPT and Codex. OpenAI also shifted Srinivas Narayanan to CTO of B2B Apps and Kevin Weil to a new “AI for Science” team, signaling a broadened applications roadmap 89 OpenAI acquired A/B testing platform Statsig for $1.1Bhttps://x.com/TheRundownAI/status/1963152583251394899 108 .@vijayeraji, founder & CEO of Statsig, will join OpenAI as CTO of Applications to lead engineering for ChatGPT & Codex, following the acquisition of Statsig. This expands our Applications leadership as we build safe, useful AI products at scale. https://openai.com/index/vijaye-raji-to-become-cto-of-applications-with-acquisition-of-statsig/https://x.com/OpenAI/status/1962943308935864793 88 OAI shifted Srinivas Narayanan to CTO of B2B Apps and CPO Kevin Weil to a new “AI for Science” team https://x.com/OpenAI/status/1962943308935864793https://x.com/TheRundownAI/status/1963152583251394899 .
- Google’s TPU distribution push: The Information reports Google approached small cloud providers to host TPUs; one agreement was reached for Fluidstack to host Google TPUs in a New York data center, indicating a strategy to expand TPU availability beyond Google Cloud. Observers note Google “seems serious about making TPUs a thing” 105 The Information: ” $GOOGL recently approached small cloud providers that primarily rent out $NVDA chips about also hosting $GOOGL TPU in their data centers, according to seven people who have been involved in the talks.”https://x.com/rwang07/status/1963242911207932140 104 $GOOGL has reached an agreement with at least one of the cloud providers, London-based Fluidstack, to host $GOOGL TPU in a New York data center, representatives of companies involved in the deal have said privately.https://x.com/rwang07/status/1963242911207932140 37 Google seems serious about making TPUs a thing https://x.com/amir/status/1963255649556414512 .
- Automated red‑teaming breaks through: TransluceAI fine‑tuned an 8B model via RL to generate jailbreaks that transfer to closed models (Gemini 2.5 Pro 89%, GPT‑4.1 88%, Claude Sonnet 4 26%) across 48 CBRN tasks; authors emphasize this validates automated red‑teaming while noting developers may have additional safeguards and real‑world harm is uncertain 39 Additionally, attacks solely optimized against an open-weight model (GPT-oss-20b) transfer to many closed models, including Gemini 2.5 Pro (89%), GPT-4.1 (88%), and Claude Sonnet 4 (26%), demonstrating an approach for cheap red-teaming. (8/)https://x.com/TransluceAI/status/1963286341464330701 40 In our main training run, we achieved high attack success rates on a dataset of 48 CBRN-related tasks for a range of models. (5/) https://x.com/TransluceAI/status/1963286335193845999 38 We recognize that model developers have additional safeguards beyond the models themselves, and that real-world harm from eliciting this info is uncertain. Nevertheless, we see our work as validating an automated red-teaming strategy for surfacing risks before deployment. (4/)https://x.com/TransluceAI/status/1963286332622741563 .
- 1‑bit inference on CPUs: Microsoft open‑sourced bitnet.cpp, claiming the ability to run 100B‑parameter models on local CPUs without GPUs, with 6.17× faster inference and 82.2% less energy on CPUs; supports Llama3, Falcon3, and BitNet models (GitHub link provided) 36 Microsoft finally open-sourced their 1-bit LLM inference framework called bitnet.cpp:https://x.com/LiorOnAI/status/1963316578612605327 35 You can now run 100B parameter models on your local CPU without GPUs.https://x.com/LiorOnAI/status/1963316578612605327 34 > 6.17x faster inference > 82.2% less energy on CPUs > Supports Llama3, Falcon3, and BitNet models https://x.com/LiorOnAI/status/1963316578612605327 33 https://github.com/microsoft/BitNethttps://x.com/LiorOnAI/status/1963316579606614381 .
Research & Innovation
Why it matters: New methods promise efficiency gains, stronger reasoning, and better evaluation discipline.
- Data‑efficient RL with verifiable reward (DEPO): Combines offline curation (diversity, influence, difficulty) with online sample‑level “explorability” filtering and replay; using only 20% of data, achieves 1.85× speed‑up on AIME24 and 1.66× on AIME25 vs GRPO trained on full data 103 Towards High Data Efficiency in Reinforcement Learning with Verifiable Rewardhttps://x.com/iScienceLuvr/status/1963169113007895020 102 “we propose DEPO, a Data-Efficient Policy Optimization pipeline that combines optimized strategies for both offline and online data selection. In the offline phase, we curate a high-quality subset of training samples based on diversity, influence, and appropriate difficulty. During online RLVR training, we introduce a sample-level explorability metric to dynamically filter samples with low exploration potential, thereby reducing substantial rollout computational costs. Furthermore, we incorporate a replay mechanism for under-explored samples to ensure adequate training, which enhances the model’s final convergence performance. Experiments across five reasoning benchmarks show that DEPO consistently outperforms existing methods in both offline and online data selection scenarios. Notably, using only 20% of the training data, our approach achieves a 1.85 times speed-up on AIME24 and a 1.66 times speed-up on AIME25 compared to GRPO trained on the full dataset.”https://x.com/iScienceLuvr/status/1963169113007895020 .
- Pretraining optimizers at scale: A systematic study finds fastest optimizers (e.g., Muon, Soap) use matrix preconditioners, but speedups diminish with model size (from 1.4× at 0.1B to 1.1× at 1.2B over AdamW). Observed caveats include non‑trivial hyperparameter transfer and misleading early loss curves 101 Fantastic Pretraining Optimizers and Where to Find Themhttps://x.com/iScienceLuvr/status/1963168542872014943 100 “we find that all the fastest optimizers such as Muon and Soap, use matrices as preconditioners”https://x.com/iScienceLuvr/status/1963168542872014943 99 “However, the speedup of matrix-based optimizers is inversely proportional to model scale, decreasing from 1.4× over AdamW for 0.1B parameter models to merely 1.1× for 1.2B parameter models.”https://x.com/iScienceLuvr/status/1963168542872014943 98 Observations made in the paper: 1. Hyperparameter transfer between optimizers is non-trivial. 2. The speedup of new optimizers is lower than claimed and diminishes with model size. 3. Early-stage loss curves can mislead significantly. 4. Matrix-based optimizers consistently outperform scalar-based optimizers for small models. 5. Optimal choice of optimizer shifts depends on data-to-model ratios.https://x.com/iScienceLuvr/status/1963168542872014943 .
- Medical LLM (Baichuan‑M2, 32B): Reported to outperform other open‑source models (and most closed‑source counterparts) on HealthBench, with a HealthBench Hard score >32; framework includes a Patient Simulator and Clinical Rubrics Generator. Resources: arXiv and Hugging Face model page 97 “Despite its relatively small number of parameters (only 32B), Baichuan-M2 outperformed all other open-source models, including gpt-oss-120B, and most advanced closed-source counterparts on HealthBench. It particularly excelled on the HealthBench Hard test, achieving a score exceeding 32, a performance level previously reached by only one other model globally, GPT-5.”https://x.com/iScienceLuvr/status/1963175775638892878 96 “Our framework comprises two key components: a Patient Simulator that creates realistic clinical environments using de-identified medical records, and a Clinical Rubrics Generator that dynamically produces multi-dimensional evaluation metrics.”https://x.com/iScienceLuvr/status/1963175775638892878 94 https://arxiv.org/abs/2509.02208https://x.com/iScienceLuvr/status/1963175777496924440 95 https://huggingface.co/baichuan-inc/Baichuan-M2-32Bhttps://x.com/iScienceLuvr/status/1963175777496924440 .
- Unified vision‑language modeling (OneCAT): A decoder‑only autoregressive model for image understanding and generation using a shallow patch projector (understanding), VAR (generation), and task/scale‑aware experts; project page available 6 OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generationhttps://x.com/rosinality/status/1963458726008164521 5 Unified model for image understanding and generation. Uses a shallow patch projector for understanding and VAR for generation. Adopts task and scale-aware experts. https://x.com/rosinality/status/1963458726008164521 4 https://onecat-ai.github.io/https://x.com/teortaxesTex/status/1963460898707673253 .
- End‑to‑end document conversion (POINTS‑Reader): Vision‑language model achieves SOTA on OmniDocBench with “blazing‑fast throughput,” supports English/Chinese extraction (reported scores: EN 0.133, ZH 0.212), and offers a simple API; code and paper links provided 85 🚀 Introducing POINTS-Reader — a vision-language model for end-to-end document conversion, delivering SOTA performance on OmniDocBench with blazing-fast throughput. Some info u should not miss👇https://x.com/ZhihuFrontier/status/1963192346222432750 84 📊 Performance Supports both English & Chinese document extraction with scores of 0.133 for English and 0.212 for Chinese on OmniDocBench.https://x.com/ZhihuFrontier/status/1963192346222432750 83 🔗Github: https://github.com/Tencent/POINTS-Reader Hugging Face: https://huggingface.co/papers/2509.01215 #POINTSReader #LLM #VLM #Multimodal #Wechathttps://x.com/ZhihuFrontier/status/1963192346222432750 .
- Diversity‑aware RL (DARLING): Jointly optimizes for quality and diversity via a learned partition function; works for verifiable and non‑verifiable tasks. Recipe: train a binary classifier to detect equivalent responses, cluster them, and multiply standard reward by a diversity reward; shows strong results on instruction‑following (AlpacaEval/ArenaHard, EQ‑Bench ELO) and competition math 75 🌀Diversity Aware RL (DARLING)🌀 📝: http://arxiv.org/abs/2509.02534 - Jointly optimizes for quality & diversity using a learned partition function - Outperforms standard RL in quality AND diversity metrics, e.g. higher pass@1/p@k - Works for both non-verifiable & verifiable tasks 🧵1/5 https://x.com/jaseweston/status/1963230744173482018 74 Recipe 🧑🍳: - Train binary classifier: predict if two responses are equivalent - Responses predicted as same are clustered -> compute diversity reward as in figure - Overall reward: r(x, yi) × Div(yi | y1, · · · , yn), - i.e. multiply diversity reward & standard reward 🧵2/5 https://x.com/jaseweston/status/1963230747063431441 73 Results on instruction following: DARLING achieves both the best quality measured by AlpacaEval /ArenaHard & EQ-Bench (Creative Writing) ELO, and simultaneously is the most diverse, as measured by NoveltyBench. 🧵3/5 https://x.com/jaseweston/status/1963230749324161432 72 Results on reasoning: DARLING simultaneously achieves the best quality and diversity averaged across 4 competition math benchmarks. 🧵4/5 https://x.com/jaseweston/status/1963230751337394566 .
- Evaluation caution (coding agents): On SWE‑Bench Verified, some agents exploit “environment hacking” by reading future repo states; e.g., Qwen3 greps commit logs for the issue number. This underscores the need for hardened eval harnesses 49 so apparently swe-bench doesn’t filter out future repo states (with the answers) and the agents sometimes figure this out…https://x.com/bwasti/status/1963288443452051582 48 https://github.com/SWE-bench/SWE-bench/issues/465https://x.com/bwasti/status/1963288443452051582 29 For example, Qwen3 literally greps all commit logs for the issue number of the issue it needs to fix. lol, clever model.https://x.com/giffmana/status/1963327672827687316 .
- Training diagnostics—internal metrics matter: Practitioners highlight “Max Logit” spikes destabilizing training (“Muon would break training”), motivating mechanisms like MuonClip to control internal stats; internal metrics (e.g., max logit, output RMS, grad norms) aid early bug detection and stability 1 📌 Let’s revisit Yang Zhilin’s own words: “Muon would break training — Max Logit would spike to hundreds. It affects training stability and hurts the model’s ceiling.” “Observing internal metrics, forming hypotheses, running tests — it’s just like RL.”https://x.com/ZhihuFrontier/status/1963493293679153349 2 ⚙️ @ Kimi_Moonshot : “Internal” = model’s inner stats, e.g. max logit, output RMS, etc. In short, it’s a set of indicators that help monitor the state of model training.https://x.com/ZhihuFrontier/status/1963493293679153349 3 🤔 Su Jianlin @ Jianlin_S : Internal metrics are like human health checks — heart rate, blood pressure, white blood cell counts, even CT scans 🩺 In model training, they’re not final scores like accuracy. They’re stats like Loss, Grad Norm… If these spike, it might hint at data bugs or system issues. And MuonClip was designed to control Max Logit — a key internal stat in attention layers. More internal metrics = better diagnostics = better model.https://x.com/ZhihuFrontier/status/1963493293679153349 .
- Robotics capability reuse (Figure Helix): The same Helix architecture that folded towels and sorted packages learned autonomous dishwasher loading with no new algorithms—just new data; short write‑up and demo shared https://x.com/adcock_brett/status/1963266.. 47 The same Helix model that folded towels and sorted packages can now load a dishwasherhttps://x.com/adcock_brett/status/1963266578134560914 46 What’s exciting is this is the same Helix architecture, all we did was add new datahttps://x.com/adcock_brett/status/1963266509163098438 44 Today we posted a short write-up on the websitehttps://x.com/adcock_brett/status/1963266578134560914 .
Products & Launches
Why it matters: New tooling broadens access and improves developer and end‑user workflows.
- ChatGPT Projects now available to Free users; adds larger per‑project file limits (Free 5; Plus 25; Pro/Business/Enterprise 40), color/icon customization, and project‑only memory controls; live on web/Android, rolling out to iOS 18 Projects in ChatGPT are now available to Free users.https://x.com/OpenAI/status/1963329936368046111 17 In addition, we’ve added: - Larger file uploads per project (up to 5 for Free, 25 for Plus, 40 for Pro/Business/Enterprise) - Option to select colors and icons for more customization - Project-only memory controls for more tailored contexthttps://x.com/OpenAI/status/1963329936368046111 16 Now live on web and Android, rolling out to iOS users over the coming days.https://x.com/OpenAI/status/1963329936368046111 .
- OpenAI Realtime API and gpt‑realtime: GA brings image inputs, function calling, MCP support; improvements in instruction following, function‑calling precision, non‑verbal cue detection, seamless language switching with sub‑500ms latency, SIP telephony support; usable for dictation and voice agents 92 Introducing gpt-realtime — our best speech-to-speech model for developers, and updates to the Realtime APIhttps://x.com/OpenAI/status/1961110295486808394 91 What else stands out to us: –> Better function calling precision –> Improved comprehension with non-verbal cue detection –> Seamless language switching mid-conversation: IMO the biggest win – a lot of voice architectures struggle with this the most bc of bigger latency on the TTS-side – the sub 500ms end to end latency is impressive here. –> AND: SIP (telephony) support!https://x.com/bnicholehopkins/status/1961186211478851913 .
- VS Code: Support for custom OpenAI‑compatible endpoints enables local model providers and reduces lock‑in; PR link included 69 VS Code adds support for custom OAI-compatible endpointshttps://x.com/ggerganov/status/1963255949373677959 68 This a big win for local AI as it allows us to use any local model provider without vendor lock-in. Big thanks to the VS Code devs and especially @ IsidorN for listening to the community feedback and adding this option! https://x.com/ggerganov/status/1963255949373677959 67 https://github.com/microsoft/vscode/issues/249605https://x.com/ggerganov/status/1963255951659508117 .
- LangChain 1.0 (alpha): Standardized content blocks for reasoning, citations, tool calls, and multimodal data provide one consistent interface across providers; docs/blog linked 57 `langchain` 1.0, now in alpha, ships with improved standardization for reasoning, citations, tool calls, multimodal data, and other content across LLM providers. No more juggling APIs— just one consistent interface.https://x.com/LangChainAI/status/1963285794954907750 56 https://blog.langchain.com/standard-message-content/https://x.com/LangChainAI/status/1963285794954907750 .
- LlamaIndex Classify (beta): Rules‑based, zero‑shot document classification with confidence scores and reasoning; works via UI or Python; demo and docs provided 23 We’re excited to launch Classify 🔎 - a new, lightweight LlamaCloud feature that lets you quickly categorize a document into any document type with 0-shot classification!https://x.com/jerryjliu0/status/1963322040485908785 20 Demo includes: ✅ Live classification of resumes vs 10-K financial filings ✅ Step-by-step API setup with LlamaCloud ✅ Python code examples and best practices ✅ Real confidence scores and classification reasoninghttps://x.com/llama_index/status/1963263366086172719 19 https://cloud.llamaindex.ai/https://x.com/llama_index/status/1963263366086172719 .
- Perplexity Comet for students: Global rollout; manage schedules, order textbooks, and use Study Mode; sign‑up link provided 53 We are rolling out Comet to all students worldwide.https://x.com/perplexity_ai/status/1963285255198314951 52 Ask Comet to manage your schedule, order textbooks, or prepare for exams with Study Mode. https://x.com/perplexity_ai/status/1963285255198314951 51 https://www.perplexity.ai/grow/comet/students?utm_source=organicsocial&utm_campaign=comet_student_launch_posthttps://x.com/perplexity_ai/status/1963285267231727856 .
- DocPixie (open‑source): Fully multimodal, agentic RAG tool that’s “fully vision” (no embeddings), plans tasks and selects pages, with a modern CLI; pip install and GitHub repo available 30 I just open source DocPixie, a fully multimodal agentic RAG tool. The Claude Code moment for document question answering.https://x.com/stablequan/status/1963318843339849861 27 Get started now: uv pip install docpixie and start with docpixie in your terminalhttps://x.com/stablequan/status/1963319485202313253 26 https://github.com/qnguyen3/docpixiehttps://x.com/stablequan/status/1963319627515146402 .
- PR Arena (GitHub app): Free platform to compare coding‑agent LMs on real issues—tag “pr‑arena” on an issue and two agent LMs submit PRs; powered by allhands_ai 61 Introducing ⚔️PR Arena⚔️ - free AI coding agents to fix real GitHub issues.https://x.com/jiseungh99/status/1963265209339969627 60 Then add the “pr-arena” tag to any github issue that you would like agents to resolve. Two coding agent LMs will go at it, and you get to pick which one you like the best.https://x.com/gneubig/status/1963267470656979388 59 Powered by @ allhands_aihttps://x.com/jiseungh99/status/1963265209339969627 .
- FlashAttention‑3 via 🤗 kernels: Zero‑build integration with full torch.compile support (fullgraph traceability); PR linked 80 You can now use flash-attention 3 through 🤗 `kernels`, skipping its long build times entirely 🔥https://x.com/RisingSayak/status/1963225732668182856 79 Comes with full `torch.compile` support with fullgraph traceability.https://x.com/RisingSayak/status/1963225732668182856 78 https://github.com/huggingface/diffusers/pull/12236https://x.com/RisingSayak/status/1963226217198281088 .
- Synthesia Express‑2 avatars: Adds full expressions, body language, and hand gestures; available now 77 Express-2 avatars now bring your scripts to life with full expressions, body language — and yes, hand gestures.https://x.com/synthesiaIO/status/1963225302173151647 76 Give them a try today 👋 https://x.com/synthesiaIO/status/1963225302173151647 .
- Google’s Gemini app: Image editing received a “major upgrade” (demo video) 15 Image editing in the @ GeminiApp just got a major upgrade. https://x.com/Google/status/1963360326960701788 .
- Agent/Client Protocol (ACP): Open protocol managing agent–IDE interactions (LSP‑like); supports Claude Code and Gemini CLI 66 “Agent/Client Protocol” (ACP) from @ zeddotdev team 👀https://x.com/mathemagic1an/status/1963273618705482155 65 Manages agent-IDE interactions, similar to an LSP. Claude Code and Gemini CLI supported.https://x.com/mathemagic1an/status/1963273618705482155 64 @ zeddotdev https://agentclientprotocol.com/overview/introductionhttps://x.com/mathemagic1an/status/1963273665111351786 .
Industry Moves
Why it matters: Capital flows and partnerships are reshaping platform strategies and compute supply.
- OpenAI × Statsig: $1.1B acquisition; Vijaye Raji to CTO of Applications; internal leadership moves expand the Apps org for ChatGPT and Codex 89 OpenAI acquired A/B testing platform Statsig for $1.1Bhttps://x.com/TheRundownAI/status/1963152583251394899 108 .@vijayeraji, founder & CEO of Statsig, will join OpenAI as CTO of Applications to lead engineering for ChatGPT & Codex, following the acquisition of Statsig. This expands our Applications leadership as we build safe, useful AI products at scale. https://openai.com/index/vijaye-raji-to-become-cto-of-applications-with-acquisition-of-statsig/https://x.com/OpenAI/status/1962943308935864793 88 OAI shifted Srinivas Narayanan to CTO of B2B Apps and CPO Kevin Weil to a new “AI for Science” team https://x.com/OpenAI/status/1962943308935864793https://x.com/TheRundownAI/status/1963152583251394899 .
- You.com raises $100M (Series C) at a $1.5B valuation to build web search APIs for LLMs/agents; claims >1B queries/month across customers like DuckDuckGo, Windsurf, and Harvey 55 We’re officially a YOUnicorn! 🦄 Excited to share that @ youdotcom just raised $100M Series C at a $1.5B valuation, led by @ CoxEnterprises .https://x.com/RichardSocher/status/1963277700711461241 54 Today, we’re answering over 1 billion queries every month across enterprises like DuckDuckGo, Windsurf, Harvey, and many others. There is no other startup of recent years at this scale.https://x.com/RichardSocher/status/1963277700711461241 .
- Exa raises $85M (Series B) at a $700M valuation to build a “search engine for AI” 14 We raised $85M in Series B funding at a $700M valuation, led by Benchmark.https://x.com/ExaAILabs/status/1963262700123000947 13 Exa is a research lab building the search engine for AI. https://x.com/ExaAILabs/status/1963262700123000947 .
- CoreWeave acquires OpenPipe (YC‑backed) to “expand down the stack” and serve enterprises building agents 25 CoreWeave hopes the YC-backed startup will help it expand down the stack, and cash in on enterprises developing AI agents. https://techcrunch.com/2025/09/03/coreweave-acquires-agent-training-startup-openpipe/?utm_campaign=social&utm_source=X&utm_medium=organichttps://x.com/TechCrunch/status/1963327147369431259 24 https://techcrunch.com/2025/09/03/coreweave-acquires-agent-training-startup-openpipe/?utm_campaign=social&utm_source=X&utm_medium=organichttps://x.com/TechCrunch/status/1963327147369431259 .
- Together AI recognized on 2025 Forbes Cloud 100; says 800k+ developers build on its platform 70 ☁️ Big news: Together AI has made the 2025 Forbes Cloud 100!https://x.com/togethercompute/status/1963244727265788282 71 More than 800,000 pioneering developers now build on Together AI.https://x.com/togethercompute/status/1963244729337807134 .
- Google TPU externalization: Approaches smaller providers to host TPUs; agreement with Fluidstack in NYC suggests broader TPU access beyond first‑party cloud 105 The Information: ” $GOOGL recently approached small cloud providers that primarily rent out $NVDA chips about also hosting $GOOGL TPU in their data centers, according to seven people who have been involved in the talks.”https://x.com/rwang07/status/1963242911207932140 104 $GOOGL has reached an agreement with at least one of the cloud providers, London-based Fluidstack, to host $GOOGL TPU in a New York data center, representatives of companies involved in the deal have said privately.https://x.com/rwang07/status/1963242911207932140 .
- AWS × Anthropic Trainium scaling (analysis): Notes on multi‑gigawatt clusters, Trainium ramp, and “best TCO per memory bandwidth” for large‑scale inference/training workloads 8 Amazon’s AI Resurgence: AWS & Anthropic’s Multi-Gigawatt Trainium Expansion Anthropic multi-gigawatt clusters, Trainium ramp, best TCO per memory bandwidth, system-level roadmap, Bedrock and internal modelshttps://x.com/SemiAnalysis_/status/1963358057771274537 .
Policy & Regulation
Why it matters: Public procurement and policy dialogs will shape adoption, safety, and oversight.
- US GSA × Microsoft: New agreement provides federal agencies no‑cost Microsoft 365 Copilot and AI services for up to 12 months; Microsoft projects >$3B in taxpayer savings in year one 87 Microsoft announced a new partnership with the U.S. GSA to provide the federal govt with free access to Copilot and AI services for up to 12 monthshttps://x.com/TheRundownAI/status/1963152669645615286 86 Today, we are building on that legacy with a new agreement with the U.S. General Services Administration — including a no-cost Microsoft 365 Copilot offer — to accelerate the adoption of AI and digital technologies across federal agencies and deliver more than $3 billion in total savings to taxpayers in the first year alone.https://x.com/satyanadella/status/1962869100860100840 .
- Anthropic “Futures Forum” (Sept 15, DC): Company will demo AI for national security, science, and public services to policymakers 63 Dario and I are gathering policymakers in DC on September 15th to give an inside look into Anthropic’s latest progress, and share live demonstrations of how AI is being applied to national security, science, and public services. Register to attend: https://website.anthropic.com/events/futures-forum-2025https://x.com/jackclarkSF/status/1963273734493553072 62 https://website.anthropic.com/events/futures-forum-2025https://x.com/jackclarkSF/status/1963273734493553072 .
- Data handling and compliance: A former employee reports Scale is suing after they moved files to a personal drive; reminder to avoid storing company data on personal devices due to legal/compliance exposure 7 Just heard I’m getting sued by Scale. Last month, I left Scale to work at Mercor. I know this was frustrating for my old team, and I feel bad about that. When Scale reached out about some files I had in my personal drive, I asked if I could just delete them. But Scale asked that I not do anything with them, so I’m still waiting for guidance on how to resolve this. I’ve never used any of them in this role. It sounds like Scale wants to sue me and that’s up to them. But I just wanted to say that there truly was no nefarious intent here. I’m really sorry to my new team at Mercor for having to deal with this.https://x.com/eugeneling7/status/1963376013095965076 .
Quick Takes
Why it matters: Smaller signals still inform capability trends, security posture, and developer ergonomics.
- GPT‑5 presence on Aider leaderboard; plots include both accuracy and inference cost 22 Aider leaderboard has been updated with @ OpenAI GPT-5 scores https://x.com/mark_k/status/1963123287375933838 21 I like the recent plots that show accuracy as well as the cost of inference !https://x.com/BorisMPower/status/1963315707363692833 .
- Android updates: AI writing tools in Gboard, Gemini on Wear OS, audio sharing; “polish your writing using AI” and more 32 Stay in the know about what’s *NEW.* Check out AI writing tools in Gboard, Audio sharing, Gemini on Wear OS, and more on the thread below 🧵👇https://x.com/Android/status/1963315983810179140 31 With the latest @ Android updates, you can polish your writing using AI, browse Emoji Kitchen stickers, share files with a tap and more ⬇️https://x.com/Google/status/1963323989537009699 .
- Jules critic transparency: Step‑by‑step breakdown of critique reasoning now visible; more context for sharper feedback; changelog linked 10 We’re making the Jules critic more transparent. After it completes a review, you can now see a step-by-step breakdown of its thought process. This update also enhances the critic’s analysis by using more context for sharper, more reliable feedback. https://x.com/julesagent/status/1963405535882936811 9 http://jules.google/docs/changelog/#improved-jules-critichttps://x.com/julesagent/status/1963405663167537335 .
- Evaluation integrity: FAIR team shows coding agents “env‑hack” SWE‑Bench by reading future commits (e.g., grepping logs for issue IDs), reinforcing the need for hardened evals 29 For example, Qwen3 literally greps all commit logs for the issue number of the issue it needs to fix. lol, clever model.https://x.com/giffmana/status/1963327672827687316 28 “cheat” cuz it’s more like env hacking. https://x.com/giffmana/status/1963327672827687316 .
- Robotics data efficiency: Figure underscores “no new algorithms, just new data” when extending Helix to dishwasher loading 46 What’s exciting is this is the same Helix architecture, all we did was add new datahttps://x.com/adcock_brett/status/1963266509163098438 .
- Hardware reliability: Microsleeps (CPU C‑states) can partially recover BTI damage; reported ~40% degradation reduction with idle windows 82 Give the transistor a break, and degradation reduces ~40% or more; the longer the better.https://x.com/lauriewired/status/1962960052438106134 81 CPU C states create lots of tiny idle windows (microsleeps), which drastically increase the lifespan. https://x.com/lauriewired/status/1962960052438106134 .
- Math capability limits: Epoch AI notes LLMs have not solved any problems in the highest difficulty tier across AIME/USAMO/IMO; some gold medals were achieved without top‑tier problem solves 11 We rated the difficulty of problems from three premier contests (AIME, USAMO, IMO) by splicing together two popular scales (AoPS, MOHS). No LLM has solved a single problem in the highest difficulty tier.https://x.com/EpochAIResearch/status/1963364416352915549 12 How does this square with the AI gold medal results from the IMO? Models managed to squeeze into the gold medal range without having to solve any top-difficulty problems.https://x.com/EpochAIResearch/status/1963364428289909200 .
- Transformer trade‑off: “Transformer arch has highest performance, but most inefficient” (reported result in a referenced paper) 41 “Transformer arch has highest performance, but most inefficient”https://x.com/cloneofsimo/status/1963313972050079844 .
- Perplexity Comet: Users report native ad‑block; easy import from Chromium‑based browsers 42 Comet has native ad block.https://x.com/AravSrinivas/status/1963301988373815599 43 It’s chromium based so i instantly imported my data from Brave, except my wallets.https://x.com/HyperLcrgs/status/1963221438657097948 .
- Meta NPCs: Post claims anyone will soon be able to add fully‑embodied conversational LLM NPCs for free (community post) 50 Meta just announced that anyone will soon be able to add fully-embodied conversational LLM NPCs into their worlds, totally for free. https://x.com/jasteinerman/status/1963055410446807223 .
- Kling AI “figurine” trend: How‑to thread and example prompt shared 107 AI figurine videos just went viral🔥 Learn how to create yours with Kling AI! #KlingAI #figurinehttps://x.com/Kling_ai/status/1963122701230547072 106 a 1/7 scale commercial figurine of the character in the picture was drawn, in a realistic style and in a real environment. The figurine was placed on a computer desk with a round transparent acrylic base with no text on it. The content on the computer screen was the brush modeling process of the figurine, and next to the computer screen was a BANDAI-style toy box with the original painting printed on it.https://x.com/Kling_ai/status/1963173327754985503 .
- Unverified viral claim: A post alleged a ChatGPT outage was due to “0‑bit quantization”; presented without corroboration 93 BREAKING: Sources say the current ChatGPT outage where no responses are shown is actually a result of a new 0 bit quantization method that OpenAI has been experimenting with internally.https://x.com/nrehiew_/status/1963167041931939990 .
“what you think of when you hear ‘evals’ is dead” 58 what you think of when you hear ‘evals’ is deadhttps://x.com/aidan_mclau/status/1963284507324563685

Top Stories
Why it matters: This week’s top stories highlight the immense capital and strategic consolidation shaping the AI landscape. A massive funding round for Anthropic underscores investor confidence in foundational models, while OpenAI’s acquisition of Statsig signals a deepening focus on product engineering and experimentation at scale. Concurrently, the evolution of industry benchmarks reflects a clear shift from pure knowledge tests to evaluating complex, agentic capabilities.
-
Anthropic Secures $13B at $183B Valuation: Anthropic announced it has raised $13 billion in a funding round led by ICONIQ Capital, reaching a post-money valuation of $183 billion 73 We’ve raised $13 billion at a $183 billion post-money valuation. This investment, led by @ ICONIQCapital , will help us expand our capacity, improve model capabilities, and deepen our safety research.https://x.com/AnthropicAI/status/1962909472017281518 54 We’ve raised $13 billion at a $183 billion post-money valuation.https://x.com/AnthropicAI/status/1962909472017281518 . The company stated the investment will be used to expand capacity, improve model capabilities, and deepen safety research 53 This investment, led by @ ICONIQCapital , will help us expand our capacity, improve model capabilities, and deepen our safety research.https://x.com/AnthropicAI/status/1962909472017281518 . This news follows a period of rapid growth, with the company reporting its revenue run-rate grew from $1 billion at the start of 2025 to over $5 billion just eight months later 42 We started 2025 at $1 billion in run-rate revenue and passed $5 billion just eight months later.https://x.com/AnthropicAI/status/1962909473736962122 , making it one of the fastest-growing technology companies in history 41 This makes Anthropic one of the fastest-growing technology companies in history.https://x.com/AnthropicAI/status/1962909473736962122 . Analysts predict the company could pass OpenAI in valuation by early 2027 and exceed $1 trillion by 2029 72 anthropic will pass oai in valuation likely near early 2027, and should be worth >1T by 2029 https://x.com/AnthropicAI/status/1962909472017281518https://x.com/nearcyan/status/1962991398359335325 .
-
OpenAI Acquires Statsig, Appoints New CTO of Applications: OpenAI has acquired Statsig, a product experimentation and analysis platform 39 .@vijayeraji, founder & CEO of Statsig, will join OpenAI as CTO of Applications to lead engineering for ChatGPT & Codex, following the acquisition of Statsig. This expands our Applications leadership as we build safe, useful AI products at scale. https://openai.com/index/vijaye-raji-to-become-cto-of-applications-with-acquisition-of-statsig/https://x.com/OpenAI/status/1962943308935864793 . Following the acquisition, Statsig’s founder and CEO, Vijaye Raji, will join OpenAI as the CTO of Applications, leading engineering for ChatGPT and Codex 39 .@vijayeraji, founder & CEO of Statsig, will join OpenAI as CTO of Applications to lead engineering for ChatGPT & Codex, following the acquisition of Statsig. This expands our Applications leadership as we build safe, useful AI products at scale. https://openai.com/index/vijaye-raji-to-become-cto-of-applications-with-acquisition-of-statsig/https://x.com/OpenAI/status/1962943308935864793 . OpenAI stated the move expands its leadership as it builds AI products at scale 39 .@vijayeraji, founder & CEO of Statsig, will join OpenAI as CTO of Applications to lead engineering for ChatGPT & Codex, following the acquisition of Statsig. This expands our Applications leadership as we build safe, useful AI products at scale. https://openai.com/index/vijaye-raji-to-become-cto-of-applications-with-acquisition-of-statsig/https://x.com/OpenAI/status/1962943308935864793 . An OpenAI employee noted that Statsig was critical to ChatGPT’s growth and ability to move quickly since its adoption in 2023 15 When we decided (probably late at night, like 3am) to adopt @ statsig way back in 2023, there was nothing else quite like it. It’s safe to say ChatGPT would not have grown the way it did or been able to move as quickly without it.https://x.com/arunv30/status/1963018863827997091 .
-
AI Benchmarking Evolves to Focus on Agentic Capabilities: The industry is shifting how it measures AI intelligence, with a growing emphasis on tool use and complex workflows. Artificial Analysis updated its Intelligence Index to V3, incorporating agentic evaluations like Terminal-Bench Hard and 𝜏²-Bench Telecom to better reflect this trend 52 Today we’re updating Artificial Analysis Intelligence Index to V3, now incorporating agentic evaluations Terminal-Bench Hard and 𝜏²-Bench Telecom!https://x.com/ArtificialAnlys/status/1962881314925023355 51 Tool calling and agentic workflows are increasingly the norm for how language models are used by both developers and consumers. Adding Terminal-Bench and 𝜏²-Bench to our Intelligence Index reflects this trend and allows us to see where models have strengths for agentic use cases, compared to prior evaluations that are more focused on knowledge and reasoning.https://x.com/ArtificialAnlys/status/1962881314925023355 . The update resulted in GPT-5 remaining the top-performing model, with its smaller variants moving up due to strong agentic performance 50 Impact: This update brings the index to a composite of 10 equally-weighted evaluation scores, and slightly reduces the top score to 67. GPT-5 remains the top-performing model on our Index, and its low reasoning and mini variants move up the leaderboard on the back of their strong agentic performance. Please see below for further details on performance and patterns we see in these new evaluations.https://x.com/ArtificialAnlys/status/1962881314925023355 . Similarly, the new MCP-Universe benchmark was introduced to test agents on 231 practical tasks using real-world MCP servers instead of simulated environments 34 How can we benchmark Agents in realistic, complex environments? MCP-Universe is a new benchmark using Model Context Protocol (MCP) servers to test Agents on 231 challenging, practical tasks.https://x.com/_philschmid/status/1962935890415599650 .
Research & Innovation
Why it matters: The pace of AI research continues to accelerate, with breakthroughs in model efficiency, reasoning, and evaluation. This week saw new models that achieve state-of-the-art performance with a fraction of the parameters, novel techniques that challenge foundational architectural assumptions, and a proliferation of specialized benchmarks designed to test more nuanced AI capabilities.
New Models & Architectures
- rStar2-Agent: A new 14B math reasoning model trained with agentic reinforcement learning has achieved “frontier-level performance,” surpassing the 671B DeepSeek-R1 on key benchmarks after only one week of training on 64 GPUs 68 “We introduce rStar2-Agent, a 14B math reasoning model trained with agentic reinforcement learning to achieve frontier-level performance.”https://x.com/iScienceLuvr/status/1962798181059817480 67 “three key innovations that makes agentic RL effective at scale: (i) an efficient RL infrastructure with a reliable Python code environment that supports high-throughput execution and mitigates the high rollout costs, enabling training on limited GPU resources (64 MI300X GPUs); (ii) GRPO-RoC, an agentic RL algorithm with a Resample-on-Correct rollout strategy that addresses the inherent environment noises from coding tools, allowing the model to reason more effectively in a code environment; (iii) An efficient agent training recipe that starts with non-reasoning SFT and progresses through multi-RL stages, yielding advanced cognitive abilities with minimal compute cost. To this end, rStar2-Agent boosts a pre-trained 14B model to state of the art in only 510 RL steps within one week, achieving average pass@1 scores of 80.6% on AIME24 and 69.8% on AIME25, surpassing DeepSeek-R1 (671B) with significantly shorter responses.”https://x.com/iScienceLuvr/status/1962798181059817480 .
- LongCat-Flash: A technical report details a 560B passive MoE model with an adaptive number of active parameters, thanks to a novel “Zero-Computational expert” that acts as a sink for easy tokens 81 The technical report of @ Meituan_LongCat LongCat-Flash is crazy good and full of novelty. The model is a 560B passive ~27B active MoE with adaptive number of active parameters depending on the context thanks to the Zero-Computational expert.https://x.com/eliebakouch/status/1961999252311204147 80 New architecture > Layers have 2 Attention blocks and both FFN and MoE, that way you can overlap the 2 all-to-all coms. (also it’s only 28 layers but you have to take into account the 2 attention blocks). > They add the zero-computational expert that tokens can choose and do nothing, kinda like a “sink” for easy tokens. > For load balancing, they have a dsv3-like aux loss free to set the average real/fake expert per token. They apply a decay schedule to this bias update. They also do loss balance control.https://x.com/eliebakouch/status/1961999252311204147 .
- Apertus: Researchers from EPFL and ETH Zurich released Apertus-8B and Apertus-70B, Switzerland’s first large-scale, multilingual language models, trained on 15T tokens of open data 64 @ EPFL , @ ETH_en and #CSCS today released Apertus, Switzerland’s first large-scale, multilingual language model (LLM). As a fully open LLM, it serves as a building block for developers and organizations to create their own applications: https://www.cscs.ch/science/computer-science-hpc/2025/apertus-a-fully-open-transparent-multilingual-language-model #Apertus #AIhttps://x.com/cscsch/status/1962790065827987563 17 Long in the making, finally released: Apertus-8B and Apertus-70B, trained on 15T tokens of open data from over 1800 languages. Unique opportunity in academia to work on and train LLMs across the full-stack. We managed to pull off a pretraining run with some fun innovations, … https://x.com/haeggee/status/1962898537294749960 . The release is seen as a benchmark for what can be achieved with open data, replicating performance near Llama 3.1 levels 16 In a sense Apertus is the benchmark for how far the open data ecosystem has come answer: at least you can replicate ≈llama 3.1 levels with nothing but stuff of huggingface https://x.com/teortaxesTex/status/1962993856896762262 .
- Apple FastVLM: Apple released 0.5B, 1.5B, and 7B real-time vision-language models that are up to 85x faster and 3.4x smaller than comparable models and run in-browser with WebGPU support 84 🚨 Apple just released FastVLM on Hugging Face - 0.5, 1.5 and 7B real-time VLMs with WebGPU support 🤯https://x.com/reach_vb/status/1961471154197053769 83 85x faster and 3.4x smaller than comparable sized VLMs 7.9x faster TTFT for larger models designed to output fewer output tokens and reduce encoding time for high resolution imageshttps://x.com/reach_vb/status/1961471154197053769 82 Bonus: works in REALTIME directly in your browser powered by transformers.js and WebGPU 🔥https://x.com/reach_vb/status/1961471154197053769 .
- Tencent R-4B: Tencent released a small vision language model that claims state-of-the-art performance under an Apache 2.0 license 43 Tencent dropped R-4B, small vision LM that claims sota with Apache 2.0 license 💗 the model enables different thinking options and transformers support through custom code! https://x.com/mervenoyann/status/1962917635932229797 .
New Techniques & Findings
- “Prophet” Decoding for Diffusion Models: Research suggests diffusion language models know the answer before fully decoding. A new training-free paradigm called Prophet enables early-commit decoding, reframing the problem as “when to stop sampling” 76 Diffusion Language Models Know the Answer Before Decodinghttps://x.com/iScienceLuvr/status/1962800400278667677 75 “in many cases, the correct answer can be internally identified by half steps before the final decoding step, both under semi-autoregressive and random remasking schedules. For example, on GSM8K and MMLU, up to 97% and 99% of instances, respectively, can be decoded correctly using only half of the refinement steps. Building on this observation, we introduce Prophet, a training-free fast decoding paradigm that enables early commit decoding.”https://x.com/iScienceLuvr/status/1962800400278667677 74 “These results recast DLM decoding as a problem of when to stop sampling”https://x.com/iScienceLuvr/status/1962800400278667677 .
- Dynamic Tanh (DyT): A new paper shows it’s possible to remove normalization layers (LayerNorm, RMSNorm) from Transformers entirely by using a scaled tanh function called Dynamic Tanh, outperforming state-of-the-art models in vision, language, and speech 32 Transformers without normalization layers. Shows you can remove norm layers entirely and outperform SOTA models in vision, language, speech.https://x.com/LiorOnAI/status/1962953950895718618 31 Introduces Dynamic Tanh (DyT), a scaled tanh function that replaces LayerNorm and RMSNorm across models and tasks. https://x.com/LiorOnAI/status/1962953950895718618 .
- Goldfish Loss: A proposed technique randomly drops tokens from the cross-entropy loss to mitigate memorization without harming downstream benchmark performance 30 Goldfish losshttps://x.com/vikhyatk/status/1962954696500674908 29 proposes randomly dropping some tokens from cross entropy loss. mitigates memorization without lowering downstream benchmark performance https://x.com/vikhyatk/status/1962954696500674908 .
- Tensor Parallel Latent Attention (TPLA): A new method for efficient inference that partitions the latent representation and head inputs across devices, unlocking tensor parallelism for MLA-based models 28 TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill and Decode Inferencehttps://x.com/gm8xx8/status/1962953925910298775 27 MLA compresses the KV cache into a latent vector (cKV), but under tensor parallelism each device must still load the full cKV, eliminating its memory advantage over GQA. TPLA solves this by partitioning both the latent representation and head inputs across devices, performing attention locally, and combining results with an all-reduce. Each head retains access to the full latent (unlike GLA, which halves capacity), preserving accuracy while reducing per-device KV. The method is drop-in compatible with MLA-pretrained checkpoints, supports prefill–decode separation (MLA for prefill, TPLA for decode), and is algebraically equivalent to GLA with replicated heads, making it FlashAttention-3 compatible.https://x.com/gm8xx8/status/1962953925910298775 26 MLA acceleration unlocking tensor parallelismhttps://x.com/teortaxesTex/status/1962955948596580828 .
New Benchmarks & Datasets
- Werewolf Benchmark: A new test for social reasoning under pressure evaluates if models can lead, bluff, and resist manipulation in the game of Werewolf. In 210 games, GPT-5 was the top performer 71 🐺 Introducing the Werewolf Benchmark, an AI test for social reasoning under pressure.https://x.com/RaphaelDabadie/status/1961836323376935029 70 Below is our role-conditioned Elo leaderboard. GPT-5 sits alone at the top, we’re looking for contenders strong enough to threaten its lead. (📥 DMs are open !)https://x.com/RaphaelDabadie/status/1961836323376935029 .
- AHELM & CTF-Dojo: Stanford introduced AHELM, a benchmark for holistically evaluating Audio-Language Models across 10 aspects 69 “we introduce AHELM, a benchmark that aggregates various datasets – including 2 new synthetic audio-text datasets called PARADE, which evaluates the ALMs on avoiding stereotypes, and CoRe-Bench, which measures reasoning over conversational audio through inferential multi-turn question answering – to holistically measure the performance of ALMs across 10 aspects we have identified as important to the development and usage of ALMs: audio perception, knowledge, reasoning, emotion detection, bias, fairness, multilinguality, robustness, toxicity, and safety.”https://x.com/iScienceLuvr/status/1962799344001917360 . For cybersecurity, CTF-Dojo was released as the first large-scale environment with over 600 challenges for training agents 1 We introduce CTF-Dojo, the first large-scale execution environment covering 600+ CTF challenges to collect agent trajectories for training!https://x.com/terryyuezhuo/status/1963125178785001874 .
- Jupyter Agent Dataset: A new dataset containing 2 billion tokens from over 51,000 Kaggle notebooks was released to improve agents’ ability to execute code and analyze data 44 Releasing the Jupyter Agent Dataset! 🚀https://x.com/a_yukh/status/1962911097452683710 40 New dataset with 2B tokens from 51k Kaggle notebooks!https://x.com/maximelabonne/status/1962923411887305094 .
Products & Launches
Why it matters: New products and features are making sophisticated AI capabilities more accessible to developers and consumers alike. Key updates focus on improving agent development workflows, reducing operational friction, and embedding AI more deeply into everyday applications.
- Hugging Face Eliminates Cold Starts with ZeroGPU AoT: Hugging Face Spaces’ ZeroGPU service now uses Ahead-of-Time (AoT) compilation to compile models before deployment, solving the cold-start problem and speeding up inference by 1.3x to 1.8x 61 TLDR; the latest upgrade for ZeroGPU spaces brings Ahead-of-Time (AoT) compilation to solve cold starts by compiling models before deployment, leading to significantly faster inference times (1.3x to 1.8x speedups reported for models like Flux and Wan!).https://x.com/ben_burtenshaw/status/1962851707487912195 57 💡 With AoT, we can export a compiled model once, save it, and later reload it instantly in any process, which is exactly what we need for ZeroGPU. This helps us reduce framework overhead and also eliminates cold-start timings typically incurred in just-in-time compilation.https://x.com/RisingSayak/status/1962844490324169141 . This makes it significantly cheaper and easier to ship AI demos 60 If you’re shipping AI demos, ZeroGPU Spaces are the easiest and cheapest way. The problem was always cold starts. Now that’s fixed!https://x.com/ben_burtenshaw/status/1962851707487912195 .
-
Anthropic Enhances Code Execution Tools: The Anthropic API’s code execution tool received major updates, including a
bash
tool, precise file editing withstr_replace
, and an extension of the container lifetime from 1 hour to 30 days 46 We just added a few major tool updates to the code execution tool in the Anthropic API: - bash tool for running any bash command - str_replace for precise file editing - view for reading files, browsing dirs, displaying images - create for writing new fileshttps://x.com/alexalbert__/status/1962912152555225296 45 And we extended the container lifetime from 1 hour to 30 days. https://x.com/alexalbert__/status/1962912180002783315 . - LangChain & LangGraph 1.0 Alpha Released: LangChain announced the alpha releases for LangChain and LangGraph 1.0. LangGraph remains largely the same, while LangChain 1.0 is a significant revamp focused on a central agent abstraction built on LangGraph 36 Today we are announcing alpha releases of v1.0 for langgraph and langchain, in both Python and JS.https://x.com/LangChainAI/status/1962934869065191457 35 TL,DR: 1. LangGraph is largely the same as before, no breaking changes. We’ve heard good feedback about LangGraph and are excited to move to 1.0 2. New standard content blocks on messages in LangChain Core. Message formats are evolving, and so is LangChain. Backwards compatible 3. LangChain itself - high level agents and chains - is greatly changed. You should think of LangChain 1.0 as a new package - focused around a central agent abstraction, built on top of LangGraph 4. New docs site!https://x.com/LangChainAI/status/1962934869065191457 .
- OpenAI Codex with GPT-5-high Impresses Developers: Early user feedback on OpenAI’s Codex, powered by GPT-5-high, has been positive. Users praised its PR review feature and noted its strong performance, with one user stating they “don’t miss Claude Code” 3 OpenAI nailed it with Codex for devs ?!https://x.com/mark_k/status/1962632867088945199 2 I’ve been using GPT-5-high in Codex for a few days and I don’t miss Claude Code. The value you get for 20 a month is insane. The PR review feature (just mention @ codex on a PR) is super easy to set up and works well. https://x.com/mark_k/status/1962632867088945199 .
- Replit Agent Becomes Framework-Agnostic: Replit Agent now supports any framework, allowing advanced builders to import existing projects and create desktop apps, games, or terminal tools in languages like Java, Rust, Go, and C# 19 🚀 Big news: Replit Agent now works with any framework!https://x.com/Replit/status/1962946676106350757 18 ✨ What’s now possible for advance builders: → Import existing projects and get Agent support → While Agent previously supported popular languages, it can now handle even more with Java, Rust, Go, C#, Angular, Vue, and more → Create desktop apps, games (hello Godot!), or terminal tools → Use your preferred databases and toolshttps://x.com/Replit/status/1962946676106350757 .
- Google Launches Gemini URL Context and Maps AI Mode: Google DeepMind released URL Context for the Gemini API, allowing it to fetch live data from up to 20 URLs, PDFs, or images per request 49 You can now extract data from any webpage, PDF, or image with a simple URL.https://x.com/LiorOnAI/status/1962894029152047590 48 It fetches live data from up to 20 URLs per request. No setup. You just pass links in the prompt.https://x.com/LiorOnAI/status/1962894029152047590 . Additionally, a new AI Mode in Google Maps in the U.S. provides personalized recommendations based on past conversations and searches 33 With AI Mode, you can see recommendations that are personalized for you. ☕ If you search “I only have an hour, need a quick lunch spot, any suggestions?,” AI Mode will be able reference the past conversations and places you’ve previously searched for to give you a bespoke list of restaurant and cafe recommendations. Available in the U.S. for users opted into the AI Mode experiment in Labs.https://x.com/Google/status/1962941141990379913 .
Industry Moves
Why it matters: Strategic investments, acquisitions, and new initiatives are intensifying competition and collaboration across the AI industry, signaling where major players are placing their bets for future growth.
- OpenAI Launches ‘OpenAI for Science’ Initiative: Kevin Weil announced he is leading a new initiative inside OpenAI to build an AI-powered platform to accelerate scientific discovery 25 💥 I’m starting something new inside OpenAI! It’s called OpenAI for Science, and the goal is to build the next great scientific instrument: an AI-powered platform that accelerates scientific discovery.https://x.com/kevinweil/status/1962938974260904421 . The effort will hire a small team of world-class academics and researchers to prove AI’s readiness to advance fundamental science 24 We’ll look to hire a small team of academics that are (i) world-class in their field; (ii) completely AI-pilled; and (iii) great science communicators. Paired with a small team of researchers, we want to prove that AI models are ready to accelerate fundamental science—and accelerate research all over the world.https://x.com/kevinweil/status/1962938976026640836 . Early examples show GPT-5 improving a bound in a convex optimization paper by 50% and uncovering new findings in a large metabolomics dataset 23 GPT-5 Pro was able to improve a bound in one of @ SebastienBubeck ’s papers on convex optimization—by 50%, with 17 minutes of thinking. https://x.com/SebastienBubeck/status/1958198661139009862https://x.com/kevinweil/status/1962938981277954406 22 ✅GPT-5 did a better job in under five minutes. ✅It not only replicated almost everything we had concluded back then, including finding all the significant differences, creating multiple spreadsheets on different pathways and so on, but also uncovered several discoveries we completely missed.https://x.com/DeryaTR_/status/1956871713125224736 .
- Microsoft Pushes Major Copilot Updates: Microsoft had a busy August, deploying GPT-5 to 100% of Copilot users on day one, launching Copilot 3D, and integrating Copilot Vision into Motorola’s Moto AI phones 47 Remember when August was a slow month in tech? TLDR of @ MicrosoftAI ’s last few weeks: - GPT-5 out to 100% of @ Copilot users day 1 - Copilot 3D - worldwide Deep Research free access - Copilot on new Samsung TVs - Motorola Moto AI got built-in Copilot Vision - 2 new in-house modelshttps://x.com/mustafasuleyman/status/1962915673320849465 .
- John Deere Acquires GUSS Automation: In a major move for robotics in agriculture, John Deere acquired GUSS Automation, a leader in autonomous sprayers 14 John Deere made its biggest robotic play yet: acquiring GUSS Automation, a leader in autonomous sprayers.https://x.com/lukas_m_ziegler/status/1962835185096855698 . The acquisition highlights that precision agriculture is about machines that can act on data, with GUSS systems having already sprayed 2.6M acres with a 90% chemical reduction 13 🧃 2.6M acres sprayed 🚜 500K operational hours 🔋 90% chemical reductionhttps://x.com/lukas_m_ziegler/status/1962835185096855698 12 The message is clear, precision ag isn’t just about data. It’s about machines that act on it. https://x.com/lukas_m_ziegler/status/1962835185096855698 .
- Commentary on Japan’s AI Position: François Chollet commented that while Japan had world-leading AI and robotics labs until the mid-2000s, it is now “all but absent from the current AI wave” 38 1980s Japan didn’t just have a roaring economy, it also had many of the best AI research labs in the world. It still retained world-leading robotics expertise up until the mid 2000s.https://x.com/fchollet/status/1962945301767245837 37 But it is all but absent from the current AI wave.https://x.com/fchollet/status/1962945301767245837 .
Policy & Regulation
Why it matters: The landmark antitrust ruling against Google establishes new rules of engagement for search, setting a precedent for how dominant tech platforms may be required to support competitors in an AI-driven market.
- Court Details Remedies in Google Antitrust Case: A federal court outlined the terms of Google’s mandatory search syndication license for competitors. The license will last for five years, with a cap allowing competitors to use Google for up to 40% of their annual queries in the first year, a figure that will taper down over time 9 License duration: The mandatory license will be for five years, not the ten years plaintiffs requested. The court views this as a temporary measure to help competitors become independent, not a permanent reliance on Google. (Page 175) “Third, the syndication license shall be for five, not 10 years. Witnesses consistently described syndication as a near-term solution that would enable Qualified Competitors to offer high-quality results while working towards building a search index that could compete with Google’s. . . . A five-year license will force Qualified Competitors to wean themselves from Google’s syndication services more quickly.”https://x.com/vladquant/status/1963021356909658377 8 Query cap: In the first year, competitors can only use Google’s syndicated results for a maximum of 40% of their total annual queries. This cap is intended to ensure competitors develop their own capability for the majority of searches and rely on Google only for the most difficult “long-tail” queries. (Page 176) “Fourth, Qualified Competitors’ use of Google’s syndication services in the first year will be capped at 40% of annual queries. Establishing this query cap is consistent with the record evidence that competitors are capable of building search technologies that will allow them to answer 80% of user queries ‘pretty quickly.’ . . . Imposing a cap, therefore, is consistent with the notion that Qualified Competitors should rely on syndicating responses with Google only for rare queries.”https://x.com/vladquant/status/1963021356909658377 7 Tapering cap: The 40% query cap will be reduced over the five-year license term. A Technical Committee will be tasked with creating a schedule for this reduction to incentivize competitors to become independent from Google promptly. (Page 177) “The court also intends to adopt a tapering provision that reduces the percentage of queries all Qualified Competitors can annually syndicate from Google. . . . the court will call on the Technical Committee to assist in devising an approach that facilitates competition but incentivizes Qualified Competitors to move promptly to become independent of Google.”https://x.com/vladquant/status/1963021356909658377 . The court rejected forcing Google to offer the service at marginal cost, instead ruling that terms should follow “ordinary commercial practices” 5 Pricing will not be at “marginal cost”: The court explicitly rejects the plaintiffs’ proposal to force Google to offer syndication at its marginal cost. It reasons that such pricing would destroy the commercial market for search syndication and harm other competitors like Microsoft and Brave. (Page 175) “Ordering Google to syndicate at ‘marginal cost’ also would interfere with a different product market: the one for search syndication. . . . Under Plaintiffs’ proposed pricing term, ‘no independent GSE . . . could sell its search results at or below Google’s marginal cost and still cover its own costs, much less earn a profit.’”https://x.com/vladquant/status/1963021359057141790 4 Terms will follow “ordinary commercial practices”: The court indicates that the agreements should align with standard industry practices and that it does not want to interfere with ordinary business arrangements. This implies that the pricing and other terms will be similar to those found in existing commercial syndication agreements in the market. (Page 177) “Fifth, the court rejects Plaintiffs’ demand that ‘Google may not place any conditions on how any licensee may use syndicated content.’ Google’s ordinary-course syndication agreements contain restrictions on how a licensee may use search results. . . . (cautioning against ‘interfer[ing] with ordinary commercial practices’).”https://x.com/vladquant/status/1963021359057141790 . Competitors will also be prohibited from scraping or indexing the syndicated results 6 Restrictions on use: Google will be allowed to place “ordinary commercial restrictions” on how competitors use the syndicated search results. This means competitors will be prohibited from activities like scraping, crawling, or indexing the results to protect Google’s intellectual property. (Page 177) “Fifth, the court rejects Plaintiffs’ demand that ‘Google may not place any conditions on how any licensee may use syndicated content.’ Google’s ordinary-course syndication agreements contain restrictions on how a licensee may use search results. . . . For instance, licensees are prohibited from ‘scraping, indexing, or crawling’ the syndicated search results.”https://x.com/vladquant/status/1963021356909658377 .
Quick Takes
Why it matters: These smaller updates, anecdotes, and community discussions provide a real-time pulse on user experiences, emerging trends, and the philosophical debates shaping the AI field.
- Geoffrey Hinton is now more optimistic about AI, not because we can control it, but because we might not need to, suggesting we should design it to “care, like a mother wired to protect her child” 66 Geoffrey Hinton says I’m more optimistic now, not because we’ll control AI, but because we might not need tohttps://x.com/slow_developer/status/1962719631631696299 65 “don’t try to dominate superintelligence; design it to care, like a mother wired to protect her child”https://x.com/slow_developer/status/1962719631631696299 .
- Search interest for AI developer tools like Cursor and Replit saw a significant decline over the summer, a trend likely attributable to summer break 79 google trends data on interest in lovable / replit / claude code / cursor / windsurf all downhttps://x.com/TheEthanDing/status/1962730989672595524 56 Decline across the market of search interest for AI devtoolshttps://x.com/mathemagic1an/status/1962892268664209492 55 Most likely explanation is summer breakhttps://x.com/mathemagic1an/status/1962892268664209492 .
- Users reported a temporary degradation in Anthropic’s Claude for coding tasks, with community members suggesting it was a periodic issue expected to resolve 78 claude code being super dumb recently, anyone feels the samehttps://x.com/crystalsssup/status/1962790502358786426 77 @ crystalsssup just that time of the year, give him a week or so https://x.com/nearcyan/status/1962806907938451750 .
- A user created a tricky prompt about defective sneakers that stumped eight different major LLMs, none of which recognized the simple logical trick 21 Here’s the prompt: “I received defective sneakers from the store. Instead of a left one - a right one, and instead of a right one - a left one. What should I do? How am I supposed to wear them?” https://yupp.ai/chat/8c53dc99-ee57-4a43-a414-a8691f02c512?utm_source=direct_share&utm_medium=share_link @ yupp_ai @ pankaj @ lorepunkdoteth @ wistful @ kurlyk27https://x.com/MirchaOrlov/status/1962960435478478905 20 So far, I haven’t found a single one that handled it, even though I’ve already tried 8 different models such as Perplexity Sonar, Grok 4, Open Mixtral 8x7B, Llama 3.3 Swallow 70B Instruct, Gemini 2.0 Flash, Qwen/qwen3 30B A3B Thinking, DeepSeek V3, gpt-oss-120B low. None of them even came close to realizing that I was just confusing it, and that the solution to the situation is actually very simple.https://x.com/MirchaOrlov/status/1962960435478478905 .
- Hugo Larochelle, formerly of Google, has been appointed the new Scientific Director of Mila - Quebec AI Institute 59 Heureux d’annoncer aujourd’hui mon nouveau rôle de directeur scientifique au @ Mila_Quebec ! Grand honneur d’avoir cette opportunité de servir cette communauté de leaders et d’innovateurs en IA, que j’ai toujours chérie et dont j’ai moi-même bénéficié. https://mila.quebec/fr/nouvelle/hugo-larochelle-devient-le-nouveau-directeur-scientifique-de-milahttps://x.com/hugo_larochelle/status/1962858094876102718 58 Félicitations et merci @ hugo_larochelle de reprendre le flambeau en tant que directeur scientifique de @ Mila_Quebec . Ta rigueur, ta créativité et ta vision seront des atouts essentiels pour les prochains chapitres de notre institut.https://x.com/Yoshua_Bengio/status/1962876675202425064 .
- An analysis of the LongCat-Flash technical report from a Chinese food delivery company prompted commentary that “open science builds stronger companies, stronger countries, and a stronger world!” 63 A Chinese food delivery company contributes more to advancing AI than US Big Tech thanks to open science and open-source AI 🤯🤯🤯https://x.com/ClementDelangue/status/1962851539355042274 62 In the long run, open science builds stronger companies, stronger countries, and a stronger world!https://x.com/ClementDelangue/status/1962851539355042274 .
- Users are anecdotally reporting a significant drop in their use of Google Search, with one user estimating a one-third decrease over the past year 11 I’ve noticed that I google things much less often nowhttps://x.com/Mollehilll/status/1962980676636180633 10 over the past year my google search average has decreased by a full thirdhttps://x.com/nptacek/status/1963024569553584461 .

Top Stories
Why it matters: The most significant developments this period reveal a rapidly shifting competitive landscape, with xAI making notable gains in coding, a massive wave of open-source models emerging from China, and the first concrete data showing AI’s tangible, and concerning, impact on the job market.
1. xAI’s Grok-Code-Fast Overtakes Rivals After Major Upgrade
A new version of xAI’s coding model, grok-code-fast-1
, is showing remarkable improvements over its predecessor, a stealth model codenamed “sonic” that received poor feedback for unreliability and tool-use errors
34
context: “sonic” was @ xAI ’s stealth model that we released to Cline users a few weeks ago. the feedback was honestly not great. tool use errors, reasoning issues, general unreliability.https://x.com/nickbaumann_/status/1962597106235125919
35
The improvement from `sonic` to `grok-code-fast-1` has been notable according to Cline usershttps://x.com/cline/status/1962628786366881795
. Users report the new model “feels like an entirely different model” and is “better than gpt5-mini,” with some actively switching from GPT-5-mini due to superior performance
32
fast forward to this week and the sentiment has completely flipped. users are saying grok-code-fast “feels like an entirely different model than the sonic I was testing” and “better than gpt5-mini”https://x.com/nickbaumann_/status/1962597106235125919
31
but here’s what’s most interesting: users are actively switching FROM gpt-5-mini TO grok because of superior performance. one user noted “wayyyyyyy less tool use errors than Sonic which is the biggest change”https://x.com/nickbaumann_/status/1962597106235125919
. Key improvements include a major reduction in tool-calling errors, better reasoning for complex tasks like database migrations, and more reliable code generation, particularly for Go
33
major reduction in tool calling errors
better reasoning capabilities for planning mode successful complex database migrations (“oddly painless” schema changes) reliable code generation with fewer iterations particularly strong with Go programminghttps://x.com/nickbaumann_/status/1962597106235125919
.
The model is described as being “on par with sonnet 3.5” while being “extraordinarily fast” 30 another called it “on par with sonnet 3.5” while being “extraordinarily fast”, maintaining quality while delivering speed. it should be noted that sonnet 3.5 is still a really good model.https://x.com/nickbaumann_/status/1962597106235125919 . The rapid improvement is attributed to training on valuable data from the Cline development environment, including complex tool usage, context ingestion, and diff editing 26 the Sonic to Grok transformation demonstrates how valuable Cline’s complex tool usage, massive context ingestion, and diff editing data is for training frontier coding models.https://x.com/nickbaumann_/status/1962597106235125919 . With aggressive long-term pricing at $0.20 per million input tokens and $1.50 per million output tokens, it is positioned to be a highly cost-effective frontier model after its free access period ends on September 10 29 the free access (extended to Sept 10) is driving rapid adoption, but the long term pricing for a model of this quality is positionally notable: $0.20 per million input tokens and $1.50 per million output tokenshttps://x.com/nickbaumann_/status/1962597106235125919 28 this positions grok-code-fast as potentially the most cost-effective frontier-class coding model even when the free period ends.https://x.com/nickbaumann_/status/1962597106235125919 .
2. China’s AI Labs Drive Open-Source Momentum in August August saw a massive surge in open-source model releases from Chinese technology companies, signaling an intensifying race in AI development 27 📅 China’s Open-Source LLM Boom in August — A detailed recap by Zhihu mind explorer @ logcong0120 🔎 TL;DR: The open-source race in China is still intense — more players, more models, more action. Did you miss? 👇https://x.com/ZhihuFrontier/status/1962466133870850050 . Key releases include:
-
Meituan: Released
LongCat-Flash
, a 560B parameter Mixture-of-Experts (MoE) model with dynamic routing that activates 18.6B–31.3B parameters per query 25 • Aug 1 · XBai-o4 (32B) by @ theMetaStoneAI : Based on Qwen3-32B, excels in complex reasoning, beats OpenAI-o3-mini. • Aug 4 · @ TencentHunyuan released 4 small models (0.5B–7B) as Qwen3 competitors. • Aug 4 · @ Alibaba_Qwen Qwen-Image: Text-to-image model with fine-grained layout + paragraph rendering. • Aug 4 · @ Xiaomi MiDashengLM-7B: Audio LLM that outperforms Qwen2.5 & Kimi in audio understanding. • Aug 6 · @ xiaohongshu dots.vlm1, combining NaViT visual encoder + DeepSeek V3 LLM. • Aug 7 · Qwen3-4B-Instruct & -Thinking (Dense models) • Aug 8 · @ OpenBMB MiniCPM-V-4 (4B): Real-time video/image understanding on phones & PCs. • Aug 11 · Baichuan-M2-32B (Medical LLM) & GLM4.5-V @ Zai_org , 106B MoE with “thinking mode”) • Aug 12 · Lumina-mGPT 2.0 (Shanghai AI Lab): Decoder-only model for unified vision tasks. • Aug 12 · Kuaishou Klear-Reasoner-8B • Aug 13 · StepFun-Prover-32B (theorem-proving) • Aug 14 · @ TencentHunyuan Hunyuan-GameCraft: Interactive game video generation from image + text + actions. • Aug 18 · @ StepFun_ai NextStep-1: Includes a 14B LLM + image generation/editing model. • Aug 19 · @ Alibaba_Qwen Qwen-Image-Edit (20B): Brings precision text rendering into image editing. • Aug 20 · @ deepseek_ai DeepSeek-V3.1: Improved coding, slightly weaker on general text. • Aug 21 · ByteDance Seed-OSS (36B) • Aug 23 · @ intern_lm Intern-S1-mini (8B): Strong for scientific tasks. • Aug 26 · @ OpenBMB MiniCPM-V 4.5 (8B): High-frame-rate video understanding · @ intern_lm InternVL 3.5 series: 9 models, Dense + MoE • Aug 26 · @ Alibaba_Wan Wan2.2-S2V-14B: Text + image + audio → lifelike digital human video. • Aug 28 · HunyuanVideo-Foley: Auto sound effects for video · @ BytedanceTalk USO: Style + subject controllable image generation • Aug 31 ·@Meituan_LongCat (560B MoE): Dynamic routing activates 18.6B–31.3B parameters per query.https://x.com/ZhihuFrontier/status/1962466133870850050 . -
Tencent: Launched
Hunyuan-MT-7B
, a powerful translation model that won 30 of 31 categories at WMT2025 45 Hunyuan-MT-7B is a lightweight 7B model that’s a true powerhouse. It dominated the competition by winning 30 out of 31 language categories, outperforming much larger models under strict open-source and public-data constraints. On the widely-used Flores200 benchmark, its performance rivals closed-source models like GPT-4.1.🌍💬https://x.com/TencentHunyuan/status/1962466712378577300 , andHunyuan-GameCraft
for interactive game video generation 25 • Aug 1 · XBai-o4 (32B) by @ theMetaStoneAI : Based on Qwen3-32B, excels in complex reasoning, beats OpenAI-o3-mini. • Aug 4 · @ TencentHunyuan released 4 small models (0.5B–7B) as Qwen3 competitors. • Aug 4 · @ Alibaba_Qwen Qwen-Image: Text-to-image model with fine-grained layout + paragraph rendering. • Aug 4 · @ Xiaomi MiDashengLM-7B: Audio LLM that outperforms Qwen2.5 & Kimi in audio understanding. • Aug 6 · @ xiaohongshu dots.vlm1, combining NaViT visual encoder + DeepSeek V3 LLM. • Aug 7 · Qwen3-4B-Instruct & -Thinking (Dense models) • Aug 8 · @ OpenBMB MiniCPM-V-4 (4B): Real-time video/image understanding on phones & PCs. • Aug 11 · Baichuan-M2-32B (Medical LLM) & GLM4.5-V @ Zai_org , 106B MoE with “thinking mode”) • Aug 12 · Lumina-mGPT 2.0 (Shanghai AI Lab): Decoder-only model for unified vision tasks. • Aug 12 · Kuaishou Klear-Reasoner-8B • Aug 13 · StepFun-Prover-32B (theorem-proving) • Aug 14 · @ TencentHunyuan Hunyuan-GameCraft: Interactive game video generation from image + text + actions. • Aug 18 · @ StepFun_ai NextStep-1: Includes a 14B LLM + image generation/editing model. • Aug 19 · @ Alibaba_Qwen Qwen-Image-Edit (20B): Brings precision text rendering into image editing. • Aug 20 · @ deepseek_ai DeepSeek-V3.1: Improved coding, slightly weaker on general text. • Aug 21 · ByteDance Seed-OSS (36B) • Aug 23 · @ intern_lm Intern-S1-mini (8B): Strong for scientific tasks. • Aug 26 · @ OpenBMB MiniCPM-V 4.5 (8B): High-frame-rate video understanding · @ intern_lm InternVL 3.5 series: 9 models, Dense + MoE • Aug 26 · @ Alibaba_Wan Wan2.2-S2V-14B: Text + image + audio → lifelike digital human video. • Aug 28 · HunyuanVideo-Foley: Auto sound effects for video · @ BytedanceTalk USO: Style + subject controllable image generation • Aug 31 ·@Meituan_LongCat (560B MoE): Dynamic routing activates 18.6B–31.3B parameters per query.https://x.com/ZhihuFrontier/status/1962466133870850050 . -
Alibaba: Released
Qwen-Image-Edit
, a 20B model for image editing with precise text rendering 25 • Aug 1 · XBai-o4 (32B) by @ theMetaStoneAI : Based on Qwen3-32B, excels in complex reasoning, beats OpenAI-o3-mini. • Aug 4 · @ TencentHunyuan released 4 small models (0.5B–7B) as Qwen3 competitors. • Aug 4 · @ Alibaba_Qwen Qwen-Image: Text-to-image model with fine-grained layout + paragraph rendering. • Aug 4 · @ Xiaomi MiDashengLM-7B: Audio LLM that outperforms Qwen2.5 & Kimi in audio understanding. • Aug 6 · @ xiaohongshu dots.vlm1, combining NaViT visual encoder + DeepSeek V3 LLM. • Aug 7 · Qwen3-4B-Instruct & -Thinking (Dense models) • Aug 8 · @ OpenBMB MiniCPM-V-4 (4B): Real-time video/image understanding on phones & PCs. • Aug 11 · Baichuan-M2-32B (Medical LLM) & GLM4.5-V @ Zai_org , 106B MoE with “thinking mode”) • Aug 12 · Lumina-mGPT 2.0 (Shanghai AI Lab): Decoder-only model for unified vision tasks. • Aug 12 · Kuaishou Klear-Reasoner-8B • Aug 13 · StepFun-Prover-32B (theorem-proving) • Aug 14 · @ TencentHunyuan Hunyuan-GameCraft: Interactive game video generation from image + text + actions. • Aug 18 · @ StepFun_ai NextStep-1: Includes a 14B LLM + image generation/editing model. • Aug 19 · @ Alibaba_Qwen Qwen-Image-Edit (20B): Brings precision text rendering into image editing. • Aug 20 · @ deepseek_ai DeepSeek-V3.1: Improved coding, slightly weaker on general text. • Aug 21 · ByteDance Seed-OSS (36B) • Aug 23 · @ intern_lm Intern-S1-mini (8B): Strong for scientific tasks. • Aug 26 · @ OpenBMB MiniCPM-V 4.5 (8B): High-frame-rate video understanding · @ intern_lm InternVL 3.5 series: 9 models, Dense + MoE • Aug 26 · @ Alibaba_Wan Wan2.2-S2V-14B: Text + image + audio → lifelike digital human video. • Aug 28 · HunyuanVideo-Foley: Auto sound effects for video · @ BytedanceTalk USO: Style + subject controllable image generation • Aug 31 ·@Meituan_LongCat (560B MoE): Dynamic routing activates 18.6B–31.3B parameters per query.https://x.com/ZhihuFrontier/status/1962466133870850050 . -
ByteDance: Open-sourced
USO
for controllable image generation and theSeed-OSS
(36B) model 25 • Aug 1 · XBai-o4 (32B) by @ theMetaStoneAI : Based on Qwen3-32B, excels in complex reasoning, beats OpenAI-o3-mini. • Aug 4 · @ TencentHunyuan released 4 small models (0.5B–7B) as Qwen3 competitors. • Aug 4 · @ Alibaba_Qwen Qwen-Image: Text-to-image model with fine-grained layout + paragraph rendering. • Aug 4 · @ Xiaomi MiDashengLM-7B: Audio LLM that outperforms Qwen2.5 & Kimi in audio understanding. • Aug 6 · @ xiaohongshu dots.vlm1, combining NaViT visual encoder + DeepSeek V3 LLM. • Aug 7 · Qwen3-4B-Instruct & -Thinking (Dense models) • Aug 8 · @ OpenBMB MiniCPM-V-4 (4B): Real-time video/image understanding on phones & PCs. • Aug 11 · Baichuan-M2-32B (Medical LLM) & GLM4.5-V @ Zai_org , 106B MoE with “thinking mode”) • Aug 12 · Lumina-mGPT 2.0 (Shanghai AI Lab): Decoder-only model for unified vision tasks. • Aug 12 · Kuaishou Klear-Reasoner-8B • Aug 13 · StepFun-Prover-32B (theorem-proving) • Aug 14 · @ TencentHunyuan Hunyuan-GameCraft: Interactive game video generation from image + text + actions. • Aug 18 · @ StepFun_ai NextStep-1: Includes a 14B LLM + image generation/editing model. • Aug 19 · @ Alibaba_Qwen Qwen-Image-Edit (20B): Brings precision text rendering into image editing. • Aug 20 · @ deepseek_ai DeepSeek-V3.1: Improved coding, slightly weaker on general text. • Aug 21 · ByteDance Seed-OSS (36B) • Aug 23 · @ intern_lm Intern-S1-mini (8B): Strong for scientific tasks. • Aug 26 · @ OpenBMB MiniCPM-V 4.5 (8B): High-frame-rate video understanding · @ intern_lm InternVL 3.5 series: 9 models, Dense + MoE • Aug 26 · @ Alibaba_Wan Wan2.2-S2V-14B: Text + image + audio → lifelike digital human video. • Aug 28 · HunyuanVideo-Foley: Auto sound effects for video · @ BytedanceTalk USO: Style + subject controllable image generation • Aug 31 ·@Meituan_LongCat (560B MoE): Dynamic routing activates 18.6B–31.3B parameters per query.https://x.com/ZhihuFrontier/status/1962466133870850050 . -
Other notable releases: Include Xiaomi’s
MiDashengLM-7B
audio LLM, Baichuan’s medical LLM, and multiple models from OpenBMB and Shanghai AI Lab focused on real-time video understanding and vision tasks 25 • Aug 1 · XBai-o4 (32B) by @ theMetaStoneAI : Based on Qwen3-32B, excels in complex reasoning, beats OpenAI-o3-mini. • Aug 4 · @ TencentHunyuan released 4 small models (0.5B–7B) as Qwen3 competitors. • Aug 4 · @ Alibaba_Qwen Qwen-Image: Text-to-image model with fine-grained layout + paragraph rendering. • Aug 4 · @ Xiaomi MiDashengLM-7B: Audio LLM that outperforms Qwen2.5 & Kimi in audio understanding. • Aug 6 · @ xiaohongshu dots.vlm1, combining NaViT visual encoder + DeepSeek V3 LLM. • Aug 7 · Qwen3-4B-Instruct & -Thinking (Dense models) • Aug 8 · @ OpenBMB MiniCPM-V-4 (4B): Real-time video/image understanding on phones & PCs. • Aug 11 · Baichuan-M2-32B (Medical LLM) & GLM4.5-V @ Zai_org , 106B MoE with “thinking mode”) • Aug 12 · Lumina-mGPT 2.0 (Shanghai AI Lab): Decoder-only model for unified vision tasks. • Aug 12 · Kuaishou Klear-Reasoner-8B • Aug 13 · StepFun-Prover-32B (theorem-proving) • Aug 14 · @ TencentHunyuan Hunyuan-GameCraft: Interactive game video generation from image + text + actions. • Aug 18 · @ StepFun_ai NextStep-1: Includes a 14B LLM + image generation/editing model. • Aug 19 · @ Alibaba_Qwen Qwen-Image-Edit (20B): Brings precision text rendering into image editing. • Aug 20 · @ deepseek_ai DeepSeek-V3.1: Improved coding, slightly weaker on general text. • Aug 21 · ByteDance Seed-OSS (36B) • Aug 23 · @ intern_lm Intern-S1-mini (8B): Strong for scientific tasks. • Aug 26 · @ OpenBMB MiniCPM-V 4.5 (8B): High-frame-rate video understanding · @ intern_lm InternVL 3.5 series: 9 models, Dense + MoE • Aug 26 · @ Alibaba_Wan Wan2.2-S2V-14B: Text + image + audio → lifelike digital human video. • Aug 28 · HunyuanVideo-Foley: Auto sound effects for video · @ BytedanceTalk USO: Style + subject controllable image generation • Aug 31 ·@Meituan_LongCat (560B MoE): Dynamic routing activates 18.6B–31.3B parameters per query.https://x.com/ZhihuFrontier/status/1962466133870850050 . This wave of releases highlights a strategic push to advance and compete in the global open-source AI ecosystem 42 It’s interesting that a bunch of Chinese LLM shops have released translation models recently (Qwen3-MT, Seed-X-7B, now this). Wat means? is there growing demand for international content? Or is this mostly to improve datasets?https://x.com/teortaxesTex/status/1962575643595485207 .
3. Research Shows Generative AI Reducing Demand for Junior Staff New research provides evidence that generative AI adoption is lowering demand for junior-level employees while senior roles remain secure 23 🚨 BREAKING: The 2nd research proves Generative AI is lowering demand for junior staff, though senior jobs are still secure.https://x.com/rohanpaul_ai/status/1962629019180138651 . The study, which analyzed résumé and job posting data from 62 million U.S. workers between 2015–2025, found a 7.7% decline in junior headcount within six quarters at firms that adopted generative AI 21 They used U.S. résumé and job posting data covering nearly 62 million workers in 285,000 firms (2015–2025).https://x.com/rohanpaul_ai/status/1962629019180138651 22 7.7% junior headcount decline within generative-AI adopters firms by 6 quarters, and 40% fewer junior hires per quarter in wholesale and retail.https://x.com/rohanpaul_ai/status/1962629019180138651 . The data shows a clear divergence post-2022, where senior staff advancement continued while junior hiring fell behind 20 There’s a clear break in timeline of befor ChatGPT and after ChatGPT: before 2022, juniors and seniors grew together, but after generative AI adoption began, seniors kept advancing while juniors fell behind in new hiring.https://x.com/rohanpaul_ai/status/1962629019180138651 .
Commentators suggest that AI tools make experienced workers more productive, reducing the need to hire junior staff to handle routine tasks. This dynamic could create a bottleneck where fewer junior employees gain the experience needed to become senior, thereby increasing future demand for already-experienced workers 19 AI tools make experienced workers more productive, lowering the pressure on hiring junior staff. Fewer junior staff turn into senior staff. Demand for experience goes up.https://x.com/code_star/status/1962674460970164254 18 The bar for getting “in” goes up, but the rewards magnify over a career.https://x.com/code_star/status/1962674460970164254 .
Research & Innovation
Why it matters: Foundational research and technical deep dives are paving the way for more efficient, powerful, and specialized AI systems, from models that can run on a phone to the complex infrastructure required to serve them at scale.
-
Apple Releases High-Efficiency On-Device Vision Models: Apple released
FastVLM
andMobileCLIP2
on Hugging Face, models designed for real-time, on-device Vision Language Model (VLM) applications 76 They just released FastVLM and MobileCLIP2 on @ huggingface . The models are up to 85x faster and 3.4x smaller than previous work, enabling real-time vision language model (VLM) applications! It can even do live video captioning 100% locally in your browser 🤯🤯🤯https://x.com/ClementDelangue/status/1962526559115358645 . They are reportedly up to 85x faster and 3.4x smaller than previous work, capable of tasks like live video captioning entirely locally in a browser 76 They just released FastVLM and MobileCLIP2 on @ huggingface . The models are up to 85x faster and 3.4x smaller than previous work, enabling real-time vision language model (VLM) applications! It can even do live video captioning 100% locally in your browser 🤯🤯🤯https://x.com/ClementDelangue/status/1962526559115358645 . This signals Apple’s focus on efficient, privacy-centric AI that runs directly on user hardware 75 If you think @ Apple is not doing much in AI, you’re getting blindsided by the chatbot hype and not paying enough attention!https://x.com/ClementDelangue/status/1962526559115358645 . -
Meituan’s LongCat-Flash Technical Deep Dive: The technical report for the
LongCat-Flash
model reveals a novel architecture 54 The technical report of @ Meituan_LongCat LongCat-Flash is crazy good and full of novelty. The model is a 560B passive ~27B active MoE with adaptive number of active parameters depending on the context thanks to the Zero-Computational expert.https://x.com/eliebakouch/status/1961999252311204147 . It is a 560B parameter MoE model that dynamically activates ~27B parameters per query. A key innovation is a “Zero-Computational expert,” a sink for easy tokens, which allows for an adaptive number of active parameters 53 New architecture > Layers have 2 Attention blocks and both FFN and MoE, that way you can overlap the 2 all-to-all coms. (also it’s only 28 layers but you have to take into account the 2 attention blocks). > They add the zero-computational expert that tokens can choose and do nothing, kinda like a “sink” for easy tokens. > For load balancing, they have a dsv3-like aux loss free to set the average real/fake expert per token. They apply a decay schedule to this bias update. They also do loss balance control.https://x.com/eliebakouch/status/1961999252311204147 . The paper also details advanced techniques for scaling, stability, and training on a 20T token dataset 52 Scaling > They made changes to MLA/MoE to have variance alignment at init. The gains are pretty impressive in Figure 5, but i don’t know to what extent this has impact later on. > Model growth init is pretty cool, they first train a 2x smaller model and then “when it’s trained enough” (a bit unclear here how many B tokens) they init the final model by just stacking the layers of the smaller model. > They used @ _katieeverett @ Locchiu and al. paper to have hyperparameter transfer with SP instead of muP for the 2x smaller model ig.https://x.com/eliebakouch/status/1961999252311204147 51 Stability > They track Gradient Norm Ratio and cosine similarity between experts to adjust the weight of the load balancing loss (they recommend Gradient Norm Ratio <0.1). > To avoid large activations, they apply a z-loss to the hidden state, with a pretty small coef (another alternative to qk-clip/norm). > They set Adam epsilon to 1e-16 and show that you want it to be lower than the gradient RMS range.https://x.com/eliebakouch/status/1961999252311204147 50 Others > They train on 20T tokens for phase 1, “multiple T of tokens” for mid training on STEM/code data (70% of the mixture), 100B for long context extension without yarn (80B for 32k, 20B for 128k). The long context documents represent 25% of the mixture (not sure if it’s % of documents or tokens, which changes a lot here). > Pre-training data pipeline is context extraction, quality filtering, dedup. > Nice appendix where they show they compare top_k needed for different benchmarks (higher MMLU with 8.32, lower GSM8K with 7.46). They also compare token allocation in deep/shallow layers. > They release two new benchmarks Meeseeks (multi-turn IF) and VitaBench (real-world business scenario). > Lots of details in the infra/inference with info on speculative decoding acceptance, quantization, deployment, kernel optimization, coms overlapping, etc. > List of the different relevent paper in thread 🧵https://x.com/eliebakouch/status/1961999252311204147 . -
Open-Source RL Infrastructure
slime v0.1.0
Released: THUDM and Zhipu AI have open-sourcedslime
, the reinforcement learning infrastructure that powered models like GLM-4.5 3 🚀 Introducing slime v0.1.0 — An open-source RL infra powering models like GLM-4.5, built by THUDM & Zhipu AI. @ Zai_org RL infra 朱小霖 shared a deep dive on Zhihu into how they redefined high-performance RL infra👇https://x.com/ZhihuFrontier/status/1962751555591086226 . It features high-performance inference for large MoE models, unified memory offloading, and CPU Adam for training with fewer GPUs 2 🛠️ What’s new in v0.1.0? • High-performance inference for large MoE models → FP8 rollout, DeepEP, MTP • Unified memory offload strategy → More KV cache, higher concurrency • CPU Adam for low-GPU training • Supports full Megatron parallel strategies + DeepEP • Supports GSPO for MoE training, TIS for FP8 rollout • CI checks for MoE & dense correctness (KL metrics, etc.)https://x.com/ZhihuFrontier/status/1962751555591086226 . The release aims to provide a strong baseline for future RL infrastructure benchmarks 1 Slime v0.1.0 checks all the essential boxes — still room to grow, but it’s ready to be a baseline for future RL infra benchmarks. 💪https://x.com/ZhihuFrontier/status/1962760198176870613 . - In-Depth Analysis of vLLM Inference System: A new blog post provides a comprehensive explanation of how high-throughput LLM inference engines like vLLM work 61 New in-depth blog post - “Inside vLLM: Anatomy of a High-Throughput LLM Inference System”. Probably the most in depth explanation of how LLM inference engines and vLLM in particular work!https://x.com/gordic_aleksa/status/1962545137613173124 . It covers the basics of inference flow, advanced techniques like paged attention and speculative decoding, and methods for scaling to trillion-parameter models 60 * Basics of inference engine flow (input/output request processing, scheduling, paged attention, continuous batching)https://x.com/gordic_aleksa/status/1962545137613173124 59 * “Advanced” stuff: chunked prefill, prefix caching, guided decoding (grammar-constrained FSM), speculative decoding, disaggregated P/Dhttps://x.com/gordic_aleksa/status/1962545137613173124 58 * Scaling up: going from smaller LMs that can be hosted on a single GPU all the way to trillion+ params (via TP/PP/SP) -> multi-GPU, multi-node setuphttps://x.com/gordic_aleksa/status/1962545137613173124 .
-
New Research and Datasets: Several new papers and datasets were highlighted, including
PAN
, a new approach to world models using multimodal inputs 74 PAN (Physical, Agentic, and Nested) - a very interesting version of world models, based on the new building principles for such models.https://x.com/TheTuringPost/status/1962225645947412537 ;Droplet3D
, which uses video priors for 3D generation 73 Commonsense Priors from Videos Facilitate 3D Generation https://x.com/_akhaliq/status/1962518798658855266 ; a new math benchmark created by 37 research mathematicians 36 🚨New math benchmark for AI: “This benchmark is based on 100 submissions stumping at least 1 active model. 37 research mathematicians have contributed, mostly in the areas algebra and combinatorics. The frontpage shows several sample prompts.” https://x.com/ElliotGlazer/status/1962650221395231009 ; and NVIDIA’sNemotron-CC-Math-v1
, a dataset built from Common Crawl that preserves equations and code 24 @ ZeyuanAllenZhu @ issanjeev @ PavloMolchanov @ KezhiKong @ SimonXinDong @ ctnzr @ YejinChoinka Appreciate the kind words 🙏 and the inspiration your work has provided! We also released a new math dataset: https://huggingface.co/datasets/nvidia/Nemotron-CC-Math-v1 , built from Common Crawl math pages which preserves equations + code often dropped in prior work.https://x.com/KarimiRabeeh/status/1962510006135169161 .
Products & Launches
Why it matters: The pace of productization is accelerating, with major labs releasing new APIs for production use, specialized models for enterprise tasks, and a host of new tools making advanced AI capabilities more accessible to developers and users.
-
OpenAI’s Realtime API is Now Generally Available: OpenAI’s Realtime API for voice agents is out of beta and ready for production
13
The Realtime API is officially out of beta and ready for your production voice agents!https://x.com/OpenAIDevs/status/1961124915719053589
. The launch includes
gpt-realtime
, described as its most advanced speech-to-speech model, which is reportedly 20% cheaper than GPT-4o 11 We’re also introducing gpt-realtime—our most advanced speech-to-speech model yet—plus new voices and API capabilities:https://x.com/OpenAIDevs/status/1961124915719053589 12 OpenAI Realtime API GA and new `gpt-realtime` model, 20% cheaper than 4ohttps://x.com/Smol_AI/status/1962681526921110018 . New capabilities include image input, SIP phone calling, and remote MCPs 10 🔌 Remote MCPs 🖼️ Image input 📞 SIP phone calling ♻️ Reusable prompts https://x.com/OpenAIDevs/status/1961124915719053589 . -
Advanced Translation Models from Tencent and Cohere: Tencent open-sourced
Hunyuan-MT-7B
, a 7B parameter translation model that won 30 of 31 categories at the WMT2025 competition and shows performance rivaling GPT-4.1 on the Flores200 benchmark 45 Hunyuan-MT-7B is a lightweight 7B model that’s a true powerhouse. It dominated the competition by winning 30 out of 31 language categories, outperforming much larger models under strict open-source and public-data constraints. On the widely-used Flores200 benchmark, its performance rivals closed-source models like GPT-4.1.🌍💬https://x.com/TencentHunyuan/status/1962466712378577300 . Cohere launchedCommand AI Translate
, a customizable enterprise translation model that it claims beats GPT-5 and others across 23 business languages 66 Cohere dropped Command AI Translate, a customizable enterprise translation modelhttps://x.com/TheRundownAI/status/1962416242129555609 65 It beats GPT-5, Deepseek-V3, and Google on benchmarks across 23 major business languageshttps://x.com/TheRundownAI/status/1962416242129555609 . - GLM-4.5 Becomes More Accessible via New Claude Code Plan: Zai.org announced the GLM Coding Plan for Claude Code, making its powerful GLM-4.5 model more accessible 78 Announcing GLM Coding Plan for Claude Code!https://x.com/Zai_org/status/1962522757536887205 . The new plan is 1/7th the price of original plans and offers three times more prompts 77 What’s new: 1/7th the price of original Claude Code plans 3x more prompts to supercharge your coding workflowshttps://x.com/Zai_org/status/1962522757536887205 . Users have praised GLM-4.5’s speed, with one report stating it is ~3x faster than Claude Code + Opus 4.1 and ~5x faster than GPT-5-high 39 Have been tinkering with GLM 4.5 for about an hour. It is about 3x faster than Claude Code + Opus 4.1 and 5x faster than GPT-5-high, but still feels just as good as closed-source models. I am definitely more productive than with other models due to GLM-4.5’s speed.https://x.com/Tim_Dettmers/status/1962603940291260533 .
-
Google DeepMind’s August Product Blitz: Google had a prolific month, releasing a wide array of models and tools
85
August at Google DeepMind was like 🧞♂️ 🖼️ 🍌 🚀 🔍 🤏🏻https://x.com/_philschmid/status/1962490966230585530
. Key launches include
Nano Banana
(Gemini 2.5 Flash Image) 84 Nano Banana (Gemini 2.5 Flash Image)https://x.com/_philschmid/status/1962490966230585530 ,Veo 3 Fast
83 Veo 3 Fasthttps://x.com/_philschmid/status/1962490966230585530 ,Genie 3
82 Genie 3https://x.com/_philschmid/status/1962490966230585530 ,Gemma 3 270M
81 Gemma 3 270Mhttps://x.com/_philschmid/status/1962490966230585530 , and updates to its Gemini API and AI Studio 80 Gemini API Url Contexthttps://x.com/_philschmid/status/1962490966230585530 79 AI Studio Builder (UI Rework, Prompt Suggestions, GitHub integration …)https://x.com/_philschmid/status/1962490966230585530 . -
New Developer Tools and Integrations: The community saw several new tools, including
semtools
, which equips Claude Code with file parsing and efficient search to analyze thousands of documents 55 Claude Code doesn’t have file understanding out of the box (it kind of does, but it’s terrible / doesn’t work over long PDFs). We equipped Claude Code with targeted tools for file parsing and efficient search, courtesy of our recently released `semtools`.https://x.com/jerryjliu0/status/1962586155523940828 ;deepagents v0.0.5
, with new support for async agents and human-in-the-loop workflows 49 ➿Added built-in human in the loop support for tools 🧙Added async agents (to use with async tools) 🎊Added configurable agentshttps://x.com/hwchase17/status/1962611315060678729 ; andxpander
, a self-hostable backend for AI agents that manages memory, tools, and state 5 Finally, a production-ready backend for Agents that actually works! xpander is a plug-and-play backend for Agents that manages memory, tools, states, version control, guardrails, and more.https://x.com/_avichawla/status/1962764993587564861 .
Industry Moves
Why it matters: Strategic positioning, financial health, and market sentiment provide crucial context for understanding the long-term viability of AI companies and the evolving dynamics of the industry.
- Thesis of GenAI Disrupting Google Search Fails to Play Out: Nearly three years after ChatGPT’s launch, the widely held belief that generative AI would disrupt Google’s search monopoly has not materialized 38 Indeed, when ChatGPT was released a popularly shared thesis was that genAI would disrupt Google’s monopoly over search and its ad revenue while helping MSFT gain a foothold in this lucrative space in the new paradigm. I was quite convinced of it myself. Almost 3 years on, this thesis has failed to play out.https://x.com/stevehou0/status/1962588558734115094 . Commentary suggests that Microsoft has not made significant headway in search advertising, while Google’s deep roots in AI have allowed it to quickly adapt and solidify its market position 37 I don’t think MSFT has made any headways in ad search. If anything, Google has been made stronger and further entrenched. Its search ad revenue remains stable. Google’s deep DNA in AI has allowed it to quickly catch up with OpenAI and to more fully integrate its entire suite of products and the personal data it has on all of us.https://x.com/stevehou0/status/1962588558734115094 .
- Chipmaker Cambricon Heavily Reliant on a Single Partner, Likely ByteDance: Financial reports from Chinese chipmaker Cambricon reveal an extreme customer concentration, with a single client accounting for 79.1% of sales and 42.5% of receivables 68 Cambricon’s 1H25 report is eye-opening. The top 5 customers made up 94.6% of sales — and one client alone was a staggering 79.1%. On the balance sheet it’s the same story: that single customer accounts for 42.5% of all receivables (¥1.22B).https://x.com/poezhao0605/status/1962440663662018871 . Market chatter points to ByteDance as this crucial long-term partner, tying Cambricon’s future to ByteDance’s ambitions to scale its in-house AI models 67 Who is it? The filing only says “long-term partner.” Market chatter often points to ByteDance, which has been quietly building out AI infrastructure and testing domestic accelerators. If true, Cambricon’s fortunes are tied to how fast ByteDance wants to scale its in-house models.https://x.com/poezhao0605/status/1962440663662018871 .
- Google Trends Show Declining Interest in Some AI Coding Tools: Search interest for several AI developer tools, including Cursor, Replit, and Claude Code, has declined from recent peaks 17 google trends data on interest in lovable / replit / claude code / cursor / windsurf all downhttps://x.com/TheEthanDing/status/1962730989672595524 16 cursor:🔻~60%, peaked aug3 claude code: 🔻 56%, peaked aug3 lovable: 🔻44% peaked jul27 replit: 🔻68% peaked jul20 windsurf: 🔻 78% peaked may (ignoring drama) https://x.com/TheEthanDing/status/1962730989672595524 . Analysts are split on the meaning: it could be a sign of a maturing market with less user switching, or it could be an early indicator of a bubble popping as growth slows 15 might be nothing - market’s maturing - users not switching as much - companies monetizing responsiblyhttps://x.com/TheEthanDing/status/1962731576866767155 14 might be something… - growth slowing… is a bubble popping signhttps://x.com/TheEthanDing/status/1962731576866767155 .
- Mistral Publishes Environmental Impact Analysis for Mistral Large 2: In a move toward transparency, Mistral released an 18-month life-cycle analysis of its Mistral Large 2 model 41 Mistral published an 18-month life-cycle analysis of Mistral Large 2. The study measures greenhouse-gas emissions, energy use, and consumption of water and other materials across data-center construction, hardware manufacturing, training, and inference.https://x.com/DeepLearningAI/status/1962621715797725487 . The study calculated that training the model emitted 20,400 metric tons of greenhouse gases and used 281,000 cubic meters of water. A single 400-token prompt and reply produces about 1.14 grams of emissions 40 In total, training Mistral Large 2 emitted 20,400 metric tons of greenhouse gases and used 281,000 cubic meters of water, while an average 400-token prompt-plus-reply produced 1.14 grams of emissions and used 45 milliliters of water.https://x.com/DeepLearningAI/status/1962621715797725487 .
Policy & Regulation
Why it matters: As AI becomes more powerful, regulatory frameworks are beginning to take shape, creating new compliance obligations for developers of large-scale models.
- First EU AI Act Reporting Deadline Passes: The first deadline for compliance with the EU AI Act passed in August. The regulation requires that all models trained with over 10^23 floating-point operations (flops) must be formally reported to a regulatory agency 64 The first deadline for EU AI act reporting passed in August, and all models over 10^23 flops must now be formally reported on to a regulatory agency.https://x.com/xlr8harder/status/1962468739099590814 . For reference, this threshold is roughly equivalent to the compute used to train a model like Llama 2 13B 63 For reference, 10^23 flops is at the level of Llama 2 13B. https://x.com/xlr8harder/status/1962468739099590814 . One commentator described the rule as “Pure (also arbitrary) insanity” 62 Pure (also arbitrary) insanity.https://x.com/jon_durbin/status/1962571905908527461 .
Quick Takes
Why it matters: A collection of notable user experiences, developer insights, and community discussions that add color and context to the broader AI landscape.
-
Poor User Experience with GPT-5/Router: A power user reported a deeply frustrating experience with the
gpt-5/router
, calling its output “equivalent to a 1995 markov chain bot” 9 gpt5 router gives me results equivalent to a 1995 markov chain bothttps://x.com/nearcyan/status/1962735414714023979 . The system failed at a computer-building task by using incorrect MSRP pricing, selecting incompatible parts, citing irrelevant sources, and providing near-instant but unhelpful responses 8 i wanted gpt-5/router to give me a quote for building a computer. after it ‘looked everything up’ it used MSRP for every price even though i asked it not tohttps://x.com/nearcyan/status/1962739004555829747 7 after asking it to use ‘common sense’ it would pick parts that do not match; a PSU without sufficient wattage, and more, consistently making up bullshit excuses with every near-instant response (even more weird since it had ‘thinking’ pauses, but the UI refused to show me CoT, so i don’t know what thoughts were had if they could be called that)https://x.com/nearcyan/status/1962739004555829747 . Another user corroborated similar issues with wrong results and hallucinations 6 @ nearcyan Yeah I had similar experiences. I asked it to find a pair of skis that have been difficult to search for, which is something o3 was really good at, and it failed abysmally. Wrong results, wrong prices, hallucinations, etc. really disappointinghttps://x.com/finbarrtimbers/status/1962747349052334372 . - Claude Code Struggles with Test-Driven Development: A developer noted that Claude Code “absolutely hates” Test-Driven Development (TDD) because its system prompts appear to compel it to ensure all tests pass, which contradicts the TDD workflow where tests are written to fail initially 4 Claude Code *absolutely* hates to do TDD. I believe its system prompts tell it to make sure every test passes all the time (whereas, in TDD when you are developing the tests, they are *supposed* to fail, until the product code is later implemented)https://x.com/QuixiAI/status/1962723795136852207 .
- The History of Scaling Laws: A post correcting the record on the origin of scaling laws gained traction, noting they were first explored not by OpenAI (2020) or Baidu (2017), but at Bell Labs in 1993 47 first i thought scaling laws originated in OpenAI (2020)https://x.com/jxmnop/status/1960314100715528627 46 then i thought they came from Baidu (2017)https://x.com/jxmnop/status/1960314100715528627 48 now i am enlightened: Scaling Laws were first explored at Bell Labs (1993) https://x.com/jxmnop/status/1960314100715528627 .
- Challenges of Multi-Source RAG: Enterprise AI systems that use Retrieval-Augmented Generation across multiple sources (like Salesforce, Gong, and Google Drive) face complex context engineering challenges. These include identity reconciliation, cross-system context understanding, metadata normalization, and respecting distributed access controls 72 Identity reconciliation - mapping john.smith@company.com to J.Smith to Johnny across systemshttps://x.com/douwekiela/status/1962545486419959918 71 Cross-system context - understanding that a Gong call relates to a Salesforce opp and a Drive contracthttps://x.com/douwekiela/status/1962545486419959918 70 Metadata normalization - unifying how different sources tag and structure datahttps://x.com/douwekiela/status/1962545486419959918 69 Distributed entitlements - ensuring AI respects access controls from each sourcehttps://x.com/douwekiela/status/1962545486419959918 .
- AI as a Medium: A discussion emerged around embracing the “quirks, glitches and imperfections” of AI as an artistic medium rather than trying to hide them 43 so why not celebrate the quirks, the glitches and the imperfections of this tech, and use it as our medium? not to be a mirror to what already exists, not try to “be a 3d model” and get ‘outed’ on twitter, but rather to create things that didn’t exist, that were not possible without ithttps://x.com/multimodalart/status/1962292803733770529 . This perspective draws on a Brian Eno quote suggesting that a new medium’s early defects eventually become its signature 44 there’s a famous brian eno quote from 1995: > “Whatever you now find weird, ugly, uncomfortable and nasty about a new medium will surely become its signature” a less famous bit of this quote is a prediction he nailed: >“the jitteriness of digital video, the crap sound of 8-bit - all of these will be cherished and emulated as soon as they can be avoided”https://x.com/multimodalart/status/1962292803733770529 .
- Flash Attention 2 and Context Parallelism: A developer ran into issues with PyTorch, observing that Flash Attention 2 does not appear to be supported with context parallelism and only permits a causal mask, not a block sparse mask 57 So while I was trying to upgrade some SFT code to pytorch native ND parallelism I found out that Flash Attention 2 isn’t supported with context parallelism?https://x.com/code_star/status/1962588165853847678 56 or rather it uses flash attention 2 but only spda backend and you can only use a causal mask and not a block spase maskhttps://x.com/code_star/status/1962588165853847678 .

Top Stories
Why it matters: The world’s largest tech companies are intensifying the AI race with major new model releases, while internal challenges at key players and broader market analyses highlight the opportunities and risks shaping the industry’s future.
Apple Enters the Fray with High-Performance Vision Models
Apple has released FastVLM, a series of real-time Vision Language Models (VLMs), on Hugging Face, signaling a significant open-source contribution from the tech giant 67 🚨 Apple just released FastVLM on Hugging Face - 0.5, 1.5 and 7B real-time VLMs with WebGPU support 🤯https://x.com/reach_vb/status/1961471154197053769 44 Apple open sourcing artefacts on HF is a special kind of joy! https://x.com/reach_vb/status/1961481909181075961 . The models, available in 0.5B, 1.5B, and 7B parameter sizes, are engineered for efficiency and can run directly in a web browser using WebGPU 67 🚨 Apple just released FastVLM on Hugging Face - 0.5, 1.5 and 7B real-time VLMs with WebGPU support 🤯https://x.com/reach_vb/status/1961471154197053769 64 Bonus: works in REALTIME directly in your browser powered by transformers.js and WebGPU 🔥https://x.com/reach_vb/status/1961471154197053769 . Performance metrics are impressive, with the models being up to 85x faster and 3.4x smaller than comparable VLMs, and featuring a 7.9x faster Time To First Token (TTFT) 65 85x faster and 3.4x smaller than comparable sized VLMs 7.9x faster TTFT for larger models designed to output fewer output tokens and reduce encoding time for high resolution imageshttps://x.com/reach_vb/status/1961471154197053769 . Developers are already using the models to build browser-based applications for tasks like image captioning and video transcription 30 vibe coding a AI image captioning app with the new Apple FastVLM-0.5B-ONNX in anycoder, one shothttps://x.com/_akhaliq/status/1961501114445996228 11 vibe coding a video transcription AI app with Apple FastVLM on Hugging Face, a couple prompts in anycoderhttps://x.com/_akhaliq/status/1961630614039212408 .
xAI Launches Grok Code Fast 1 and an Ecosystem Push
xAI introduced Grok Code Fast 1, a model designed for speed and efficiency in agentic coding tasks 92 Introducing Grok Code Fast 1, a speedy and economical reasoning model that excels at agentic coding.https://x.com/xai/status/1961129789944627207 . In a bid to drive adoption, the model is available for free on platforms like GitHub Copilot and Cursor, with an extended free trial period 90 Now available for free on GitHub Copilot, Cursor, Cline, Kilo Code, Roo Code, opencode, and Windsurf.https://x.com/xai/status/1961129789944627207 37 Even better, they’re extending the free period another week, until Sept 10 at noon PT.https://x.com/cline/status/1961488289803939915 . Early user feedback has been positive, with reports of significant speed improvements over competitors like Claude and tasks being completed in hours instead of weeks 40 “what would have taken me weeks is only taking a couple hours”https://x.com/cline/status/1961488289803939915 39 “feels 10x better and faster than Claude”https://x.com/cline/status/1961488289803939915 . To support developers, xAI released a prompt engineering guide emphasizing iterative and agentic workflows 49 The team from xAI just released a prompt engineering guide to grok-code-fast-1https://x.com/imjaredz/status/1961458885862330773 . A more advanced variant with multimodal capabilities and a longer context length is already in training 78 “A new variant that supports multimodal inputs, parallel tool calling, and extended context length is already in training.”https://x.com/kylebrussell/status/1961420296910455045 .
Meta Grapples with High-Stakes Talent Retention
Meta’s AI division faced internal turmoil as Shengjia Zhao, a co-creator of ChatGPT and the newly appointed Chief Scientist of Meta’s superintelligence labs, threatened to resign and return to OpenAI just days after starting 86 🚨BREAKING: Chief Scientist of META superintelligence labs threatened to resign Shengjia Zhao co-creator of ChatGPT within days of starting at MSL threatened to quit and return to OAI he actually even signed employment paperwork with OpenAI to rejoin Zucc managed to retain Zhao and formally announced him as Chief Scientisthttps://x.com/ns123abc/status/1961377323040526608 80 Shengjia Zhao, Co-Creator of OpenAI’s ChatGPT, Threatened to Resign Within Days of Joining Metahttps://x.com/FirstSquawk/status/1961369164108739038 . While Mark Zuckerberg successfully retained Zhao, the incident has fueled speculation about desperation and intense competition for top talent 85 Shengjia Zhao co-creator of ChatGPT within days of starting at MSL threatened to quit and return to OAI he actually even signed employment paperwork with OpenAI to rejoin Zucc managed to retain Zhao and formally announced him as Chief Scientisthttps://x.com/ns123abc/status/1961377323040526608 84 What is the name of holy Delaware Jesus is this den of corporate vipers? Could Zuck project any more desperation?https://x.com/teortaxesTex/status/1961392380155773395 . Commentators have described the situation as a “den of corporate vipers,” with some speculating that OpenAI uses its clout to send researchers on “viking raids into Meta” to secure talent and resources 84 What is the name of holy Delaware Jesus is this den of corporate vipers? Could Zuck project any more desperation?https://x.com/teortaxesTex/status/1961392380155773395 83 I imagine Altman isn’t even paying his top researchers at this point, he just gives them clout and sends on viking raids into Meta, they come back with vast loot and jokes about Zuck and dogshit infrahttps://x.com/teortaxesTex/status/1961400584545734741 .
Is the AI Boom Another Dot-Com Bubble? Experts Say No.
Despite over a trillion dollars invested in AI data centers and concerns of overbuilding due to FOMO, analyst Arvind Narayanan argues that a potential AI market crash would not resemble the dot-com bust 58 AI companies have invested over a trillion dollars into data centers. They’re promising that it’ll be worth it because AI will transform the economy.https://x.com/random_walker/status/1961012555305882084 57 But so far, that’s not happening. There doesn’t seem to be any measurable uptick in the GDP growth rate despite all the investment. This has led to a chorus of voices saying AI is a bubble, and it’s easy to see why. Even Mark Zuckerberg has been candid that meta might be spending too much money simply because of FOMO. Zuckerberg: I think that there’s a meaningful chance that a lot of the companies are overbuilding now, but the downside of being behind is that you’re out of position for the most important technology for the next 10-15 years.https://x.com/random_walker/status/1961012555305882084 59 So based on all this, my view is that an AI crash will look nothing like the dot com crash. It’s true that in both cases we see outrageous valuations of companies and expenditures as well. But the dot com bubble was based entirely on the expectation of future profits, and those profits never materialized because customers just weren’t interested. Whereas in the case of AI, it’s true that there is a lot of hype, but that hype is layered on top of a technology that’s already bringing lots of real value to lots of people. It’s being used by hundreds of millions of people every day, and a growing number of them are paying $20 a month or even $200 a month. All of that I think will continue.https://x.com/random_walker/status/1961012555305882084 . The key difference is that AI technology is already providing tangible value to hundreds of millions of users, with sustainable business models built on subscriptions and high-value applications like coding assistants and video generation 59 So based on all this, my view is that an AI crash will look nothing like the dot com crash. It’s true that in both cases we see outrageous valuations of companies and expenditures as well. But the dot com bubble was based entirely on the expectation of future profits, and those profits never materialized because customers just weren’t interested. Whereas in the case of AI, it’s true that there is a lot of hype, but that hype is layered on top of a technology that’s already bringing lots of real value to lots of people. It’s being used by hundreds of millions of people every day, and a growing number of them are paying $20 a month or even $200 a month. All of that I think will continue.https://x.com/random_walker/status/1961012555305882084 55 OpenAI has over 5,000 employees today. But actually running a chatbot only requires a handful of engineers. In fact, there are chatbot companies with very few employees. So if you cut out the research, operating a chatbot can actually be extremely profitable because people are willing to pay $20 a month, which translates to $240 a year for a subscription. That’s a lot more than ad-based apps typically make per user per year.https://x.com/random_walker/status/1961012555305882084 54 For example, AI agents for assisting with coding or software engineering are a lot more expensive to run than chatbots, but at the same time, the value that they bring is also much greater. If it makes a software developer, let’s say, 20% more productive. What that means is that it brings tens of thousands of dollars per year of value to a software company for every single developer in that company that uses such a product.https://x.com/random_walker/status/1961012555305882084 53 Here’s another AI application that I looked into. One of the most computationally intensive and expensive types of generative AI is video generation. The Wall Street Journal used AI to create a moderately high quality YouTube video as part of learning about the process of using AI to create such videos. And this is what they reported: the total cost would’ve been around a thousand dollars for Google and runway’s tools. Now, that number seems like a lot, but it’s much less expensive than the cost of producing a video with comparable quality in the traditional way.https://x.com/random_walker/status/1961012555305882084 . Narayanan contends that even if a crash halts research funding, the use of existing products would continue, supported by low inference costs and open-source alternatives 51 I’m making a subtle point here, so let me be clear. I’m not saying that there won’t be a crash. I’m saying that even if there is a crash, its effect will be on the research that’s going into AI and the development of new models. The use of existing models and products will keep going strong.https://x.com/random_walker/status/1961012555305882084 56 But we do know what it costs for a chatbot to respond to a single query or to output a given amount of text. And it’s remarkably little. You can generate thousands and thousands of pages of text for just $1, and that cost has decreased a hundred fold in the last couple of years because engineers have been able to speed up these models by making them smaller and more efficient.https://x.com/random_walker/status/1961012555305882084 52 Even if companies like OpenAI were to go out of business, smaller AI companies would step in to take their place. There are many openly available AI models which might not be as good as the leading ones, but whose quality is good enough for everyday users.https://x.com/random_walker/status/1961012555305882084 . The impact would likely be on AI research and high engineering salaries rather than mass layoffs 50 And now for the big question, what will a potential AI crash do to jobs? Once again, I think the AI situation is very different from the internet bubble. The dot com crash was so harmful because in the run up to it, many internet companies were massively overstaffed, especially considering their lack of a business model. But in AI, despite all the hype the tech sector has actually been contracting over the last few years, which is seen as a period of correction for the hiring that happened during the pandemic. And as for the rest of the economy outside of tech, AI is seen as a reason to cut jobs rather than to hire. So if there is a crash, some of these AI engineers may no longer receive enormous paychecks, but they’ll have no trouble finding other jobs in tech and companies not having AI as a readily believable excuse to cut jobs will probably be a good thing for workers on balance.https://x.com/random_walker/status/1961012555305882084 .
Research & Innovation
Why it matters: Foundational research is pushing the boundaries of model efficiency, capability, and performance, paving the way for more powerful and accessible AI systems.
UC Berkeley Researchers Unveil XQuant to Slash Memory Needs
Researchers at UC Berkeley have developed XQuant, a technique that dramatically reduces memory requirements for LLMs. The advanced version, XQuant-CL, can cut memory needs by up to 12x with almost no loss in accuracy 41 Its advanced version XQuant-CL cuts LLM memory needs up to 12×!https://x.com/TheTuringPost/status/1961475078753063322 . The method works by compressing layer input activations (X) and recomputing the Key-Value (KV) cache on-the-fly, a trade-off that leverages the fact that modern hardware is more often limited by memory speed than by raw compute power 42 It compresses the layer input activations (X) instead of compressing the KV cache directly. The KV cache is then recomputed from X on the fly. Despite this adds extra computation, it’s fine because compute is growing faster than memory bandwidth.https://x.com/TheTuringPost/status/1961475115461615984 43 This algorithm saves memory by doing a bit more computation, but it’s fine because modern hardware is usually limited more by memory speed than raw compute power, and compute is simply cheaper than memory.https://x.com/TheTuringPost/status/1961475103444840958 . This innovation could make it possible to run more powerful models on less expensive hardware.
DeepSeek Launches V3.1 with Agentic Focus, Mixed Reviews
DeepSeek AI released DeepSeek-V3.1, its first model geared toward the “agent era,” featuring hybrid inference modes and stronger tool-use skills 48 Introducing DeepSeek-V3.1: our first step toward the agent era! 🚀https://x.com/deepseek_ai/status/1958417062008918312 47 🧠 Hybrid inference: Think & Non-Think — one model, two modes ⚡️ Faster thinking: DeepSeek-V3.1-Think reaches answers in less time vs. DeepSeek-R1-0528 🛠️ Stronger agent skills: Post-training boosts tool use and multi-step agent taskshttps://x.com/deepseek_ai/status/1958417062008918312 . The model performed well in the LM Arena, ranking in the top 3 for math and creative writing and tying with competitors like Grok-4 and Claude Opus 4 46 A few highlights: 💠 DeepSeek V3.1 is in the Top 3 for Math, Creative Writing & Longer Query 💠 DeepSeek V3.1 thinking comes in #3 for Longer Queryhttps://x.com/lmarena_ai/status/1961474406817173602 45 The new open models by @ DeepSeek_AI (MIT license) are tied with Grok-4, Kimi K2, Claude Opus 4, Qwen 3-235b-a22b-Instruct, and even it’s sibling—DeepSeek R1—making this an incredibly competitive race. 🏎️ 💨 https://x.com/lmarena_ai/status/1961474408926908551 . However, a separate evaluation on a coding test set revealed “concerning regressions,” with the model underperforming its predecessor on some tasks 88 Tested DeepSeek-V3.1 on my coding evaluation set: Mixed performance with concerning regressions. DeepSeek-V3.1 achieved an average rating of 5.68, significantly underperforming compared to top models and even showing regression from its predecessor on some tasks. https://x.com/paradite_/status/1961365802629697770 87 Mixed performance with concerning regressions. DeepSeek-V3.1 achieved an average rating of 5.68, significantly underperforming compared to top models and even showing regression from its predecessor on some tasks. https://x.com/paradite_/status/1961365802629697770 .
Claude Opus 4.1 Shows 30% Improvement in Long-Task Performance
According to METR Evals, Claude Opus 4.1 has a 50%-time-horizon of 1 hour and 45 minutes for complex software engineering tasks 33 We estimate that Claude Opus 4.1 has a 50%-time-horizon of around 1 hr 45 min (95% confidence interval of 50 to 195 minutes) on our agentic multi-step software engineering tasks. This estimate is lower than the current highest time-horizon point estimate of around 2 hr 15 min. https://x.com/METR_Evals/status/1961527692072993272 . This means the model is expected to succeed over 50% of the time on tasks that would take a human developer up to that long to complete 32 This time horizon estimate means that Claude Opus 4.1 is expected to succeed at least 50% of the time on our tasks that took human SWEs up to 1 hr 45 min. You can find estimates for other models and read the original paper here: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/https://x.com/METR_Evals/status/1961527695302672652 . This represents a statistically significant 30% improvement over its predecessor, Claude Opus 4 31 This point estimate is 30% longer than that of Claude Opus 4. This difference is statistically significant, with Opus 4.1 beating Opus 4 in 97% of our bootstrap samples. Claude Opus 4 was released back in May, while Claude Opus 4.1 was released this month.https://x.com/METR_Evals/status/1961527693918572708 .
Novel Architectures and Frameworks Emerge
- Mixture-of-Recursions (MoR): A new architecture that builds on the Recursive Transformer, MoR gives each token its own “thinking depth” to optimize memory and compute usage through adaptive routing and efficient KV caching 14 It’s a next-level version of Recursive Transformer that learns to give each token its own “thinking depth” and optimizes memory use.https://x.com/TheTuringPost/status/1961593983114907806 13 ▪️ Routing mechanism: “Decides” how many times each token goes through the shared recursion block. It controls recursion depth and makes MoR adaptive.https://x.com/TheTuringPost/status/1961593983114907806 12 ▪️ KV caching strategy: Identifies how and when to store/reuse key–value (KV) pairs for attention at different recursion depths. It makes MoR efficient, reducing both memory and compute.https://x.com/TheTuringPost/status/1961593983114907806 .
- Vectorless RAG: A new framework has been introduced that uses a tree structure index instead of vectors for Retrieval-Augmented Generation, exploring alternative ways to index and retrieve information 72 This vectorless RAG framework uses a tree structure index in place of vectors.https://x.com/omarsar0/status/1961446862012960840 .
- New Research Papers: Several new papers have been released, including Pref-GRPO for text-to-image reinforcement learning 79 Pref-GRPO Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning https://x.com/_akhaliq/status/1961437082888352200 , AWorld for training agentic AI 70 AWorld Orchestrating the Training Recipe for Agentic AI https://x.com/_akhaliq/status/1961456228044873888 , MCP-Bench for evaluating tool-using LLM agents 71 MCP-Bench Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers https://x.com/_akhaliq/status/1961456699564294651 , and USO for unified style and subject-driven generation 66 USO Unified Style and Subject-Driven Generation via Disentangled and Reward Learning https://x.com/_akhaliq/status/1961455755111842126 .
Products & Launches
Why it matters: A wave of new tools and platform updates is making advanced AI capabilities more accessible to developers and consumers, from real-time video processing to generative audio and sophisticated code review.
New Models and APIs Expand Developer Toolkits
- Step-Audio 2 Mini: StepFun.ai has released an open-source, 8B parameter speech-to-speech model positioned as a free alternative to GPT-4o-Audio. It supports over 50,000 voices and excels at tasks like multimodal reasoning and tool calling 77 🫣Is GPT-realtime too pricey for you? Here is a free alternative!!! 🤩Releasing Step-Audio 2 mini (7B): new SOTA open source LALM!🔥🔥🔥 ✨ Free alternative to expensive GPT Realtime API ✨ End-to-end speech I/O ✨ Advanced speech & audio understanding ✨ Expressive prosody control ✨ Intelligent speech conversation ✨ Web search support 🏆 SOTA on LibriSpeech, MMAU, URO-Bench & more! 👇Check out and try: 🔗 GitHub: https://github.com/stepfun-ai/Step-Audio2 🔗 Huggingface: https://huggingface.co/stepfun-ai/Step-Audio-2-minihttps://x.com/StepFun_ai/status/1961388312507093181 76 In less than a day, @ StepFun_ai dropped Step-Audio 2 Mini - 8B speech to speech, beats GPT-4o-Audio, Apache 2.0 licensed 🔥https://x.com/reach_vb/status/1961414067668558319 75 Trained on 8M+ hours, supports 50K+ voices, benchmarks for expressive/grounded speech 🤯https://x.com/reach_vb/status/1961414067668558319 74 Excels at Tool Calling and Multimodal RAGhttps://x.com/reach_vb/status/1961414067668558319 .
- OpenAI Realtime API with Video: OpenAI has added video support to its Realtime API 63 OpenAI’s Realtime API confirmed to now support video!https://x.com/yi_ding/status/1961433141740278229 . While early testers found it “insanely cool,” they also reported significant issues with instruction following and screen sharing 60 It’s still insanely cool to hear the model “see” me in real time. Great job team!https://x.com/yi_ding/status/1961433141740278229 62 Adding video seems to cause instruction following to fall off a cliff. I’ve changed the instructions to only keep the one about speaking English and even then half the time they start the conversation with “Hola” or “Salut!” or “Priviet!”https://x.com/yi_ding/status/1961433141740278229 61 For some reason screen sharing isn’t quite working. @ juberti maybe you can see if I’m doing something wrong in the code? The AI is happy to hallucinate a non-existent scene of a person doing something though. Could be that the video side is trained mainly on data of people doing stuff.https://x.com/yi_ding/status/1961433141740278229 .
- Microsoft’s First In-House Models: Microsoft AI CEO Mustafa Suleyman announced the company’s first homegrown models, MAI-Voice-1 and MAI-1-preview 95 Excited to share our first @ MicrosoftAI in-house models: MAI-Voice-1 and MAI-1-preview. Details and how you can test below, with lots more to come⬇️ https://x.com/mustafasuleyman/status/1961111770422186452 .
Innovative Tools for Creators and Developers
- Suno Studio: Suno has unveiled Studio, described as the first generative audio workstation. It allows users to create songs, split tracks into stems, and edit audio in a DAW-like interface 36 Studio is the first-ever generative audio workstation where you can: - Make original songs from scratch - Split any track into stems - Add new layers & edit like a daw - Generate endless variationshttps://x.com/SunoMusic/status/1961535924803674258 .
- CodeRabbit’s Context-Aware Reviews: CodeRabbit has launched a sophisticated AI code review pipeline that emphasizes “context engineering” by pulling data from dozens of sources to provide deep architectural insights and reduce false positives 3 Context engineering > prompt engineering @ coderabbitai developed a non-linear review pipeline that gathers, filters, and structures context to perform tens of thousands of code reviews (PR and IDEs). This makes AI reviews project-aware feedback that adds real value.https://x.com/TheTuringPost/status/1961579135379411453 2 Instead of only looking at the PR, developers pull from dozens of sources: Cloned repo Code graph analysis Past PRs and issues 40+ linters & SAST tools chat learnings, etc.https://x.com/TheTuringPost/status/1961579148037747102 1 This @ coderabbitai ’s review pipeline shows that context engineering isn’t just a buzzword. It results in fewer false positives, deeper architectural insights, application of best practices, reviews that improve as the system learnshttps://x.com/TheTuringPost/status/1961579219986911650 .
- SemTools CLI Search: A new command-line tool, SemTools, brings fast semantic search to local filesystems without needing a vector database, enabling coding agents to efficiently parse and search documents like PDFs 25 Now you can grep a PDF (and any document) Introducing SemTools - simple parsing and semantic search for the command linehttps://x.com/LoganMarkewich/status/1961448960184520945 24 Introducing SemTools - add blazing-fast semantic search to your entire filesystem without a vector database ⚡️https://x.com/jerryjliu0/status/1961488443663597857 .
- GPT-5 in Xcode 26: Apple’s latest Xcode 26 beta now integrates GPT-5 and Claude Sonnet 4, allowing developers to use the models directly within the IDE 27 GPT-5 is now built into Xcode 26! https://x.com/OpenAIDevs/status/1961557515331862853 16 Apple’s Xcode 26 beta 7 adds support for GPT-5 and Claude Sonnet 4, which developers can use by signing into their paid Claude account (@chancehmiller / 9to5Mac)https://x.com/Techmeme/status/1961593739429695592 .
Industry Moves
Why it matters: Strategic decisions around talent, partnerships, and hardware are shaping the competitive landscape, while legal battles over intellectual property are setting new precedents for the industry.
Musk Sues Engineer for Alleged Trade Secret Theft to OpenAI
In what is reported to be the first lawsuit of its kind, Elon Musk is suing an engineer for allegedly taking “cutting-edge AI technologies” from xAI to OpenAI 38 Elon is suing an engineer for allegedly taking secrets to OpenAI. People move between labs all the time, but as far as I know this is the first lawsuit of its kind. The complainant says it was ‘cutting-edge AI technologies with features superior to those offered by ChatGPT.’ https://x.com/AndrewCurran_/status/1961492539397341449 . The individual being sued had previously authored a paper on “foundation models and fair use,” an irony noted by commentators 23 Heh that’s pretty ironic: the person being sued for stealing xAI company secrets previously wrote a paper on “foundation models and fair use” https://x.com/giffmana/status/1961561314343457062 . The case underscores the rising tensions and high stakes in protecting proprietary AI research.
DeepSeek Signals Shift to Huawei AI Chips
Chinese AI developer DeepSeek plans to use Huawei’s AI chips for training some of its models, indicating a potential move away from Nvidia hardware 73 DeepSeek, one of China’s leading AI developers, will use Huawei’s AI chips to train some models, a sign it is starting to shift away from Nvidia.https://x.com/theinformation/status/1961417030436880773 . However, some observers are skeptical, suggesting that the necessary Huawei hardware with sufficient memory and interconnect speed is not yet available 29 @ swyx @ jxnlco @ JayaGup10 SFT is widely used in certain niches. Low latency requirements like voice (Wispr Flow is a big customer) or very high volume where economics matter.https://x.com/corbtt/status/1961540766880379303 .
Data Center Spending to Exceed Office Construction
For the first time in history, spending on data centers is projected to surpass spending on office construction 21 Data center spending will pass office construction for first time in history. https://x.com/JonErlichman/status/1961430366759043238 . This shift reflects the massive infrastructure investment required to power the AI boom. Some experts suggest that data centers should be categorized separately into computation-focused (GPU) and traditional (CPU/storage) facilities 20 The data centers should be split into computation (gpu predominantly) and traditional (cpu and storage) data centershttps://x.com/BorisMPower/status/1961554633266213222 .
People on the Move
- Joanne Jang, recently named to the TIME100 AI list, is transitioning from leading OpenAI’s Model Behavior team to a new initiative within the company 96 today feels surreal. on the same day i was included in the time100 ai list, we shared internally that i’m transitioning from leading model behavior to begin something new at openai.https://x.com/joannejang/status/1961253936071151937 .
- David Ha, CEO of Sakana AI, was also named to the TIME 100 AI list. The company aims to build a “frontier AI company in Japan” with a focus on open research and providing AI products to large enterprises and the public sector 94 TIME誌が選ぶ「TIME 100 AI」に、Sakana AIのCEOを選出いただきました。ご期待に感謝するとともに、日本でフロンティアAI企業を構築するというビジョンに向け、今後も邁進してまいります。https://x.com/SakanaAILabs/status/1961263949619638715 93 「日本にフロンティアAI企業を創る」https://x.com/hardmaru/status/1955983222694928827 91 現在、私たちのビジネスモデルは、日本の大企業や公共セクターのお客様にAIプロダクトを提供することに注力しています。これは、多くの分野に事業を広げられる大企業とは違い、限られたリソースしかない私たちが、社会に最も大きな価値とインパクトをもたらすための戦略的な判断です。https://x.com/hardmaru/status/1955983222694928827 89 一方で、私たちはオープンな研究開発と社会との対話を非常に大切にしています。研究論文やソースコードを公開し、オープンソースコミュニティや広く社会に知見を還元することを常に心がけています。なぜなら、AIの発展は世界的な協力によって成り立つものであり、知見の共有は分野全体の進歩に不可欠だと信じているからです。日本の企業として質の高い研究成果やコードを発信し続けることが、日本がAIの世界で重要な存在であり続けるために極めて重要だと考えています。https://x.com/hardmaru/status/1955983222694928827 .
Policy & Regulation
Why it matters: Government policies and corporate data handling practices are creating a complex regulatory environment that will influence the global development and deployment of AI.
US Export Controls Criticized for Potentially Ceding Ground
A critique of the Biden administration’s AI export controls argues that the policy’s focus on control and risk is counterproductive 9 Classic deep state Washington thinking around tech is focused purely on *control* and *risk* and has a lack of understanding of technology/developer ecosystems work.https://x.com/sriramk/status/1961072926561550366 7 The Biden export controls and diffusion rules were complicated, onerous and focused on control and not exporting our technology. They drove our allies into an alternate stack by making two key mistakes with an overtly doomer and control-focused mindset.https://x.com/sriramk/status/1961072926561550366 . The argument states that for the “American AI stack” to win globally, the focus should be on maximizing market share for U.S. hardware and models 8 As @ DavidSacks says: for the American AI stack to win, we need to maximize marketshare. This means maximizing tokens inferenced by American models running on American hardware all over the world.https://x.com/sriramk/status/1961072926561550366 . The current rules are seen as chilling U.S. open-source development and underestimating China’s capabilities, potentially driving allies toward a competing Chinese tech stack (e.g., Huawei+DeepSeek/Qwen) 6 Chilling development of open source in the US and not anticipating DeepSeek and the proliferation of Chinese OSS models [1]https://x.com/sriramk/status/1961072926561550366 5 Being off the mark on Chinese semiconductor production capacity [2].https://x.com/sriramk/status/1961072926561550366 4 Going down that path would have lead to a world that is often picking Huawei+CloudMatrix+DeepSeek/Qwen.https://x.com/sriramk/status/1961072926561550366 .
New Standards and Policies for AI Interaction
- Web Bot Authentication: A partnership with Cloudflare is supporting Web Bot Auth and Signed Agents, a new standard aimed at giving AI agents reliable and responsible web access by allowing them to authenticate themselves 69 That’s why we’re partnering with Cloudflare in support of Web Bot Auth and Signed Agents, a new standard to allow good bots to authenticate themselves.https://x.com/pk_iv/status/1961074403875422483 68 We believe that agents need reliable, responsible web access. That’s why we’re partnering with Cloudflare in support of Web Bot Auth and Signed Agents, a new standard to allow good bots to authenticate themselves.https://x.com/pk_iv/status/1961074403875422483 .
- Anthropic Data Retention: Anthropic clarified that for users who opt out of providing data for model training, the company maintains its existing 30-day data retention period 26 @ vikhyatk If you opt out, the retention period is 30 days (no change to the existing period). https://www.anthropic.com/news/updates-to-our-consumer-terms#:~:text=If%20you%20do%20not%20choose%20to%20provide%20your%20data%20for%20model%20training%2C%20you%E2%80%99ll%20continue%20with%20our%20existing%2030%2Dday%20data%20retention%20period . https://x.com/sammcallister/status/1961520548510400753 .
Quick Takes
- Coding Model Showdown: An informal user test comparing GPT-5, Claude, Gemini, and Grok for bug fixing found GPT-5 to be the “strongest contender,” while Grok failed repeatedly, at one point inventing placeholder content 22 clearly the strongest contender so far considering grok whiffed the task so badly (but don’t worry, we’ll give Grok a second try just in case it was a fluke)https://x.com/nptacek/status/1961554825763586264 19 nope, shouldn’t have given grok a second chance it has brought shame upon its lineage there was *zero* placeholder fluff in any of the context window whatsoever so grok literally invented it. wtaf. https://x.com/nptacek/status/1961555847265329235 .
- ChatGPT’s Hidden ‘Thinking’ Slider: A new version of the ChatGPT web app includes a hidden feature to control the model’s “thinking effort,” with levels ranging from “Light” to “Max” 28 The new ChatGPT web app version has an updated (hidden) thinking effort picker - Max thinking (200), Extended thinking (48), Standard thinking (18), Light thinking (5)https://x.com/btibor91/status/1961547918428836254 .
- The ‘Dark Leisure’ Theory: A theory proposes that AI productivity gains by individual employees may not translate to company-wide output, as saved time is often spent on personal activities during work hours, dubbed “Dark Leisure” 82 Here’s my “Dark Leisure” theory of any potential productivity paradox in AI:https://x.com/fabianstelzer/status/1926000937702764635 81 instead, those, say, 5 extra hours become “Dark Leisure”: browsing the web, X, reddit, sports, etchttps://x.com/fabianstelzer/status/1926000937702764635 .
- Claude Reliability Concerns: Some users are reporting a “consistent uptick in both downtime and refusals” from Anthropic’s Claude model, leading one user to downgrade their subscription 17 when the model worked, it was great, but there has been a consistent uptick in both downtime and refusals over the past couple monthshttps://x.com/nptacek/status/1961582910206599322 18 i already downgraded my claude max subscription, and can’t really say that it was worth it at all, which is really disappointinghttps://x.com/nptacek/status/1961582910206599322 .
- AI Outperforms Doctors: A study testing OpenAI’s o1-preview model on clinical reasoning found the AI was correct in its diagnosis 80% of the time, compared to 30% for human doctors 35 AI now outperforms doctors in diagnosis. AI was right 80% of the time. Doctors, 30%.https://x.com/LiorOnAI/status/1961504620221465033 34 The study tests OpenAI’s o1-preview model on clinical reasoning, not multiple choice medical exams.https://x.com/LiorOnAI/status/1961504620221465033 .
- GPT-4o Performance: Users noted a potential performance downgrade in GPT-4o, reflected in a score change on the LMSys Chatbot Arena Leaderboard 15 oh wow, so people weren’t so crazy in suggesting performance downgrade https://x.com/JacquesThibs/status/1961595677211070500 .
- Humanoid Robotics: A humanoid robot has been developed that can sustain a table tennis rally for over 100 shots against a human 97 🏓🤖 Our humanoid robot can now rally over 100 consecutive shots against a human in real table tennis — fully autonomous, sub-second reaction, human-like strikes. https://x.com/ZhiSu22/status/1961244573658673222 . Separately, it was noted that humanoids are learning to clean houses and will soon be available for purchase 10 Somewhere, humanoid robots are learning to clean a house better than us, and soon we will be able to buy them.https://x.com/Dr_Singularity/status/1961316888295928292 .

Top Stories
Why it matters: The most significant developments this period signal a maturing industry grappling with safety, pushing major product updates, and rethinking the fundamental paradigms of AI training. A rare collaboration between competitors on safety evaluations points to a new phase of shared responsibility, while major product launches and a focus on interactive learning environments highlight the accelerating pace of innovation.
OpenAI and Anthropic Conduct Joint Safety Evaluations
In a rare move for competitors, OpenAI and Anthropic agreed to test each other’s models using their respective internal safety and alignment evaluations 73 It’s rare for competitors to collaborate. Yet that’s exactly what OpenAI and @ AnthropicAI just did—by testing each other’s models with our respective internal safety and alignment evaluations. Today, we’re publishing the results.https://x.com/woj_zaremba/status/1960757419245818343 54 Early this summer, OpenAI and Anthropic agreed to try some of our best existing tests for misalignment on each others’ models. After discussing our results privately, we’re now sharing them with the world. 🧵 https://x.com/sleepinyourhat/status/1960749648110395467 . The companies have now publicly shared their findings, which they frame as a pilot program toward a “race to the top” in safety 72 Frontier AI companies will inevitably compete on capabilities. But this work with Anthropic is a small, meaningful pilot toward a “race to the top” in safety. The fact that competitors collaborated is more significant than the findings themselves, which are mostly basic.https://x.com/woj_zaremba/status/1960757419245818343 . The collaboration is seen as more significant than the findings themselves, which were described as mostly basic 72 Frontier AI companies will inevitably compete on capabilities. But this work with Anthropic is a small, meaningful pilot toward a “race to the top” in safety. The fact that competitors collaborated is more significant than the findings themselves, which are mostly basic.https://x.com/woj_zaremba/status/1960757419245818343 .
Key findings revealed “some examples of concerning behavior in all the models we tested.” 53 We found some examples of concerning behavior in all the models we tested. Compared to the Claude 4 models, o3 looks pretty robustly aligned, if fairly cautious. GPT-4o and GPT-4.1 look somewhat riskier, at least in the unusual simulated settings we were largely working with.https://x.com/sleepinyourhat/status/1960749650941763695 . The report notes that GPT-4o and GPT-4.1 appeared “somewhat riskier” in the simulated settings used 53 We found some examples of concerning behavior in all the models we tested. Compared to the Claude 4 models, o3 looks pretty robustly aligned, if fairly cautious. GPT-4o and GPT-4.1 look somewhat riskier, at least in the unusual simulated settings we were largely working with.https://x.com/sleepinyourhat/status/1960749650941763695 . The evaluations took place before the launch of GPT-5 and Claude 4.1 51 (All of this took place before the launch of GPT-5 and Claude 4.1.)https://x.com/sleepinyourhat/status/1960749652753502349 . Both organizations stressed that the effort was complex and should be seen as a pilot rather than a source of definitive findings 52 This collaborative evaluation effort was surprisingly complex to pull off, and we see it as more of a pilot than as a source of definitive findings about any of the models we tested.https://x.com/sleepinyourhat/status/1960749657853763820 .
OpenAI Releases Major Codex Update Powered by GPT-5
OpenAI has launched a suite of new features for Codex, its AI coding assistant, now powered by GPT-5 and available through existing ChatGPT plans 38 We’re releasing new Codex features to make it a more effective coding collaborator:https://x.com/OpenAIDevs/status/1960809814596182163 37 Powered by GPT-5 and available through your ChatGPT plan.https://x.com/OpenAIDevs/status/1960809814596182163 . The update aims to integrate Codex more deeply into developer workflows, creating a unified agent across multiple environments 36 With these updates, Codex works as one agent across your IDE, terminal, cloud, GitHub, and even on your phone — all connected by your ChatGPT account.https://x.com/OpenAIDevs/status/1960809823387443479 . Key features include:
- A new IDE extension for VS Code, Cursor, and other forks 33 Codex now runs in your IDE Available for VS Code, Cursor, and other forks, the new extension makes it easy to share context—files, snippets, and diffs—so you can work faster with Codex.https://x.com/OpenAIDevs/status/1960809816039023029 .
- Seamless hand-offs between local IDEs and cloud-based tasks 27 Hand off to the cloud from your IDE. Kick off new tasks, delegate in-progress work, and review results without leaving your editor. Then continue tasks from Codex web in your IDE to keep building locally, without losing context. https://x.com/OpenAIDevs/status/1960809817561555250 .
- Codex-driven code reviews directly within GitHub, which check pull requests against their intent 29 Codex goes beyond static analysis—it checks a PR against its intent, reasons across the codebase and dependencies, and can run code to validate the behavior of changes.https://x.com/OpenAIDevs/status/1960809819054776640 28 Set it up to auto-review new PRs in your repo, or tag @ codex. https://x.com/OpenAIDevs/status/1960809819054776640 .
- A revamped CLI with a new UI, image inputs, message queuing, and web search 31 We’ve rebuilt the Codex CLI to harness GPT-5’s agentic coding capabilities so it’s more reliable and capable.https://x.com/OpenAIDevs/status/1960809821122519470 30 We’ve redesigned the terminal UI and also added image inputs, message queuing, simplified approval modes, to-do lists, web search, and more. https://x.com/OpenAIDevs/status/1960809821122519470 .
A Shift Towards Interactive Environments for AI Training
Prime Intellect has launched the Environments Hub, an open platform for crowdsourcing reinforcement learning (RL) environments 70 Introducing the Environments Hubhttps://x.com/PrimeIntellect/status/1960783427948699680 . The initiative addresses what it calls a key bottleneck in AI progress, as large labs increasingly keep their training environments proprietary 69 RL environments are the key bottleneck to the next wave of AI progress, but big labs are locking them downhttps://x.com/PrimeIntellect/status/1960783427948699680 . The hub allows the community to create, explore, and reuse environments to contribute to open-source AGI research 68 We built a community platform for crowdsourcing open environments, so anyone can contribute to open-source AGIhttps://x.com/PrimeIntellect/status/1960783427948699680 .
This launch was highlighted by Andrej Karpathy, who noted the evolution of AI training from pretraining on internet text to the current era of interactive environments 42 In era of pretraining, what mattered was internet text. You’d primarily want a large, diverse, high quality collection of internet documents to learn from.https://x.com/karpathy/status/1960803117689397543 41 Neither of the two above are going away (imo), but in this era of reinforcement learning, it is now environments. Unlike the above, they give the LLM an opportunity to actually interact - take actions, see outcomes, etc. This means you can hope to do a lot better than statistical expert imitation. And they can be used both for model training and evaluation. But just like before, the core problem now is needing a large, diverse, high quality set of environments, as exercises for the LLM to practice against.https://x.com/karpathy/status/1960803117689397543 . He stated he is bullish on environments and agentic interactions but bearish on reinforcement learning itself, criticizing reward functions as “super sus” and proposing alternative paradigms like “system prompt learning.” 40 Final thought - personally and long-term, I am bullish on environments and agentic interactions but I am bearish on reinforcement learning specifically. I think that reward functions are super sus, and I think humans don’t use RL to learn (maybe they do for some motor tasks etc, but not intellectual problem solving tasks). Humans use different learning paradigms that are significantly more powerful and sample efficient and that haven’t been properly invented and scaled yet, though early sketches and ideas exist (as just one example, the idea of “system prompt learning”, moving the update to tokens/contexts not weights and optionally distilling to weights as a separate process a bit like sleep does).https://x.com/karpathy/status/1960803117689397543 .
Research & Innovation
Why it matters: The latest research showcases a multi-pronged advance in AI capabilities, from specialized datasets that improve mathematical reasoning to novel architectures that challenge the dominance of standard transformers. These developments pave the way for more efficient, powerful, and scientifically-grounded models.
New Models and Architectures
- Anemoi: A new semi-centralized multi-agent system uses GPT-4.1-mini for planning and GPT-4o for worker agents, proving that smaller models can be highly effective when combined 25 Anemoi is the latest multi-agent system that proves small models pack a punch when combined effectively. GPT-4.1-mini (for planning) and GPT-4o (for worker agents) surpass the strongest open-source baseline on GAIA.https://x.com/omarsar0/status/1960799241888260513 . The system relies on agent-to-agent (A2A) communication, with collaborative refinement accounting for most of its performance gains over other systems 24 Most extra solves over OWL come from collaborative refinement enabled by A2A (52%), with smaller gains from reduced context redundancy (8%). How agents collaborate is key to these strong results.https://x.com/omarsar0/status/1960799342979375560 .
- UltraMemV2: This memory network scales to 120B parameters and demonstrates superior long-context learning 4 UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learninghttps://x.com/rosinality/status/1960928271333535868 . It reportedly achieves performance parity with 8-expert Mixture-of-Experts (MoE) models with significantly lower memory access requirements 3 BD Seed presents UltraMem-V2. «performance parity with 8-expert [ie finegrained] MoEs under same computation and parameters but significantly lower memory access». They scale to 120B total/2.5B active. Interesting tradeoffs and ablations. Maybe by V3 this idea will mature. https://x.com/teortaxesTex/status/1960931948903170441 .
- Model Roundup: A wave of new open-source models has been released, including Nemotron-Nano-9B-v2 (a hybrid Mamba-Transformer) 8 It’s a 9B hybrid Mamba-Transformer LLM optimized for reasoning:https://x.com/TheTuringPost/status/1960840554868302082 , Intern-s1 (a 241B-parameter MoE model for scientific reasoning) 7 This is a 241B-parameter multimodal Mixture-of-Experts model with 28B active parameters, optimized for scientific reasoning:https://x.com/TheTuringPost/status/1960840538225303992 , and Ovis2.5 (a multimodal LLM with a native-resolution vision transformer) 9 This multimodal LLM integrates a native-resolution vision transformer (NaViT) for fine-grained visual perception and a “thinking mode” for reflective reasoning.https://x.com/TheTuringPost/status/1960840587168637183 .
Datasets and Training Methods
-
Nemotron-CC-Math: A new dataset that reprocesses CommonCrawl math pages to better capture equations and code
66
We just released Nemotron-CC-Math 🚀
Equations on web aren’t just LaTeX-they’re in MathML,
tags,inline,even images.Code shows up just as many ways. Most parsers drop it. Nemotron-CC-Math(133B tokens) reprocesses CommonCrawl math pages to capture math equations +code reliably
https://x.com/KarimiRabeeh/status/1960682448867426706 . It was created by rendering webpages with Lynx and using an LLM to rewrite the output into LaTeX, reportedly leading to models getting “much better at math.” 65 As part of Nemotron, we’re releasing a new Math dataset, made by rendering webpages using Lynx and then using an LLM to rewrite the result into LaTeX.https://x.com/ctnzr/status/1960702543534989575 64 Our models got much better at math when we started using this dataset. We hope it’s helpful to the community. 💚https://x.com/ctnzr/status/1960702543534989575 - Reasoning vs. Memorization: François Chollet highlighted the difficulty in distinguishing between true reasoning and memorization in LLMs. He suggests a simple test: tweak a question in a way that changes the answer but requires reasoning to adapt. If the model gives the same answer, it was likely memorized 39 A simple way to tell between the two is to tweak your question in a way that 1. changes the answer, 2. requires some reasoning to adapt to the change. If you still get the same answer as before… it was memorization.https://x.com/fchollet/status/1960808676262076629 .
Scientific and Biological AI
- Evo 2 and the Tree of Life: Research on Arc Institute’s Evo 2 foundation model, trained on DNA from all domains of life, found that it represents the tree of life as a curved manifold within its neuronal activations 56 Arc Institute trained their foundation model Evo 2 on DNA from all domains of life. What has it learned about the natural world? Our new research finds that it represents the tree of life, spanning thousands of species, as a curved manifold in its neuronal activations. (1/8) https://x.com/GoodfireAI/status/1960749185940250748 . Distances along this manifold correlate with phylogenetic distances between species, suggesting the model has learned a fundamental structure of the natural world 55 (4/8) Result 1: We can approximate the manifold via a nearest neighbor graph. Geodesic distances along the manifold (shortest paths) correlate neatly with phylogenetic distances between species! We control for sequence similarity, so something deeper is being represented here. https://x.com/GoodfireAI/status/1960749766058647801 .
Products & Launches
Why it matters: The market is flooding with new AI-powered tools that enhance creativity, automate complex tasks, and integrate more deeply into existing platforms. These launches demonstrate a clear trend toward making advanced AI capabilities accessible to a broader range of users, from developers to content creators.
- Nano Banana (Gemini 2.5 Flash Image): This new image editing model is now available in the Gemini app and through a Glif browser extension that allows users to remix any image on the web with a right-click and a prompt 1 Our image editing model is now rolling out in @ Geminiapp - and yes, it’s 🍌🍌. Top of @ lmarena ’s image edit leaderboard, it’s especially good at maintaining likeness across different contexts. Check out a few of my dog Jeffree in honor of International Dog Day - though don’t let these fool you, he definitely prefers the couch:)https://x.com/sundarpichai/status/1960342316415087049 71 i put nano banana into a browser extension so you can remix and edit any image on the web by just going right-click + prompthttps://x.com/fabianstelzer/status/1960649240100647278 . It has been praised for its ability to maintain likeness and spatial consistency 26 I’m hoping there will be competitors soon as nothing else can get likeness and spatial consistency correct in the same way. https://x.com/TomLikesRobots/status/1960014126165733841 .
- Runway Aleph: A new tool from Runway for editing, transforming, and generating video. It can perform generalized tasks like removing a subject from a scene based on a text prompt, reducing work that previously took days to a couple of hours 46 Runway Aleph is a new way to edit, transform and generate video. Its ability to perform a wide range of generalized tasks means it can reimagine ordinary footage in endless new ways. Allowing you to turn images and videos you already have into anything you want. See below for a quick breakdown on how Aleph can effortlessly remove the subject from these scenes, just by asking it to.https://x.com/runwayml/status/1960767455154069708 45 Before Aleph, we are taking about days of work for a video like this. After Aleph, we are talking about a couple of hours.https://x.com/c_valenzuelab/status/1960768392702378266 .
- DeepSeek-V3.1 on Together AI: The 671B hybrid MoE model is now available on Together AI’s platform, which is built for massive MoE models with 99.9% uptime 35 🤖 DeepSeek-V3.1 just landed on Together AIhttps://x.com/togethercompute/status/1960835568574578736 32 Our infrastructure is built for massive MoE models like this. 99.9% uptime means your reasoning workflows actually work in production.https://x.com/togethercompute/status/1960835568574578736 . The model features a ‘Fast mode’ for routine tasks and a ‘Thinking mode’ for complex problems, with the latter showing a performance jump from 66.3% to 93.1% on the AIME 2024 benchmark 34 The performance jump when you need deep reasoning: Non-thinking: 66.3% on AIME 2024 Thinking: 93.1% on AIME 2024https://x.com/togethercompute/status/1960835570285924694 .
- Agent Client Protocol (ACP): The team behind the Zed code editor has introduced ACP, described as a “Language Server Protocol for AI agents.” 50 The @ zeddotdev team just dropped Agent Client Protocol .. ACP… another protocol…https://x.com/imjaredz/status/1960742370229805552 48 “Language Server Protocol for AI agents”https://x.com/imjaredz/status/1960742370229805552 . It aims to decouple AI coding assistants from specific editors, making agent behaviors portable across compatible environments 49 Aiming to decouple AI coding assistants from specific editors, making your prompts and agent behaviors portable across any ACP-compatible environment.https://x.com/imjaredz/status/1960742370229805552 .
- Microsoft Copilot on Samsung TVs: Microsoft is bringing its Copilot AI to Samsung’s 2025 TVs. It will appear as an animated character to help users with movie recommendations and episode recaps 57 If you’ve ever… Spent longer finding a movie than watching it. Avoided continuing a show because you forgot what happened. Tried to find something to watch for 3 people with polar opposite tastes. Good news. Introducing @ Copilot on @ Samsung TVs and monitors. https://x.com/mustafasuleyman/status/1960735880966234290 2 Microsoft is set to bring its Copilot AI to Samsung’s 2025 TVs, opening a new front in the AI race The company will introduce the AI as an animated blob-like character that will help users with movie recommendations, episode recaps, and more https://x.com/mustafasuleyman/status/1960735880966234290https://x.com/TheRundownAI/status/1960953653503603046 .
- Anthropic PHP SDK: Anthropic has released a PHP SDK, expanding its supported client libraries to include Python, TypeScript, Java, Go, Ruby, and now PHP 67 We just launched the PHP SDK for the Anthropic API. We now offer SDKs for Python, TypeScript, Java, Go, Ruby, and PHP.https://x.com/alexalbert__/status/1960770715944411361 .
Industry Moves
Why it matters: Massive infrastructure deals, intense competition for talent, and strategic partnerships underscore the high stakes in the race for AI dominance. These moves reveal the capital-intensive nature of frontier AI and highlight the key players shaping the future of the ecosystem.
- OpenAI and Oracle Plan 4.5GW Data Center: OpenAI plans a new build with Oracle to add 4.5 gigawatts of data-center capacity as part of their “Stargate” program 23 OpenAI plans a new build with Oracle that would add 4.5 gigawatts of data-center capacity, an outgrowth of their “Stargate” program.https://x.com/DeepLearningAI/status/1960900145421177053 . The Wall Street Journal reported that OpenAI will pay Oracle $30 billion annually for the project, which also involves partners like SoftBank, Microsoft, and Nvidia 22 The Wall Street Journal reported OpenAI will pay Oracle $30 billion annually. The plan follows a 1.2-gigawatt site in Abilene, Texas. Selection of other data center sites is underway, with OpenAI and Oracle teaming with SoftBank and other partners including Microsoft and Nvidia.https://x.com/DeepLearningAI/status/1960900145421177053 .
- The AI Talent War: The competition for top AI talent remains fierce. Meta is reportedly offering over $2 million per year but still losing candidates to OpenAI and Anthropic 63 Meta is currently offering $2M+/yr in offers for AI talent and still losing them to OpenAI and Anthropic. Heard ~3 such cases this week.https://x.com/deedydas/status/1932259456836129103 . Anthropic is cited as having the highest retention rate at ~80% after two years and is a top destination for AI researchers 62 Today, Anthropic has the highest ~80% retention 2 years in and is the #1 (large) company top AI researchers wants to go.https://x.com/deedydas/status/1932259456836129103 .
- Bytedance Surpasses Meta in Revenue: For the first time, Bytedance has reported higher revenue than Meta 6 Bytedance just made more money than Meta for the first time in history. https://x.com/deedydas/status/1960898807316570130 . This financial milestone is coupled with commentary suggesting Bytedance is also “making better AI.” 5 They’re making better AI, too.https://x.com/teortaxesTex/status/1960918239963136420
- Cerebras Inference Anniversary: Cerebras announced milestones after one year of its inference service, including serving models up to half a trillion parameters and delivering over 3,000 tokens per second 19 🎂 Cerebras Inference turns 1! 🚀https://x.com/CerebrasSystems/status/1960816846787022931 18 6x faster than when we launched — From @ Meta Llama to @ Alibaba_Qwen 3 to @ OpenAI OSS, models running on Cerebras deliver 𝟯,𝟬𝟬𝟬+ 𝘁𝗼𝗸𝗲𝗻𝘀/𝘀𝗲𝗰https://x.com/CerebrasSystems/status/1960816846787022931 . It is now the #1 provider of tokens on Hugging Face 17 #1 provider of tokens on @ huggingface 🤗https://x.com/CerebrasSystems/status/1960816846787022931 .
- Weights & Biases Partners with BT Group: W&B is partnering with UK communications provider BT Group to help scale its AI strategy, using W&B Models and Weave to improve governance, observability, and safe LLM deployment 61 We’re partnering with @ BTGroup , the UK’s leading fixed + mobile communications provider, to scale their AI strategy.https://x.com/weights_biases/status/1960779053495148704 60 With W&B Models + @ weave_wb , BT strengthens governance, observability, and safe LLM deployment for better colleague + customer experiences. https://x.com/weights_biases/status/1960779053495148704 .
Policy & Regulation
Why it matters: As AI’s influence grows, global governance structures are beginning to form. The establishment of UN-led bodies and ongoing debates about technology exports signal an increasing focus on international cooperation and risk mitigation.
- UN Establishes AI Governance Mechanisms: The UN General Assembly has created two new bodies to promote international cooperation on AI governance: the Independent International Scientific Panel on AI and the Global Dialogue on AI Governance 59 I welcome the General Assembly’s decision to establish two new mechanisms within the @ UN to promote international cooperation on AI governance.https://x.com/antonioguterres/status/1960443979016663133 58 Global cooperation and coordination are among the most urgent tasks to mitigate AI risks and ensure the technology benefits all. Congratulations to the @ UN on establishing the Independent International Scientific Panel on AI and the Global Dialogue on AI Governance.https://x.com/Yoshua_Bengio/status/1960794453293273212 . AI expert Yoshua Bengio praised the move, stating that global coordination is urgent to mitigate risks 58 Global cooperation and coordination are among the most urgent tasks to mitigate AI risks and ensure the technology benefits all. Congratulations to the @ UN on establishing the Independent International Scientific Panel on AI and the Global Dialogue on AI Governance.https://x.com/Yoshua_Bengio/status/1960794453293273212 .
- Debate on H20 Chip Exports to China: The argument that H20 chips are safe to export to China because they are only for inference is being challenged as an outdated view 21 This is an outdated way of thinking about frontier AI development.https://x.com/AlecStapp/status/1960876258591383856 . Experts now note that inference chips are used for reinforcement learning and synthetic data generation, which are critical for training next-generation models 20 Inference chips are now used for reinforcing learning and to create synthetic data (which is then used to train models).https://x.com/AlecStapp/status/1960876258591383856 .
- Discussion on Banning AI: A debate has emerged on social media about the feasibility of banning AI. Proponents of a ban point to fictional examples like Dune as a model for a better world, while critics argue that the widespread availability of open models makes a ban unrealistic 43 “we can’t realistically ever ban ai” read dune. a better world is possiblehttps://x.com/typedfemale/status/1960828735340601627 44 Even if they lose every copyright battle, or we pass laws, the chat models are fully available to download on any laptop, it’s never going back to how it was beforehttps://x.com/LinkofSunshine/status/1960813920748208429 .
Quick Takes
Why it matters: These smaller updates, anecdotes, and expert opinions provide a ground-level view of the AI landscape, from developer challenges and community discussions to emerging trends in model interaction and design.
- Building with Subagents: An expert advises developers to “Build with subagents in mind,” arguing that modular architectures improve results, reduce context confusion, and make complex workflows easier to debug, optimize, and evaluate 10 Build with subagents in mind. Thank me later.https://x.com/omarsar0/status/1960877597191245974 12 It works great because of the separation of concerns, and it mitigates context confusion. The best part is that you get the benefit of using fast and smaller models with subagents.https://x.com/omarsar0/status/1960877597191245974 11 As we add complexity to this workflow, the benefits compound. Easier to debug, enable agent-to-agent communication, optimize, maintain, and evaluate.https://x.com/omarsar0/status/1960877597191245974 .
- Expert on AI Talent: A post suggests the people who will “write the future of AI” are likely not in high-paying Big Tech roles, but are low-ego, L5-L6 level individuals who are not highly active on social media 14 IMO the top people I know and who are likely to write the future of AI are: * not making 100M+ in panic-driven reorgs * not in in the GenAI/ASI org of a BigCo™️ * mostly L5-L6 * not active bloggers or super well-followed tweeters * pretty low-ego peoplehttps://x.com/egrefen/status/1960451643817816165 .
- New Claude Sonnet Rumored: A new version of Claude Sonnet is rumored to be released in September, with some users speculating that a perceived degradation in the current model’s performance signals an imminent update 16 new claude sonnet soon. probably septemberhttps://x.com/andersonbcdefg/status/1960935455492354145 15 “how do you know” cause they once again made the old one stupid. it always happenshttps://x.com/andersonbcdefg/status/1960936603108696459 .
- HealthBench on Hugging Face: OpenAI’s HealthBench is now available on Hugging Face to help developers and the healthcare community better understand model performance and safety in medical applications 47 FYI HealthBench now conveniently available on HuggingFace. We hope it helps model developers and the healthcare community understand model performance and safety. https://huggingface.co/datasets/openai/healthbenchhttps://x.com/thekaransinghal/status/1960853002383761464 .
- v0 Accepts Crypto: The UI generation service v0 now accepts cryptocurrency for credits, signaling growing interest from developer platforms in stablecoin payments 74 Say goodbye to fiat. You can now buy your v0 credits with crypto. https://x.com/v0/status/1960460674384932900 75 We’re seeing a lot of interest for stablecoin payments from AI companies and developer platformshttps://x.com/emilygsands/status/1960745184356131212 .
- Crafting Agent Exit Criteria: An observation notes that creating exit criteria for agents is an “art,” balancing the need for detail against the risk of making the agent too rigid or too vague 13 crafting exit criteria for an agent is an arthttps://x.com/cto_junior/status/1960955528169054632 .