# Anthropic's Self-Improvement Metrics, Nemotron 3 Ultra, and Live Agent Evals

*By AI High Signal Digest • June 5, 2026*

Anthropic published unusually concrete data on AI-assisted AI development, NVIDIA released a major open agent model, and Agent Arena introduced a live benchmark for real-world agent performance. The brief also covers ChatGPT memory, enterprise retrieval, outcome-based AI go-to-market moves, and new policy attention on biosecurity and national AI strategy.

## Top Stories

*Why it matters: today’s biggest developments were about AI improving AI, stronger open models, and better measurement of real agent performance.*

- **Anthropic put hard numbers on AI-assisted AI development.** Anthropic said internal data shows Claude is accelerating AI development, with engineers shipping **8x** more code, Claude writing **80%+** of merged code, open-ended task success reaching **76%**, and the length of tasks AI can reliably complete doubling roughly every **4 months**. Anthropic outlined three futures—stalling progress, compounding gains with humans still setting direction, or full recursive self-improvement—and said the middle path is the likeliest. OpenAI separately said it also sees early signs of recursive self-improvement and warned existing institutions are not ready for the governance challenges. [^1][^2][^3][^4][^5]
- **NVIDIA raised the bar for open agent models with Nemotron 3 Ultra.** The new model is a fully open **550B** model with **55B active parameters**, designed for long-running agents, up to **1M** context, and released with weights, training data, and recipe. NVIDIA says it delivers **5x** faster inference and up to **30%** lower cost on complex agentic tasks; Artificial Analysis said it now leads U.S. open-weight models on its Intelligence Index at **47.7**. [^6][^7][^8][^9]
- **Agent Arena launched a live benchmark for real agent work.** Arena said its new leaderboard is built from **300K+** tasks, **2M+** tool calls, and **40M** lines of code across live user sessions using web search, filesystem, and terminal tools. The first ranking places **OpenAI GPT-5.5** first, **Anthropic Claude-Opus-4.7** second, and **Z.ai GLM-5.1** third, signaling a shift away from static agent evals toward production-like measurements. [^10]

## Research & Innovation

*Why it matters: the most useful research updates focused on long-horizon agents, multimodal grounding, and model oversight.*

- **AutoLab argued that persistence matters more than first-try quality.** Across **17** frontier models and **36** expert-curated tasks in optimization, model development, CUDA kernels, and puzzles, the strongest predictor of success was repeated benchmarking, editing, and feedback loops—not the initial answer. The authors said Claude-opus-4.6 sustained that loop best. [^11]
- **AllenAI’s Molmo2 pushed open video-grounded vision forward.** The model supports video pointing, tracking, counting by pointing, and multi-image reasoning in one open system, returns precise pixel coordinates and timestamps, and was trained on new video and multi-image datasets collected without distilling from closed models. [^12][^13][^14]
- **Goodfire showed a cheaper way to detect eval awareness.** Its new method uses logits to measure how close a model is to recognizing that it is being tested, reportedly requiring **10x to 100x fewer samples** than monitoring outputs alone. [^15]

## Products & Launches

*Why it matters: consumer and enterprise AI products kept moving toward better memory, faster retrieval, and bigger working context.*

- **OpenAI rolled out a more capable ChatGPT memory system.** The update carries context across conversations, lets users review and steer memory through a summary, and gives Plus and Pro users in the U.S. **2x** more memory. Team posts said the work evolved from saved memory to dreaming and now dreaming V3. [^16][^17][^18][^19]
- **Databricks launched Instructed-Retriever-1.** Instead of sequential agentic search loops, the model scales retrieval in parallel by generating multiple query and filter variants, then reranking them. Databricks said this cuts search time by **more than 3x**, halves answer time, and matches Claude Sonnet 4.5 retrieval quality on KARLBench. [^20]
- **GitHub Copilot expanded to a 1M-token window.** Copilot now supports a **1 million** context window and configurable reasoning levels for VS Code, Copilot CLI, and app developers. [^21]

## Industry Moves

*Why it matters: companies are increasingly selling measurable outcomes, broad AI access, and long-term platform bets—not just model access.*

- **Cognition put a financial guarantee behind Devin.** Its new AI Productivity Guarantee says that if Devin delivers less engineering value than customers pay for, Cognition will fund usage until it does, up to **$10 million**. The company also published how it estimates productive output and human-equivalent engineering time. [^22][^23][^24]
- **Perplexity partnered with the U.S. Small Business Administration on a mass adoption push.** The Main Street AI Accelerator will provide **$25M** in compute credits—**$250** each for up to **100,000** eligible companies. [^25]
- **GeneralistAI raised $400M.** The company said the new capital will go toward building general intelligence for the physical world and making it useful to everyone. [^26]

## Policy & Regulation

*Why it matters: biosecurity and national AI policy both moved closer to concrete action.*

- **A broad coalition urged Congress to mandate DNA synthesis screening.** Signatories including Sam Altman, Dario Amodei, Demis Hassabis, Mustafa Suleyman, Nobel laureates, and DNA-synthesis firms called for mandatory screening and recordkeeping for synthetic nucleic acid orders and the machines that print them, arguing AI is eroding historical knowledge barriers around biological weapons. [^27][^28]
- **Canada launched a new national AI strategy.** The government framed AI For All around Canadian values, public accountability, and AI that serves all Canadians; related posts described it as part of building, training, and scaling AI domestically. [^29][^30]

## Quick Takes

*Why it matters: a few smaller updates still sharpened the picture.*

- OpenAI said one of its models found a counterexample to an **80-year-old Erdős conjecture** and discussed the discovery on the OpenAI Podcast. [^31]
- OpenAI added moderation scores to the Responses API and Completions API so developers can log, route, review, or block within the same request flow. [^32]
- ParseBench debuted at CVPR 2026 with **2,000+** enterprise document pages and **167K+** test rules for VLM document understanding. [^33][^34]
- Runway said token consumption grew **50%**, power users **140%**, and enterprise NDR reached **300%** in the past six weeks. [^35]

---

### Sources

[^1]: [𝕏 post by @AnthropicAI](https://x.com/AnthropicAI/status/2062568862479208923)
[^2]: [𝕏 post by @AnthropicAI](https://x.com/AnthropicAI/status/2062568864240836995)
[^3]: [𝕏 post by @alexalbert__](https://x.com/alexalbert__/status/2062580571214389510)
[^4]: [𝕏 post by @kimmonismus](https://x.com/kimmonismus/status/2062571807274602534)
[^5]: [𝕏 post by @kimmonismus](https://x.com/kimmonismus/status/2062517474277675102)
[^6]: [𝕏 post by @kimmonismus](https://x.com/kimmonismus/status/2062555924225761397)
[^7]: [𝕏 post by @vllm_project](https://x.com/vllm_project/status/2062574262163280172)
[^8]: [𝕏 post by @NVIDIAAI](https://x.com/NVIDIAAI/status/2062521325076299981)
[^9]: [𝕏 post by @ArtificialAnlys](https://x.com/ArtificialAnlys/status/2062527871529439438)
[^10]: [𝕏 post by @arena](https://x.com/arena/status/2062566749418233981)
[^11]: [𝕏 post by @dair_ai](https://x.com/dair_ai/status/2062570078705688777)
[^12]: [𝕏 post by @skalskip92](https://x.com/skalskip92/status/2062549751246066144)
[^13]: [𝕏 post by @skalskip92](https://x.com/skalskip92/status/2062549764604846294)
[^14]: [𝕏 post by @skalskip92](https://x.com/skalskip92/status/2062549756887302277)
[^15]: [𝕏 post by @santiaranguri](https://x.com/santiaranguri/status/2062568362685956333)
[^16]: [𝕏 post by @OpenAI](https://x.com/OpenAI/status/2062567556524003631)
[^17]: [𝕏 post by @OpenAI](https://x.com/OpenAI/status/2062567559673856346)
[^18]: [𝕏 post by @OpenAI](https://x.com/OpenAI/status/2062567561276100809)
[^19]: [𝕏 post by @ChristinaHartW](https://x.com/ChristinaHartW/status/2062585124450172956)
[^20]: [𝕏 post by @DbrxMosaicAI](https://x.com/DbrxMosaicAI/status/2062576815927857321)
[^21]: [𝕏 post by @pierceboggan](https://x.com/pierceboggan/status/2062612889073238464)
[^22]: [𝕏 post by @cognition](https://x.com/cognition/status/2062597242167628019)
[^23]: [𝕏 post by @cognition](https://x.com/cognition/status/2062597244214542346)
[^24]: [𝕏 post by @cognition](https://x.com/cognition/status/2062597247393755590)
[^25]: [𝕏 post by @perplexity_ai](https://x.com/perplexity_ai/status/2062556000394379710)
[^26]: [𝕏 post by @GeneralistAI](https://x.com/GeneralistAI/status/2062519753307263081)
[^27]: [𝕏 post by @TheRundownAI](https://x.com/TheRundownAI/status/2062578772793008512)
[^28]: [𝕏 post by @kimmonismus](https://x.com/kimmonismus/status/2062485389949145457)
[^29]: [𝕏 post by @MarkJCarney](https://x.com/MarkJCarney/status/2062559439270363193)
[^30]: [𝕏 post by @aidangomez](https://x.com/aidangomez/status/2062560231662424287)
[^31]: [𝕏 post by @OpenAI](https://x.com/OpenAI/status/2062630454537424930)
[^32]: [𝕏 post by @OpenAIDevs](https://x.com/OpenAIDevs/status/2062619558440267801)
[^33]: [𝕏 post by @jerryjliu0](https://x.com/jerryjliu0/status/2062535626491412621)
[^34]: [𝕏 post by @llama_index](https://x.com/llama_index/status/2062525204262236266)
[^35]: [𝕏 post by @c_valenzuelab](https://x.com/c_valenzuelab/status/2062614359747055618)