# Workflow Primitives Beat Model Chasing in Coding Agents

*By Coding Agents Alpha Tracker • June 1, 2026*

Model gains looked incremental today; the useful edge came from better workflow primitives. This brief covers Codex’s new workspace upgrades, transcript/versioning discipline, background QA loops, and the most relevant tool and model updates.

## 🔥 TOP SIGNAL

Riley Brown spent three hours comparing Opus 4.8 to 4.7 and could not find a meaningful difference, while still trusting GPT-5.5 more for deep agentic coding and computer control; Theo’s benchmark recap likewise put GPT-5.5 at the top on realistic long-horizon tasks, with a much better token/cost profile than Opus and Flash. [^1][^2]

The sharper edge in today’s sources is workflow design: persistent browser state, sub-agent thread spawning, transcript capture, and background QA loops all look more actionable than another minor model increment. [^1][^3][^4]

## ⚡ TRY THIS

- **Treat transcripts as artifacts, not disposable chat.** Simon Willison’s Codex Desktop workflow: export the full conversation with `Copy as Markdown`, paste it into a Gist, then link that Gist from commit messages and PRs so the decision trail survives outside the UI. If the menu item has vanished in your build, bind a custom keyboard shortcut to the command as a workaround. [^5][^6][^3][^7]

- **Split big jobs into explicit threads.** Riley Brown’s Codex super prompt asks the current chat to spin up new chat sessions inside Codex, then creates six background threads with narrow briefs and completion criteria; Theo separately says one thread per request in OpenClaw/Hermes is easier to manage than one giant conversation. After the fan-out, use `Cmd+G` to search across the resulting chat history. [^1][^8][^1]

- **Automate QA off every commit.** Peter Steinberger is teaching Codex to create a user-test scenario for each commit, run it through webVNC plus computer/browser-use tools against OpenClaw exactly like a human QA pass, and open PRs with fixes in the background. Strong template for regression hunting before manual review. [^4]

- **Use shorter prompts, then keep a failure corpus.** Theo’s rule: describe the problem and what success looks like in roughly two sentences, not a long step-by-step interface spec, and let strong models decide how to explore and test the change. When the agent fails, log the model, prompt, tools, codebase, and hash so you can replay those cases as your own mini benchmark when new agents arrive. [^2]

## 📡 WHAT SHIPPED

- **Codex workspace upgrades from a heavy user.** Riley Brown, on a 43-day streak with 4B tokens, says Codex now supports Windows computer use plus ChatGPT mobile remote control via QR-synced threads; the in-app browser preserves sign-ins, supports multiple tabs per task, and `Cmd+G` now searches full chat history. There is also a new GitHub-style activity page, and Peter Steinberger separately reports Codex writing ad-hoc codemods during a larger TypeScript migration. [^1][^9]

- **Model reality check: Opus 4.8 looks incremental.** Riley could not tell a meaningful difference between Opus 4.8 and 4.7 after three hours, still prefers Opus 4.6 for general agent work, and uses GPT-5.5 when trust, deep coding, terminal work, or computer control matter most. Theo’s benchmark recap on long-horizon tasks similarly had GPT-5.5 at 70%, GPT-5.4 at 56%, Opus 4.7 at 54%, and Sonnet 4.6 at 32%; average GPT-5.5 trial used 47k output tokens and cost $5.80 versus Opus at 97k and $16. [^1][^2]

- **Codex regression to watch.** A recent Codex Desktop update removed Simon Willison’s favorite feature, `Copy as Markdown` transcript export, prompting [issue #25201](https://github.com/openai/codex/issues/25201). His current stance: use the keyboard-shortcut workaround and think twice before auto-updating if your workflow depends on specific agent UI features. [^5][^7][^10]

- **Hermes vs OpenClaw is becoming a real product split.** Teknium says Hermes is intentionally batteries-included, can be pared back with `hermes skills config` or `hermes tools`, and can export the entire agent as a GitHub repo; Peter Steinberger says OpenClaw should stay modular and lean, because fewer skills and tools make the agent more efficient. @pocarles, who says he has used both since early versions, describes them as almost opposite visions rather than a simple better/worse ranking; Theo also used Hermes to clone a repo and run `npm publish` from his phone after putting his laptop away on a plane. [^11][^12][^13][^14]

- **New entrant to watch: MiniMax M3 and MiniMax Code.** MiniMax claims M3 hits 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, and 74.2% MCP Atlas, with 1M context via sparse attention and native multimodality. Tool endpoints are live at [MiniMax Code](http://code.minimax.io) and [platform.minimax.io](http://platform.minimax.io); weights and a tech report were promised in about 10 days. [^15]

## 🎬 GO DEEPER

- **10:57-12:41 — Riley Brown on Codex super prompts.** Best quick demo in today’s sources of a master thread creating six child threads with narrow briefs and completion criteria. [^1]

[![The Latest Codex Updates and The Truth about Opus 4.8](https://img.youtube.com/vi/84NkgiTcOdo/hqdefault.jpg)](https://youtube.com/watch?v=84NkgiTcOdo&t=657)
*The Latest Codex Updates and The Truth about Opus 4.8 (10:57)*


- **07:51-10:38 — Riley Brown on signed-in browser state inside Codex.** Useful if you want to see why persistent auth and multiple tabs matter: he jumps across already-authenticated web tools from inside the agent workspace. [^1]

[![The Latest Codex Updates and The Truth about Opus 4.8](https://img.youtube.com/vi/84NkgiTcOdo/hqdefault.jpg)](https://youtube.com/watch?v=84NkgiTcOdo&t=471)
*The Latest Codex Updates and The Truth about Opus 4.8 (7:51)*


- **23:40-25:20 — Theo on token economics versus benchmark bragging.** Fast breakdown of why GPT-5.5’s output-token and dollar profile looks materially better than Opus and Flash on the tasks he highlights. [^2]

[![AI code benchmarks lied to us](https://img.youtube.com/vi/JpSHyEIZ_bo/hqdefault.jpg)](https://youtube.com/watch?v=JpSHyEIZ_bo&t=1420)
*AI code benchmarks lied to us (23:40)*


- **Artifact to study — Simon Willison’s example transcript gist.** If you want a concrete model for storing the full agent conversation alongside code changes, start with [this transcript](https://gist.github.com/simonw/10088f6abc1fc8852bac3bda54d4bd94). [^6]

- **Issue to watch — Codex transcript export regression.** If transcript capture matters to your workflow, keep an eye on [issue #25201](https://github.com/openai/codex/issues/25201). [^5]

*Editorial take: the edge right now is less about squeezing meaning from every new model point release and more about building better state, thread, review, and artifact workflows around the agents you already trust.* [^1][^3][^4]

---

### Sources

[^1]: [The Latest Codex Updates and The Truth about Opus 4.8](https://www.youtube.com/watch?v=84NkgiTcOdo)
[^2]: [AI code benchmarks lied to us](https://www.youtube.com/watch?v=JpSHyEIZ_bo)
[^3]: [𝕏 post by @simonw](https://x.com/simonw/status/2061159078479663601)
[^4]: [𝕏 post by @steipete](https://x.com/steipete/status/2061208638027395490)
[^5]: [𝕏 post by @simonw](https://x.com/simonw/status/2061158636311958005)
[^6]: [𝕏 post by @simonw](https://x.com/simonw/status/2061158933503656436)
[^7]: [𝕏 post by @PhilippSpiess](https://x.com/PhilippSpiess/status/2061189796689494339)
[^8]: [𝕏 post by @theo](https://x.com/theo/status/2061204108582043750)
[^9]: [𝕏 post by @steipete](https://x.com/steipete/status/2061115471760441692)
[^10]: [𝕏 post by @simonw](https://x.com/simonw/status/2061159298575802506)
[^11]: [𝕏 post by @Teknium](https://x.com/Teknium/status/2060939163751592310)
[^12]: [𝕏 post by @steipete](https://x.com/steipete/status/2061072753998856696)
[^13]: [𝕏 post by @pocarles](https://x.com/pocarles/status/2061169618081951801)
[^14]: [𝕏 post by @theo](https://x.com/theo/status/2061203514626584874)
[^15]: [𝕏 post by @MiniMax_AI](https://x.com/MiniMax_AI/status/2061266317815296322)