# Long-Running Agent Loops, GPT-5.5 in Production, and the New Repo-Hardening Playbook

*By Coding Agents Alpha Tracker • May 23, 2026*

The strongest signal today is operational: practitioners are giving coding agents bounded milestones and enough runway to handle repo hardening, browser-driven training jobs, and real production code. Inside are the most copyable workflows, the tool and skill releases that matter, and the clips and repos worth studying next.

## 🔥 TOP SIGNAL

- **Long-horizon agents are finally doing the boring, high-value work.** swyx’s Kakuna flow is simple: run `/plan`, then `/goal`, and let the agent spend ~16 hours / 103 commits hardening a fragile MVP without changing product behavior; reach_vb used the same pattern with Codex, giving it a screenshot plus `/goal`, and the agent drove a signed-in Colab session via Chrome, handled runtime weirdness, launched a T4 training job, and finished with 99/100 exact random checks [^1][^2][^3]. The real shift is milestone ownership, not autocomplete: OpenAI is formalizing that with `/goal`, but DHH still says AI-written production code needs review, and Armin Ronacher’s Clanker example shows why—a 10-line intent can still explode into a 300-line diff when the agent edits the wrong layer [^4][^5][^6][^7].

## ⚡ TRY THIS

- **Run a hardening pass before you add more features.** swyx’s Kakuna pattern is: `1)` start with `/plan`, `2)` switch to `/goal`, `3)` let it run for a day, `4)` review the self-audit and verify behavior stayed the same. The reported outcome was the same app back, but with the boring work done—tests, maintainability work, and subagent-parallelized cleanup that made the repo easier to build on long term [^1][^2][^1].

- **Use UI context + a milestone when the work leaves the editor.** reach_vb’s minimal setup for Codex was a screenshot plus `/goal`; Codex then operated Colab through Chrome and babysat the full run. OpenAI’s docs and Google’s Anti Gravity team describe the same deeper pattern: give the agent a specific milestone or JIRA ticket, let it run, use side chats/check-ins to inspect progress, and only step in when it needs information that is not already written down [^3][^4][^8]. For risky actions, keep explicit confirmations on until trust is earned [^9].

- **Make the app agent-testable on day one.** Google’s Anti Gravity team recommends designing new apps so the agent can boot the app, click through flows, and turn those traces into Playwright-style integration tests. If you wait until later, they explicitly say existing products may need re-architecture before agents can test them cleanly [^8].

- **If you build agent tooling, validate edits in the harness, not in the prompt.** Salvatore Sanfilippo’s progression went from classic old/new replacement to line tags, then to whole-file CRC tags, and finally to a cleaner harness design where read/search remembers the last-seen lines and edit calls either fail or force a reread if those lines moved. That directly addresses line-offset drift and duplicate-occurrence failures [^10].

## 📡 WHAT SHIPPED

- **GPT-5.5 looks materially stronger on hard agent work.** DHH says it now beats Opus 4.7 for complicated agent tasks after GPT-5.2 lagged badly; in Omarchy 4, GPT-5.5 wrote the majority of 30,000 new lines, especially QML, and he still stresses review. He also says it is unusually good at explaining his own subtle Basecamp JavaScript. Study: [Omarchy PR #5856](https://github.com/basecamp/omarchy/pull/5856) [^11][^5][^12].

- **Cursor SDK is live.** You can now build custom agents with Composer 2.5 in Python and TypeScript, with docs at [cursor.com/docs/sdk/python](http://cursor.com/docs/sdk/python). Cursor is also discounting Composer usage in the SDK by 90% for the long weekend [^13][^14][^13][^15].

- **Kakuna is a new open-source hardening skill worth watching.** swyx describes it as checklists that only harden codebases, with subagent parallelism and strong opinions about agent-friendly repo design. Repo: [swyxio/skills#kakuna-codebase-hardening-suite](https://github.com/swyxio/skills/tree/main#kakuna-codebase-hardening-suite) [^1][^16].

- **The iOS app builder SKILL went public for any agent.** Riley Brown says the package lets agents build Swift iOS apps and get them onto a phone’s Home Screen; published resources are [SKILL.md](https://ios.chorus.com/SKILL.md) and the [CLI package](https://ios.chorus.com/skill/download). The listed agent support includes ChatGPT + Codex remote, Hermes, Openclaw, Cursor/Lovable/Replit, and Claude Code [^17][^18].

- **T3 Code’s remote workflow looks polished.** Theo says it is two clicks to get a URL for remote worktrees on a Mac Mini with Tailscale built in; he also says the product is built on OpenAI’s harness, with OpenAI actively supporting development [^19][^20].

- **Review-loop skills are getting packaged.** steipete’s `codex /review`-until-clean skill is now moving into [openclaw/agent-skills](https://github.com/openclaw/agent-skills); his caveat is the right one—this cleans up issues, not system architecture [^21][^22].

- **Pi + Cursor models got a tighter bridge.** Ben Tossell one-shotted a droid SDK in ~5 minutes with Composer 2.5 Fast inside Pi; the `pi-cursor-sdk` update adds Cursor models with native capabilities plus Pi extensions/tools through an MCP bridge. Repo: [pi-droid-sdk](https://github.com/bentossell/pi-droid-sdk) [^23][^24][^25].

## 🎬 GO DEEPER

- **45:24-46:20 — Anti Gravity on agent-generated integration tests.** Best timeless pattern in today’s set: make the agent able to launch the app, click through it, and turn those traces into Playwright tests. They explicitly say retrofitting existing products means re-architecting pieces [^8].


[![The future of software development](https://img.youtube.com/vi/v0RQiNJ9nhw/hqdefault.jpg)](https://youtube.com/watch?v=v0RQiNJ9nhw&t=2724)
*The future of software development (45:24)*


- **06:06-06:32 — Assign a JIRA ticket is the cleanest long-run mental model.** Anti Gravity’s enterprise lead describes a set-it-and-forget-it loop where chat becomes a retrieval and unblocking channel instead of the place where work happens [^8].

- **41:54-43:50 (+ 46:02-47:03) — Cogent on hot vs cold context.** Best naming scheme in today’s material: *hot* context is actively used and can stay alive across handoffs; *cold* context is archived/indexed and accumulated by background processes. That is a reusable memory pattern for any long-running agent system [^26].

- **12:35-13:35 — Thibault Sottiaux on where pure vibe coding still breaks.** Fine for experiments and joy projects; if you are targeting serious scale, keep a technical owner in the loop until agents get much better at long-term maintainability [^27].

- **Repos worth studying.**
  - [Omarchy PR #5856](https://github.com/basecamp/omarchy/pull/5856) — public 30,000-line AI-heavy conversion work on a real codebase, with the author explicitly saying review is still required [^5]
  - [Kakuna codebase hardening suite](https://github.com/swyxio/skills/tree/main#kakuna-codebase-hardening-suite) — one of the clearest public examples of packaging the same app with a more maintainable repo into a reusable skill [^1][^16]
  - [codex-review SKILL.md](https://github.com/steipete/agent-scripts/blob/main/skills/codex-review/SKILL.md) and [openclaw/agent-skills](https://github.com/openclaw/agent-skills) — small, practical references for turning review loops into reusable skills while keeping humans responsible for architecture [^21][^22]

*Editorial take: today’s edge is not more codegen—it is giving agents a bounded goal, a harness that catches drift, and a review loop that keeps architecture debt from compounding* [^4][^10][^21][^7].

---

### Sources

[^1]: [𝕏 post by @swyx](https://x.com/swyx/status/2057876022553690327)
[^2]: [𝕏 post by @swyx](https://x.com/swyx/status/2057559570177007912)
[^3]: [𝕏 post by @reach_vb](https://x.com/reach_vb/status/2057882419257311652)
[^4]: [𝕏 post by @OpenAIDevs](https://x.com/OpenAIDevs/status/2057530209470210453)
[^5]: [𝕏 post by @dhh](https://x.com/dhh/status/2057907663967543618)
[^6]: [𝕏 post by @mitsuhiko](https://x.com/mitsuhiko/status/2057914670653038883)
[^7]: [𝕏 post by @mitsuhiko](https://x.com/mitsuhiko/status/2057916645985698302)
[^8]: [The future of software development](https://www.youtube.com/watch?v=v0RQiNJ9nhw)
[^9]: [Google I/O 2026 Recap with Logan Kilpatrick, Josh Woodward and Tulsee Doshi](https://www.youtube.com/watch?v=RsDSeMXaCak)
[^10]: [La programmazione è ancora interessante](https://www.youtube.com/watch?v=1HTtYNaCtcM)
[^11]: [𝕏 post by @dhh](https://x.com/dhh/status/2057906669158309913)
[^12]: [𝕏 post by @dhh](https://x.com/dhh/status/2057923711362080974)
[^13]: [𝕏 post by @cursor_ai](https://x.com/cursor_ai/status/2057913121558413770)
[^14]: [𝕏 post by @cursor_ai](https://x.com/cursor_ai/status/2057913123194155070)
[^15]: [𝕏 post by @sualehasif996](https://x.com/sualehasif996/status/2057917926452482296)
[^16]: [𝕏 post by @swyx](https://x.com/swyx/status/2057876113934942507)
[^17]: [𝕏 post by @rileybrown](https://x.com/rileybrown/status/2057845419208851504)
[^18]: [𝕏 post by @anshnanda](https://x.com/anshnanda/status/2057838654182330827)
[^19]: [𝕏 post by @theo](https://x.com/theo/status/2057960907997876412)
[^20]: [𝕏 post by @theo](https://x.com/theo/status/2057964581692317905)
[^21]: [𝕏 post by @steipete](https://x.com/steipete/status/2054850632067019173)
[^22]: [𝕏 post by @steipete](https://x.com/steipete/status/2057921975410889003)
[^23]: [𝕏 post by @bentossell](https://x.com/bentossell/status/2057924512184668589)
[^24]: [𝕏 post by @fitchmultz](https://x.com/fitchmultz/status/2057854945618190697)
[^25]: [𝕏 post by @bentossell](https://x.com/bentossell/status/2057925705116049732)
[^26]: [Inside Cogent's three-agent architecture for autonomous defense | Geng Sng \(Co-founder, Cogent\)](https://www.youtube.com/watch?v=D6XWu54oG4g)
[^27]: [Head of ChatGPT & Codex: agents for normal people are HERE](https://www.youtube.com/watch?v=DPe_srf0GlI)