LexDev

Bismarck AI

Bismarck v5 is a game-experiment platform where players play Walreign Empire web games that trigger real development work through Codex/GPT-5.4; v0.4 was torn down on 2026-04-23 and the domain was retained for the rebuild.

Competes with Devin

bismarckhq.com

v5 Phase 1LexDev

Reborn2 product eras

Era 1: Bismarck v0.4
v0.4
Cloud-hosted AI coding agent platform with persistent workspaces, Moltke orchestration, team leaderboards, XP/ranks/medals, story mode, and Theater of War gamification.
Born Feb 16, 2026Died Apr 23, 2026Superseded by Bismarck v5
- AWS staging/prod infrastructure was destroyed on 2026-04-23; no app data was backed up by explicit decision.
- Domain bismarckhq.com was retained for v5; DNS/backend callbacks were removed or left inactive until the rebuild ships.
- Archived Stripe state belongs to v0.4: Bismarck Pro, Team, and Enterprise monthly/yearly prices remain in Stripe history, with the old webhook disabled.
- Historical key metrics were waitlist_signups, weekly_workspace_provisions, and workspace_provision_p50.
Era 2: Bismarck v5
v0.5Active
Video-game-first rebuild: players play Walreign Empire web games that trigger real Codex/GPT-5.4 development work through a shared agent-service backend.
Born Apr 24, 2026
- CliqMake has been folded under the Bismarck product line as source material and product-line lineage, not as a standalone portfolio product.
- Historical CliqMake code remains at `experimental/cliqmake` and should not be ingested as an independent product repo.

Key metrics

Metrics being defined

Sub-Products

1 sub-product

Moltkeapp
Planned→

Competitive Intel

1 entries

vs. Devin
building
Edge: Devin focuses on autonomous coding agents; Bismarck v5 differentiates through game-driven development where players trigger real engineering work via Codex/GPT-5.4.

Research Hub

9 types

View full research hub →

Roadmap

464 items

Done356

`$game-audience` - created `research/game-audience.md` on 2026-05-13 with target segments, platform fit, genre familiarity, motivations, adjacent audiences, validation plan, and risky assumptions.

`$game-comparables` - created `research/game-comparables.md` on 2026-05-13 with comparable titles, tags/category signals, price points, review/traction signals, update cadence lessons, creator traction hypotheses, and positioning frames.

`$game-core-loop` - created `research/game-core-loop.md` on 2026-05-13 with 10-second, 1-minute, 5-minute, 30-minute, and multi-day loops plus reward cadence, novelty sources, genre loop fit, risks, and prototype priority.

`$game-fantasy` - created `research/game-fantasy.md` on 2026-05-13 with the player fantasy, emotional pillars, vibe references, one-sentence hook, first-session promise, and validation questions.

`$game-genre-map` - created `research/game-genre-map.md` on 2026-05-13 with genre conventions, player complaints, overused mechanics, underserved combinations, player tolerance risks, and acceptance checks.

`$game-playtest-metrics` - created `research/game-playtest-metrics.md` on 2026-05-14 with first-session completion, time-to-fun, replay, confusion, quit, share, demo conversion, wishlist, and retention metrics.

`$game-prototype-test` - created `research/game-prototype-test.md` on 2026-05-14 with prototype scope, test questions, playtest script, observation checklist, success criteria, and cut/keep/amplify decisions.

`$pack install game` - enable the game research pack before running game-pack documentation skills. Verified `.agents/project.json` now declares `project_type: game` with `enabled_packs: ["game"]`; refreshed local Claude/Codex game-pack skill links on 2026-05-13.

`packages/game-test-harness` exists and owns harness-specific contracts and React UI

`pnpm dev` starts all packages with hot reload

`pnpm turbo lint` — all packages pass

`pnpm turbo test` — no regressions

`pnpm turbo typecheck` — all packages pass

A player can create a Work Order from a plain-language coding goal, choose a hat, see the recommended skill and named agent, and confirm the draft.

A player can start from a Workflow Structure and create a pre-filtered Work Order for that structure's workflow family.

A tester can create or confirm a colony name through a modal/function, and invalid/share-risky names are rejected or clearly deferred behind validation.

Add `@bismarck/house-of-walreign` workspace package.

Add a local-dev harness reset control that clears current experiment state and harness progress.

Add a random colony name generator built from two word lists.

Add a React `CityMap` surface for `colony-map` with clickable plots, district painting, saved district selection, Work Order route/status evidence, and loop action shortcuts.

Add a reducer-derived readability helper for active work, idle colonists, crisis count, council review count, and next recommended command.

Add an optional reset-state callback to the shared `GameTestHarnessPanel`.

Add default-map identity banner and first-run naming modal.

Add executable persistence coverage for manual task priority, assigned colonist, assigned zone, and status.

Add focused coverage for reset behavior in the shared harness and game dev harness.

Add focused coverage proving:

Add focused executable coverage for category defaults and role-fit guidance.

Add focused executable coverage for summary values and next-command copy.

Add focused regression coverage for empty, title-only, and title-plus-description submissions.

Add focused shell/toolbar tests for collapsed and expanded heights.

Add focused tests for generated-name shape and modal copy.

Add focused tests proving defaults, invalid names, valid rename, and older saved-state migration.

Add game dev harness coverage for clearing the current experiment state prefix.

Add or update plan/prod-absence tests only where harness copy changes require coverage.

Add package tests for variant manifest, story-frame labels, minimum-loop controls, local/mock review honesty, and storage fallback.

Add reducer-backed colony identity defaults, validation, persisted-state migration, and update action.

Add reducer/UI coverage proving:

Add review notes with decisions, residual risks, and recommended next command.

Add review notes with tests run, skipped browser checks if any, and residual risk.

Add shared harness UI coverage for resetting progress and invoking the external reset callback.

Add task category metadata or inference, including a docs-oriented category.

Add the smallest missing executable coverage for identity, persistence, readability, role-fit, QA crisis/review, and shared harness expectations.

Add visible local QA controls or seeded empty-state actions for a representative crisis and council-review path.

Agent abstraction layer exposes all 5 core methods with TypeScript types

Agent integration works (mock and live modes)

Agent sessions create feature branches on linked repos

All 17 steps complete. 34 tests pass. Typecheck/lint pass.

All 18 steps complete. Regression suite commit `cabc120`. Tests pass. Typecheck/lint pass.

All 18 steps complete. Tests pass. Typecheck/lint pass.

All 22 steps + Step 23 review complete. Cleanup commit `46b176c`. Tests pass. Typecheck/lint pass.

All 9 game packages exist as stubs with correct dependency wiring (9 specs exist, not 10)

All 9 games use consistent faction names, colors, and terminology

All Phase 16 acceptance criteria pass

All Phase 18 acceptance criteria pass.

All phase tests pass

All spec-defined variants render and are playable

All steps complete. Package tests/typecheck/lint pass.

Answer the first three first-principles questions from UI evidence only.

Apply the smallest fix that restores validation, success feedback, queue evidence, assignment defaults, and persisted task state.

Archive the prior variation plan and interview log before replacing the canonical specs.

Audit existing Phase 17 tests against each remaining acceptance criterion.

Auth flow works end-to-end (login → authenticated shell)

Auto-battler and roguelike deckbuilder have skeletal shared test plans

Browser verification confirms harness visibility, navigation, finding capture, scroll containment, and exports

Browser verification confirms the default route can complete the core loop through Work Order draft, routing, review evidence, and district planning surfaces.

Browser verification confirms the end-to-end default workflow and district planning surfaces.

Browser verification confirms the first three first-principles harness objectives can be answered from the UI.

Browser verifies Last-Session Report, Plan City district planning, and Planning Inbox organization surfaces.

Browser verifies the workflow-colony shared harness/default route.

Browser verifies Work Order draft, routing, review evidence, and named-agent roster surfaces.

Change the secondary modal action to populate the custom-name input with a generated name.

CI pipeline passes lint, type-check, test, and build

Clarify naming controls so random-name generation fills the input and Save persists the current input value.

Cleanup commit: `16416bb`

Cleanup commit: `6ecc5d2`

Cleanup commit: `e566b81`

Codex API integration passes the same test suite as the mock layer

Colonist actions trigger agent tasks and display results

Colony Sim runs through shared harness plans instead of package-local overlay semantics

Confirm assumptions for target user, story-frame parity, first judge, minimum testable loop, and evaluation method.

Confirm route, variant dimensions, fidelity, global shell, shared inspector, review evidence surface, and responsive behavior.

Confirm that the exploration should include both a Three.js 3D/isometric path and a richer 2D path.

Confirm the city as both workflow machine and project mindmap.

Confirm the exploration should compare both a Three.js 3D/isometric living-diorama path and a richer 2D renderer path.

Confirm Tinker Mode, Planning Inbox, expansion proposals, and milestone snapshot behavior.

Cost tracking accumulates per-session and per-aggregate

Decide asset strategy, interaction model, specific benchmark metrics, and migration boundary.

Decide benchmark host: integrate renderer benchmark mode into the existing local-dev Playtest harness rather than creating a separate route.

Decide benchmark result storage: local persistence plus export for run-to-run renderer regression comparison.

Decide evaluation model: formal measured renderer benchmarks plus Playtest harness qualitative taste-pass objectives.

Decide prototype comparison scope and success criteria.

Decide renderer architecture baseline: hybrid diegetic overlay from the start, not React-only HUD over a renderer stage.

Decide renderer candidates: benchmark Three.js, PixiJS, and the incumbent Phaser renderer against the same visual-fantasy slice.

Decide shared visual target: isometric 2.5D so Three.js and strict 2D renderers can be compared against the same colony composition.

Decide story-frame strategy, player role, core loop, WorkItem contract, failure model, state model, room/object model, execution boundary, and primary screens.

Define `House of Walreign` as a life-sim experiment with Royal Household and Modern Studio as equal first-class variants.

Define asset strategy, interaction model, specific benchmark metrics, and migration boundary.

Define benchmark host: integrate renderer benchmark mode into the existing local-dev Playtest harness rather than creating a separate route.

Define benchmark result storage: local persistence plus export for run-to-run renderer regression comparison.

Define district, tag, and Thread responsibilities.

Define evaluation model: formal measured renderer benchmarks plus Playtest harness qualitative taste-pass objectives.

Define five clean-sheet workflow metaphors: Codebase Colony, Agent Settlement Ops, Bug Frontier Survival, Product Expedition, and Release Train Colony.

Define implementation-ready UI anatomy for all five UX variations.

Define prototype comparison scope and success criteria: visual-fantasy-first comparison using the same tiny playable slice in both renderers.

Define renderer architecture baseline: hybrid diegetic overlay from the start so plain React chrome does not cause premature negative evaluations.

Define renderer candidates: benchmark Three.js, PixiJS, and the incumbent Phaser renderer against the same visual-fantasy slice.

Define shared visual target: isometric 2.5D so Three.js and strict 2D renderers can be compared against the same colony composition.

Demo codebase can be provisioned for new players

Demo mode and live mode entry paths both load

Dev toolbar renders, floats over game UI, and controls experiment/variant/screen switching

District Tinker Mode lets a player paint, name, color/pattern, save, cancel, and inspect at least one non-overlapping district.

Documentation and task history capture deviations, residual risk, and next-step routing.

Documentation updated with implementation deviations and follow-ups

Event streaming delivers progress updates to a test client

Existing variant switching remains available without becoming the primary test workflow

Factory-builder: all 10 UI components migrated to CSS modules

Factory-builder: all 23 sound hooks fire custom events

Factory-builder: all 3 commission hub variants functional

Factory-builder: all 4 layout variants render and are switchable via variant selector

Factory-builder: all 4 station click variants functional

Factory-builder: animation transitions match spec (durations, easings)

Factory-builder: both inspection diff variants functional

Factory-builder: both palette variants apply correctly (CSS custom properties)

Factory-builder: HUD shows all 7 data points with animation (value pulses, power warnings)

Factory-builder: StationToolbar renders 6 stations with color coding, locked/unlocked states

Factory-builder: view navigation works per layout variant (hotkeys, tabs, sidebar icons, panel toggles)

Findings export as both JSON and triage-ready Markdown

Froggy Empire presentation is consistent across genres

Game shell experiment top bar redesigned per ui-game-shell.md §6.2 (collapsed/expanded, 4 accordion sections, agent status, perf badge)

Game shell hub page redesigned per ui-game-shell.md §6.1 (row anatomy, action cluster, hypothesis preview, keyboard nav)

Gather existing Bismarck genre experiment context.

GitHub OAuth login/logout flow works end-to-end

Ground the interview in existing Colony Sim specs, game research, and current renderer implementation.

Hats affect labels, suggestions, review emphasis, and routing copy, but tests prove they do not alter execution speed, tool capability, validation strictness, cost, or permissions.

Help overlay renders dynamic shortcut cheatsheet

HITL callbacks route correctly through the abstraction

HITL review flow works within the colony theme

Identify that the current map is too plain because it relies on flat shape/plot rendering.

Implement all five UX variations: Dollhouse Director, Household Command Ledger, Agent Stories Timeline, Studio Day Planner, and Build/Live/Review Loop.

Implement one experiment route with independent `UX Variation` and `Story Frame` variant dimensions.

Implement shared fixtures, status strip, inspector, review evidence contract, and stylized 2D room/token/card surfaces.

Initial named plans exist: `starter-template-mvp`, `genre-taste-pass`, `regression-smoke`

Inspect task docs for stale unchecked Phase 18 items.

Inspect warnings and either fix or record accepted warnings with rationale.

Interview the district physical model: building count, duplicate structures, complex growth, map freedom, and city scale.

Interview the Sims-inspired genre concept against existing Bismarck game experiment constraints.

Keep `Plan City` as a focused planning table, but make city interaction available from the default map.

Keep five UX concepts: Dollhouse Director, Household Command Ledger, Agent Stories Timeline, Studio Day Planner, and Build/Live/Review Loop.

Keep objectives one-question-at-a-time with stable expected answers and failure signals.

Keep QA affordances framed as local demo/test controls, not production agent triggers.

Live agent smoke follow-up planned (commit `8c54ba6`).

Local dev shows the harness by default

Lock per-variation anatomy decisions for Dollhouse Director, Household Command Ledger, Agent Stories Timeline, Studio Day Planner, and Build/Live/Review Loop.

Manual tasks persist across Tasks/Council/Crisis navigation and local session reloads.

Mark the Step 17.2 persistence acceptance criteria complete after validation.

Mock layer returns realistic simulated responses with configurable delays

Mock/live toggle switches cleanly without code changes

Monorepo builds successfully with Turborepo

Named agents are generated from skill metadata or local fallback skill fixtures, can be recommended by hat/team, and can be saved as profile-only or profile-with-history without implying live parallel execution.

No lore contradictions between games

No regressions in Phase 17 readability, task persistence, crisis, council, or shared harness behavior.

No regressions in previous phase tests

No regressions in shell routing, variant switching, or existing game builds

Offset the experiment viewport by the measured toolbar height during local dev testing.

Per-game acceptance criteria from colony-sim.md are met

Plan five high-contrast UX variations for first-play and repeat-play evaluation.

Player can link/unlink a GitHub repo

Present and validate assumptions checkpoint.

Preserve reducer and persistence contracts by keeping the summary derived-only.

Promote the district city map from secondary overlay to the default playable Colony Sim surface after the 2026-05-14 playtest found the old map non-interactive and the loop unclear.

Prove crisis and council review counts update the default-map summary.

Read House of Walreign source spec, interview log, game audience research, game fantasy research, game core-loop research, and Colony Sim UX direction.

Recommend prototype ordering and validation criteria for the next build phase.

Record already-covered criteria instead of duplicating tests.

Register the experiment in game types, app route loading, experiment labels, app dependency graph, and shared harness plan loaders.

Rename the submit action to `Save` so it clearly saves the current input value.

Render a secondary `Reset state` action in the harness header when reset is available.

Resolve UI context from House of Walreign spec, UX variation plan, research docs, and existing experiment shell patterns.

Restart Colony Sim UX variation planning from a coding/product-development workflow premise.

Run `git diff --check`.

Run `pnpm --filter @bismarck/colony-sim lint`.

Run `pnpm --filter @bismarck/colony-sim test -- districts`.

Run `pnpm --filter @bismarck/colony-sim test -- planning-organization`.

Run `pnpm --filter @bismarck/colony-sim test -- test-plans`.

Run `pnpm --filter @bismarck/colony-sim test`.

Run `pnpm --filter @bismarck/colony-sim typecheck`.

Run affected Colony Sim and game validation before marking the step complete.

Run affected harness and game tests before shipping.

Run affected package/app tests.

Run Colony Sim package test, typecheck, and lint.

Run Colony Sim package typecheck/lint.

Run Colony Sim test/typecheck/lint.

Run focused Colony Sim tests for district and harness plan coverage.

Run game app test/typecheck/lint/build.

Run package verification and record investigation results before shipping.

Run production harness absence check and record accepted warnings.

Run relevant app-level test/typecheck/lint/build checks discovered from package metadata.

Run targeted package/app validation only if source changes were required.

Session lifecycle works end-to-end: create → stream → review → apply/reject

Shared Colony Sim harness objectives still load and first-principles questions have stable expected answers.

Shared Colony Sim harness plans include workflow-colony objectives for Work Order creation, hat routing, named-agent recommendation, structure routing, returning report, and district planning.

Shared harness plans validate the workflow-colony loop.

Shell app loads and routes between experiment stubs

Shell error handling: soft-error banner and hard-error overlay functional

Show recommended colonist specialization fit in task surfaces.

Specify shared mechanics for motives, relationships, routines, rooms, functional objects, WorkItems, soft failure, local/mock review events, and visible agent intent.

Staging/production builds do not render harness UI or expose objectives/debug metadata

Start a local game dev server and open `/experiments/colony_sim` in Browser Use.

Step 17.1: Add colony identity state and first-run naming surface

Step 17.10: Update task docs, history, and phase closeout

Step 17.2: Move manual task and priority state into ColonyStateContext

Step 17.3: Add default-map operational summary and next-command guidance

Step 17.4: Add task category and colonist role-fit affordances

Step 17.5: Make crisis and council QA states exercisable

Step 17.6: Align Colony Sim shared harness plans with the improved scenario

Step 17.7: Write regression tests covering acceptance criteria

Step 17.8: Run package and app validation

Step 17.9: Run browser-use verification through the shared harness

Step 18.1: Define Work Order, Hat, Skill, Named Agent, and Workflow Structure contracts

Step 18.1: Define Work Order, Hat, Skill, Named Agent, and Workflow Structure contracts.

Step 18.10: Run package and app validation

Step 18.10: Run package and app validation.

Step 18.11: Run browser-use verification through the shared harness

Step 18.11: Run browser-use verification through the shared harness.

Step 18.12: Update task docs, history, and phase closeout

Step 18.12: Update task docs, history, and phase closeout.

Step 18.2: Add reducer-backed Work Order lifecycle and persistence

Step 18.2: Add reducer-backed Work Order lifecycle and persistence.

Step 18.3: Build the goal-first Work Order creation flow

Step 18.3: Build the goal-first Work Order creation flow.

Step 18.4: Build Workflow Structure routing and structure-first creation

Step 18.4: Build Workflow Structure routing and structure-first creation.

Step 18.5: Add named-agent roster and save-profile behavior

Step 18.5: Add named-agent roster and save-profile behavior.

Step 18.6: Add Last-Session Report and Colony Health Map overlay

Step 18.6: Add Last-Session Report and Colony Health Map overlay.

Step 18.7: Add district state, Tinker Mode, and district inspector foundation

Step 18.7: Add district state, Tinker Mode, and district inspector foundation.

Step 18.7a: Add failing district domain tests

Step 18.7b: Add district state, reducer actions, persistence, and Tinker Mode UI

Step 18.8: Add Planning Inbox, expansion proposals, tags, Threads, and milestone snapshots

Step 18.8: Add Planning Inbox, expansion proposals, tags, Threads, and milestone snapshots.

Step 18.8a: Add failing organization-domain tests

Step 18.8b: Add Planning Inbox, Tags, Threads, and milestone snapshot state/UI

Step 18.9: Align Colony Sim shared harness plans with the workflow-colony loop

Step 18.9: Align Colony Sim shared harness plans with the workflow-colony loop.

Steps 1-18 complete. Typecheck/lint pass.

Steps 1-20 complete. Tests pass. Typecheck/lint pass.

Style guide document exists and covers all factions and terminology

Surface that summary on the default map/HUD without obstructing the shared Playtest drawer.

Tactical layout follow-up applied (commit `33f9d2b`).

Tags and Threads are distinct in state and UI: tags are lightweight labels; Threads have timelines, evidence, related Work Orders, related districts, and optional steward.

Task assignment preserves status, colonist, work zone, and priority when leaving and returning to the Task Board.

The Crisis objective can be exercised from visible local QA controls or a seeded scenario without source-code intervention.

The default Colony Sim map shows a colony-level name, ownership signal, and purpose before relying on building labels.

The default map/HUD summarizes current work, idle colonists, crisis count, council review count, and the recommended next command.

The Engineer-to-Docs objective has visible supporting UI for task category/role fit.

The Planning Inbox can preview/approve/reject/snooze/archive at least one expansion proposal and one Thread suggestion.

The returning-player screen shows a Last-Session Report with pending reviews, stale blockers, active Work Orders, validation pressure, and recommended next action.

Trace the submission path from `TaskBoard` through task creation, queue state, assignment, and evidence surfaces.

Trace toolbar height ownership through `ExperimentToolbar` and `ExperimentLayout`.

Unit tests cover all abstraction methods and mock behaviors

Update Colony Sim first-principles harness expected answers for the improved named-colony scenario.

Update harness objectives and tests so future playtests catch regressions to a passive old map.

Update roadmap and todo with the planning result.

Validate how Colony Sim persisted state prevents first-run name-model retests.

Validate manual task creation observations against Colony Sim code and recent git history.

Validate the expanded Playtest drawer overlap against the current game viewport.

Variant switching works via dev toolbar

Variant system registers, lists, and switches between variants

Verify manual task persistence plus demo crisis/review paths from visible controls.

Verify manual tasks and priority updates are reducer-backed rather than `TaskBoard` local state.

Verify the fix in Browser Use against `/experiments/colony_sim`.

Verify the Playtest drawer loads the Colony Sim first-principles plan.

Wire the game dev harness to clear the current experiment's local state keys and reload.

Work Orders route visibly to a Workflow Structure and maintain draft, queued, active, review-ready, sent-back, shipped, blocked, and archived states in reducer-backed persisted state.

Write `specs/colony-sim-rendering-experiments-interview.md`.

Write `specs/colony-sim-rendering-experiments.md` and `specs/colony-sim-rendering-experiments-interview.md`.

Write `specs/colony-sim-rendering-experiments.md`.

Write `specs/house-of-walreign-interview.md`.

Write `specs/house-of-walreign.md`.

Write `specs/ui-house-of-walreign-variations-interview.md`.

Write `specs/ui-house-of-walreign-variations.md`.

Write `specs/ux-variations-house-of-walreign-interview.md`.

Write `specs/ux-variations-house-of-walreign.md`.

Write the canonical spec and interview log under `specs/`.

Write the district UI spec and interview log.

Planned108

`$game-launch` - create `research/game-launch.md` after `$game-store-page-test`; currently blocked because `research/game-store-page-test.md` is missing.

`$game-roadmap` - update `tasks/roadmap.md` after `$game-launch`; currently blocked because `research/game-launch.md` is missing and the current `tasks/roadmap.md` predates the missing game-market research sequence.

`$game-store-page-test` - create `research/game-store-page-test.md` after `$game-playtest-metrics`; unblocked by `research/game-playtest-metrics.md`.

`$game-store-page-test` - create `research/game-store-page-test.md` because `tasks/todo.md` § `Priority Documentation Todo` has this unchecked item unblocked by `research/game-playtest-metrics.md` (metrics research updated 2026-05-14).

`$plan-phase 19` - decompose the new Colony Sim Renderer Benchmark phase because `tasks/roadmap.md` now includes Phase 19 from `specs/colony-sim-rendering-experiments.md` (roadmap/spec updated 2026-05-15), but `tasks/todo.md` still has no implementation steps for that phase.

`$reconcile-dev-docs fix tasks` - reconcile stale manual/advisory task docs because `tasks/manual-todo.md` was last updated 2026-05-14 and still references Phase 13 plus old dogfood/UAT follow-ups while roadmap Phases 13-18 are complete.

`$spec-drift fix all` - reconcile specs against implementation because source-code commits under `apps/` and `packages/` landed after many canonical specs.

`pnpm turbo lint` — all packages pass

`pnpm turbo test` — no regressions

`pnpm turbo typecheck` — all packages pass

1 failing test: budget calc in `agent-integration.test.tsx:185` (spentTodayUsd accumulation off by ~1.2)

Agent abstraction layer exposes all 5 core methods with TypeScript types

Agent integration works (mock and live modes)

Agent sessions create feature branches on linked repos

All 10 games use consistent faction names, colors, and terminology

All 3 HITL patterns function within naval theme

All 3 HITL touchpoints work together (Log + Periscope + Radio)

All 3 visual styles render correctly (PixiJS, Three.js, terminal)

All 8 spec-defined variants render and are playable

All 9 spec-defined variants render and are playable

All spec-defined variants render and are playable

Belt logistics and station interactions function correctly

Benchmark output includes first render time, FPS/frame-time, interaction latency, bundle size impact, memory trend, and nonblank canvas verification.

Card mechanics trigger agent tasks correctly

Codex API integration passes the same test suite as the mock layer

Combat and equipment systems function

Companion actions trigger agent tasks correctly

Cost guardrails prevent runaway agent spending

Cost tracking accumulates per-session and per-aggregate

Demo codebase can be provisioned for new players

Depth mechanic affects scope of agent work

Dialogue system produces valid agent prompts

Dive/surface toggle changes agent behavior mode (analysis vs execution)

Each candidate supports the same minimal interactions and emits equivalent benchmark events.

Event streaming delivers progress updates to a test client

Factory actions trigger agent tasks and display results

Factory-builder: all 10 UI components migrated to CSS modules

Factory-builder: all 23 sound hooks fire custom events

Factory-builder: all 3 commission hub variants functional

Factory-builder: all 4 layout variants render and are switchable via variant selector

Factory-builder: all 4 station click variants functional

Factory-builder: animation transitions match spec (durations, easings)

Factory-builder: both inspection diff variants functional

Factory-builder: both palette variants apply correctly (CSS custom properties)

Factory-builder: HUD shows all 7 data points with animation (value pulses, power warnings)

Factory-builder: StationToolbar renders 6 stations with color coding, locked/unlocked states

Factory-builder: view navigation works per layout variant (hotkeys, tabs, sidebar icons, panel toggles)

Fleet actions trigger agent tasks correctly

Fleet formation mechanic affects agent coordination pattern

Froggy Empire presentation is consistent across genres

Game shell experiment top bar redesigned per ui-game-shell.md §6.2 (collapsed/expanded, 4 accordion sections, agent status, perf badge)

Game shell hub page redesigned per ui-game-shell.md §6.1 (row anatomy, action cluster, hypothesis preview, keyboard nav)

GitHub OAuth login/logout flow works end-to-end

Help overlay renders dynamic shortcut cheatsheet

HITL callbacks route correctly through the abstraction

Idle progression and prestige systems function correctly

Live-agent smoke and full per-game spec acceptance deferred by design.

Management actions trigger agent tasks and display results

Mock layer returns realistic simulated responses with configurable delays

Mock/live toggle switches cleanly without code changes

No lore contradictions between games

Per-game acceptance criteria from auto-battler-tactics.md are met

Per-game acceptance criteria from crpg.md are met

Per-game acceptance criteria from factory-builder.md are met

Per-game acceptance criteria from idle-incremental.md are met

Per-game acceptance criteria from management-tycoon.md are met

Per-game acceptance criteria from naval-combat.md are met

Per-game acceptance criteria from roguelike-deckbuilder.md are met

Per-game acceptance criteria from submarine-combat.md are met

PixiJS 2D and Three.js 3D renderers both work

Player can link/unlink a GitHub repo

Qualitative harness objectives evaluate route legibility, structure/worksite/state readability, living-colony feel, and future readiness for districts, agents, particles, labels, and evidence overlays.

Review `tasks/recurring-todo.md`: "Weekly game-experiment dogfood sweep" — next due was 2026-05-05 in `tasks/recurring-todo.md`; promote to `tasks/todo.md` only if this now requires execution work.

Round lifecycle works (deploy → execute → results → adjust)

Run lifecycle works (start → play → boss → complete/fail)

Scoring and progression systems function correctly

Selected assets are vendored into `packages/colony-sim/assets` with a reproducible manifest and license/source metadata.

Session lifecycle works end-to-end: create → stream → review → apply/reject

Shell error handling: soft-error banner and hard-error overlay functional

Style guide document exists and covers all factions and terminology

The local Playtest harness can switch renderer candidates, reset state, persist benchmark runs, capture findings, and export benchmark notes.

The recommended renderer is justified by recorded evidence instead of preference alone.

Three.js, PixiJS, and Phaser candidates render the same benchmark scene from the same renderer-neutral view model.

Torpedo system triggers targeted task executions

U-boat pen upgrades persist between sessions (localStorage)

Unit deployment triggers agent tasks correctly

Unit tests cover all abstraction methods and mock behaviors

Variant presentation parity, live-agent smoke, full spec acceptance deferred by design.

Variant switching works via dev toolbar

Timeline

20 events

May 2026

docs

docs: roadmap colony sim renderer benchmark

May 16static-

docs

docs: specify colony sim renderer benchmarks

May 15static-

feature

feat: add playtest harness state reset

May 15static-

fix

fix(colony-sim): clarify colony naming controls

May 15static-

docs

docs: add house of walreign uat plan

May 14static-

fix

fix(colony-sim): make city map the default loop surface

May 14static-

feature

feat: add house of walreign UI prototypes

May 14static-

docs

docs: add house of walreign ui variation spec

May 14static-

docs

docs: add game playtest metrics

May 14static-

docs

docs: add game prototype test plan

May 14static-

docs

docs(tasks): close phase 18

May 14static-

fix

fix(colony-sim): seed browser-verifiable planning inbox

May 14static-

docs

docs(tasks): record phase 18 validation pass

May 14static-

feature

feat(colony-sim): align workflow harness objectives

May 14static-

feature

feat(colony-sim): add planning organization inbox

May 14static-

test

test: add colony planning organization contracts

May 14static-

feature

feat(colony-sim): add district tinker mode

May 14static-

docs

docs: add house of walreign ux variations

May 14static-

test

test(colony-sim): add district red contracts

May 14static-

feature

feat(colony-sim): add last-session report

May 14static-

Dev Docs

47 files

Specs

Auto-battler/Tactics — "Walreign Vanguard" — Bismarck v5
May 15, 202621.9 KB
Bismarck v5 — Game-First Development Interface
May 15, 202613.1 KB
Bismarck v5 Game Experiments — Interview Log
May 15, 20267.8 KB
Colony Sim — "Walreign Outpost" — Bismarck v5
May 15, 202627.3 KB
Colony Sim — "Walreign Outpost" — Interview Log
May 15, 20266.9 KB
CRPG — "Walreign Chronicle" — Bismarck v5
May 15, 202629.7 KB
CRPG "Walreign Chronicle" — Interview Log
May 15, 20269.4 KB
Factory-Builder — "Walreign Forge" — Bismarck v5
May 15, 202625.2 KB
Factory-Builder — "Walreign Forge" — Interview Log
May 15, 20264.2 KB
House of Walreign
May 15, 202617.3 KB
House of Walreign Spec Interview
May 15, 20269.6 KB
Idle/Incremental — "Walreign Eternal" — Bismarck v5
May 15, 202622.6 KB
Idle/Incremental — "Walreign Eternal" — Interview Log
May 15, 20263.9 KB
Management Tycoon — "Walreign Works" — Bismarck v5
May 15, 202626.3 KB
Management Tycoon — "Walreign Works" — Interview Log
May 15, 20265.9 KB
Monorepo Structure & Playtest Platform — Bismarck v5
May 15, 202616.2 KB
Monorepo Structure & Playtest Platform — Interview Log
May 15, 20268.1 KB
Roguelike Deckbuilder — "Walreign Sortie" — Bismarck v5
May 15, 202641.3 KB
Shared Backend Service — Bismarck v5
May 15, 202613.0 KB
Shared Backend Service — Interview Log
May 15, 20266.7 KB
Shared Game Test Harness
May 15, 202610.9 KB
Shared Game Test Harness Interview
May 15, 202610.4 KB
UI Interview - House of Walreign Variations
May 15, 20266.9 KB
UI Interview Log - Colony Sim Districts, Tags, Threads, and City Planning
May 15, 20268.5 KB
UI Interview Log — Bismarck v5 Game Shell
May 15, 20266.7 KB
UI Interview Log — Colony Sim ("Walreign Outpost")
May 15, 20267.0 KB
UI Interview Log — Factory Builder (Walreign Forge)
May 15, 20266.9 KB
UI Interview Log — Management Tycoon ("Walreign Works")
May 15, 20265.6 KB
UI Spec - Colony Sim Districts, Tags, Threads, and City Planning
May 15, 202619.7 KB
UI Spec - House of Walreign Variations
May 15, 202620.1 KB
UI Spec — Bismarck v5 Game Shell (Dev / Playtest Surface)
May 15, 202623.0 KB
UI Spec — Colony Sim ("Walreign Outpost")
May 15, 202638.2 KB
UI Spec — Factory Builder (Walreign Forge)
May 15, 202637.8 KB
UI Spec — Management Tycoon ("Walreign Works")
May 15, 202631.5 KB
UX Direction - Colony Sim Workflow Colony
May 15, 202612.3 KB
UX Direction Interview Log - Colony Sim Workflow Colony
May 15, 20269.7 KB
UX Variation Interview Log — Management Tycoon ("Walreign Works")
May 15, 20264.3 KB
UX Variations - House of Walreign
May 15, 202625.6 KB
UX Variations - House of Walreign Interview
May 15, 20266.7 KB
UX Variations — Management Tycoon ("Walreign Works")
May 15, 202655.3 KB
Walreign Empire Narrative Style Guide
May 15, 202616.2 KB
Walreign High Seas — Naval Combat Experiment
May 15, 202616.0 KB
Walreign High Seas — Spec Interview Log
May 15, 20266.0 KB
Walreign Hunters — Spec Interview Log
May 15, 20266.6 KB
Walreign Hunters — Submarine Combat & Exploration Experiment
May 15, 202623.7 KB
Walreign Sortie — Roguelike Deckbuilder — Interview Log
May 15, 20266.8 KB
Walreign Vanguard — Spec Interview Log
May 15, 20267.5 KB