Agent Instructions
“The strength of a dwarf lies less in brute force than in stubborn refusal to leave a bug unfixed.” – The Oracle of Delphi (paraphrased)
Purpose
This file defines how coding agents should work in this repository.
This project uses GitHub Issues for work tracking. PROJECT_PLAN.md is the authoritative source for goals, scope, and milestone priorities.
Current Mission: Coverage Campaign (Phase 3)
Immediate priority: Get seed031, seed032, and seed033 green before expanding coverage. These three sessions exercise deep gameplay paths (endgame disclosure, game-loop timing, tutorial flow) and expose parity bugs that block other work. Fix them first, then resume coverage expansion.
The primary objective is growing parity-session coverage toward 90%+.
The metric is coverage percentage, not session count. Sessions that don’t add new coverage are pure cost. Sessions that exercise code without comparing to C ground truth are worthless. The ideal is maximum C-grounded coverage with minimum session count and runtime.
Every agent’s work must connect to this pipeline:
- Identify low-coverage code using
npm run coverage:session-parity:report. - Create one high-yield C-recorded session at a time, iterating for maximum
coverage-per-turn (see
docs/COVERAGE.md). - When that session reaches diminishing returns (typically around 800 steps),
place it in
test/comparison/sessions/pending/. - Repeat with a new high-yield session concept.
- In parallel, fix JS parity divergences on pending sessions, promote passing sessions, and verify coverage gain.
- Never regress existing parity — fix code, don’t mask sessions.
Required routine (normalize this behavior):
- Record aggressive, high-variance sessions designed to expose parity bugs, not just to add one narrow branch.
- Use mixed interactions in one trace when possible (for example potion effects + status interactions + prayer/luck + spell reads + inventory/use side effects) to maximize bug discovery per turn.
- Treat every newly exposed divergence as a core gameplay blocker to fix.
- Stay on that session until parity bugs are resolved (or clearly blocked and tracked), then repeat with a new aggressive session concept.
- Do not let observability/debug campaigns drift without feeding the coverage
pipeline:
- if a diagnostic sub-campaign produces no pending-session promotion and no new C-recorded pending session within a small batch of validated commits, return to session generation or pending-session bring-up immediately
- diagnostics are successful only when they unblock promotion, new coverage, or clearly blocked issue work with concrete evidence
- After any meaningful parity-fix batch, re-anchor on coverage:
- ask which pending session can now be promoted
- or which low-coverage file should get the next high-yield C-recorded session
- do not treat local first-divergence improvement as the end goal by itself
Read docs/COVERAGE.md for the full mandatory workflow, commands, and
session lifecycle rules. Code fixes without corresponding coverage
evidence don’t count.
Source of Truth and Priorities
- NetHack C 3.7.0 behavior is the gameplay source of truth.
PROJECT_PLAN.mdis the execution roadmap and phase gate definition.docs/COVERAGE.mdis the authoritative execution guide for the current Phase 3 coverage campaign.- Test harness outputs are evidence for divergences, not a place to hide or special-case them.
- For gameplay parity sign-off, session replay results are authoritative over unit tests.
- The execution model is the C execution model:
- single-threaded,
- one active owner of input at a time,
- no gameplay reentrancy,
- no synthetic queues/continuations to reorder command vs monster work.
- Enforced at runtime by
js/modal_guard.js: modal waits (more, yn, getlin, getdir, menu) assert exclusive input ownership. Game code (moveloop, movemon, mattacku, domove, rhack) asserts it’s not inside a modal. Missingawaiton async calls is the #1 cause of violations — the orphaned Promise fires during an unrelated yield.
Current-Phase Resources
Read these first for active work:
PROJECT_PLAN.mddocs/COVERAGE.mddocs/CODEMATCH.mddocs/PARITY_TEST_MATRIX.mddocs/ASYNC_CLEANUP.mddocs/REPAINT_PARITY.md
Historical/reference docs:
docs/IRON_PARITY_PLAN.md(campaign history and architecture lessons)
Work Types and Primary Metrics
- Porting Work Primary metric: reduce first divergence and increase matched PRNG log prefix against C. Debug focus: PRNG call context and first-mismatch localization.
- Selfplay Agent Work Primary metric: held-out improvement after training-set tuning. Competence focus: survival, exploration breadth, depth progression, and interaction quality (combat, inventory, item use, magic/abilities).
- Test Infrastructure Work Primary metric: developer insight speed. Requirements: tests run fast enough to avoid blocking developers and failures provide actionable debug detail. Scope may include deterministic replay tooling, diagnostics, and code coverage. Constraint: infrastructure reveals bugs; it must not solve or mask them.
Regression/Progress Standard (Current Phase)
- A change is a regression if it moves first divergence earlier on any parity channel (
PRNG,events, orscreen) for a session that was previously better. - A change is progress if it moves first divergence later (or fully green) on one or more parity channels without causing larger regressions elsewhere.
- Treat boundary-only capture artifacts as neutral only when repaint-parity evidence shows the mismatch is ownership/capture timing rather than gameplay-visible behavior.
- Use per-session first-divergence step movement as the decision signal for go/no-go commits.
Step Boundary Offsets
When testing reveals that JS attributes events to a different step than C,
do not treat this as a fundamental architectural problem. Despite all the
async/await in JS, all game logic runs on a single linear thread — this
models C’s single-threaded execution model and has been verified by both
static program analysis and runtime assertions that monitor and enforce
single-threaded flow (see js/modal_guard.js).
Step boundary offsets mean that certain actions (message display, monster turns, turn-end processing) happen at a different point in the linear turn sequence than C. The fix is always moving specific calls earlier or later in JS’s turn sequence to match C’s ordering — not restructuring the architecture.
Common examples:
- A
--More--boundary fires (or doesn’t fire) becausetopMessagestate differs from C’stoplinesat that point in the turn. - Monster movement is attributed to the wrong step because
moveloop_turnendran before/after the corresponding C code. - A floor check message appears one step early because the hero’s
domovetriggeredcheck_herebefore a prior--More--was dismissed.
Do not fear boundary offsets. Diagnose which action is out of order, and move it to the correct point in the turn.
Repaint Parity Discipline
- For
screenorcursormismatches, do not rely on vague “boundary artifact” explanations. - Use
docs/REPAINT_PARITY.mdand repaint diagnostics to identify the concrete owner of the visible change:flush_screen(...)bot()/renderStatus(...)more()- prompt owners such as
yn_function()/ynFunction()andgetlin()
- Frame the question precisely: “when did C make this state visible, and which repaint owner did it?”
- Prefer explicit repaint/cursor ownership evidence (
^repaint[...], step-local repaint diffs, prompt-owner traces) over informal reasoning about async boundaries. - Do not patch screen/cursor behavior speculatively. If the first divergence is visible output only, localize it with repaint traces before changing core JS.
- Repaint work is still core parity work: fix JS display/runtime ownership, not the comparator, replay harness, or session artifacts.
Circular Import Policy
Circular function imports between gameplay files are fine — ESM resolves
function bindings lazily at call time. See docs/MODULES.md for the design.
Do NOT use registration patterns (registerFoo(fn) + module-level var)
to avoid circular imports. This adds unnecessary complexity and is fragile:
var x = null initializers can reset values set during circular module
evaluation, causing hard-to-debug bugs. Instead, just import the function
directly — circular function imports work correctly by design.
If you encounter an existing registration pattern, replace it with a direct
import. Read docs/MODULES.md before concluding that a circular import
needs special handling.
Non-Negotiable Engineering Rules
- Fix behavior in core JS game code, not by patching comparator/harness logic.
- Keep harness simple, deterministic, and high-signal for debugging.
- Never normalize away real mismatches (RNG/state/screen/typgrid) just to pass tests.
- Keep changes incremental and test-backed.
- Preserve deterministic controls (seed, datetime, terminal geometry, options/symbol mode).
- Keep tests fast to expose hangs early:
- unit tests: 1000ms timeout per test
- single-session parity runs: 10000ms timeout per session
- full suite must complete in minutes, not hours — treat creeping slowdown as a regression
- fail fast on hangs; never sit for minutes producing no output
- a 30-minute deadlock is worse than a test failure — it wastes time with zero signal
- Avoid cruft in parity fixes:
- no broad refactors unrelated to the active divergence
- no compatibility shims unless required for immediate correctness
- remove temporary debug scaffolding before commit unless explicitly retained for observability
- Area Parity Sweep: When fixing a parity gap, sweep the entire surrounding
area for all similar gaps. Fix the class of problem, not just the instance.
A single discovered bug is evidence of a pattern — audit the whole function,
file, or related code for all instances before moving on. See
skills/area-parity-sweep/SKILL.mdfor the full process. - Treat generated version files as expected collateral, not unexpected changes:
_data/version.ymlandjs/version.jsmay update during hooks or normal commit flow- keep the newest generated values
- include them with the active commit when they change; do not stop work just because these two files updated
No-Fake-Implementation Rule (Strict)
- Do not present scaffolds, placeholders, or heuristics as completed parity or translation work.
- If a prerequisite from plan is missing (for example clang frontend), stop and fix the prerequisite before implementing downstream features.
- Never claim “translated” when output is stubbed or manually hardcoded per function name.
- If using fallback behavior temporarily, label it explicitly as fallback and do not count it toward milestone completion.
Examples of forbidden fakes:
- Regex-only parsing presented as completion for a clang/libclang parser milestone.
- Emitter logic like
if function_name == "X": emit hardcoded string. - Output marked as translated while body still throws
UNIMPLEMENTED_TRANSLATED_FUNCTION. - Adding comparator/harness exceptions to hide gameplay divergence instead of fixing game logic.
Development Cycle
- Identify a failing parity behavior from sessions/tests.
- Confirm expected behavior from C source.
- Implement faithful JS core fix that matches C logic.
- Run relevant tests/sessions (and held-out eval where applicable).
- Record learnings in
docs/LORE.mdfor porting work andselfplay/LEARNINGS.mdfor agent work. - Commit only validated improvements.
Long-Run Regeneration Discipline
See skills/long-running-task/SKILL.md for the full methodology.
When running rebuilds/regenerations that can take several minutes:
- Do a short preflight first (single seed/single fixture) to confirm setup and output shape.
- Start the full run only after preflight output looks correct.
- Monitor partial output during the run (periodic status polls, milestone checks, first-output sanity).
- Treat stalls or suspicious output as actionable: stop early, fix setup, restart, do not wait for full completion.
- Keep logs/checkpoints so partial progress can be inspected and reported while work is still running.
- Report partial status to the user when asked, including what is done, in progress, and next.
Session and Coverage Expectations
- Use the canonical key-centered deterministic session format.
- During translation coverage work, maintain a C-to-JS mapping ledger.
- For low-coverage parity-critical areas, add targeted deterministic sessions.
- Keep parity suites green while expanding coverage.
- Coverage credit is based on C-recorded parity sessions only (see
docs/COVERAGE.md); ordinary unit-test coverage does not count toward parity coverage progress. - Every session must compare against C ground truth. Coverage that isn’t C-grounded is not parity coverage. Sessions that merely execute code without validating against C traces are useless.
- Follow the session lifecycle in
docs/COVERAGE.md:- record new sessions in
test/comparison/sessions/pending/, - debug/fix parity until green,
- promote passing sessions into
test/comparison/sessions/coverage/<theme>/.
- record new sessions in
- Coverage sessions under
sessions/coverage/are part of the default parity suite; do not regress them. - Do not add sessions that don’t increase coverage. Every session costs CI time; it must pay for itself in new lines/branches covered.
- While parity coverage is below target, keep active issue work on both:
- fixing failing pending sessions,
- recording/promoting new targeted coverage sessions.
- Track progress by coverage percentage delta, not session count.
- Session filename length policy: for new/renamed session files, keep
<filename>.session.jsonat 56 characters or fewer (to keep tooling output and CLI workflows readable). Use compact intent tokens instead of long prose. - Active capture tactic: follow the
Coverage-Per-Turn Agent Challengeindocs/COVERAGE.md:- build one high-yield session at a time,
- iterate it toward ~800 steps while maximizing coverage-per-turn,
- place it in
sessions/pending, - then start a fresh concept/session and repeat.
- Pending bring-up workflow: run
session_test_runnerfirst to get the authoritative first divergence, then userng_step_diff/mapdump tools only for focused drilldown. - A parity-only fix stream is incomplete unless it reconnects to coverage:
- promote the newly green pending session in the same cycle when feasible
- otherwise leave an explicit next-step trail: which pending session is now closer to promotion, or which new session should be recorded next because of the fix
Agent Work Rules (Selfplay)
These rules apply to coding work focused on selfplay agent quality.
- Use a 13-seed training set with one seed per NetHack character class.
- Optimize agent behavior against that 13-class training set.
- Before committing, run a held-out evaluation on a different 13-seed set (also one per class).
- Only commit when held-out results show improvement over baseline.
- Track not only survival but competence in exploration breadth, dungeon progression, and interaction quality.
- Keep agent policy/tuning changes separate from parity harness behavior.
Harness Boundary
Allowed harness changes:
- Determinism controls
- Better observability/logging
- Faster execution that does not change semantics
Not allowed:
- Comparator exceptions that hide true behavior differences
- Replay behavior that injects synthetic decisions not in session keys
- Any workaround that makes failing gameplay look passing
- Any queueing/continuation/parallel ownership scheme that changes the
single-threaded C ordering of:
- prompt/input ownership,
- command execution,
- monster turns,
- message acknowledgement boundaries
Issue Dependencies and Hygiene
Use explicit dependency links in every scoped issue:
Blocked by #<issue>Blocks #<issue>
Operational rules:
- Apply
blockedlabel when prerequisites are open. - Apply
has-dependentslabel when an issue gates others. - Keep workflow status in sync (
Ready,Blocked,In Progress,Done). - Default: do not start
In Progresswhile declared blockers are open. - Exception: if a blocker advisory is stale/incorrect, proceed opportunistically and fix links/labels in the same cycle.
Issue hygiene:
- Run periodic triage (
gh issue list --state open). - Close obsolete/superseded issues with a clear reason.
- Update issue body/labels/status comments promptly when new evidence changes scope or priority.
- Use
paritylabel for C-vs-JS divergence/parity issues in the unified backlog. - For Iron Parity campaign issues, also add
campaign:iron-parityand one scope label (state,translator,animation,parity-test,docs, orinfra). - If a
ghcommand fails due sandbox/network restrictions, request command escalation and rerun it immediately.
Iron Parity issue structure:
- Maintain one campaign tracker epic:
IRON_PARITY: Campaign Tracker (M0-M6). - Maintain one issue per milestone (
M0throughM6) and link implementation issues under the relevant milestone. - Each implementation issue should include explicit divergence evidence and expected C behavior when parity-related.
Agent Ownership and Intake
- Agent name is the current working directory basename; use it as identity for issue ownership.
- Directory/topic affinity is suggestive only; any agent may take any issue.
- If no pending task exists, pull another actionable open issue.
- If starting work not tracked yet, create/update a GitHub issue immediately.
- Issues are unowned by default; do not assign ownership labels until work is actively claimed.
- Track ownership with
agent:<name>label only while actively working. - Use at most one
agent:*label in normal flow; temporary overlap is allowed only during explicit handoff. - When starting work:
gh issue edit <number> --add-label "agent:<name>" - If intentionally abandoning:
gh issue edit <number> --remove-label "agent:<name>" - If you complete work on an issue assigned to another agent, proceed and resolve it; leave a detailed closing/update comment so the original assignee has full context.
Practical Commands
- Install/run basics: see
docs/DEVELOPMENT.md. - Issue workflow quick reference:
gh issue list --state open
gh issue view <number>
gh issue edit <number> --add-label "agent:<name>"
gh issue edit <number> --remove-label "agent:<name>"
gh issue close <number> --comment "Done"
gh issue comment <number> --body "Status..."
- RNG divergence triage quick reference:
# Reproduce one session with JS caller-tagged RNG entries
node test/comparison/session_test_runner.js --verbose <session-path>
# Inspect first mismatch window for one step
node test/comparison/rng_step_diff.js <session-path> --step <N> --window 8
RNG_LOG_PARENT=0 can be used to shorten tags if needed.
Set RNG_LOG_TAGS=0 to disable caller tags when you need lower-overhead runs.
Skill Usage
- Agents that support skills should use repo skills from
skills/<skill-name>/SKILL.mdwhen relevant. - Current repo skills:
skills/parity-rng-triage/SKILL.mdskills/movement-propagation/SKILL.mdskills/topline-async-boundary/SKILL.mdskills/area-parity-sweep/SKILL.mdskills/long-running-task/SKILL.mdskills/trace-before-theorize/SKILL.md
AGENTS.mdremains the source of truth for non-negotiable policy.- If skill loading is unavailable in a client, follow the workflow and guardrails from the referenced
SKILL.mdmanually. - Skill guardrails are mandatory when applicable, including:
- no comparator masking/exceptions to hide divergences
- no
js/replay_core.jscompensating behavior (no synthetic queueing/injection/auto-dismiss/timing compensation) - no unfaithful single-threaded-model violations:
- no deferred continuation tokens to resume gameplay later
- no parallel input owners
- no command/monster reordering via queue machinery
Priority Docs (Read Order)
- Always start with:
PROJECT_PLAN.mddocs/COVERAGE.md(Phase 3 execution guide)docs/CODEMATCH.mddocs/PARITY_TEST_MATRIX.mddocs/DEVELOPMENT.mddocs/LORE.mddocs/ASYNC_CLEANUP.md
- For porting/parity divergence work:
docs/C_FAITHFUL_STATE_REFACTOR_PLAN.mddocs/C_TRANSLATOR_ARCHITECTURE_SPEC.mddocs/C_TRANSLATOR_PARSER_IMPLEMENTATION_SPEC.mddocs/SESSION_FORMAT_V3.mddocs/RNG_ALIGNMENT_GUIDE.mddocs/C_PARITY_WORKLIST.md
- For special-level parity work:
docs/SPECIAL_LEVELS_PARITY_2026-02-14.mddocs/special-levels/SPECIAL_LEVELS_TESTING.md
- For selfplay agent work:
selfplay/LEARNINGS.mddocs/SELFPLAY_C_LEARNINGS_2026-02-14.mddocs/agent/EXPLORATION_ANALYSIS.md
- For known issue deep-dives:
docs/bugs/pet-ai-rng-divergence.mddocs/NONWIZARD_PARITY_NOTES_2026-02-17.md
- Historical campaign references (non-primary for current phase):
docs/IRON_PARITY_PLAN.mddocs/MORE_NEEDED_CAMPAIGN.md
Completion Discipline
When a task is complete:
- File issues for any remaining follow-up work.
- Run relevant quality gates.
- Update issue status.
- Pull/rebase and push (do not leave validated work stranded locally):
git pull --rebase git push git status - Verify changes are committed and pushed.
- Report what changed, what was validated, and remaining risks.
Critical rules:
- Work is NOT complete until
git pushsucceeds. - NEVER stop before pushing — that leaves work stranded locally.
- NEVER say “ready to push when you are” — YOU must push.
- If push fails, resolve and retry until it succeeds.
- When multiple developers are active, push meaningful validated increments rather than batching too long locally.
Documentation Hygiene
- If docs are inaccurate or stale, fix or remove them immediately.
- Keep
docs/aligned to actual code behavior and active workflows.