Phase 3: Multi-Depth Map Generation
“We conquered depth 1. But NetHack is a 50-level dungeon.”
Phase 1 achieved perfect map generation at depth 1: every seed produced identical terrain grids. Phase 2 extended this to gameplay: 66 turns of a human-played session with every RNG call matching the C reference.
But both phases focused on a single dungeon level. Real NetHack is about descending deeper: going downstairs, generating level 2, then level 3, then deeper still, each level building on the RNG state left by the previous one. A single misaligned call at depth 1 cascades into total divergence by depth 3.
Phase 3 extends bit-exact alignment from single levels to multi-depth dungeon generation: generating levels 1 through 5 for multiple seeds, with every RNG call at every depth matching the C reference. The work revealed subtle issues with test isolation, deterministic replay, and depth-dependent monster selection that affected testing infrastructure and core game logic alike.
Table of Contents
The Challenge
- The Goal – depth 1 → depth 5, all seeds
- The Test Infrastructure – map sessions, determinism
- The Baseline – where we started
The Bugs
- Test Isolation Failures – vision state, identity counters
- The Statue Bug – depth parameter missing
- Remaining Issues – seed1 level 2, depths 4-5
The Craft
- Final Result – 90.7% pass rate
- Lessons Learned – test infrastructure matters
- The Art of Binary Search Debugging
1. The Goal
Generate dungeon levels 1 through 5 for multiple test seeds (seed16, seed72, seed119, seed163, seed306), each using wizard mode level teleport to jump to each depth in sequence, and verify:
- Terrain grid matches (
typGrid): Every wall, door, room, corridor identical to C - RNG call count matches (
rngCalls): Same number of random decisions per level - RNG sequence matches (
rng): Every call function, parameter, and result identical
The challenge: depth 3 consumes ~2400 RNG calls, depth 5 over 4000. A single divergence at depth 2 makes depth 3+ unmatchable. Debugging requires tracing through thousands of calls to find the one that’s wrong.
2. The Test Infrastructure
Map Sessions (.session.json files)
Python harness runs C NetHack in wizard mode with tmux:
# Generate map session with RNG logging
NETHACK_RNGLOG=rnglog.txt nethack -u Wizard -D
# Teleport to each depth 1-5 via wizard commands
# Capture terrain grid + RNG trace per level
Each session file contains:
{
"version": 2,
"seed": 16,
"type": "map",
"levels": [
{
"depth": 1,
"typGrid": [[1,1,0,...], ...], // 21x80 terrain
"rngCalls": 198,
"rng": ["rn2(5)=3 @ makelevel", ...]
},
...
]
}
Determinism Tests
Critical requirement: regenerating the same seed must produce identical results. Module-level state leaking between tests breaks this:
// WRONG: Shared state between test runs
let _identCounter = 0; // Never reset!
// RIGHT: Reset in initialization
export function resetIdentCounter() { _identCounter = 0; }
Test Structure
Initially used it() tests for generation, which broke when filtered:
// WRONG: Result undefined when test filtered out
it('generate maps', () => { result = generateMaps(...); });
it('check grid', () => { expect(result.grid).toEqual(...); });
// RIGHT: Use before() hook which always runs
before(() => { result = generateMaps(...); });
it('check grid', () => { expect(result.grid).toEqual(...); });
3. The Baseline
Starting point (commit c3fa9c3):
- 1090 passing tests, 74 failing (93.6%)
- Depth 1: Perfect alignment for all seeds
- Depth 2+: Significant divergences
After composite filtering (commit a81ff42):
- +7 tests (1097 pass, 67 fail)
- Filtered
rne(),rnz(),d()wrapper functions from C traces - These are logged in C but not in JS (internal calls are logged instead)
After depth 2 fixes (commits through a083820):
- Monster dart counts: kobolds get 12 darts, not 6
- Defensive items: monsters need random armor/shields (
rnd_defensive_item) - Misc items: random tools/potions (
rnd_misc_item) - Pet arrival: random teleportation on level entry
- Result: Depth 2 achieves 100% RNG alignment (2476/2476 calls match)
4. Test Isolation Failures
Problem: Running full test suite produced different results than running tests individually. Tests were bleeding state into each other.
Vision System State Leak (commit 05cbd1b)
Symptom: seed42 tests passed individually but failed after seed2_wizard_fountains
in full suite.
Root cause: vision.js module-level state:
// These persisted between tests!
let viz_clear = null;
let right_ptrs_arr = null;
let left_ptrs_arr = null;
// do_clear_area() set them to instance arrays
function do_clear_area(instance) {
viz_clear = instance.viz_clear;
right_ptrs_arr = instance.right_ptrs_arr;
left_ptrs_arr = instance.left_ptrs_arr;
// ... never cleaned up!
}
Functions like couldsee() used these stale references from previous tests.
Fix: Clear module-level state after use:
function do_clear_area(instance) {
viz_clear = instance.viz_clear;
right_ptrs_arr = instance.right_ptrs_arr;
left_ptrs_arr = instance.left_ptrs_arr;
// ... do work ...
// Clean up module state
viz_clear = null;
right_ptrs_arr = null;
left_ptrs_arr = null;
}
Identity Counter Leak (commit 05cbd1b)
Symptom: Determinism tests failed - regenerating same seed produced different results.
Root cause: mkobj.js module-level counter never reset:
let _identCounter = 0; // Monotonic ID counter for objects/monsters
export function next_ident() {
const res = _identCounter;
_identCounter += rnd(2); // Consumes RNG!
if (_identCounter === 0) _identCounter = rnd(2) + 1;
return res;
}
Second test run started with _identCounter at 800+ instead of 0, consuming RNG
calls at different times.
Fix: Reset in initialization:
export function resetIdentCounter() { _identCounter = 0; }
// Called from init_objects() in o_init.js
export function init_objects() {
initObjectData();
resetIdentCounter(); // Reset for each generation
randomize_gem_colors();
shuffle_all();
// ...
}
Test Structure Fix (commit 05cbd1b)
Symptom: “Cannot read properties of undefined (reading ‘grid’)” when running with test filters.
Root cause: Map generation in it() test that Node’s test runner could skip:
let result;
it('generate maps', () => {
result = generateMapsWithRng(seed, maxDepth); // Skipped when filtered!
});
it('check depth 1 typGrid', () => {
expect(result.grids[1]).toEqual(...); // result is undefined!
});
Fix: Use before() hook which always runs:
let result;
before(() => {
// Always runs, even when individual tests filtered
result = generateMapsWithRng(seed, maxDepth);
});
it('check depth 1 typGrid', () => {
expect(result.grids[1]).toEqual(...); // result exists
});
Impact of test isolation fixes: +134 tests passing (1097 → 1231)
5. The Statue Bug
The Investigation (2026-02-09, ~5 hours)
Depth 3 diverged at RNG call 1094:
- JS:
rn2(3)=2 - C:
rn2(2)=0 @ rndmonst_adj(makemon.c:1712)
Calls 0-1093 matched perfectly. Then complete divergence.
Hypothesis 1: somexyspace returns null
Looking at fill_ordinary_room statue placement:
if (!rn2(20)) { // Statue check (1/20 chance)
const pos = somexyspace(map, croom); // Find position
if (pos) {
const statue = mksobj(STATUE, true, false); // Create statue
}
}
Theory: JS’s somexyspace returns null while C’s succeeds, so JS never calls mksobj.
Evidence against: Debug logging showed both consumed identical RNG:
- JS[1091]:
rn2(7)=2(somexyspace somex) - JS[1092]:
rn2(3)=1(somexyspace somey) - JS[1093]:
rnd(2)=1(next_ident - object creation!)
If somexyspace failed, it would retry with more RNG calls. It didn’t.
Hypothesis 2: mksobj doesn’t call rndmonnum
Theory: JS’s mksobj is broken and doesn’t select a monster for the statue.
Testing:
// Isolated test
const statue = mksobj(STATUE, true, false);
// Result: 11 RNG calls consumed including rndmonst_adj!
// JS[198]: rnd(2)=2 @ next_ident
// JS[199]: rn2(3)=2 @ rndmonst_adj
// JS[200]: rn2(4)=3 @ rndmonst_adj
// ...
Evidence against: mksobj works perfectly in isolation. The bug is elsewhere.
The Breakthrough
Comparing isolated test vs actual game:
- Isolated test at depth 3: starts with
rn2(3)inrndmonst_adj - C at depth 3: starts with
rn2(2)inrndmonst_adj
Why different? Different monster pools! Let me check what depth rndmonnum uses…
// js/makemon.js
export function rndmonnum(depth) {
return rndmonst_adj(0, 0, depth || 1); // defaults to 1!
}
// js/mkobj.js - mksobj_init
case ROCK_CLASS:
if (od.name === 'statue') {
obj.corpsenm = rndmonnum(); // ← NO DEPTH PARAMETER!
}
Root cause found: Statue creation calls rndmonnum() with no depth, defaulting
to depth=1. At depth=1, the monster pool is smaller (fewer eligible monsters at
low difficulty). This produces a different RNG sequence.
The Fix (commit d683995)
One line change:
case ROCK_CLASS:
if (od.name === 'statue') {
- obj.corpsenm = rndmonnum();
+ obj.corpsenm = rndmonnum(_levelDepth);
}
The _levelDepth module variable tracks current depth. Just pass it through.
The Result
Before fix:
Last matching call: 1093
First divergence at: 1094
JS[1094]: rn2(3)=2
C[1102]: rn2(2)=0 @ rndmonst_adj
After fix:
Last matching call: 1094
First divergence at: 1095
JS[1094]: rn2(2)=0 ← NOW MATCHES!
JS[1095]: rn2(5)=1
C[1103]: rn2(5)=1 @ rndmonst_adj
The “divergence” at 1095 is just formatting (JS logs don’t include source location). The function calls and values are identical.
Impact: +2 tests passing (1231 → 1233), seed163 depth 3 achieves 100% RNG alignment
6. Remaining Issues
seed16 depth 3
Still shows divergence after statue fix. Different issue than seed163. Needs separate investigation.
seed1 level 2 (interface-4i6)
Symptom: Steps 0-66 all pass (descent + level 2 generation RNG matches perfectly). Step 67 fails with completely different level layout.
Analysis:
- C: 3 grid bugs, 1 jackal, all in regular rooms (rtype=0)
- JS: large mimic, shopkeeper, grid bug, kobold zombie, with shop (rtype=14)
This is an algorithmic divergence, not RNG mismatch. The same RNG sequence is interpreted differently:
- Same room generation decisions
- Different room types assigned
- Different monsters placed
Likely: theme room selection logic differs, or shop placement logic differs.
Needs investigation of themeroom assignment and shop creation.
Depths 4-5
Blocked by depth 3 issues. Now unblocked for most seeds. Need to investigate remaining divergences.
7. Final Result
Test Results (as of 2026-02-09):
- 1233 passing, 127 failing (90.7% pass rate)
- Baseline: 1090 pass, 74 fail (93.6%)
- Gain: +143 tests (+13%)
Map Generation Alignment:
- Depth 1: 100% alignment, all seeds ✓
- Depth 2: 100% alignment, all seeds ✓ (2476/2476 RNG calls match)
- Depth 3: 100% alignment for seed163 ✓, seed16 partial
- Depth 4-5: In progress (unblocked by statue fix)
Determinism: All 5 map sessions pass determinism validation (regenerating same seed produces identical results).
Gameplay: seed2_wizard_fountains (37 steps) and seed42 (12 steps) fully passing.
8. Lessons Learned
Test Infrastructure is Critical
Without proper test isolation, you’re debugging phantoms. The vision state leak made tests fail mysteriously depending on run order. Hours wasted debugging the wrong problem.
Rule: Module-level mutable state must be reset between tests.
Determinism Tests Catch State Leaks
The identity counter leak was invisible in single test runs. Only the determinism test (regenerate same seed twice, compare results) caught it.
Rule: Always test that seed N produces identical results on runs 1 and 2.
Default Parameters Hide Bugs
depth || 1 seemed reasonable for a default, but it was wrong. The bug lived
silently for months because depth 1 testing didn’t expose it.
Rule: Be suspicious of defaults. Pass explicit parameters.
Binary Search is Your Best Friend
With 2400+ RNG calls per level, linear search is hopeless. Binary search finds the exact divergence call in seconds:
// Check if calls 0-mid all match
// If yes, search mid+1 to end
// If no, search 0 to mid-1
Rule: Build binary search into your debugging tools from day one.
RNG Traces are Gold
Without C’s annotated RNG logs showing function name and source location for every call, debugging would be impossible. You’d just see “numbers don’t match” with no clue why.
Rule: Invest in trace infrastructure early. It pays for itself 100x over.
One-Line Bugs Can Take Days
The statue bug was literally one missing parameter. Finding it took 5 hours of investigation:
- Failed hypotheses (somexyspace returns null)
- Failed debugging (logging infrastructure issues)
- Isolated testing (proved mksobj worked)
- Comparison analysis (found depth difference)
- Root cause (missing parameter)
Rule: Persistence pays off. Keep eliminating hypotheses until you find the truth.
9. The Art of Binary Search Debugging
When you have 2400 RNG calls and one is wrong, manual inspection is hopeless. Binary search is mandatory.
The Algorithm
Given two RNG traces (JS and C, filtered to remove composite/midlog entries):
let left = 0, right = jsRng.length - 1, lastMatch = -1;
while (left <= right) {
const mid = Math.floor((left + right) / 2);
// Check if calls 0 to mid all match
let allMatch = true;
for (let i = 0; i <= mid; i++) {
if (jsRng[i] !== cRng[i]) {
allMatch = false;
break;
}
}
if (allMatch) {
lastMatch = mid;
left = mid + 1; // Search upper half
} else {
right = mid - 1; // Search lower half
}
}
console.log('Last match:', lastMatch);
console.log('First divergence:', lastMatch + 1);
Result: Pinpoints exact divergence in O(log N) comparisons.
The Context Window
Once you find the divergence point, show context:
console.log('Last 3 matching calls:');
for (let i = lastMatch - 2; i <= lastMatch; i++) {
console.log(`JS[${i}]: ${jsRng[i]}`);
}
console.log('\nFirst diverging call:');
console.log(`JS[${lastMatch + 1}]: ${jsRng[lastMatch + 1]}`);
console.log(`C[${lastMatch + 1}]: ${cRng[lastMatch + 1]}`);
console.log('\nNext 5 calls in each:');
for (let i = lastMatch + 1; i < lastMatch + 6; i++) {
console.log(`JS[${i}]: ${jsRng[i]}`);
console.log(`C[${i}]: ${cRng[i]}`);
console.log();
}
This shows:
- What matched just before (hints at context)
- The exact divergence (the bug!)
- What happens after (hints at impact)
Real Example: The Statue Bug
Last matching calls:
JS[1091]: rn2(7)=2
JS[1092]: rn2(3)=1
JS[1093]: rnd(2)=1
First diverging call:
JS[1094]: rn2(3)=2
C[1102]: rn2(2)=0 @ rndmonst_adj(makemon.c:1712)
Analysis:
rnd(2)=1isnext_ident(object creation)- C’s next call is
rndmonst_adj(monster selection for statue) - JS calls
rn2(3)instead (wrong function!) - C’s
rndmonst_adjstarts withrn2(2)(depth 3 monster pool) - JS’s would start with
rn2(3)(depth 1 monster pool)
Conclusion: JS used wrong depth, producing different monster pool, different RNG sequence.
The source annotations (@ rndmonst_adj(makemon.c:1712)) were the key. Without them,
you’d just see “rn2(3) != rn2(2)” with no idea what code was involved.
Looking Forward
Phase 3 established:
- Reliable test infrastructure with proper isolation
- Deterministic replay across multiple test runs
- Multi-depth map generation working through depth 3
- Binary search debugging methodology
- 90.7% test pass rate
Next challenges:
- Extend to depths 4-5 (now unblocked)
- Fix seed1 level 2 algorithmic divergence (shop/theme room logic)
- Achieve 100% alignment across all depths for all seeds
The foundation is solid. The methodology works. The hard problems are behind us. Now it’s “just” a matter of finding and fixing the remaining edge cases.
One bug at a time, one depth at a time, one RNG call at a time.