replicalab / docs /map /tests.md
maxxie114's picture
Initial HF Spaces deployment
80d8c84

Tests Map - tests/

365 tests across 18 files. All passing.

Last verified: 2026-03-08

Summary

File Tests What it covers
test_api_rest_isolation.py 11 API 14 REST session isolation and replay separation
test_cache.py 2 Oracle scenario caching and reuse
test_client.py 24 TRN 13 reusable client over REST and WebSocket
test_config.py 3 Shared constants and config consistency
test_env.py 56 ENV 01-08, ENV 10, ENV 11, OBS 04, JDG 04-05, TST 01-03
test_judge_policy.py 10 JDG 11 structured judge audit payload
test_lab_manager_policy.py 37 AGT 05-07 plus AGT 09 determinism coverage
test_models.py 21 Action, observation, step, state, and log contracts
test_logging.py 11 MOD 07 replay persistence and JDG 07 CSV logging helpers
test_oracle.py 5 Oracle hybrid wrapper, structured parsing, and env reset adapter
test_prompts.py 7 AGT 10 prompt files and Oracle prompt asset loading
test_reward.py 40 JDG 01-06, JDG 08, and reward regression coverage
test_rollout.py 12 TRN 03 rollout worker behavior
test_rollout_traces.py 2 TRN 04 bounded tool trace aggregation and batched collection
test_scenarios.py 14 SCN 01-13 scenario generation, determinism, and Oracle scenario adaptation
test_scientist_policy.py 46 MOD 09, AGT 01-04, AGT 08
test_server.py 44 API 01-04, API 06-08, API 13-14, replay audit propagation, and root landing page
test_validation.py 20 MOD 05-06 semantic validation
Total 365

Coverage Notes

  • The environment stack is covered end to end:
    • test_env.py validates reset, step, invalid action, termination, reward integration, deep state snapshots, close/reopen lifecycle behavior, terminal judge-audit propagation, and seeded replay determinism across all scenario families.
  • The API/server stack is covered end to end:
    • test_server.py covers REST reset/step/scenarios, WebSocket session handling, idle timeout cleanup, CORS behavior, and replay audit propagation.
  • The scientist stack is covered end to end:
    • test_scientist_policy.py, test_prompts.py, test_rollout.py, and test_rollout_traces.py together cover prompt construction, observation formatting, parse/retry, baseline policy, rollout collection, and bounded tool trace capture.
  • The judge stack is covered end to end:
    • test_reward.py covers rubric scores and reward math, while test_judge_policy.py covers structured audit payload generation.
  • The Oracle hybrid layer is covered additively:
    • test_oracle.py, test_cache.py, and test_prompts.py cover Oracle scenario generation wrappers, cache reuse, and prompt asset loading without changing the deterministic reward contract.

Remaining Gaps

Planned test work Why it still matters
TST 09 notebook smoke coverage Fresh-runtime validation for the judged training notebook

Task-to-Test Mapping

Area Primary test files
Models and contracts test_models.py, test_validation.py
Scenarios test_scenarios.py
Oracle integration and cache test_oracle.py, test_cache.py, test_prompts.py
Scientist policy test_scientist_policy.py, test_prompts.py
Lab Manager policy test_lab_manager_policy.py
Judge and reward test_reward.py, test_judge_policy.py
Environment test_env.py
API and deployment-facing server behavior test_server.py
Client and training rollouts test_client.py, test_rollout.py, test_rollout_traces.py