Spaces:
Sleeping
Sleeping
Architecture
Runtime Topology
Agent / baseline script
-> client.RAGDebugEnv (openenv.core.EnvClient)
-> WebSocket/HTTP to FastAPI app (server/app.py)
-> RAGDebugEnvironment (server/rag_debug_env_environment.py)
-> Corpus artifacts (corpora/<domain>/*)
Server construction uses openenv.core.env_server.http_server.create_app:
- Environment class:
RagDebugEnvironmentaliasingRAGDebugEnvironment - Action schema:
RAGDebugAction - Observation schema:
RAGDebugObservation env_name="rag_debug_env"max_concurrent_envs=1inserver/app.py
Core Simulation Contract
The environment does not call a live vector database during episodes.
Episode-time retrieval is simulated from precomputed matrices:
S_true_{general,medical,legal,code}.npy: query-chunk cosine matricesground_truth.json: relevant chunk IDs (R*) per query
At reset:
- Load one domain corpus (
software,climate,medical) - Sample episode queries (5 total per task)
- Slice full
S_truematrices down to episode query rows - Sample injected faults
- Build
S_faultedviaserver/fault_math.py - Return initial
RAGDebugObservation
At step:
- Apply action to config/model/rewrite overlay
- Recompute
S_faultedwhen required - Simulate retrieval (
top_kthen threshold) - Compute per-query coverage/precision and aggregate metrics
- Compute dense reward (or terminal submit reward)
Task Configuration
Values below are sourced from server/constants.py and server/rag_debug_env_environment.py.
Shared limits
- Episode queries: 5 (
_N_EPISODE_QUERIESfor all tasks) - Max steps: 10 (
_MAX_STEPS)
Task 1 (software)
- Domain:
software - Faults sampled from:
[chunk_too_large, no_reranking][threshold_too_high][top_k_too_small][chunk_too_large]
- Success check on submit:
task_score >= 0.75
Task 2 (climate)
- Domain:
climate - Faults sampled from:
[threshold_too_low, duplicate_flooding][top_k_too_small, context_overflow][duplicate_flooding][context_overflow]
- Success check on submit:
task_score >= 0.75
Task 3 (medical)
- Domain:
medical - Fixed fault set:
wrong_embedding_modelchunk_too_largethreshold_too_high
- Initial active model is
legal(intentional mismatch) - Query sampling forces up to 2 multi-hop queries per episode
- Success check on submit:
task_score >= 0.70multi_hop_coverage > 0.60
Reward and Scoring
All rewards are in [0.0, 1.0]. Non-terminal steps span [0.0, ~0.89] based on absolute quality progress toward the success threshold.
Dense step reward (_compute_reward):
progress_reward:0.10 + 0.55 × min(1, quality_score / quality_target)→ [0.10, 0.65] Absolute quality level signal using_quality_score(task_score formula minus efficiency). Ensures the full reward range is utilised across the episode — low-quality states get low rewards, high-quality states get high rewards.delta_bonus:clip(Δquality × 2.0, −0.15, +0.15)Direction signal that distinguishes an improving step from a no-op at the same level.empty_retrieval_signal: bidirectional, weight ×0.06 (rewards fixing empties too)overflow_signal: bidirectional, weight ×0.04 (rewards fixing overflows too)step_cost = -0.01redundancy_penalty = -0.04for same action type twice in a row
Submit reward (_apply_action):
- Success:
0.7 + 0.3 × task_score→ [0.7, 1.0] - Failure:
0.2 × task_score→ [0.0, 0.2]
Task score (_compute_task_score):
- Task 1/2:
0.60*coverage + 0.25*precision + 0.15*efficiency - Task 3:
0.55*coverage + 0.25*precision + 0.20*multi_hop_coverage
Fault Math (Implemented)
All transformations are in server/fault_math.py.
CHUNK_TOO_LARGE: 1D uniform filter along chunk axis; severity scales withchunk_sizeCHUNK_TOO_SMALL: gaussian noise scaled by small chunk size, mitigated by overlapTHRESHOLD_TOO_LOW: additive gaussian noiseTHRESHOLD_TOO_HIGH: multiplicative score deflation (* 0.55)TOP_K_TOO_SMALL: score compression toward 0.5; less severe if reranking enabledDUPLICATE_FLOODING: boosts random duplicate columns; reduced if reranking enabledCONTEXT_OVERFLOW: zeroes tail columns based oncontext_window_limitNO_RERANKING: additive noise only when reranking is offWRONG_EMBEDDING_MODEL: implicit by selecting wrong matrix (not a direct transform)- Cross-encoder reranking blend: after all faults, if
use_reranking=True, blends faulted scores back toward pre-fault scores (alpha=0.35). Simulates a cross-encoder partially recovering true relevance signal. Non-monotonic for noise-based faults (changes rank order), restores score spread for compression faults.
Determinism and Fallbacks
- Noise arrays and duplicate indices are sampled once at reset and reused during recomputation for deterministic intra-episode behavior.
- If required corpus files are missing,
server/corpus.pyfalls back to synthetic data and emits warnings. - Synthetic fallback is for smoke testing only, not for real training/evaluation.