Spaces:
Sleeping
Future Improvements
Source of truth: ReplicaLab_Comprehensive_Task_Division.md
This document tracks post-MVP architectural improvements. Work here begins only after the core logic is complete and the hackathon deliverables are stable.
1. Domain-Agnostic Normalized Scenario Layer
Priority: highest future feature
Problem
The current models in replicalab/models.py use domain-biased field names:
paper_title,paper_hypothesis,paper_method,paper_key_findingequipment_available,reagents_in_stock,staff_countsample_size,controls,technique
These work for the three MVP scenario families (cell biology, ML benchmark, behavioral psychology) because all three map onto a lab-style replication frame. But if the environment needs to support domains outside scientific replication (e.g., engineering design, clinical trial planning, supply chain optimization), the field names stop making sense.
The turn protocol itself (propose, revise, request_info, accept) is
already generic. The gap is in the observation and protocol content layer.
Solution: normalized scenario representation
Introduce a structured internal representation that any domain adapter can emit:
class NormalizedScenarioPack(BaseModel):
domain_id: str # "cell_biology", "ml_benchmark", etc.
task_summary: str # what the agent is trying to achieve
success_criteria: list[str] # measurable conditions for success
constraints: list[Constraint] # budget, time, equipment, policy, etc.
resources: list[Resource] # what is available to work with
allowed_substitutions: list[Substitution] # valid swaps the agent can propose
hidden_reference_spec: dict # ground truth the judge scores against
difficulty: str # "easy", "medium", "hard"
metadata: dict # domain-specific extras
Where:
class Constraint(BaseModel):
dimension: str # "budget", "time", "equipment", "personnel", "safety"
label: str # human-readable name
value: Any # the constraint value (numeric, list, etc.)
hard: bool = True # hard constraint vs soft preference
class Resource(BaseModel):
category: str # "equipment", "reagent", "compute", "personnel"
name: str # resource identifier
available: bool # currently available
quantity: Optional[int] # count if applicable
notes: str = "" # booking conflicts, expiry, etc.
class Substitution(BaseModel):
original: str # what the reference spec uses
replacement: str # what the agent can use instead
quality_impact: float # 0.0 to 1.0, how much fidelity is lost
cost_delta: float # cost difference
Architecture principle
Domain template
-> Scenario adapter (thin mapper, <50 lines per domain)
-> NormalizedScenarioPack
-> Observation mapper (fills ScientistObservation / LabManagerObservation)
-> Prompt assembler (data-driven, not hard-coded)
-> Validator (checks action against constraints)
-> Scorer (compares final protocol against hidden_reference_spec)
The external contract (ScientistAction, LabManagerAction,
ScientistObservation, LabManagerObservation, StepResult) stays
unchanged. The normalization lives below those models as an internal
implementation layer.
LLMs reason and negotiate. They never own truth. Truth lives in the normalized scenario pack and the deterministic scorer.
How this affects the future core logic
| Current component | Impact | Severity |
|---|---|---|
replicalab/models.py |
External contract unchanged. Add NormalizedScenarioPack and helper models as new classes |
Low |
replicalab/scenarios/templates.py (SCN 02) |
Must define the normalized schema. generate_scenario() returns a pack instead of raw dicts |
High |
replicalab/scenarios/*.py (SCN 03-05) |
Each domain file becomes a thin scenario adapter that emits a normalized pack | Medium |
replicalab/scenarios/templates.py (SCN 06) |
Difficulty scaling becomes mechanical: add/remove constraints, tighten resource limits | Medium, but simpler |
replicalab/scenarios/templates.py (SCN 07) |
Constraint generator emits Constraint objects instead of ad hoc lab fields |
High |
replicalab/scenarios/templates.py (SCN 08) |
hidden_reference_spec is part of the pack, not a separate hidden structure |
Medium |
replicalab/utils/validation.py (MOD 05-06) |
Validators read constraints[] and resources[] from the pack instead of checking lab-specific fields |
High |
replicalab/scoring/*.py (JDG 01-04) |
Scorers compare the final protocol against hidden_reference_spec on normalized dimensions |
High |
replicalab/env/replicalab_env.py (ENV 01-07) |
EpisodeState gains a scenario_pack field. Reset populates it from the adapter |
Medium |
replicalab/agents/scientist_policy.py (AGT 01-02) |
Prompts assembled from scenario pack data, not hard-coded domain text | Medium |
replicalab/agents/lab_manager_policy.py (AGT 05-07) |
Feasibility checker reads normalized constraints instead of lab-specific fields | Medium |
frontend/ (UI 01+) |
Render "constraint cards" and "resource cards" instead of lab-specific panels | Low (future) |
What stays the same
- The turn protocol (
propose,revise,request_info,accept) - The reward formula (
10 * rigor * feasibility * fidelity + bonuses - penalties) - The external API contract (REST + WebSocket payloads)
- The training loop and RL pipeline
- The deterministic reward principle
2. Planned work items for the normalized scenario layer
Item 1: Define the normalized scenario schema
What: Add NormalizedScenarioPack, Constraint, Resource, and
Substitution as Pydantic models in a new file
replicalab/scenarios/schema.py.
Why: This is the foundation. Every other item depends on having a stable schema that all adapters, validators, and scorers agree on.
Depends on: Core MVP scenario work (SCN 02-09) being complete so we know what fields the adapters actually need.
Scope: ~80 lines of model definitions, no business logic.
Item 2: Convert existing scenario templates into adapters
What: Refactor cell_biology.py, ml_benchmark.py, and
behavioral_psych.py so each one returns a NormalizedScenarioPack instead
of raw domain-specific dicts.
Why: Proves the schema works for all three MVP domains. If a field cannot be cleanly mapped, the schema needs revision before adding new domains.
Depends on: Item 1 (schema exists), SCN 03-05 (domain templates exist).
Scope: ~50 lines per adapter. Should be thin mappers. If an adapter exceeds 50 lines, the schema is wrong.
Constraint: The existing observation fields (paper_title,
equipment_available, etc.) must still be populated. The adapter fills
both the normalized pack and the legacy observation slots until the
observation models are generalized.
Item 3: Build data-driven prompt assembly
What: Replace hard-coded prompt text with a template that assembles from the scenario pack:
You are a {role} working on: {task_summary}
Success criteria:
{success_criteria[]}
You must work within these constraints:
{constraints[].label}: {constraints[].value}
Available resources:
{resources[].name} ({resources[].category}): {available/unavailable}
Why: Makes AGT 01 (Scientist prompt) and AGT 07 (Lab Manager templates) domain-neutral. Adding a new domain requires only a new adapter, not new prompts.
Depends on: Item 2 (adapters produce normalized packs), AGT 01 and AGT 07 existing in their MVP form.
Scope: One prompt template function per role. ~40 lines each.
Item 4: Hybrid LLM Lab Manager with deterministic post-checking
What: Replace the rule-based Lab Manager with a hybrid architecture:
- LLM receives the
LabManagerObservationand generates negotiation text plus alternative suggestions in natural language - Deterministic constraint checker computes the real feasibility flags by
reading the normalized scenario pack's
constraints[]andresources[] - A composer merges the LLM output with the checker output into a valid
LabManagerAction - The
model_validatoronLabManagerActioncatches any inconsistency
Why: Gives the Lab Manager realistic negotiation language and creative suggestions (the LLM's strength) while keeping feasibility flags truthful (the checker's strength). Training reward stays deterministic because the reward engine only reads the validated action, not the LLM's raw text.
Depends on: Item 2 (checker needs normalized constraints), AGT 05 (feasibility checker exists), MOD 02 (LabManagerAction validators exist).
Scope: ~120 lines. The LLM call, the checker, the composer. Uses the same base model as the Scientist (Qwen3-4B) with a separate role adapter.
Risk: Episode variance increases because the same seed may produce
different negotiation paths. Mitigate by keeping the deterministic checker as
the authority on all boolean flags. The LLM only controls explanation text
and suggestion ideas, never the truth flags.
Item 5: Normalized scoring against hidden reference spec
What: Refactor the scoring engine so score_rigor(),
score_feasibility(), and score_fidelity() compare the final protocol
against hidden_reference_spec from the normalized scenario pack instead of
using domain-specific scoring logic.
Scoring dimensions become:
- Rigor: Does the protocol preserve the success criteria? Compare
protocol.controlsagainsthidden_reference_spec.required_controls, check sample size ratio, verify statistical validity markers. - Feasibility: Does the protocol satisfy all hard constraints? Walk
constraints[]and check each one against the protocol. - Fidelity: How close is the protocol to the reference spec? Compare
technique, duration, equipment, reagents against
hidden_reference_specand compute a similarity score usingallowed_substitutions[]quality impact.
Why: Makes scoring work for any domain without per-domain scorer code. The domain-specific knowledge lives in the scenario adapter (which defines what the reference spec and constraints are), not in the scoring engine.
Depends on: Item 1 (schema with hidden_reference_spec), Item 2
(adapters populate it), JDG 01-04 (MVP scorers exist to refactor from).
Scope: Refactor of existing scorer files. ~150 lines total across
rigor.py, feasibility.py, fidelity.py.
Item 6: Lab Manager orchestrator with specialist subagents
What: Decompose the hybrid Lab Manager into a coordinator that delegates to specialist subagents:
| Subagent | Responsibility |
|---|---|
| Budget agent | Checks cost against remaining budget |
| Scheduling agent | Checks timeline and booking conflicts |
| Equipment agent | Checks equipment availability and substitutions |
| Safety agent | Checks policy and compliance constraints |
| Coordinator | Aggregates subagent outputs into one LabManagerAction |
Externally, the contract is unchanged: one LabManagerAction per turn. The
orchestration is internal.
Why: Stronger multi-agent story for the hackathon track alignment. Demonstrates that the Lab Manager is not a monolithic policy but a team of constraint specialists. Each subagent can be individually tested, improved, or replaced.
Depends on: Item 4 (hybrid Lab Manager works first), Item 2 (normalized constraints are available for each subagent to read).
Scope: Orchestration layer ~200 lines. Each subagent ~40 lines. Total ~400 lines.
Risk: Adds latency (multiple LLM calls or multiple checker passes per turn), orchestration failure handling, and logging complexity. Only pursue after the single hybrid Lab Manager is stable and training is producing results.
Phasing: This is the lowest priority item. Build it only if the MVP is complete, training shows improvement, and there is time remaining before submission.
3. Recommended order
| Order | Item | Gate |
|---|---|---|
| 1 | Define normalized scenario schema | After SCN 02-09 complete |
| 2 | Convert templates to adapters | After Item 1 |
| 3 | Data-driven prompt assembly | After Item 2 + AGT 01/07 |
| 4 | Hybrid LLM Lab Manager | After Item 2 + AGT 05 |
| 5 | Normalized scoring | After Item 2 + JDG 01-04 |
| 6 | Lab Manager orchestrator with subagents | After Item 4 stable |
4. Key principle
The external contract stays stable. Internal policy can evolve. LLMs reason and negotiate. They never own truth. Truth lives in the normalized scenario pack and the deterministic scorer.