context-prune / README.md
prithic07's picture
Docs: Fix technical inaccuracies in Action and Observation tables
8b0bc5e
metadata
title: ContextPrune
emoji: 🧹
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false

ContextPrune: Adaptive Context Garbage Collection for RAG

ContextPrune is a benchmark environment designed to solve the "Attention Dilution" problem in Large Language Model (LLM) workflows. It treats context management as a form of Garbage Collection, where the system identifies, filters, and compresses information to maintain high signal-to-noise ratios in RAG pipelines.


1. System Overview

In standard RAG, retrieval often returns too much irrelevant data, causing models to "lose the signal" or hallucinate. ContextPrune provides a Reinforcement Learning (RL) environment where agents are trained to surgically manage their context window.

Architecture Flow

graph TD
    A[User / Agent] -->|Execute Actions| B[FastAPI / Streamlit Interface]
    B -->|RagAction| C[ContextPrune Environment]
    C -->|Update State| D[State Machine]
    D -->|Token Budgeting| E[Context Working Set]
    D -->|Hybrid Retrieval| F[Corpus Search]
    C -->|Terminal Action| G[Deterministic Grader]
    G -->|Weighted Reward| A

2. Methodology: The Operational Loop

ContextPrune enforces a 5-staged workflow that mirrors enterprise incident response.

Stage Action Rationale
Triage inspect_artifact Low-cost preview of artifact keywords and domains to filter out "Garbage" early.
Analysis prioritize_artifact Committing specific evidence to the working set. Consumes token budget.
Optimization summarize_artifact AI-driven compression. Reduces token footprint while attempting to preserve "Grounding" tokens.
Resolution set_resolution_plan Forces the agent to internalize the evidence into a logical plan before producing an output.
Submission submit_report Terminates the episode. The output must be grounded exclusively in the working set.

3. Observation Space

The RagObservation provides the agent with the internal state of the incident and the current working set budget.

Field Type Description
case_id str Unique simulated case identifier
case_summary str Real-world case context and background
objective str Specific deliverable the agent must produce
workflow_stage triage | analysis | resolution | submitted Current stage in the operational loop
customer_tier standard | business | enterprise Customer criticality and SLA priority
incident_severity sev3 | sev2 | sev1 Impact magnitude of the incident
available_artifacts List[ChunkSummary] Metadata for artifacts available for inspection
reviewed_artifacts List[str] IDs of artifacts already triaged
prioritized_artifacts List[str] IDs of artifacts currently in the working set
plan_draft Optional[str] Current state of the resolution plan
total_tokens_used int Current token cost of the working set
token_budget int Maximum allowed token budget
step_number int Current step index in the episode
task_name str Name of the active benchmark task

4. Action Space

Agents interact with the environment through the following canonical actions:

Action Type Parameters Effect
inspect_artifact artifact_id Review artifact keywords without committing to the working set
prioritize_artifact artifact_id Add a reviewed artifact to the working set (consumes tokens)
summarize_artifact artifact_id, compression_ratio Compress a prioritized artifact using AI summarization
set_resolution_plan plan Update the draft plan before final submission
submit_report answer Generate final response and terminate the episode

5. Reward Engineering (The Benchmarking Grader)

The environment calculates a weighted score (0.0 - 1.0) based on 8 distinct metrics.

  • Required Coverage (24%): Inclusion of critical "Gold" artifacts.
  • Cross-Domain Variety (12%): Rewards correlation across Support, Incident logs, and Release guardrails.
  • Triage Thoroughness (12%): Penalizes skipping the inspection phase.
  • Planning Logic (16%): Alignment between the drafted plan and ground truth steps.
  • Reporting Accuracy (18%): Presence of mission-critical operational keywords.
  • Citation Fidelity (10%): Verification that claimed evidence is in the working set.
  • Token Efficiency (8%): Scaled bonus for minimal context usage.
  • Hallucination Penalty (-18%): Severe deduction for unsupported claims.

6. Scenario Benchmarks

Task Difficulty Steps Budget Key Challenge
refund_triage_easy Easy 7 850 Systematically checking policy artifacts before relief.
cross_function_brief_medium Medium 8 620 Filtering overlapping narratives for a singular source of truth.
executive_escalation_hard Hard 10 360 Correlating suspicious logs with release freezes on a tight budget.

7. Configuration & Environment

Environment Variables

Variable Default Purpose
API_BASE_URL https://router.huggingface.co/v1 OpenAI-compatible inference endpoint
MODEL_NAME Qwen/Qwen2.5-72B-Instruct Model used for baseline tasks
HF_TOKEN None Authentication for Hugging Face Inference API
RAG_ENV_URL http://localhost:7860 Base URL for the ContextPrune server

Project Components

  • rag_optimizer_env/: State machine, hybrid retrieval, and token estimation.
  • app.py: FastAPI implementation for remote agent interaction.
  • inference.py: Baseline agent script (OpenAI-compatible).
  • validate.py: Robust validation suite for episode lifecycle verification.

πŸš€ Quick Start

  1. Setup: pip install -r requirements.txt
  2. Server: python app.py (Runs on Port 7860)
  3. Control Panel: streamlit run optimizer_ui.py
  4. Validation: python validate.py

🌎 Live Deployment

Built for Context Optimization Research.