Spaces:

uvpatel7271
/

python-code-review-env

Runtime error

App Files Files Community

uvpatel7271 commited on 6 days ago

Commit

7c8fa1c

1 Parent(s): 737f100

Add reward scoring and context-aware code review flow

Browse files

Files changed (7) hide show

DEMO_SCRIPT.md +8 -8
README.md +41 -13
server/demo.py +44 -23
tests/test_triage_pipeline.py +4 -2
triage.py +90 -24
triage_catalog.py +17 -0
triage_models.py +6 -0

DEMO_SCRIPT.md CHANGED Viewed

@@ -2,11 +2,11 @@
 ## 60-90 Second Walkthrough
-1. Open the Hugging Face Space and introduce TorchReview Copilot as an AI-powered Python triage assistant built with PyTorch.
-2. Point to the single-sentence problem statement: teams lose time figuring out whether a failure is syntax, logic, or performance related.
-3. Select the `Fix the invoice total syntax regression` example to show the app loading a real broken code sample.
-4. Highlight the **Live Triage Radar** updating immediately, then call out the predicted issue class and repair risk.
-5. Explain that the PyTorch layer uses CodeBERTa embeddings to compare the input against known bug patterns from the OpenEnv task catalog.
-6. Scroll to the repair plan and note that the output is not just a label; it gives a prioritized remediation checklist and the nearest known failure pattern.
-7. Switch to the performance example to show the confidence profile change and emphasize that the system can distinguish runtime bottlenecks from correctness bugs.
-8. Close by noting that OpenEnv still powers deterministic validation under the hood, so the demo stays grounded in measurable task outcomes.

 ## 60-90 Second Walkthrough
+1. Open the Hugging Face Space and introduce TorchReview Copilot as an AI-powered code review and improvement system built with PyTorch.
+2. Point to the problem statement: manual code review is slow, inconsistent, and hard to scale.
+3. Select the `Fix the invoice total syntax regression` example to show the app loading a broken code sample together with the context window.
+4. Highlight the **Live Triage Radar**, the ML quality score, and the RL-ready reward score.
+5. Explain that the PyTorch layer uses CodeBERTa embeddings to compare the input against known code-quality patterns from the OpenEnv task catalog.
+6. Scroll to the three-step improvement plan and call out the progression: syntax and bug fixes, edge cases, then scalability.
+7. Switch to the performance example to show the confidence profile and reward changing for a different class of issue.
+8. Close by noting that OpenEnv still powers deterministic validation under the hood, so the demo remains grounded in measurable task outcomes.

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 title: TorchReview Copilot
-emoji: torch
 colorFrom: orange
 colorTo: red
 sdk: docker
@@ -16,7 +16,7 @@ tags:
 # TorchReview Copilot
-TorchReview Copilot is an **AI-powered Python code triage system using PyTorch** to classify issue type, estimate repair risk, and generate an actionable remediation plan from broken code plus failure output.
 It upgrades the original OpenEnv hackathon environment into a judge-friendly product demo: a polished Hugging Face Space on top, with the deterministic OpenEnv validation engine still preserved underneath.
@@ -35,13 +35,14 @@ That triage step is repetitive, error-prone, and often slows down the actual fix
 ## Solution
-TorchReview Copilot turns code plus traceback text into a practical triage report:
 - **Issue classification:** syntax, logic, or performance
-- **Repair risk:** low, medium, or high
 - **Live Triage Radar:** confidence visualization for all issue classes
 - **Nearest known pattern:** the closest OpenEnv task match
-- **Fix plan:** prioritized remediation steps for the engineer
 The result is a demo that feels like a real AI debugging assistant rather than a backend-only environment.
@@ -54,13 +55,13 @@ This project uses **PyTorch for real inference**, not placeholder branching:
 - embeddings are compared against curated OpenEnv issue prototypes
 - the final decision blends model similarity with lightweight static analysis signals
-That gives the demo an actual model-backed classification path while keeping it CPU-friendly for Hugging Face Spaces.
 ## How It Works
 ### Pipeline
-`Input code + traceback -> static checks -> PyTorch embeddings -> similarity against issue prototypes -> confidence scores -> repair plan`
 ### Detailed Flow
@@ -68,16 +69,28 @@ That gives the demo an actual model-backed classification path while keeping it
 2. TorchReview extracts lightweight static signals:
    - parser success/failure
    - assertion-style test language
-   - performance keywords
-   - nested-loop depth
 3. CodeBERTa runs through PyTorch to embed the combined input.
-4. The embedding is compared against built-in issue prototypes derived from the OpenEnv task catalog.
 5. The UI returns:
    - top issue label
    - confidence radar
    - repair risk
    - nearest known bug pattern
-   - suggested next action
 ## Built-In Demo Scenarios
@@ -98,6 +111,18 @@ These examples make the classification differences obvious during judging and vi
 - **OpenEnv** for deterministic validation endpoints and environment compatibility
 - **Pydantic** for typed schemas
 ## Hugging Face Space UX
 The root app now presents a production-style triage experience:
@@ -105,8 +130,10 @@ The root app now presents a production-style triage experience:
 - a clear problem/solution hero section
 - example scenario selector
 - code and traceback inputs
 - **Live Triage Radar**
-- structured fix plan
 - visible model/backend notes
 The underlying OpenEnv endpoints remain available for compatibility and evaluation.
@@ -209,7 +236,8 @@ Short version:
 3. Show the Live Triage Radar and issue label.
 4. Explain the PyTorch embedding step.
 5. Show the matched pattern and fix plan.
-6. Switch to the performance example to prove the model distinguishes issue classes.
 ## Limitations

 ---
 title: TorchReview Copilot
+emoji: 🧠
 colorFrom: orange
 colorTo: red
 sdk: docker
 # TorchReview Copilot
+TorchReview Copilot is an **AI-powered code review and improvement system using PyTorch** to analyze Python code, predict quality, generate structured improvement suggestions, and compute an RL-ready reward score.
 It upgrades the original OpenEnv hackathon environment into a judge-friendly product demo: a polished Hugging Face Space on top, with the deterministic OpenEnv validation engine still preserved underneath.
 ## Solution
+TorchReview Copilot turns code, traceback text, and a short context window into a practical code-review report:
 - **Issue classification:** syntax, logic, or performance
+- **ML quality score:** predicted code quality from PyTorch embeddings
+- **Reward score:** RL-ready score from model quality, lint quality, and complexity penalty
 - **Live Triage Radar:** confidence visualization for all issue classes
 - **Nearest known pattern:** the closest OpenEnv task match
+- **Improvement plan:** step 1 syntax/bug fixes, step 2 edge cases, step 3 scalability
 The result is a demo that feels like a real AI debugging assistant rather than a backend-only environment.
 - embeddings are compared against curated OpenEnv issue prototypes
 - the final decision blends model similarity with lightweight static analysis signals
+That gives the demo an actual model-backed quality and issue scoring path while keeping it CPU-friendly for Hugging Face Spaces.
 ## How It Works
 ### Pipeline
+`Input code + context window + traceback -> static checks -> PyTorch embeddings -> quality + issue prediction -> suggestion engine -> reward computation -> UI/API output`
 ### Detailed Flow
 2. TorchReview extracts lightweight static signals:
    - parser success/failure
    - assertion-style test language
+   - lint/style issues
+   - nested-loop depth and complexity pressure
 3. CodeBERTa runs through PyTorch to embed the combined input.
+4. The embedding is compared against built-in issue prototypes derived from the OpenEnv task catalog and reference implementations.
 5. The UI returns:
    - top issue label
    - confidence radar
    - repair risk
+   - ML quality score
+   - RL-ready reward score
    - nearest known bug pattern
+   - three-step improvement plan
+### Reward Formula
+The current reward computation is:
+```text
+reward = (0.5 x ML_quality_score) + (0.3 x lint_score) - (0.2 x complexity_penalty)
+```
+This keeps the project compatible with OpenEnv-style reinforcement learning workflows.
 ## Built-In Demo Scenarios
 - **OpenEnv** for deterministic validation endpoints and environment compatibility
 - **Pydantic** for typed schemas
+## Features
+- PyTorch-powered code quality inference
+- Static analysis for syntax, lint, and complexity
+- Context-window-aware review flow
+- RL-ready reward shaping
+- Live Triage Radar visualization
+- Three-step improvement plan:
+  1. syntax checking and bug fixes
+  2. edge-case handling
+  3. scalability improvements
 ## Hugging Face Space UX
 The root app now presents a production-style triage experience:
 - a clear problem/solution hero section
 - example scenario selector
 - code and traceback inputs
+- context window input
 - **Live Triage Radar**
+- structured improvement plan
+- reward and quality score display
 - visible model/backend notes
 The underlying OpenEnv endpoints remain available for compatibility and evaluation.
 3. Show the Live Triage Radar and issue label.
 4. Explain the PyTorch embedding step.
 5. Show the matched pattern and fix plan.
+6. Show the reward score and explain how it can be used inside an RL environment.
+7. Switch to the performance example to prove the model distinguishes issue classes.
 ## Limitations

server/demo.py CHANGED Viewed

@@ -189,7 +189,7 @@ def _default_outputs() -> tuple[str, str, str, str, str]:
     return (
         "<div class='metric-card'><div class='eyebrow'>Awaiting Analysis</div><p class='hero-copy'>Paste Python code, add an optional traceback, or load one of the built-in examples.</p></div>",
         "<div class='metric-card'><div class='eyebrow'>Live Triage Radar</div><p class='hero-copy'>Confidence bars will appear after the first analysis run.</p></div>",
-        "### Fix Plan\nAnalyze a sample to generate a prioritized remediation checklist.",
         "### Known Pattern Match\nThe nearest OpenEnv task will be highlighted here after inference runs.",
         "### Model Notes\nBackend and extracted signal details will appear here.",
     )
@@ -209,19 +209,31 @@ def _summary_html(result) -> str:
         <span class="pill {escape(result.repair_risk)}">{escape(result.repair_risk)} repair risk</span>
       </div>
       <p class="hero-copy">{summary}</p>
-      <div class="summary-grid">
         <div class="summary-stat">
-          <strong>Matched Pattern</strong>
-          {escape(result.matched_pattern.title)}
         </div>
         <div class="summary-stat">
-          <strong>Similarity</strong>
-          {result.matched_pattern.similarity:.0%}
         </div>
         <div class="summary-stat">
           <strong>Inference Backend</strong>
           {escape(result.model_backend)}
         </div>
         <div class="summary-stat">
           <strong>Next Action</strong>
           {next_action}
@@ -264,7 +276,7 @@ def _radar_html(result) -> str:
 def _plan_markdown(result) -> str:
     plan_lines = "\n".join(f"{index + 1}. {step}" for index, step in enumerate(result.repair_plan))
     return (
-        "### Fix Plan\n"
         f"**Primary issue:** `{result.issue_label}`\n\n"
         f"{plan_lines}\n\n"
         f"**Suggested next action:** {result.suggested_next_action}"
@@ -292,6 +304,9 @@ def _model_markdown(result) -> str:
         f"- **Model backend:** `{result.model_backend}`\n"
         f"- **Model id:** `{result.model_id}`\n"
         f"- **Analysis time:** `{result.analysis_time_ms:.2f} ms`\n\n"
         "### Extracted Signals\n"
         f"{signal_lines}\n\n"
         "### Backend Notes\n"
@@ -299,10 +314,10 @@ def _model_markdown(result) -> str:
     )
-def analyze_inputs(code: str, traceback_text: str) -> tuple[str, str, str, str, str]:
     """Run the triage engine and format outputs for the Gradio UI."""
-    result = get_default_engine().triage(code or "", traceback_text or "")
     return (
         _summary_html(result),
         _radar_html(result),
@@ -312,18 +327,18 @@ def analyze_inputs(code: str, traceback_text: str) -> tuple[str, str, str, str,
     )
-def load_example(example_key: str) -> tuple[str, str, str, str, str, str, str, str]:
     """Populate the UI from a built-in example and immediately analyze it."""
     example = get_default_engine().example_map()[example_key]
-    outputs = analyze_inputs(example.code, example.traceback_text)
     header = (
         f"### Example Scenario\n"
         f"**{example.title}**  \n"
         f"{example.summary}  \n"
         f"Label target: `{example.label}`"
     )
-    return (example.code, example.traceback_text, header, *outputs)
 def build_demo() -> gr.Blocks:
@@ -339,8 +354,8 @@ def build_demo() -> gr.Blocks:
               <div class="eyebrow">Meta PyTorch OpenEnv Hackathon Demo</div>
               <h1 class="hero-title">TorchReview Copilot</h1>
               <p class="hero-copy">
-                AI-powered Python code triage using PyTorch to classify issue type, estimate repair risk,
-                and turn messy failure output into an actionable fix plan. OpenEnv stays underneath as the deterministic validation engine.
               </p>
             </div>
             """
@@ -367,8 +382,14 @@ def build_demo() -> gr.Blocks:
                     label="Optional traceback / failing test output",
                     placeholder="Paste stack traces, assertion failures, or benchmark notes here.",
                 )
                 with gr.Row():
-                    analyze_button = gr.Button("Analyze With PyTorch", variant="primary")
                     clear_button = gr.Button("Clear Inputs", variant="secondary")
             with gr.Column(scale=5):
@@ -384,9 +405,9 @@ def build_demo() -> gr.Blocks:
               <div class="eyebrow">How It Works</div>
               <div class="how-grid">
                 <div class="how-step"><strong>Input</strong><br>Code plus optional traceback or benchmark signal.</div>
-                <div class="how-step"><strong>Processing</strong><br>Static checks extract parser, assertion, and runtime clues.</div>
-                <div class="how-step"><strong>Model</strong><br>CodeBERTa embeddings run through PyTorch and compare against known OpenEnv task patterns.</div>
-                <div class="how-step"><strong>Output</strong><br>Confidence radar, nearest known issue, repair risk, and a practical remediation plan.</div>
               </div>
             </div>
             """
@@ -395,25 +416,25 @@ def build_demo() -> gr.Blocks:
         example_choice.change(
             fn=load_example,
             inputs=example_choice,
-            outputs=[code_input, traceback_input, example_header, summary_html, radar_html, plan_markdown, match_markdown, model_markdown],
             show_progress="hidden",
         )
         analyze_button.click(
             fn=analyze_inputs,
-            inputs=[code_input, traceback_input],
             outputs=[summary_html, radar_html, plan_markdown, match_markdown, model_markdown],
             show_progress="minimal",
         )
         clear_button.click(
-            fn=lambda: ("", "", "### Example Scenario\nChoose a built-in example or paste custom code.", *_default_outputs()),
             inputs=None,
-            outputs=[code_input, traceback_input, example_header, summary_html, radar_html, plan_markdown, match_markdown, model_markdown],
             show_progress="hidden",
         )
         demo.load(
             fn=load_example,
             inputs=example_choice,
-            outputs=[code_input, traceback_input, example_header, summary_html, radar_html, plan_markdown, match_markdown, model_markdown],
             show_progress="hidden",
         )

     return (
         "<div class='metric-card'><div class='eyebrow'>Awaiting Analysis</div><p class='hero-copy'>Paste Python code, add an optional traceback, or load one of the built-in examples.</p></div>",
         "<div class='metric-card'><div class='eyebrow'>Live Triage Radar</div><p class='hero-copy'>Confidence bars will appear after the first analysis run.</p></div>",
+        "### Improvement Plan\nAnalyze a sample to generate syntax, edge-case, and scalability recommendations.",
         "### Known Pattern Match\nThe nearest OpenEnv task will be highlighted here after inference runs.",
         "### Model Notes\nBackend and extracted signal details will appear here.",
     )
         <span class="pill {escape(result.repair_risk)}">{escape(result.repair_risk)} repair risk</span>
       </div>
       <p class="hero-copy">{summary}</p>
+        <div class="summary-grid">
         <div class="summary-stat">
+          <strong>Reward Score</strong>
+          {result.reward_score:.0%}
+        </div>
+        <div class="summary-stat">
+          <strong>ML Quality</strong>
+          {result.ml_quality_score:.0%}
         </div>
         <div class="summary-stat">
+          <strong>Matched Pattern</strong>
+          {escape(result.matched_pattern.title)}
         </div>
         <div class="summary-stat">
           <strong>Inference Backend</strong>
           {escape(result.model_backend)}
         </div>
+        <div class="summary-stat">
+          <strong>Lint Score</strong>
+          {result.lint_score:.0%}
+        </div>
+        <div class="summary-stat">
+          <strong>Complexity Penalty</strong>
+          {result.complexity_penalty:.0%}
+        </div>
         <div class="summary-stat">
           <strong>Next Action</strong>
           {next_action}
 def _plan_markdown(result) -> str:
     plan_lines = "\n".join(f"{index + 1}. {step}" for index, step in enumerate(result.repair_plan))
     return (
+        "### Improvement Plan\n"
         f"**Primary issue:** `{result.issue_label}`\n\n"
         f"{plan_lines}\n\n"
         f"**Suggested next action:** {result.suggested_next_action}"
         f"- **Model backend:** `{result.model_backend}`\n"
         f"- **Model id:** `{result.model_id}`\n"
         f"- **Analysis time:** `{result.analysis_time_ms:.2f} ms`\n\n"
+        "### Reward Formula\n"
+        f"- `reward = (0.5 x {result.ml_quality_score:.2f}) + (0.3 x {result.lint_score:.2f}) - (0.2 x {result.complexity_penalty:.2f})`\n"
+        f"- **Final reward:** `{result.reward_score:.2f}`\n\n"
         "### Extracted Signals\n"
         f"{signal_lines}\n\n"
         "### Backend Notes\n"
     )
+def analyze_inputs(code: str, traceback_text: str, context_window: str) -> tuple[str, str, str, str, str]:
     """Run the triage engine and format outputs for the Gradio UI."""
+    result = get_default_engine().triage(code or "", traceback_text or "", context_window or "")
     return (
         _summary_html(result),
         _radar_html(result),
     )
+def load_example(example_key: str) -> tuple[str, str, str, str, str, str, str, str, str]:
     """Populate the UI from a built-in example and immediately analyze it."""
     example = get_default_engine().example_map()[example_key]
+    outputs = analyze_inputs(example.code, example.traceback_text, example.context_window)
     header = (
         f"### Example Scenario\n"
         f"**{example.title}**  \n"
         f"{example.summary}  \n"
         f"Label target: `{example.label}`"
     )
+    return (example.code, example.traceback_text, example.context_window, header, *outputs)
 def build_demo() -> gr.Blocks:
               <div class="eyebrow">Meta PyTorch OpenEnv Hackathon Demo</div>
               <h1 class="hero-title">TorchReview Copilot</h1>
               <p class="hero-copy">
+                AI-powered code review and improvement system using PyTorch to score code quality, surface bugs,
+                and generate a three-step improvement plan. OpenEnv stays underneath as the deterministic validation engine.
               </p>
             </div>
             """
                     label="Optional traceback / failing test output",
                     placeholder="Paste stack traces, assertion failures, or benchmark notes here.",
                 )
+                context_input = gr.Textbox(
+                    value=first_example.context_window,
+                    lines=4,
+                    label="Context window",
+                    placeholder="Describe expected behavior, constraints, or repository context.",
+                )
                 with gr.Row():
+                    analyze_button = gr.Button("Analyze & Score Code", variant="primary")
                     clear_button = gr.Button("Clear Inputs", variant="secondary")
             with gr.Column(scale=5):
               <div class="eyebrow">How It Works</div>
               <div class="how-grid">
                 <div class="how-step"><strong>Input</strong><br>Code plus optional traceback or benchmark signal.</div>
+                <div class="how-step"><strong>Processing</strong><br>Static checks extract parser, lint, complexity, and runtime clues.</div>
+                <div class="how-step"><strong>Model</strong><br>CodeBERTa embeddings run through PyTorch and score code quality against known OpenEnv patterns.</div>
+                <div class="how-step"><strong>Output</strong><br>Confidence radar, reward score, and a three-step improvement plan.</div>
               </div>
             </div>
             """
         example_choice.change(
             fn=load_example,
             inputs=example_choice,
+            outputs=[code_input, traceback_input, context_input, example_header, summary_html, radar_html, plan_markdown, match_markdown, model_markdown],
             show_progress="hidden",
         )
         analyze_button.click(
             fn=analyze_inputs,
+            inputs=[code_input, traceback_input, context_input],
             outputs=[summary_html, radar_html, plan_markdown, match_markdown, model_markdown],
             show_progress="minimal",
         )
         clear_button.click(
+            fn=lambda: ("", "", "", "### Example Scenario\nChoose a built-in example or paste custom code.", *_default_outputs()),
             inputs=None,
+            outputs=[code_input, traceback_input, context_input, example_header, summary_html, radar_html, plan_markdown, match_markdown, model_markdown],
             show_progress="hidden",
         )
         demo.load(
             fn=load_example,
             inputs=example_choice,
+            outputs=[code_input, traceback_input, context_input, example_header, summary_html, radar_html, plan_markdown, match_markdown, model_markdown],
             show_progress="hidden",
         )

tests/test_triage_pipeline.py CHANGED Viewed

@@ -20,18 +20,20 @@ def test_examples_map_to_expected_labels_with_fallback_backend() -> None:
     engine = CodeTriageEngine(backend=HashingEmbeddingBackend())
     for example in examples:
-        result = engine.triage(example.code, example.traceback_text)
         assert result.issue_label == example.label
 def test_syntax_example_exposes_parser_signal() -> None:
     example = next(item for item in build_examples() if item.label == "syntax")
     engine = CodeTriageEngine(backend=HashingEmbeddingBackend())
-    result = engine.triage(example.code, example.traceback_text)
     assert any(signal.name == "syntax_parse" and signal.value == "fails" for signal in result.extracted_signals)
     assert result.matched_pattern.task_id == example.task_id
 def test_composed_app_preserves_health_route() -> None:

     engine = CodeTriageEngine(backend=HashingEmbeddingBackend())
     for example in examples:
+        result = engine.triage(example.code, example.traceback_text, example.context_window)
         assert result.issue_label == example.label
+        assert 0.0 <= result.reward_score <= 1.0
 def test_syntax_example_exposes_parser_signal() -> None:
     example = next(item for item in build_examples() if item.label == "syntax")
     engine = CodeTriageEngine(backend=HashingEmbeddingBackend())
+    result = engine.triage(example.code, example.traceback_text, example.context_window)
     assert any(signal.name == "syntax_parse" and signal.value == "fails" for signal in result.extracted_signals)
     assert result.matched_pattern.task_id == example.task_id
+    assert result.repair_plan[0].startswith("Step 1 - Syntax checking and bug fixes")
 def test_composed_app_preserves_health_route() -> None:

triage.py CHANGED Viewed

@@ -181,6 +181,43 @@ def _repair_risk(label: IssueLabel, confidence: float, signal_count: int) -> str
     return "high"
 class CodeTriageEngine:
     """Combine static signals with PyTorch embeddings to classify code issues."""
@@ -195,6 +232,7 @@ class CodeTriageEngine:
         self.prototypes = list(prototypes or build_prototypes())
         self.examples = list(examples or build_examples())
         self._prototype_matrix: torch.Tensor | None = None
     def example_map(self) -> dict[str, TriageExample]:
         """Return UI examples keyed by task id."""
@@ -206,12 +244,25 @@ class CodeTriageEngine:
         snippet = _sanitize_text(code) or "# No code supplied."
         return f"Candidate code:\n{snippet}\n\nObserved failure:\n{trace}\n"
     def _prototype_embeddings(self) -> torch.Tensor:
         if self._prototype_matrix is None:
             reference_texts = [prototype.reference_text for prototype in self.prototypes]
             self._prototype_matrix = self.backend.embed_texts(reference_texts)
         return self._prototype_matrix
     def _extract_signals(self, code: str, traceback_text: str) -> tuple[list[TriageSignal], dict[IssueLabel, float], list[str]]:
         trace = (traceback_text or "").lower()
         heuristic_scores: dict[IssueLabel, float] = {label: 0.15 for label in LABELS}
@@ -321,31 +372,37 @@ class CodeTriageEngine:
         best_similarity = float((similarities[best_index] + 1.0) / 2.0)
         return best_prototype, best_similarity, indexed_scores
-    def _repair_plan(self, label: IssueLabel, matched: TriagePrototype) -> list[str]:
-        plans = {
-            "syntax": [
-                "Patch the parser break first: missing colon, bracket, or indentation before changing logic.",
-                f"Realign the implementation with the known-good pattern from `{matched.title}`.",
-                "Re-run the visible checks once the file compiles, then verify hidden edge cases.",
-            ],
-            "logic": [
-                "Reproduce the failing assertion with the smallest public example and inspect state transitions.",
-                f"Compare boundary handling against the known issue pattern `{matched.title}`.",
-                "Patch the final state update or branch condition, then rerun correctness checks before submission.",
-            ],
-            "performance": [
-                "Profile the hot path and isolate repeated full-list scans or nested loops.",
-                f"Refactor toward counting or indexing strategies similar to `{matched.title}`.",
-                "Benchmark the new implementation on a production-like fixture and confirm output stability.",
-            ],
-        }
-        return plans[label]
-    def triage(self, code: str, traceback_text: str = "") -> TriageResult:
         """Run the full triage pipeline on code plus optional failure context."""
         started = time.perf_counter()
-        document = self._build_document(code, traceback_text)
         signals, heuristic_scores, notes = self._extract_signals(code, traceback_text)
         candidate_embedding = self.backend.embed_texts([document])
@@ -367,9 +424,14 @@ class CodeTriageEngine:
         top_confidence = confidence_scores[issue_label]
         top_signal = signals[0].evidence if signals else "Model similarity dominated the decision."
         summary = (
             f"Detected a {issue_label} issue with {top_confidence:.0%} confidence. "
-            f"The closest known failure pattern is `{matched.title}`, which indicates {matched.summary.lower()}"
         )
         suggested_next_action = {
             "syntax": "Fix the parser error first, then rerun validation before changing behavior.",
@@ -381,6 +443,10 @@ class CodeTriageEngine:
             issue_label=issue_label,
             confidence_scores=confidence_scores,
             repair_risk=_repair_risk(issue_label, top_confidence, len(signals)),
             summary=summary,
             matched_pattern=PrototypeMatch(
                 task_id=matched.task_id,
@@ -390,7 +456,7 @@ class CodeTriageEngine:
                 summary=matched.summary,
                 rationale=top_signal,
             ),
-            repair_plan=self._repair_plan(issue_label, matched),
             suggested_next_action=suggested_next_action,
             extracted_signals=signals,
             model_backend=self.backend.backend_name,

     return "high"
+def _clamp_unit(value: float) -> float:
+    return round(max(0.0, min(1.0, float(value))), 4)
+def _lint_score(code: str) -> float:
+    stripped_lines = [line.rstrip("\n") for line in code.splitlines()]
+    if not stripped_lines:
+        return 0.2
+    score = 1.0
+    if any(len(line) > 88 for line in stripped_lines):
+        score -= 0.15
+    if any(line.rstrip() != line for line in stripped_lines):
+        score -= 0.1
+    if any("\t" in line for line in stripped_lines):
+        score -= 0.1
+    try:
+        tree = ast.parse(code)
+        functions = [node for node in tree.body if isinstance(node, ast.FunctionDef)]
+        if functions and not ast.get_docstring(functions[0]):
+            score -= 0.08
+    except SyntaxError:
+        score -= 0.45
+    return _clamp_unit(score)
+def _complexity_penalty(code: str) -> float:
+    try:
+        tree = ast.parse(code)
+    except SyntaxError:
+        return 0.95
+    branch_nodes = sum(isinstance(node, (ast.If, ast.For, ast.While, ast.Try, ast.Match)) for node in ast.walk(tree))
+    loop_depth = _loop_depth(code)
+    penalty = 0.1 + min(branch_nodes, 8) * 0.07 + min(loop_depth, 4) * 0.12
+    return _clamp_unit(penalty)
 class CodeTriageEngine:
     """Combine static signals with PyTorch embeddings to classify code issues."""
         self.prototypes = list(prototypes or build_prototypes())
         self.examples = list(examples or build_examples())
         self._prototype_matrix: torch.Tensor | None = None
+        self._reference_code_matrix: torch.Tensor | None = None
     def example_map(self) -> dict[str, TriageExample]:
         """Return UI examples keyed by task id."""
         snippet = _sanitize_text(code) or "# No code supplied."
         return f"Candidate code:\n{snippet}\n\nObserved failure:\n{trace}\n"
+    def _build_review_document(self, code: str, traceback_text: str, context_window: str) -> str:
+        context = _sanitize_text(context_window) or "No additional context window supplied."
+        return (
+            f"{self._build_document(code, traceback_text)}\n"
+            f"Context window:\n{context}\n"
+        )
     def _prototype_embeddings(self) -> torch.Tensor:
         if self._prototype_matrix is None:
             reference_texts = [prototype.reference_text for prototype in self.prototypes]
             self._prototype_matrix = self.backend.embed_texts(reference_texts)
         return self._prototype_matrix
+    def _reference_code_embeddings(self) -> torch.Tensor:
+        if self._reference_code_matrix is None:
+            reference_codes = [prototype.reference_code for prototype in self.prototypes]
+            self._reference_code_matrix = self.backend.embed_texts(reference_codes)
+        return self._reference_code_matrix
     def _extract_signals(self, code: str, traceback_text: str) -> tuple[list[TriageSignal], dict[IssueLabel, float], list[str]]:
         trace = (traceback_text or "").lower()
         heuristic_scores: dict[IssueLabel, float] = {label: 0.15 for label in LABELS}
         best_similarity = float((similarities[best_index] + 1.0) / 2.0)
         return best_prototype, best_similarity, indexed_scores
+    def _repair_plan(self, label: IssueLabel, matched: TriagePrototype, context_window: str) -> list[str]:
+        context = _sanitize_text(context_window)
+        step_one = {
+            "syntax": "Step 1 - Syntax checking and bug fixes: resolve the parser break before touching behavior, then align the function with the expected contract.",
+            "logic": "Step 1 - Syntax checking and bug fixes: confirm the code parses cleanly, then patch the failing branch or state update causing the incorrect result.",
+            "performance": "Step 1 - Syntax checking and bug fixes: keep the implementation correct first, then isolate the slow section without changing external behavior.",
+        }[label]
+        step_two = (
+            "Step 2 - Edge case handling: verify empty input, boundary values, missing fields, and final-state flush behavior "
+            f"against the known pattern `{matched.title}`."
+        )
+        step_three = (
+            "Step 3 - Scalability of code: remove repeated full scans, prefer linear-time data structures, "
+            "and benchmark the path on a production-like fixture."
+        )
+        if context:
+            step_two = f"{step_two} Context window to preserve: {context}"
+        return [step_one, step_two, step_three]
+    def _reference_quality_score(self, code: str, matched: TriagePrototype) -> float:
+        candidate = self.backend.embed_texts([_sanitize_text(code) or "# empty"])
+        match_index = next(index for index, prototype in enumerate(self.prototypes) if prototype.task_id == matched.task_id)
+        reference = self._reference_code_embeddings()[match_index : match_index + 1]
+        score = float(torch.matmul(candidate, reference.T)[0][0].item())
+        return _clamp_unit((score + 1.0) / 2.0)
+    def triage(self, code: str, traceback_text: str = "", context_window: str = "") -> TriageResult:
         """Run the full triage pipeline on code plus optional failure context."""
         started = time.perf_counter()
+        document = self._build_review_document(code, traceback_text, context_window)
         signals, heuristic_scores, notes = self._extract_signals(code, traceback_text)
         candidate_embedding = self.backend.embed_texts([document])
         top_confidence = confidence_scores[issue_label]
         top_signal = signals[0].evidence if signals else "Model similarity dominated the decision."
+        ml_quality_score = self._reference_quality_score(code, matched)
+        lint_score = _lint_score(code)
+        complexity_penalty = _complexity_penalty(code)
+        reward_score = _clamp_unit((0.5 * ml_quality_score) + (0.3 * lint_score) - (0.2 * complexity_penalty))
         summary = (
             f"Detected a {issue_label} issue with {top_confidence:.0%} confidence. "
+            f"The closest known failure pattern is `{matched.title}`, which indicates {matched.summary.lower()}. "
+            f"Predicted quality score is {ml_quality_score:.0%} with an RL-ready reward of {reward_score:.0%}."
         )
         suggested_next_action = {
             "syntax": "Fix the parser error first, then rerun validation before changing behavior.",
             issue_label=issue_label,
             confidence_scores=confidence_scores,
             repair_risk=_repair_risk(issue_label, top_confidence, len(signals)),
+            ml_quality_score=ml_quality_score,
+            lint_score=lint_score,
+            complexity_penalty=complexity_penalty,
+            reward_score=reward_score,
             summary=summary,
             matched_pattern=PrototypeMatch(
                 task_id=matched.task_id,
                 summary=matched.summary,
                 rationale=top_signal,
             ),
+            repair_plan=self._repair_plan(issue_label, matched, context_window),
             suggested_next_action=suggested_next_action,
             extracted_signals=signals,
             model_backend=self.backend.backend_name,

triage_catalog.py CHANGED Viewed

@@ -44,6 +44,21 @@ SUMMARY_BY_TASK_ID: Dict[str, str] = {
     "optimization_rank_active_users": "A nightly ranking job is correct on small fixtures but too slow at production scale.",
 }
 def _prototype_text(
     task_id: str,
@@ -82,6 +97,7 @@ def build_examples() -> List[TriageExample]:
                 summary=SUMMARY_BY_TASK_ID[task.task_id],
                 code=task.starter_code,
                 traceback_text=TRACEBACK_BY_TASK_ID[task.task_id],
                 task_id=task.task_id,
             )
         )
@@ -111,6 +127,7 @@ def build_prototypes() -> List[TriagePrototype]:
                     traceback_text,
                 ),
                 starter_code=task.starter_code,
                 traceback_text=traceback_text,
             )
         )

     "optimization_rank_active_users": "A nightly ranking job is correct on small fixtures but too slow at production scale.",
 }
+CONTEXT_BY_TASK_ID: Dict[str, str] = {
+    "syntax_fix_invoice_totals": (
+        "Context window: this helper runs in an end-of-day billing reconciliation job. "
+        "Keep the public function signature intact and restore correct totals for mixed integer/string inputs."
+    ),
+    "bug_fix_session_windows": (
+        "Context window: this function groups sorted product analytics events into sessions for retention dashboards. "
+        "Boundary behavior must stay deterministic because downstream reports depend on it."
+    ),
+    "optimization_rank_active_users": (
+        "Context window: this pipeline feeds a nightly export on a small CPU instance. "
+        "Maintain identical output ordering while improving scalability on larger event volumes."
+    ),
+}
 def _prototype_text(
     task_id: str,
                 summary=SUMMARY_BY_TASK_ID[task.task_id],
                 code=task.starter_code,
                 traceback_text=TRACEBACK_BY_TASK_ID[task.task_id],
+                context_window=CONTEXT_BY_TASK_ID[task.task_id],
                 task_id=task.task_id,
             )
         )
                     traceback_text,
                 ),
                 starter_code=task.starter_code,
+                reference_code=task.reference_code,
                 traceback_text=traceback_text,
             )
         )

triage_models.py CHANGED Viewed

@@ -41,6 +41,7 @@ class TriageExample(BaseModel):
     summary: str
     code: str
     traceback_text: str
     task_id: str
@@ -53,6 +54,7 @@ class TriagePrototype(BaseModel):
     summary: str
     reference_text: str
     starter_code: str
     traceback_text: str
@@ -62,6 +64,10 @@ class TriageResult(BaseModel):
     issue_label: IssueLabel
     confidence_scores: Dict[str, float]
     repair_risk: RiskLevel
     summary: str
     matched_pattern: PrototypeMatch
     repair_plan: List[str]

     summary: str
     code: str
     traceback_text: str
+    context_window: str
     task_id: str
     summary: str
     reference_text: str
     starter_code: str
+    reference_code: str
     traceback_text: str
     issue_label: IssueLabel
     confidence_scores: Dict[str, float]
     repair_risk: RiskLevel
+    ml_quality_score: float = Field(..., ge=0.0, le=1.0)
+    lint_score: float = Field(..., ge=0.0, le=1.0)
+    complexity_penalty: float = Field(..., ge=0.0, le=1.0)
+    reward_score: float = Field(..., ge=0.0, le=1.0)
     summary: str
     matched_pattern: PrototypeMatch
     repair_plan: List[str]