uvpatel7271 commited on
Commit
7c8fa1c
·
1 Parent(s): 737f100

Add reward scoring and context-aware code review flow

Browse files
DEMO_SCRIPT.md CHANGED
@@ -2,11 +2,11 @@
2
 
3
  ## 60-90 Second Walkthrough
4
 
5
- 1. Open the Hugging Face Space and introduce TorchReview Copilot as an AI-powered Python triage assistant built with PyTorch.
6
- 2. Point to the single-sentence problem statement: teams lose time figuring out whether a failure is syntax, logic, or performance related.
7
- 3. Select the `Fix the invoice total syntax regression` example to show the app loading a real broken code sample.
8
- 4. Highlight the **Live Triage Radar** updating immediately, then call out the predicted issue class and repair risk.
9
- 5. Explain that the PyTorch layer uses CodeBERTa embeddings to compare the input against known bug patterns from the OpenEnv task catalog.
10
- 6. Scroll to the repair plan and note that the output is not just a label; it gives a prioritized remediation checklist and the nearest known failure pattern.
11
- 7. Switch to the performance example to show the confidence profile change and emphasize that the system can distinguish runtime bottlenecks from correctness bugs.
12
- 8. Close by noting that OpenEnv still powers deterministic validation under the hood, so the demo stays grounded in measurable task outcomes.
 
2
 
3
  ## 60-90 Second Walkthrough
4
 
5
+ 1. Open the Hugging Face Space and introduce TorchReview Copilot as an AI-powered code review and improvement system built with PyTorch.
6
+ 2. Point to the problem statement: manual code review is slow, inconsistent, and hard to scale.
7
+ 3. Select the `Fix the invoice total syntax regression` example to show the app loading a broken code sample together with the context window.
8
+ 4. Highlight the **Live Triage Radar**, the ML quality score, and the RL-ready reward score.
9
+ 5. Explain that the PyTorch layer uses CodeBERTa embeddings to compare the input against known code-quality patterns from the OpenEnv task catalog.
10
+ 6. Scroll to the three-step improvement plan and call out the progression: syntax and bug fixes, edge cases, then scalability.
11
+ 7. Switch to the performance example to show the confidence profile and reward changing for a different class of issue.
12
+ 8. Close by noting that OpenEnv still powers deterministic validation under the hood, so the demo remains grounded in measurable task outcomes.
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
  title: TorchReview Copilot
3
- emoji: torch
4
  colorFrom: orange
5
  colorTo: red
6
  sdk: docker
@@ -16,7 +16,7 @@ tags:
16
 
17
  # TorchReview Copilot
18
 
19
- TorchReview Copilot is an **AI-powered Python code triage system using PyTorch** to classify issue type, estimate repair risk, and generate an actionable remediation plan from broken code plus failure output.
20
 
21
  It upgrades the original OpenEnv hackathon environment into a judge-friendly product demo: a polished Hugging Face Space on top, with the deterministic OpenEnv validation engine still preserved underneath.
22
 
@@ -35,13 +35,14 @@ That triage step is repetitive, error-prone, and often slows down the actual fix
35
 
36
  ## Solution
37
 
38
- TorchReview Copilot turns code plus traceback text into a practical triage report:
39
 
40
  - **Issue classification:** syntax, logic, or performance
41
- - **Repair risk:** low, medium, or high
 
42
  - **Live Triage Radar:** confidence visualization for all issue classes
43
  - **Nearest known pattern:** the closest OpenEnv task match
44
- - **Fix plan:** prioritized remediation steps for the engineer
45
 
46
  The result is a demo that feels like a real AI debugging assistant rather than a backend-only environment.
47
 
@@ -54,13 +55,13 @@ This project uses **PyTorch for real inference**, not placeholder branching:
54
  - embeddings are compared against curated OpenEnv issue prototypes
55
  - the final decision blends model similarity with lightweight static analysis signals
56
 
57
- That gives the demo an actual model-backed classification path while keeping it CPU-friendly for Hugging Face Spaces.
58
 
59
  ## How It Works
60
 
61
  ### Pipeline
62
 
63
- `Input code + traceback -> static checks -> PyTorch embeddings -> similarity against issue prototypes -> confidence scores -> repair plan`
64
 
65
  ### Detailed Flow
66
 
@@ -68,16 +69,28 @@ That gives the demo an actual model-backed classification path while keeping it
68
  2. TorchReview extracts lightweight static signals:
69
  - parser success/failure
70
  - assertion-style test language
71
- - performance keywords
72
- - nested-loop depth
73
  3. CodeBERTa runs through PyTorch to embed the combined input.
74
- 4. The embedding is compared against built-in issue prototypes derived from the OpenEnv task catalog.
75
  5. The UI returns:
76
  - top issue label
77
  - confidence radar
78
  - repair risk
 
 
79
  - nearest known bug pattern
80
- - suggested next action
 
 
 
 
 
 
 
 
 
 
81
 
82
  ## Built-In Demo Scenarios
83
 
@@ -98,6 +111,18 @@ These examples make the classification differences obvious during judging and vi
98
  - **OpenEnv** for deterministic validation endpoints and environment compatibility
99
  - **Pydantic** for typed schemas
100
 
 
 
 
 
 
 
 
 
 
 
 
 
101
  ## Hugging Face Space UX
102
 
103
  The root app now presents a production-style triage experience:
@@ -105,8 +130,10 @@ The root app now presents a production-style triage experience:
105
  - a clear problem/solution hero section
106
  - example scenario selector
107
  - code and traceback inputs
 
108
  - **Live Triage Radar**
109
- - structured fix plan
 
110
  - visible model/backend notes
111
 
112
  The underlying OpenEnv endpoints remain available for compatibility and evaluation.
@@ -209,7 +236,8 @@ Short version:
209
  3. Show the Live Triage Radar and issue label.
210
  4. Explain the PyTorch embedding step.
211
  5. Show the matched pattern and fix plan.
212
- 6. Switch to the performance example to prove the model distinguishes issue classes.
 
213
 
214
  ## Limitations
215
 
 
1
  ---
2
  title: TorchReview Copilot
3
+ emoji: 🧠
4
  colorFrom: orange
5
  colorTo: red
6
  sdk: docker
 
16
 
17
  # TorchReview Copilot
18
 
19
+ TorchReview Copilot is an **AI-powered code review and improvement system using PyTorch** to analyze Python code, predict quality, generate structured improvement suggestions, and compute an RL-ready reward score.
20
 
21
  It upgrades the original OpenEnv hackathon environment into a judge-friendly product demo: a polished Hugging Face Space on top, with the deterministic OpenEnv validation engine still preserved underneath.
22
 
 
35
 
36
  ## Solution
37
 
38
+ TorchReview Copilot turns code, traceback text, and a short context window into a practical code-review report:
39
 
40
  - **Issue classification:** syntax, logic, or performance
41
+ - **ML quality score:** predicted code quality from PyTorch embeddings
42
+ - **Reward score:** RL-ready score from model quality, lint quality, and complexity penalty
43
  - **Live Triage Radar:** confidence visualization for all issue classes
44
  - **Nearest known pattern:** the closest OpenEnv task match
45
+ - **Improvement plan:** step 1 syntax/bug fixes, step 2 edge cases, step 3 scalability
46
 
47
  The result is a demo that feels like a real AI debugging assistant rather than a backend-only environment.
48
 
 
55
  - embeddings are compared against curated OpenEnv issue prototypes
56
  - the final decision blends model similarity with lightweight static analysis signals
57
 
58
+ That gives the demo an actual model-backed quality and issue scoring path while keeping it CPU-friendly for Hugging Face Spaces.
59
 
60
  ## How It Works
61
 
62
  ### Pipeline
63
 
64
+ `Input code + context window + traceback -> static checks -> PyTorch embeddings -> quality + issue prediction -> suggestion engine -> reward computation -> UI/API output`
65
 
66
  ### Detailed Flow
67
 
 
69
  2. TorchReview extracts lightweight static signals:
70
  - parser success/failure
71
  - assertion-style test language
72
+ - lint/style issues
73
+ - nested-loop depth and complexity pressure
74
  3. CodeBERTa runs through PyTorch to embed the combined input.
75
+ 4. The embedding is compared against built-in issue prototypes derived from the OpenEnv task catalog and reference implementations.
76
  5. The UI returns:
77
  - top issue label
78
  - confidence radar
79
  - repair risk
80
+ - ML quality score
81
+ - RL-ready reward score
82
  - nearest known bug pattern
83
+ - three-step improvement plan
84
+
85
+ ### Reward Formula
86
+
87
+ The current reward computation is:
88
+
89
+ ```text
90
+ reward = (0.5 x ML_quality_score) + (0.3 x lint_score) - (0.2 x complexity_penalty)
91
+ ```
92
+
93
+ This keeps the project compatible with OpenEnv-style reinforcement learning workflows.
94
 
95
  ## Built-In Demo Scenarios
96
 
 
111
  - **OpenEnv** for deterministic validation endpoints and environment compatibility
112
  - **Pydantic** for typed schemas
113
 
114
+ ## Features
115
+
116
+ - PyTorch-powered code quality inference
117
+ - Static analysis for syntax, lint, and complexity
118
+ - Context-window-aware review flow
119
+ - RL-ready reward shaping
120
+ - Live Triage Radar visualization
121
+ - Three-step improvement plan:
122
+ 1. syntax checking and bug fixes
123
+ 2. edge-case handling
124
+ 3. scalability improvements
125
+
126
  ## Hugging Face Space UX
127
 
128
  The root app now presents a production-style triage experience:
 
130
  - a clear problem/solution hero section
131
  - example scenario selector
132
  - code and traceback inputs
133
+ - context window input
134
  - **Live Triage Radar**
135
+ - structured improvement plan
136
+ - reward and quality score display
137
  - visible model/backend notes
138
 
139
  The underlying OpenEnv endpoints remain available for compatibility and evaluation.
 
236
  3. Show the Live Triage Radar and issue label.
237
  4. Explain the PyTorch embedding step.
238
  5. Show the matched pattern and fix plan.
239
+ 6. Show the reward score and explain how it can be used inside an RL environment.
240
+ 7. Switch to the performance example to prove the model distinguishes issue classes.
241
 
242
  ## Limitations
243
 
server/demo.py CHANGED
@@ -189,7 +189,7 @@ def _default_outputs() -> tuple[str, str, str, str, str]:
189
  return (
190
  "<div class='metric-card'><div class='eyebrow'>Awaiting Analysis</div><p class='hero-copy'>Paste Python code, add an optional traceback, or load one of the built-in examples.</p></div>",
191
  "<div class='metric-card'><div class='eyebrow'>Live Triage Radar</div><p class='hero-copy'>Confidence bars will appear after the first analysis run.</p></div>",
192
- "### Fix Plan\nAnalyze a sample to generate a prioritized remediation checklist.",
193
  "### Known Pattern Match\nThe nearest OpenEnv task will be highlighted here after inference runs.",
194
  "### Model Notes\nBackend and extracted signal details will appear here.",
195
  )
@@ -209,19 +209,31 @@ def _summary_html(result) -> str:
209
  <span class="pill {escape(result.repair_risk)}">{escape(result.repair_risk)} repair risk</span>
210
  </div>
211
  <p class="hero-copy">{summary}</p>
212
- <div class="summary-grid">
213
  <div class="summary-stat">
214
- <strong>Matched Pattern</strong>
215
- {escape(result.matched_pattern.title)}
 
 
 
 
216
  </div>
217
  <div class="summary-stat">
218
- <strong>Similarity</strong>
219
- {result.matched_pattern.similarity:.0%}
220
  </div>
221
  <div class="summary-stat">
222
  <strong>Inference Backend</strong>
223
  {escape(result.model_backend)}
224
  </div>
 
 
 
 
 
 
 
 
225
  <div class="summary-stat">
226
  <strong>Next Action</strong>
227
  {next_action}
@@ -264,7 +276,7 @@ def _radar_html(result) -> str:
264
  def _plan_markdown(result) -> str:
265
  plan_lines = "\n".join(f"{index + 1}. {step}" for index, step in enumerate(result.repair_plan))
266
  return (
267
- "### Fix Plan\n"
268
  f"**Primary issue:** `{result.issue_label}`\n\n"
269
  f"{plan_lines}\n\n"
270
  f"**Suggested next action:** {result.suggested_next_action}"
@@ -292,6 +304,9 @@ def _model_markdown(result) -> str:
292
  f"- **Model backend:** `{result.model_backend}`\n"
293
  f"- **Model id:** `{result.model_id}`\n"
294
  f"- **Analysis time:** `{result.analysis_time_ms:.2f} ms`\n\n"
 
 
 
295
  "### Extracted Signals\n"
296
  f"{signal_lines}\n\n"
297
  "### Backend Notes\n"
@@ -299,10 +314,10 @@ def _model_markdown(result) -> str:
299
  )
300
 
301
 
302
- def analyze_inputs(code: str, traceback_text: str) -> tuple[str, str, str, str, str]:
303
  """Run the triage engine and format outputs for the Gradio UI."""
304
 
305
- result = get_default_engine().triage(code or "", traceback_text or "")
306
  return (
307
  _summary_html(result),
308
  _radar_html(result),
@@ -312,18 +327,18 @@ def analyze_inputs(code: str, traceback_text: str) -> tuple[str, str, str, str,
312
  )
313
 
314
 
315
- def load_example(example_key: str) -> tuple[str, str, str, str, str, str, str, str]:
316
  """Populate the UI from a built-in example and immediately analyze it."""
317
 
318
  example = get_default_engine().example_map()[example_key]
319
- outputs = analyze_inputs(example.code, example.traceback_text)
320
  header = (
321
  f"### Example Scenario\n"
322
  f"**{example.title}** \n"
323
  f"{example.summary} \n"
324
  f"Label target: `{example.label}`"
325
  )
326
- return (example.code, example.traceback_text, header, *outputs)
327
 
328
 
329
  def build_demo() -> gr.Blocks:
@@ -339,8 +354,8 @@ def build_demo() -> gr.Blocks:
339
  <div class="eyebrow">Meta PyTorch OpenEnv Hackathon Demo</div>
340
  <h1 class="hero-title">TorchReview Copilot</h1>
341
  <p class="hero-copy">
342
- AI-powered Python code triage using PyTorch to classify issue type, estimate repair risk,
343
- and turn messy failure output into an actionable fix plan. OpenEnv stays underneath as the deterministic validation engine.
344
  </p>
345
  </div>
346
  """
@@ -367,8 +382,14 @@ def build_demo() -> gr.Blocks:
367
  label="Optional traceback / failing test output",
368
  placeholder="Paste stack traces, assertion failures, or benchmark notes here.",
369
  )
 
 
 
 
 
 
370
  with gr.Row():
371
- analyze_button = gr.Button("Analyze With PyTorch", variant="primary")
372
  clear_button = gr.Button("Clear Inputs", variant="secondary")
373
 
374
  with gr.Column(scale=5):
@@ -384,9 +405,9 @@ def build_demo() -> gr.Blocks:
384
  <div class="eyebrow">How It Works</div>
385
  <div class="how-grid">
386
  <div class="how-step"><strong>Input</strong><br>Code plus optional traceback or benchmark signal.</div>
387
- <div class="how-step"><strong>Processing</strong><br>Static checks extract parser, assertion, and runtime clues.</div>
388
- <div class="how-step"><strong>Model</strong><br>CodeBERTa embeddings run through PyTorch and compare against known OpenEnv task patterns.</div>
389
- <div class="how-step"><strong>Output</strong><br>Confidence radar, nearest known issue, repair risk, and a practical remediation plan.</div>
390
  </div>
391
  </div>
392
  """
@@ -395,25 +416,25 @@ def build_demo() -> gr.Blocks:
395
  example_choice.change(
396
  fn=load_example,
397
  inputs=example_choice,
398
- outputs=[code_input, traceback_input, example_header, summary_html, radar_html, plan_markdown, match_markdown, model_markdown],
399
  show_progress="hidden",
400
  )
401
  analyze_button.click(
402
  fn=analyze_inputs,
403
- inputs=[code_input, traceback_input],
404
  outputs=[summary_html, radar_html, plan_markdown, match_markdown, model_markdown],
405
  show_progress="minimal",
406
  )
407
  clear_button.click(
408
- fn=lambda: ("", "", "### Example Scenario\nChoose a built-in example or paste custom code.", *_default_outputs()),
409
  inputs=None,
410
- outputs=[code_input, traceback_input, example_header, summary_html, radar_html, plan_markdown, match_markdown, model_markdown],
411
  show_progress="hidden",
412
  )
413
  demo.load(
414
  fn=load_example,
415
  inputs=example_choice,
416
- outputs=[code_input, traceback_input, example_header, summary_html, radar_html, plan_markdown, match_markdown, model_markdown],
417
  show_progress="hidden",
418
  )
419
 
 
189
  return (
190
  "<div class='metric-card'><div class='eyebrow'>Awaiting Analysis</div><p class='hero-copy'>Paste Python code, add an optional traceback, or load one of the built-in examples.</p></div>",
191
  "<div class='metric-card'><div class='eyebrow'>Live Triage Radar</div><p class='hero-copy'>Confidence bars will appear after the first analysis run.</p></div>",
192
+ "### Improvement Plan\nAnalyze a sample to generate syntax, edge-case, and scalability recommendations.",
193
  "### Known Pattern Match\nThe nearest OpenEnv task will be highlighted here after inference runs.",
194
  "### Model Notes\nBackend and extracted signal details will appear here.",
195
  )
 
209
  <span class="pill {escape(result.repair_risk)}">{escape(result.repair_risk)} repair risk</span>
210
  </div>
211
  <p class="hero-copy">{summary}</p>
212
+ <div class="summary-grid">
213
  <div class="summary-stat">
214
+ <strong>Reward Score</strong>
215
+ {result.reward_score:.0%}
216
+ </div>
217
+ <div class="summary-stat">
218
+ <strong>ML Quality</strong>
219
+ {result.ml_quality_score:.0%}
220
  </div>
221
  <div class="summary-stat">
222
+ <strong>Matched Pattern</strong>
223
+ {escape(result.matched_pattern.title)}
224
  </div>
225
  <div class="summary-stat">
226
  <strong>Inference Backend</strong>
227
  {escape(result.model_backend)}
228
  </div>
229
+ <div class="summary-stat">
230
+ <strong>Lint Score</strong>
231
+ {result.lint_score:.0%}
232
+ </div>
233
+ <div class="summary-stat">
234
+ <strong>Complexity Penalty</strong>
235
+ {result.complexity_penalty:.0%}
236
+ </div>
237
  <div class="summary-stat">
238
  <strong>Next Action</strong>
239
  {next_action}
 
276
  def _plan_markdown(result) -> str:
277
  plan_lines = "\n".join(f"{index + 1}. {step}" for index, step in enumerate(result.repair_plan))
278
  return (
279
+ "### Improvement Plan\n"
280
  f"**Primary issue:** `{result.issue_label}`\n\n"
281
  f"{plan_lines}\n\n"
282
  f"**Suggested next action:** {result.suggested_next_action}"
 
304
  f"- **Model backend:** `{result.model_backend}`\n"
305
  f"- **Model id:** `{result.model_id}`\n"
306
  f"- **Analysis time:** `{result.analysis_time_ms:.2f} ms`\n\n"
307
+ "### Reward Formula\n"
308
+ f"- `reward = (0.5 x {result.ml_quality_score:.2f}) + (0.3 x {result.lint_score:.2f}) - (0.2 x {result.complexity_penalty:.2f})`\n"
309
+ f"- **Final reward:** `{result.reward_score:.2f}`\n\n"
310
  "### Extracted Signals\n"
311
  f"{signal_lines}\n\n"
312
  "### Backend Notes\n"
 
314
  )
315
 
316
 
317
+ def analyze_inputs(code: str, traceback_text: str, context_window: str) -> tuple[str, str, str, str, str]:
318
  """Run the triage engine and format outputs for the Gradio UI."""
319
 
320
+ result = get_default_engine().triage(code or "", traceback_text or "", context_window or "")
321
  return (
322
  _summary_html(result),
323
  _radar_html(result),
 
327
  )
328
 
329
 
330
+ def load_example(example_key: str) -> tuple[str, str, str, str, str, str, str, str, str]:
331
  """Populate the UI from a built-in example and immediately analyze it."""
332
 
333
  example = get_default_engine().example_map()[example_key]
334
+ outputs = analyze_inputs(example.code, example.traceback_text, example.context_window)
335
  header = (
336
  f"### Example Scenario\n"
337
  f"**{example.title}** \n"
338
  f"{example.summary} \n"
339
  f"Label target: `{example.label}`"
340
  )
341
+ return (example.code, example.traceback_text, example.context_window, header, *outputs)
342
 
343
 
344
  def build_demo() -> gr.Blocks:
 
354
  <div class="eyebrow">Meta PyTorch OpenEnv Hackathon Demo</div>
355
  <h1 class="hero-title">TorchReview Copilot</h1>
356
  <p class="hero-copy">
357
+ AI-powered code review and improvement system using PyTorch to score code quality, surface bugs,
358
+ and generate a three-step improvement plan. OpenEnv stays underneath as the deterministic validation engine.
359
  </p>
360
  </div>
361
  """
 
382
  label="Optional traceback / failing test output",
383
  placeholder="Paste stack traces, assertion failures, or benchmark notes here.",
384
  )
385
+ context_input = gr.Textbox(
386
+ value=first_example.context_window,
387
+ lines=4,
388
+ label="Context window",
389
+ placeholder="Describe expected behavior, constraints, or repository context.",
390
+ )
391
  with gr.Row():
392
+ analyze_button = gr.Button("Analyze & Score Code", variant="primary")
393
  clear_button = gr.Button("Clear Inputs", variant="secondary")
394
 
395
  with gr.Column(scale=5):
 
405
  <div class="eyebrow">How It Works</div>
406
  <div class="how-grid">
407
  <div class="how-step"><strong>Input</strong><br>Code plus optional traceback or benchmark signal.</div>
408
+ <div class="how-step"><strong>Processing</strong><br>Static checks extract parser, lint, complexity, and runtime clues.</div>
409
+ <div class="how-step"><strong>Model</strong><br>CodeBERTa embeddings run through PyTorch and score code quality against known OpenEnv patterns.</div>
410
+ <div class="how-step"><strong>Output</strong><br>Confidence radar, reward score, and a three-step improvement plan.</div>
411
  </div>
412
  </div>
413
  """
 
416
  example_choice.change(
417
  fn=load_example,
418
  inputs=example_choice,
419
+ outputs=[code_input, traceback_input, context_input, example_header, summary_html, radar_html, plan_markdown, match_markdown, model_markdown],
420
  show_progress="hidden",
421
  )
422
  analyze_button.click(
423
  fn=analyze_inputs,
424
+ inputs=[code_input, traceback_input, context_input],
425
  outputs=[summary_html, radar_html, plan_markdown, match_markdown, model_markdown],
426
  show_progress="minimal",
427
  )
428
  clear_button.click(
429
+ fn=lambda: ("", "", "", "### Example Scenario\nChoose a built-in example or paste custom code.", *_default_outputs()),
430
  inputs=None,
431
+ outputs=[code_input, traceback_input, context_input, example_header, summary_html, radar_html, plan_markdown, match_markdown, model_markdown],
432
  show_progress="hidden",
433
  )
434
  demo.load(
435
  fn=load_example,
436
  inputs=example_choice,
437
+ outputs=[code_input, traceback_input, context_input, example_header, summary_html, radar_html, plan_markdown, match_markdown, model_markdown],
438
  show_progress="hidden",
439
  )
440
 
tests/test_triage_pipeline.py CHANGED
@@ -20,18 +20,20 @@ def test_examples_map_to_expected_labels_with_fallback_backend() -> None:
20
  engine = CodeTriageEngine(backend=HashingEmbeddingBackend())
21
 
22
  for example in examples:
23
- result = engine.triage(example.code, example.traceback_text)
24
  assert result.issue_label == example.label
 
25
 
26
 
27
  def test_syntax_example_exposes_parser_signal() -> None:
28
  example = next(item for item in build_examples() if item.label == "syntax")
29
  engine = CodeTriageEngine(backend=HashingEmbeddingBackend())
30
 
31
- result = engine.triage(example.code, example.traceback_text)
32
 
33
  assert any(signal.name == "syntax_parse" and signal.value == "fails" for signal in result.extracted_signals)
34
  assert result.matched_pattern.task_id == example.task_id
 
35
 
36
 
37
  def test_composed_app_preserves_health_route() -> None:
 
20
  engine = CodeTriageEngine(backend=HashingEmbeddingBackend())
21
 
22
  for example in examples:
23
+ result = engine.triage(example.code, example.traceback_text, example.context_window)
24
  assert result.issue_label == example.label
25
+ assert 0.0 <= result.reward_score <= 1.0
26
 
27
 
28
  def test_syntax_example_exposes_parser_signal() -> None:
29
  example = next(item for item in build_examples() if item.label == "syntax")
30
  engine = CodeTriageEngine(backend=HashingEmbeddingBackend())
31
 
32
+ result = engine.triage(example.code, example.traceback_text, example.context_window)
33
 
34
  assert any(signal.name == "syntax_parse" and signal.value == "fails" for signal in result.extracted_signals)
35
  assert result.matched_pattern.task_id == example.task_id
36
+ assert result.repair_plan[0].startswith("Step 1 - Syntax checking and bug fixes")
37
 
38
 
39
  def test_composed_app_preserves_health_route() -> None:
triage.py CHANGED
@@ -181,6 +181,43 @@ def _repair_risk(label: IssueLabel, confidence: float, signal_count: int) -> str
181
  return "high"
182
 
183
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
184
  class CodeTriageEngine:
185
  """Combine static signals with PyTorch embeddings to classify code issues."""
186
 
@@ -195,6 +232,7 @@ class CodeTriageEngine:
195
  self.prototypes = list(prototypes or build_prototypes())
196
  self.examples = list(examples or build_examples())
197
  self._prototype_matrix: torch.Tensor | None = None
 
198
 
199
  def example_map(self) -> dict[str, TriageExample]:
200
  """Return UI examples keyed by task id."""
@@ -206,12 +244,25 @@ class CodeTriageEngine:
206
  snippet = _sanitize_text(code) or "# No code supplied."
207
  return f"Candidate code:\n{snippet}\n\nObserved failure:\n{trace}\n"
208
 
 
 
 
 
 
 
 
209
  def _prototype_embeddings(self) -> torch.Tensor:
210
  if self._prototype_matrix is None:
211
  reference_texts = [prototype.reference_text for prototype in self.prototypes]
212
  self._prototype_matrix = self.backend.embed_texts(reference_texts)
213
  return self._prototype_matrix
214
 
 
 
 
 
 
 
215
  def _extract_signals(self, code: str, traceback_text: str) -> tuple[list[TriageSignal], dict[IssueLabel, float], list[str]]:
216
  trace = (traceback_text or "").lower()
217
  heuristic_scores: dict[IssueLabel, float] = {label: 0.15 for label in LABELS}
@@ -321,31 +372,37 @@ class CodeTriageEngine:
321
  best_similarity = float((similarities[best_index] + 1.0) / 2.0)
322
  return best_prototype, best_similarity, indexed_scores
323
 
324
- def _repair_plan(self, label: IssueLabel, matched: TriagePrototype) -> list[str]:
325
- plans = {
326
- "syntax": [
327
- "Patch the parser break first: missing colon, bracket, or indentation before changing logic.",
328
- f"Realign the implementation with the known-good pattern from `{matched.title}`.",
329
- "Re-run the visible checks once the file compiles, then verify hidden edge cases.",
330
- ],
331
- "logic": [
332
- "Reproduce the failing assertion with the smallest public example and inspect state transitions.",
333
- f"Compare boundary handling against the known issue pattern `{matched.title}`.",
334
- "Patch the final state update or branch condition, then rerun correctness checks before submission.",
335
- ],
336
- "performance": [
337
- "Profile the hot path and isolate repeated full-list scans or nested loops.",
338
- f"Refactor toward counting or indexing strategies similar to `{matched.title}`.",
339
- "Benchmark the new implementation on a production-like fixture and confirm output stability.",
340
- ],
341
- }
342
- return plans[label]
343
-
344
- def triage(self, code: str, traceback_text: str = "") -> TriageResult:
 
 
 
 
 
 
345
  """Run the full triage pipeline on code plus optional failure context."""
346
 
347
  started = time.perf_counter()
348
- document = self._build_document(code, traceback_text)
349
  signals, heuristic_scores, notes = self._extract_signals(code, traceback_text)
350
 
351
  candidate_embedding = self.backend.embed_texts([document])
@@ -367,9 +424,14 @@ class CodeTriageEngine:
367
  top_confidence = confidence_scores[issue_label]
368
 
369
  top_signal = signals[0].evidence if signals else "Model similarity dominated the decision."
 
 
 
 
370
  summary = (
371
  f"Detected a {issue_label} issue with {top_confidence:.0%} confidence. "
372
- f"The closest known failure pattern is `{matched.title}`, which indicates {matched.summary.lower()}"
 
373
  )
374
  suggested_next_action = {
375
  "syntax": "Fix the parser error first, then rerun validation before changing behavior.",
@@ -381,6 +443,10 @@ class CodeTriageEngine:
381
  issue_label=issue_label,
382
  confidence_scores=confidence_scores,
383
  repair_risk=_repair_risk(issue_label, top_confidence, len(signals)),
 
 
 
 
384
  summary=summary,
385
  matched_pattern=PrototypeMatch(
386
  task_id=matched.task_id,
@@ -390,7 +456,7 @@ class CodeTriageEngine:
390
  summary=matched.summary,
391
  rationale=top_signal,
392
  ),
393
- repair_plan=self._repair_plan(issue_label, matched),
394
  suggested_next_action=suggested_next_action,
395
  extracted_signals=signals,
396
  model_backend=self.backend.backend_name,
 
181
  return "high"
182
 
183
 
184
+ def _clamp_unit(value: float) -> float:
185
+ return round(max(0.0, min(1.0, float(value))), 4)
186
+
187
+
188
+ def _lint_score(code: str) -> float:
189
+ stripped_lines = [line.rstrip("\n") for line in code.splitlines()]
190
+ if not stripped_lines:
191
+ return 0.2
192
+
193
+ score = 1.0
194
+ if any(len(line) > 88 for line in stripped_lines):
195
+ score -= 0.15
196
+ if any(line.rstrip() != line for line in stripped_lines):
197
+ score -= 0.1
198
+ if any("\t" in line for line in stripped_lines):
199
+ score -= 0.1
200
+ try:
201
+ tree = ast.parse(code)
202
+ functions = [node for node in tree.body if isinstance(node, ast.FunctionDef)]
203
+ if functions and not ast.get_docstring(functions[0]):
204
+ score -= 0.08
205
+ except SyntaxError:
206
+ score -= 0.45
207
+ return _clamp_unit(score)
208
+
209
+
210
+ def _complexity_penalty(code: str) -> float:
211
+ try:
212
+ tree = ast.parse(code)
213
+ except SyntaxError:
214
+ return 0.95
215
+ branch_nodes = sum(isinstance(node, (ast.If, ast.For, ast.While, ast.Try, ast.Match)) for node in ast.walk(tree))
216
+ loop_depth = _loop_depth(code)
217
+ penalty = 0.1 + min(branch_nodes, 8) * 0.07 + min(loop_depth, 4) * 0.12
218
+ return _clamp_unit(penalty)
219
+
220
+
221
  class CodeTriageEngine:
222
  """Combine static signals with PyTorch embeddings to classify code issues."""
223
 
 
232
  self.prototypes = list(prototypes or build_prototypes())
233
  self.examples = list(examples or build_examples())
234
  self._prototype_matrix: torch.Tensor | None = None
235
+ self._reference_code_matrix: torch.Tensor | None = None
236
 
237
  def example_map(self) -> dict[str, TriageExample]:
238
  """Return UI examples keyed by task id."""
 
244
  snippet = _sanitize_text(code) or "# No code supplied."
245
  return f"Candidate code:\n{snippet}\n\nObserved failure:\n{trace}\n"
246
 
247
+ def _build_review_document(self, code: str, traceback_text: str, context_window: str) -> str:
248
+ context = _sanitize_text(context_window) or "No additional context window supplied."
249
+ return (
250
+ f"{self._build_document(code, traceback_text)}\n"
251
+ f"Context window:\n{context}\n"
252
+ )
253
+
254
  def _prototype_embeddings(self) -> torch.Tensor:
255
  if self._prototype_matrix is None:
256
  reference_texts = [prototype.reference_text for prototype in self.prototypes]
257
  self._prototype_matrix = self.backend.embed_texts(reference_texts)
258
  return self._prototype_matrix
259
 
260
+ def _reference_code_embeddings(self) -> torch.Tensor:
261
+ if self._reference_code_matrix is None:
262
+ reference_codes = [prototype.reference_code for prototype in self.prototypes]
263
+ self._reference_code_matrix = self.backend.embed_texts(reference_codes)
264
+ return self._reference_code_matrix
265
+
266
  def _extract_signals(self, code: str, traceback_text: str) -> tuple[list[TriageSignal], dict[IssueLabel, float], list[str]]:
267
  trace = (traceback_text or "").lower()
268
  heuristic_scores: dict[IssueLabel, float] = {label: 0.15 for label in LABELS}
 
372
  best_similarity = float((similarities[best_index] + 1.0) / 2.0)
373
  return best_prototype, best_similarity, indexed_scores
374
 
375
+ def _repair_plan(self, label: IssueLabel, matched: TriagePrototype, context_window: str) -> list[str]:
376
+ context = _sanitize_text(context_window)
377
+ step_one = {
378
+ "syntax": "Step 1 - Syntax checking and bug fixes: resolve the parser break before touching behavior, then align the function with the expected contract.",
379
+ "logic": "Step 1 - Syntax checking and bug fixes: confirm the code parses cleanly, then patch the failing branch or state update causing the incorrect result.",
380
+ "performance": "Step 1 - Syntax checking and bug fixes: keep the implementation correct first, then isolate the slow section without changing external behavior.",
381
+ }[label]
382
+ step_two = (
383
+ "Step 2 - Edge case handling: verify empty input, boundary values, missing fields, and final-state flush behavior "
384
+ f"against the known pattern `{matched.title}`."
385
+ )
386
+ step_three = (
387
+ "Step 3 - Scalability of code: remove repeated full scans, prefer linear-time data structures, "
388
+ "and benchmark the path on a production-like fixture."
389
+ )
390
+ if context:
391
+ step_two = f"{step_two} Context window to preserve: {context}"
392
+ return [step_one, step_two, step_three]
393
+
394
+ def _reference_quality_score(self, code: str, matched: TriagePrototype) -> float:
395
+ candidate = self.backend.embed_texts([_sanitize_text(code) or "# empty"])
396
+ match_index = next(index for index, prototype in enumerate(self.prototypes) if prototype.task_id == matched.task_id)
397
+ reference = self._reference_code_embeddings()[match_index : match_index + 1]
398
+ score = float(torch.matmul(candidate, reference.T)[0][0].item())
399
+ return _clamp_unit((score + 1.0) / 2.0)
400
+
401
+ def triage(self, code: str, traceback_text: str = "", context_window: str = "") -> TriageResult:
402
  """Run the full triage pipeline on code plus optional failure context."""
403
 
404
  started = time.perf_counter()
405
+ document = self._build_review_document(code, traceback_text, context_window)
406
  signals, heuristic_scores, notes = self._extract_signals(code, traceback_text)
407
 
408
  candidate_embedding = self.backend.embed_texts([document])
 
424
  top_confidence = confidence_scores[issue_label]
425
 
426
  top_signal = signals[0].evidence if signals else "Model similarity dominated the decision."
427
+ ml_quality_score = self._reference_quality_score(code, matched)
428
+ lint_score = _lint_score(code)
429
+ complexity_penalty = _complexity_penalty(code)
430
+ reward_score = _clamp_unit((0.5 * ml_quality_score) + (0.3 * lint_score) - (0.2 * complexity_penalty))
431
  summary = (
432
  f"Detected a {issue_label} issue with {top_confidence:.0%} confidence. "
433
+ f"The closest known failure pattern is `{matched.title}`, which indicates {matched.summary.lower()}. "
434
+ f"Predicted quality score is {ml_quality_score:.0%} with an RL-ready reward of {reward_score:.0%}."
435
  )
436
  suggested_next_action = {
437
  "syntax": "Fix the parser error first, then rerun validation before changing behavior.",
 
443
  issue_label=issue_label,
444
  confidence_scores=confidence_scores,
445
  repair_risk=_repair_risk(issue_label, top_confidence, len(signals)),
446
+ ml_quality_score=ml_quality_score,
447
+ lint_score=lint_score,
448
+ complexity_penalty=complexity_penalty,
449
+ reward_score=reward_score,
450
  summary=summary,
451
  matched_pattern=PrototypeMatch(
452
  task_id=matched.task_id,
 
456
  summary=matched.summary,
457
  rationale=top_signal,
458
  ),
459
+ repair_plan=self._repair_plan(issue_label, matched, context_window),
460
  suggested_next_action=suggested_next_action,
461
  extracted_signals=signals,
462
  model_backend=self.backend.backend_name,
triage_catalog.py CHANGED
@@ -44,6 +44,21 @@ SUMMARY_BY_TASK_ID: Dict[str, str] = {
44
  "optimization_rank_active_users": "A nightly ranking job is correct on small fixtures but too slow at production scale.",
45
  }
46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
  def _prototype_text(
49
  task_id: str,
@@ -82,6 +97,7 @@ def build_examples() -> List[TriageExample]:
82
  summary=SUMMARY_BY_TASK_ID[task.task_id],
83
  code=task.starter_code,
84
  traceback_text=TRACEBACK_BY_TASK_ID[task.task_id],
 
85
  task_id=task.task_id,
86
  )
87
  )
@@ -111,6 +127,7 @@ def build_prototypes() -> List[TriagePrototype]:
111
  traceback_text,
112
  ),
113
  starter_code=task.starter_code,
 
114
  traceback_text=traceback_text,
115
  )
116
  )
 
44
  "optimization_rank_active_users": "A nightly ranking job is correct on small fixtures but too slow at production scale.",
45
  }
46
 
47
+ CONTEXT_BY_TASK_ID: Dict[str, str] = {
48
+ "syntax_fix_invoice_totals": (
49
+ "Context window: this helper runs in an end-of-day billing reconciliation job. "
50
+ "Keep the public function signature intact and restore correct totals for mixed integer/string inputs."
51
+ ),
52
+ "bug_fix_session_windows": (
53
+ "Context window: this function groups sorted product analytics events into sessions for retention dashboards. "
54
+ "Boundary behavior must stay deterministic because downstream reports depend on it."
55
+ ),
56
+ "optimization_rank_active_users": (
57
+ "Context window: this pipeline feeds a nightly export on a small CPU instance. "
58
+ "Maintain identical output ordering while improving scalability on larger event volumes."
59
+ ),
60
+ }
61
+
62
 
63
  def _prototype_text(
64
  task_id: str,
 
97
  summary=SUMMARY_BY_TASK_ID[task.task_id],
98
  code=task.starter_code,
99
  traceback_text=TRACEBACK_BY_TASK_ID[task.task_id],
100
+ context_window=CONTEXT_BY_TASK_ID[task.task_id],
101
  task_id=task.task_id,
102
  )
103
  )
 
127
  traceback_text,
128
  ),
129
  starter_code=task.starter_code,
130
+ reference_code=task.reference_code,
131
  traceback_text=traceback_text,
132
  )
133
  )
triage_models.py CHANGED
@@ -41,6 +41,7 @@ class TriageExample(BaseModel):
41
  summary: str
42
  code: str
43
  traceback_text: str
 
44
  task_id: str
45
 
46
 
@@ -53,6 +54,7 @@ class TriagePrototype(BaseModel):
53
  summary: str
54
  reference_text: str
55
  starter_code: str
 
56
  traceback_text: str
57
 
58
 
@@ -62,6 +64,10 @@ class TriageResult(BaseModel):
62
  issue_label: IssueLabel
63
  confidence_scores: Dict[str, float]
64
  repair_risk: RiskLevel
 
 
 
 
65
  summary: str
66
  matched_pattern: PrototypeMatch
67
  repair_plan: List[str]
 
41
  summary: str
42
  code: str
43
  traceback_text: str
44
+ context_window: str
45
  task_id: str
46
 
47
 
 
54
  summary: str
55
  reference_text: str
56
  starter_code: str
57
+ reference_code: str
58
  traceback_text: str
59
 
60
 
 
64
  issue_label: IssueLabel
65
  confidence_scores: Dict[str, float]
66
  repair_risk: RiskLevel
67
+ ml_quality_score: float = Field(..., ge=0.0, le=1.0)
68
+ lint_score: float = Field(..., ge=0.0, le=1.0)
69
+ complexity_penalty: float = Field(..., ge=0.0, le=1.0)
70
+ reward_score: float = Field(..., ge=0.0, le=1.0)
71
  summary: str
72
  matched_pattern: PrototypeMatch
73
  repair_plan: List[str]