Spaces:
Running
Running
Upload folder using huggingface_hub
Browse files
README.md
CHANGED
|
@@ -102,7 +102,7 @@ The environment returns an `APIDebugObservation` with:
|
|
| 102 |
| feedback | string | Structured validation feedback from last action |
|
| 103 |
| message | string | Human-readable status |
|
| 104 |
| done | bool | Whether the episode has ended |
|
| 105 |
-
| reward | float | Reward signal (0
|
| 106 |
|
| 107 |
## Reward Design
|
| 108 |
|
|
@@ -137,10 +137,10 @@ Scores from running inference.py against the live HF Space (3 episodes per task,
|
|
| 137 |
|
| 138 |
| Task | Episodes | Qwen2.5-72B-Instruct | gpt-4o-mini |
|
| 139 |
|------|----------|----------------------|-------------|
|
| 140 |
-
| easy | 3 |
|
| 141 |
-
| medium | 3 |
|
| 142 |
| hard | 3 | 0.780 | 0.760 |
|
| 143 |
-
| **overall** | **9** | **0.
|
| 144 |
|
| 145 |
Hard task uses LLM-as-judge (gpt-4o-mini) for explanation quality scoring, which is stricter than a heuristic baseline. The agent must fix 2-3 simultaneous errors and provide a developer-facing explanation to score high. Larger models perform better on the hard task, showing meaningful difficulty progression.
|
| 146 |
|
|
@@ -211,6 +211,9 @@ api-debug-env/
|
|
| 211 |
βββ client.py # APIDebugEnv(EnvClient)
|
| 212 |
βββ validate-submission.sh # Pre-submission validator
|
| 213 |
βββ __init__.py
|
|
|
|
|
|
|
|
|
|
| 214 |
βββ server/
|
| 215 |
βββ __init__.py
|
| 216 |
βββ app.py # FastAPI app via create_app()
|
|
|
|
| 102 |
| feedback | string | Structured validation feedback from last action |
|
| 103 |
| message | string | Human-readable status |
|
| 104 |
| done | bool | Whether the episode has ended |
|
| 105 |
+
| reward | float | Reward signal in open interval (0, 1) |
|
| 106 |
|
| 107 |
## Reward Design
|
| 108 |
|
|
|
|
| 137 |
|
| 138 |
| Task | Episodes | Qwen2.5-72B-Instruct | gpt-4o-mini |
|
| 139 |
|------|----------|----------------------|-------------|
|
| 140 |
+
| easy | 3 | 0.999 | 0.667 |
|
| 141 |
+
| medium | 3 | 0.999 | 0.999 |
|
| 142 |
| hard | 3 | 0.780 | 0.760 |
|
| 143 |
+
| **overall** | **9** | **0.926** | **0.809** |
|
| 144 |
|
| 145 |
Hard task uses LLM-as-judge (gpt-4o-mini) for explanation quality scoring, which is stricter than a heuristic baseline. The agent must fix 2-3 simultaneous errors and provide a developer-facing explanation to score high. Larger models perform better on the hard task, showing meaningful difficulty progression.
|
| 146 |
|
|
|
|
| 211 |
βββ client.py # APIDebugEnv(EnvClient)
|
| 212 |
βββ validate-submission.sh # Pre-submission validator
|
| 213 |
βββ __init__.py
|
| 214 |
+
βββ tests/
|
| 215 |
+
β βββ __init__.py
|
| 216 |
+
β βββ test_environment.py # 79 unit tests
|
| 217 |
βββ server/
|
| 218 |
βββ __init__.py
|
| 219 |
βββ app.py # FastAPI app via create_app()
|