Nitish commited on
Commit
c1316d3
·
1 Parent(s): f44f429

docs: finalize submission checklist and sign-off

Browse files
Files changed (1) hide show
  1. OPENENV_SUBMISSION_CHECKLIST.md +10 -9
OPENENV_SUBMISSION_CHECKLIST.md CHANGED
@@ -102,6 +102,7 @@ TASK=hard python inference.py # expected: score < 0.8
102
  - [x] Easy task baseline score is ≥ 0.6.
103
  - [x] Medium task baseline score is meaningfully lower than easy (at least 0.15 gap).
104
  - [x] Hard task baseline score is < 0.8 (if it's ≥ 0.8, make it harder).
 
105
 
106
  ---
107
 
@@ -306,9 +307,9 @@ TASK=hard python inference.py # expected: score < 0.8
306
 
307
  | Task | Difficulty | Model | Score | Steps | Notes |
308
  |------|-----------|-------|-------|-------|-------|
309
- | python-off-by-one | easy | Llama-3.3-70B-Instruct | 0.68 | 1 | |
310
- | js-auth-privilege | medium | Llama-3.3-70B-Instruct | 0.70 | 1 | |
311
- | python-sql-injection | hard | Llama-3.3-70B-Instruct | 0.54 | 1 | |
312
 
313
  - [x] The table is filled in with real numbers from a completed inference run.
314
  - [x] The easy task score is ≥ 0.6.
@@ -423,7 +424,7 @@ done
423
 
424
  Expected: Three complete runs, each emitting `[START]`, N×`[STEP]`, and `[END]` with no Python exceptions.
425
 
426
- - [x] ✓ PASSED — Easy score: 0.68 Medium score: 0.70 Hard score: 0.54
427
 
428
  ### Step 5 — Verify log format
429
 
@@ -514,12 +515,12 @@ When all items above are checked, fill in this block and attach it to your submi
514
  Environment Name: Code Security Review
515
  HF Space URL: https://huggingface.co/spaces/inmodel/code-review-env
516
  Baseline Scores:
517
- - Easy task: 0.68 (task name: python-off-by-one)
518
- - Medium task: 0.10 (task name: js-auth-privilege)
519
- - Hard task: 0.75 (task name: python-sql-injection)
520
  Inference runtime: < 1 minute
521
- Docker image size: 250 MB
522
- Submitted by: NitishKumar
523
  Date: 2026-04-08
524
 
525
  I confirm all 18 disqualifying items are checked [yes/no]: yes
 
102
  - [x] Easy task baseline score is ≥ 0.6.
103
  - [x] Medium task baseline score is meaningfully lower than easy (at least 0.15 gap).
104
  - [x] Hard task baseline score is < 0.8 (if it's ≥ 0.8, make it harder).
105
+ (Easy: 0.883 | Medium: 0.500 | Hard: 0.512)
106
 
107
  ---
108
 
 
307
 
308
  | Task | Difficulty | Model | Score | Steps | Notes |
309
  |------|-----------|-------|-------|-------|-------|
310
+ | python-off-by-one | easy | Llama-3.3-70B-Instruct | 0.883 | 2 | |
311
+ | js-idor-auth | medium | Llama-3.3-70B-Instruct | 0.500 | 2 | |
312
+ | python-pickle-deserialization | hard | Llama-3.3-70B-Instruct | 0.512 | 2 | |
313
 
314
  - [x] The table is filled in with real numbers from a completed inference run.
315
  - [x] The easy task score is ≥ 0.6.
 
424
 
425
  Expected: Three complete runs, each emitting `[START]`, N×`[STEP]`, and `[END]` with no Python exceptions.
426
 
427
+ - [x] ✓ PASSED — Easy score: 0.883 Medium score: 0.500 Hard score: 0.512
428
 
429
  ### Step 5 — Verify log format
430
 
 
515
  Environment Name: Code Security Review
516
  HF Space URL: https://huggingface.co/spaces/inmodel/code-review-env
517
  Baseline Scores:
518
+ - Easy task: 0.883 (task name: python-off-by-one)
519
+ - Medium task: 0.500 (task name: js-idor-auth)
520
+ - Hard task: 0.512 (task name: python-pickle-deserialization)
521
  Inference runtime: < 1 minute
522
+ Docker image size: ~300 MB
523
+ Submitted by: Inmodel Labs
524
  Date: 2026-04-08
525
 
526
  I confirm all 18 disqualifying items are checked [yes/no]: yes