EB-1A Petition Screener — Criterion-Level Evidence Classifier
A fine-tuned DistilBERT classifier that predicts whether a piece of evidence satisfies a specific EB-1A extraordinary ability criterion (8 C.F.R. § 204.5(h)(3)), trained on Administrative Appeals Office (AAO) appellate decisions.
Live demo: agme2019/eb1a-screener-app
⚠️ Research prototype. Not legal advice. Results do not constitute an immigration determination.
Model Performance
Evaluated on a 29-file held-out test set of 2024–2026 AAO decisions (never seen during training). Direct criterion classification mode: 52 criterion examples, 20 positives.
| Metric | Score | 95% CI |
|---|---|---|
| F1 | 88.4% | [76.2%, 97.4%] |
| Recall | 95.0% | [83.3%, 100.0%] |
| Precision | 82.6% | [65.4%, 96.0%] |
| Criterion-level accuracy | 90.4% | [80.8%, 98.1%] |
| Case-level decision accuracy | 100% | exact (21/21) |
Bootstrap 95% CIs: 10,000 resamples at the criterion-example level (n=52). The wide precision interval reflects the small positive count (n=20). Decision accuracy is exact — all 21 cases correct.
Per-criterion breakdown (held-out test set)
| Criterion | Description | TP | FP | FN | Recall | Precision |
|---|---|---|---|---|---|---|
| i | Prizes / awards | — | — | — | — | — |
| ii | Association membership | 2 | 0 | 0 | 100% | 100% |
| iii | Published material | 1 | 0 | 0 | 100% | 100% |
| iv | Judging others' work | 4 | 1 | 0 | 100% | 80% |
| v | Original contributions | 5 | 1 | 0 | 100% | 83% |
| vi | Scholarly articles | 3 | 0 | 0 | 100% | 100% |
| vii | Artistic exhibitions | — | — | — | — | — |
| viii | Critical / leading role | 2 | 1 | 0 | 100% | 67% |
| ix | High salary | 2 | 1 | 1 | 67% | 67% |
| x | Commercial success | — | — | — | — | — |
— = no examples in held-out set for this criterion.
Ablation results (cross-validation benchmark, v3 corpus)
| Model variant | Recall | Precision | F1 | Decision acc. |
|---|---|---|---|---|
| v3-Scaled (this model) | 75.0% | 37.5% | 50.0% | 84.0% |
| v3-Baseline | 14.3% | 50.0% | 22.2% | 96.0% |
| v2-Scaled | 75.0% | 17.6% | 28.6% | 72.0% |
| v2-Scaled + no task prefix | 50.0% | 14.3% | 22.2% | 68.0% |
| v2-Legal (legal-BERT) | 75.0% | 17.6% | 28.6% | 72.0% |
| v2-Weighted (8.75× loss) | 100% | 10.8% | 19.5% | 84.0% |
Task Formulation
Each evidence item is evaluated independently per criterion using the prompt:
[EVIDENCE] Does this evidence satisfy criterion 8 C.F.R. § 204.5(h)(3)(<id>)? <evidence text>
Three task types are distinguished via prefix tokens:
| Prefix | Task |
|---|---|
[CRITERION] |
Criterion analysis text → met / not met |
[OVERALL] |
Case summary → petition approved / denied |
[EVIDENCE] |
Evidence string → satisfies criterion / does not |
Criterion-level decisions aggregate evidence-item predictions using a majority vote rule at a 0.65 confidence threshold (recommended operating point).
Training Data
| Split | Files | Parsed criteria |
|---|---|---|
| Training | 4,660 AAO decisions (2010–2023) | ~9,730 |
| Held-out test | 29 AAO decisions (2024–2026) | 52 |
Source: USCIS Administrative Appeals Office public decisions (uscis.gov/administrative-appeals).
Parser: v3 parser with CFR-header anchoring (regulation at 8 C.F.R. § 204.5(h)(3)(X)),
negation context window, conclusion-zone reversal detection, and pdfplumber extraction.
Compared to v2 content-regex parsing: avg. criteria per file 1.891 → 2.087 (+10.4%).
Class imbalance: The appellate corpus is skewed toward denied/contested cases.
Positive criteria (met) represent ~15% of training examples. The v3-Scaled variant
applies upsampling to balance the training set.
Base Model
| Property | Value |
|---|---|
| Base | distilbert-base-uncased |
| Parameters | 66M |
| Max token length | 512 |
| Training epochs | 10 (early stopping, patience 5) |
| Optimizer | AdamW |
| Best checkpoint | saved by aggregate criterion accuracy |
A legal-bert ablation (nlpaueb/legal-bert-base-uncased) showed no significant gain
over DistilBERT at this training scale, consistent with prior findings that domain
adaptation helps most when training data is large.
Usage
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
model = AutoModelForSequenceClassification.from_pretrained("agme2019/eb1a-screener")
tokenizer = AutoTokenizer.from_pretrained("agme2019/eb1a-screener")
model.eval()
def score_evidence(criterion_id: str, evidence: str) -> dict:
text = (
f"[EVIDENCE] Does this evidence satisfy criterion "
f"8 C.F.R. § 204.5(h)(3)({criterion_id})? {evidence}"
)
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=-1)[0]
return {"satisfies": bool(probs[1] > probs[0]), "confidence": probs.max().item()}
# Example
result = score_evidence("iv", "Reviewed 12 papers for NeurIPS 2024 as an invited PC member.")
print(result) # {'satisfies': True, 'confidence': 0.91}
Or use the full pipeline via the hosted app: huggingface.co/spaces/agme2019/eb1a-screener-app
Criteria Reference (8 C.F.R. § 204.5(h)(3))
| ID | Criterion |
|---|---|
| i | Outstanding prizes or awards for excellence |
| ii | Membership in associations requiring outstanding achievement |
| iii | Published material about the person in major media |
| iv | Judging the work of others in the field |
| v | Original contributions of major significance |
| vi | Authorship of scholarly articles |
| vii | Artistic exhibitions or showcases |
| viii | Leading or critical role for distinguished organizations |
| ix | High salary relative to peers |
| x | Commercial success in the performing arts |
A petition is approvable when ≥ 3 criteria are met AND a final merits determination confirms the petitioner is among the small percentage at the very top of their field.
Limitations
- Trained exclusively on AAO appellate decisions — these disproportionately represent denied or contested petitions. The model may underestimate strength for straightforward approvals not reaching appeal.
- Criterion ix (high salary) has the lowest recall (67%) — salary evidence is highly context-dependent and harder to classify from text alone.
- The model scores each evidence item independently; it does not reason about the totality of evidence or perform the final merits determination step.
- Not a substitute for legal counsel.
Citation
@misc{gosai2026eb1a,
title = {Criterion-Level Evidence Assessment for EB-1A Petitions:
An Applied NLP Pipeline on AAO Appellate Decisions},
author = {Gosai, Agnivo},
year = {2026},
note = {Submitted to ACL ARR May 2026 (EMNLP 2026 Industry Track)},
url = {https://huggingface.co/agme2019/eb1a-screener}
}
- Downloads last month
- 46
Evaluation results
- F1 on AAO Held-out Test Set (2024–2026)self-reported0.884
- Recall on AAO Held-out Test Set (2024–2026)self-reported0.950
- Precision on AAO Held-out Test Set (2024–2026)self-reported0.826
- Criterion Accuracy on AAO Held-out Test Set (2024–2026)self-reported0.904