EB-1A Petition Screener — Criterion-Level Evidence Classifier

A fine-tuned DistilBERT classifier that predicts whether a piece of evidence satisfies a specific EB-1A extraordinary ability criterion (8 C.F.R. § 204.5(h)(3)), trained on Administrative Appeals Office (AAO) appellate decisions.

Live demo: agme2019/eb1a-screener-app

⚠️ Research prototype. Not legal advice. Results do not constitute an immigration determination.


Model Performance

Evaluated on a 29-file held-out test set of 2024–2026 AAO decisions (never seen during training). Direct criterion classification mode: 52 criterion examples, 20 positives.

Metric Score 95% CI
F1 88.4% [76.2%, 97.4%]
Recall 95.0% [83.3%, 100.0%]
Precision 82.6% [65.4%, 96.0%]
Criterion-level accuracy 90.4% [80.8%, 98.1%]
Case-level decision accuracy 100% exact (21/21)

Bootstrap 95% CIs: 10,000 resamples at the criterion-example level (n=52). The wide precision interval reflects the small positive count (n=20). Decision accuracy is exact — all 21 cases correct.

Per-criterion breakdown (held-out test set)

Criterion Description TP FP FN Recall Precision
i Prizes / awards
ii Association membership 2 0 0 100% 100%
iii Published material 1 0 0 100% 100%
iv Judging others' work 4 1 0 100% 80%
v Original contributions 5 1 0 100% 83%
vi Scholarly articles 3 0 0 100% 100%
vii Artistic exhibitions
viii Critical / leading role 2 1 0 100% 67%
ix High salary 2 1 1 67% 67%
x Commercial success

— = no examples in held-out set for this criterion.

Ablation results (cross-validation benchmark, v3 corpus)

Model variant Recall Precision F1 Decision acc.
v3-Scaled (this model) 75.0% 37.5% 50.0% 84.0%
v3-Baseline 14.3% 50.0% 22.2% 96.0%
v2-Scaled 75.0% 17.6% 28.6% 72.0%
v2-Scaled + no task prefix 50.0% 14.3% 22.2% 68.0%
v2-Legal (legal-BERT) 75.0% 17.6% 28.6% 72.0%
v2-Weighted (8.75× loss) 100% 10.8% 19.5% 84.0%

Task Formulation

Each evidence item is evaluated independently per criterion using the prompt:

[EVIDENCE] Does this evidence satisfy criterion 8 C.F.R. § 204.5(h)(3)(<id>)? <evidence text>

Three task types are distinguished via prefix tokens:

Prefix Task
[CRITERION] Criterion analysis text → met / not met
[OVERALL] Case summary → petition approved / denied
[EVIDENCE] Evidence string → satisfies criterion / does not

Criterion-level decisions aggregate evidence-item predictions using a majority vote rule at a 0.65 confidence threshold (recommended operating point).


Training Data

Split Files Parsed criteria
Training 4,660 AAO decisions (2010–2023) ~9,730
Held-out test 29 AAO decisions (2024–2026) 52

Source: USCIS Administrative Appeals Office public decisions (uscis.gov/administrative-appeals).

Parser: v3 parser with CFR-header anchoring (regulation at 8 C.F.R. § 204.5(h)(3)(X)), negation context window, conclusion-zone reversal detection, and pdfplumber extraction. Compared to v2 content-regex parsing: avg. criteria per file 1.891 → 2.087 (+10.4%).

Class imbalance: The appellate corpus is skewed toward denied/contested cases. Positive criteria (met) represent ~15% of training examples. The v3-Scaled variant applies upsampling to balance the training set.


Base Model

Property Value
Base distilbert-base-uncased
Parameters 66M
Max token length 512
Training epochs 10 (early stopping, patience 5)
Optimizer AdamW
Best checkpoint saved by aggregate criterion accuracy

A legal-bert ablation (nlpaueb/legal-bert-base-uncased) showed no significant gain over DistilBERT at this training scale, consistent with prior findings that domain adaptation helps most when training data is large.


Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model = AutoModelForSequenceClassification.from_pretrained("agme2019/eb1a-screener")
tokenizer = AutoTokenizer.from_pretrained("agme2019/eb1a-screener")
model.eval()

def score_evidence(criterion_id: str, evidence: str) -> dict:
    text = (
        f"[EVIDENCE] Does this evidence satisfy criterion "
        f"8 C.F.R. § 204.5(h)(3)({criterion_id})? {evidence}"
    )
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        logits = model(**inputs).logits
    probs = torch.softmax(logits, dim=-1)[0]
    return {"satisfies": bool(probs[1] > probs[0]), "confidence": probs.max().item()}

# Example
result = score_evidence("iv", "Reviewed 12 papers for NeurIPS 2024 as an invited PC member.")
print(result)  # {'satisfies': True, 'confidence': 0.91}

Or use the full pipeline via the hosted app: huggingface.co/spaces/agme2019/eb1a-screener-app


Criteria Reference (8 C.F.R. § 204.5(h)(3))

ID Criterion
i Outstanding prizes or awards for excellence
ii Membership in associations requiring outstanding achievement
iii Published material about the person in major media
iv Judging the work of others in the field
v Original contributions of major significance
vi Authorship of scholarly articles
vii Artistic exhibitions or showcases
viii Leading or critical role for distinguished organizations
ix High salary relative to peers
x Commercial success in the performing arts

A petition is approvable when ≥ 3 criteria are met AND a final merits determination confirms the petitioner is among the small percentage at the very top of their field.


Limitations

  • Trained exclusively on AAO appellate decisions — these disproportionately represent denied or contested petitions. The model may underestimate strength for straightforward approvals not reaching appeal.
  • Criterion ix (high salary) has the lowest recall (67%) — salary evidence is highly context-dependent and harder to classify from text alone.
  • The model scores each evidence item independently; it does not reason about the totality of evidence or perform the final merits determination step.
  • Not a substitute for legal counsel.

Citation

@misc{gosai2026eb1a,
  title        = {Criterion-Level Evidence Assessment for EB-1A Petitions:
                  An Applied NLP Pipeline on AAO Appellate Decisions},
  author       = {Gosai, Agnivo},
  year         = {2026},
  note         = {Submitted to ACL ARR May 2026 (EMNLP 2026 Industry Track)},
  url          = {https://huggingface.co/agme2019/eb1a-screener}
}
Downloads last month
46
Safetensors
Model size
67M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

  • F1 on AAO Held-out Test Set (2024–2026)
    self-reported
    0.884
  • Recall on AAO Held-out Test Set (2024–2026)
    self-reported
    0.950
  • Precision on AAO Held-out Test Set (2024–2026)
    self-reported
    0.826
  • Criterion Accuracy on AAO Held-out Test Set (2024–2026)
    self-reported
    0.904