EB-1A Petition Screener — Criterion-Level Evidence Classifier

A fine-tuned DistilBERT classifier that predicts whether a piece of evidence satisfies a specific EB-1A extraordinary ability criterion (8 C.F.R. § 204.5(h)(3)), trained on Administrative Appeals Office (AAO) appellate decisions.

Live demo: agme2019/eb1a-screener-app

⚠️ Research prototype. Not legal advice. Results do not constitute an immigration determination.

Model Performance

Evaluated on a 29-file held-out test set of 2024–2026 AAO decisions (never seen during training). Direct criterion classification mode: 52 criterion examples, 20 positives.

Metric	Score	95% CI
F1	88.4%	[76.2%, 97.4%]
Recall	95.0%	[83.3%, 100.0%]
Precision	82.6%	[65.4%, 96.0%]
Criterion-level accuracy	90.4%	[80.8%, 98.1%]
Case-level decision accuracy	100%	exact (21/21)

Bootstrap 95% CIs: 10,000 resamples at the criterion-example level (n=52). The wide precision interval reflects the small positive count (n=20). Decision accuracy is exact — all 21 cases correct.

Per-criterion breakdown (held-out test set)

Criterion	Description	TP	FP	FN	Recall	Precision
i	Prizes / awards	—	—	—	—	—
ii	Association membership	2	0	0	100%	100%
iii	Published material	1	0	0	100%	100%
iv	Judging others' work	4	1	0	100%	80%
v	Original contributions	5	1	0	100%	83%
vi	Scholarly articles	3	0	0	100%	100%
vii	Artistic exhibitions	—	—	—	—	—
viii	Critical / leading role	2	1	0	100%	67%
ix	High salary	2	1	1	67%	67%
x	Commercial success	—	—	—	—	—

— = no examples in held-out set for this criterion.

Ablation results (cross-validation benchmark, v3 corpus)

Model variant	Recall	Precision	F1	Decision acc.
v3-Scaled (this model)	75.0%	37.5%	50.0%	84.0%
v3-Baseline	14.3%	50.0%	22.2%	96.0%
v2-Scaled	75.0%	17.6%	28.6%	72.0%
v2-Scaled + no task prefix	50.0%	14.3%	22.2%	68.0%
v2-Legal (legal-BERT)	75.0%	17.6%	28.6%	72.0%
v2-Weighted (8.75× loss)	100%	10.8%	19.5%	84.0%

Task Formulation

Each evidence item is evaluated independently per criterion using the prompt:

[EVIDENCE] Does this evidence satisfy criterion 8 C.F.R. § 204.5(h)(3)(<id>)? <evidence text>

Three task types are distinguished via prefix tokens:

Prefix	Task
`[CRITERION]`	Criterion analysis text → met / not met
`[OVERALL]`	Case summary → petition approved / denied
`[EVIDENCE]`	Evidence string → satisfies criterion / does not

Criterion-level decisions aggregate evidence-item predictions using a majority vote rule at a 0.65 confidence threshold (recommended operating point).

Training Data

Split	Files	Parsed criteria
Training	4,660 AAO decisions (2010–2023)	~9,730
Held-out test	29 AAO decisions (2024–2026)	52

Source: USCIS Administrative Appeals Office public decisions (uscis.gov/administrative-appeals).

Parser: v3 parser with CFR-header anchoring (regulation at 8 C.F.R. § 204.5(h)(3)(X)), negation context window, conclusion-zone reversal detection, and pdfplumber extraction. Compared to v2 content-regex parsing: avg. criteria per file 1.891 → 2.087 (+10.4%).

Class imbalance: The appellate corpus is skewed toward denied/contested cases. Positive criteria (met) represent ~15% of training examples. The v3-Scaled variant applies upsampling to balance the training set.

Base Model

Property	Value
Base	`distilbert-base-uncased`
Parameters	66M
Max token length	512
Training epochs	10 (early stopping, patience 5)
Optimizer	AdamW
Best checkpoint	saved by aggregate criterion accuracy

A legal-bert ablation (nlpaueb/legal-bert-base-uncased) showed no significant gain over DistilBERT at this training scale, consistent with prior findings that domain adaptation helps most when training data is large.

Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model = AutoModelForSequenceClassification.from_pretrained("agme2019/eb1a-screener")
tokenizer = AutoTokenizer.from_pretrained("agme2019/eb1a-screener")
model.eval()

def score_evidence(criterion_id: str, evidence: str) -> dict:
    text = (
        f"[EVIDENCE] Does this evidence satisfy criterion "
        f"8 C.F.R. § 204.5(h)(3)({criterion_id})? {evidence}"
    )
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        logits = model(**inputs).logits
    probs = torch.softmax(logits, dim=-1)[0]
    return {"satisfies": bool(probs[1] > probs[0]), "confidence": probs.max().item()}

# Example
result = score_evidence("iv", "Reviewed 12 papers for NeurIPS 2024 as an invited PC member.")
print(result)  # {'satisfies': True, 'confidence': 0.91}

Or use the full pipeline via the hosted app: huggingface.co/spaces/agme2019/eb1a-screener-app

Criteria Reference (8 C.F.R. § 204.5(h)(3))

ID	Criterion
i	Outstanding prizes or awards for excellence
ii	Membership in associations requiring outstanding achievement
iii	Published material about the person in major media
iv	Judging the work of others in the field
v	Original contributions of major significance
vi	Authorship of scholarly articles
vii	Artistic exhibitions or showcases
viii	Leading or critical role for distinguished organizations
ix	High salary relative to peers
x	Commercial success in the performing arts

A petition is approvable when ≥ 3 criteria are met AND a final merits determination confirms the petitioner is among the small percentage at the very top of their field.

Limitations

Trained exclusively on AAO appellate decisions — these disproportionately represent denied or contested petitions. The model may underestimate strength for straightforward approvals not reaching appeal.
Criterion ix (high salary) has the lowest recall (67%) — salary evidence is highly context-dependent and harder to classify from text alone.
The model scores each evidence item independently; it does not reason about the totality of evidence or perform the final merits determination step.
Not a substitute for legal counsel.

Citation

@misc{gosai2026eb1a,
  title        = {Criterion-Level Evidence Assessment for EB-1A Petitions:
                  An Applied NLP Pipeline on AAO Appellate Decisions},
  author       = {Gosai, Agnivo},
  year         = {2026},
  note         = {Submitted to ACL ARR May 2026 (EMNLP 2026 Industry Track)},
  url          = {https://huggingface.co/agme2019/eb1a-screener}
}

Downloads last month: 46

Safetensors

Model size

67M params

Tensor type

F32

Evaluation results

F1 on AAO Held-out Test Set (2024–2026)
self-reported

0.884
Recall on AAO Held-out Test Set (2024–2026)
self-reported

0.950
Precision on AAO Held-out Test Set (2024–2026)
self-reported

0.826
Criterion Accuracy on AAO Held-out Test Set (2024–2026)
self-reported

0.904