ALBERT-base fine-tuned on JABD for Bias Detection in Job Advertisements
This model is a fine-tuned version of albert-base-v2 on the Job Ads Bias Dataset (JABD) for token-level bias detection in job advertisements, framed as a Named Entity Recognition (NER) task using the BIO tagging scheme.
It is the best-performing model reported in the accompanying paper, "Bias Detection in Job Advertisement using Natural Language Processing".
Model Description
The model identifies and classifies 12 types of linguistic bias at the token/span level in job advertisements, covering both explicit and implicit biases across six sociodemographic groups: gender, religion, disability, ethnicity, age, and nationality.
- Base model:
albert-base-v2 - Task: Token classification (NER) with BIO tagging
- Language: English
- Training dataset: Job Ads Bias Dataset (JABD) — 14,960 sentences with token-level annotations
- Number of labels: 25 (12 bias categories × 2 BIO tags + O)
Bias Categories
| Category | Type | Group |
|---|---|---|
| Generic She | Explicit | Gender |
| Generic He | Explicit | Gender |
| Explicit Marking of Sex | Explicit | Gender |
| Masculine Coded | Implicit | Gender |
| Feminine Coded | Implicit | Gender |
| Religion Related | Explicit | Religion |
| Disability Related | Explicit | Disability |
| Nationality Related | Explicit | Ethnicity |
| Ethnic Related | Explicit | Ethnicity |
| Age Related | Explicit | Age |
| Old Coded | Implicit | Age |
| Young Coded | Implicit | Age |
Intended Use
Primary Use Cases
- Flagging potentially biased language in job advertisements for human review.
- Research on fairness, bias, and inclusion in recruitment-related text.
- Building tools to assist recruiters and HR professionals in writing more inclusive job postings.
Out-of-Scope Uses
- Legal determinations or hiring decisions. This model is not designed and must not be used as an automated decision-maker in any recruitment process.
- Automated content moderation without human oversight.
- Languages other than English. The model was trained exclusively on English-language job ads.
- Domains other than job advertisements. Performance on other text domains has not been evaluated.
Performance
Micro-averaged token-level metrics on the JABD test split (averaged across three random seeds):
| Metric | Score |
|---|---|
| F1 | 59.27 ± 0.86 |
| Precision | 65.29 ± 1.79 |
| Recall | 54.27 ± 0.33 |
Per-Label Performance (F1)
Performance varies substantially by category. Explicit biases are detected reliably; implicit (coded) biases remain challenging.
| Label | Type | F1 | Precision | Recall |
|---|---|---|---|---|
| Generic She | Explicit | 88.57 | 85.57 | 91.92 |
| Explicit Marking of Sex | Explicit | 81.40 | 78.79 | 84.44 |
| Disability | Explicit | 77.70 | 71.78 | 85.13 |
| Religion | Explicit | 77.49 | 87.53 | 69.74 |
| Generic He | Explicit | 77.28 | 69.54 | 87.56 |
| Nationality Related | Explicit | 72.65 | 74.20 | 71.26 |
| Ethnic Related | Explicit | 72.10 | 64.61 | 82.39 |
| Feminine Coded | Implicit | 55.68 | 47.62 | 67.34 |
| Masculine Coded | Implicit | 51.69 | 41.34 | 68.96 |
| Age Related | Explicit | 41.01 | 29.06 | 72.08 |
| Old Coded | Implicit | 10.15 | 7.15 | 27.89 |
| Young Coded | Implicit | 0.53 | 0.28 | 4.94 |
Note: This checkpoint corresponds to a single random seed from the experiments reported in the paper.
How to Use
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
model_name = "your-username/albert-base-v2-jabd-bias-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
ner = pipeline(
"token-classification",
model=model,
tokenizer=tokenizer,
aggregation_strategy="simple"
)
text = "We are looking for a young and energetic salesman to join our team."
predictions = ner(text)
for pred in predictions:
print(f"{pred['word']:<20} {pred['entity_group']:<25} {pred['score']:.3f}")
Training Details
Training Data
The model was trained on the Job Ads Bias Dataset (JABD), which contains 14,960 sentences with token-level BIO annotations across 12 bias categories. JABD was built on top of the EMSCAD corpus (Vidros et al., 2017) and annotated by 192 trained annotators recruited via Prolific, following a custom taxonomy and a rigorous quality-assurance process.
Data Splits
| Split | Job IDs | Phrases |
|---|---|---|
| Train | 7,195 (80%) | 11,991 (80.15%) |
| Validation | 899 (10%) | 1,517 (10.14%) |
| Test | 899 (10%) | 1,452 (9.71%) |
Splits were stratified by job_id to prevent cross-advertisement leakage.
Training Procedure
- Architecture: ALBERT-base-v2 with a token classification head
- Tagging scheme: BIO
- Checkpoint selection: Best validation macro-F1
- Regularization: Higher dropout and longer training (more epochs) to improve generalization
Limitations and Ethical Considerations
Limitations
- Implicit bias remains hard to detect. Categories like Young Coded and Old Coded show very low F1 scores. Predictions in these categories should be interpreted with caution.
- Taxonomy scope. The taxonomy does not cover all possible forms of bias (e.g., criminal record references, socioeconomic status, intersectional biases are underrepresented).
- Cultural and temporal contingency. Bias is context- and group-dependent. The taxonomy reflects the cultural norms present in the EMSCAD corpus (English-language job ads, 2012–2014) and may not transfer cleanly to other contexts.
- Subjectivity. Inter-annotator agreement (Krippendorff's α = 0.51) reflects the inherent subjectivity of the task. Model errors partly inherit this variability.
- Single seed. This checkpoint corresponds to one random seed; results may vary slightly across seeds.
Ethical Considerations
This model addresses a sensitive topic with potential for misuse. We highlight the following:
- Dual use. The model could in principle be inverted to craft covert discriminatory language or to identify thresholds for evading bias detection. Users must commit to non-discriminatory and assistive applications only.
- Human-in-the-loop. Outputs are intended to flag language for human review, not to make automatic determinations about candidates, employers, or job postings.
- Over-penalization. Excessive flagging of subtle or ambiguous wording can produce compliance theater or suppress inclusive language. Calibrate thresholds appropriately for your context.
- Bias amplification. Unequal error rates across categories may amplify existing disparities. Per-class metrics should be monitored in deployment.
For a complete discussion, see Section 5.4 of the accompanying paper.
Citation
If you use this model, please cite:
@article{citation_2025,
title={Bias Detection in Job Advertisement using Natural Language Processing},
author={Private for now},
journal={Journal name},
year={2025}
}
Acknowledgments
The model is built on ALBERT (Lan et al., 2020) and uses the EMSCAD dataset (Vidros et al., 2017) as the source of job advertisements.
Contact
For questions about the model or dataset, please contact the authors via the paper or open an issue on the model repository.
- Downloads last month
- 18
Model tree for mborquez/albert-base-v2-jabd-bias-ner
Base model
albert/albert-base-v2