ALBERT-base fine-tuned on JABD for Bias Detection in Job Advertisements

This model is a fine-tuned version of albert-base-v2 on the Job Ads Bias Dataset (JABD) for token-level bias detection in job advertisements, framed as a Named Entity Recognition (NER) task using the BIO tagging scheme.

It is the best-performing model reported in the accompanying paper, "Bias Detection in Job Advertisement using Natural Language Processing".

Model Description

The model identifies and classifies 12 types of linguistic bias at the token/span level in job advertisements, covering both explicit and implicit biases across six sociodemographic groups: gender, religion, disability, ethnicity, age, and nationality.

  • Base model: albert-base-v2
  • Task: Token classification (NER) with BIO tagging
  • Language: English
  • Training dataset: Job Ads Bias Dataset (JABD) — 14,960 sentences with token-level annotations
  • Number of labels: 25 (12 bias categories × 2 BIO tags + O)

Bias Categories

Category Type Group
Generic She Explicit Gender
Generic He Explicit Gender
Explicit Marking of Sex Explicit Gender
Masculine Coded Implicit Gender
Feminine Coded Implicit Gender
Religion Related Explicit Religion
Disability Related Explicit Disability
Nationality Related Explicit Ethnicity
Ethnic Related Explicit Ethnicity
Age Related Explicit Age
Old Coded Implicit Age
Young Coded Implicit Age

Intended Use

Primary Use Cases

  • Flagging potentially biased language in job advertisements for human review.
  • Research on fairness, bias, and inclusion in recruitment-related text.
  • Building tools to assist recruiters and HR professionals in writing more inclusive job postings.

Out-of-Scope Uses

  • Legal determinations or hiring decisions. This model is not designed and must not be used as an automated decision-maker in any recruitment process.
  • Automated content moderation without human oversight.
  • Languages other than English. The model was trained exclusively on English-language job ads.
  • Domains other than job advertisements. Performance on other text domains has not been evaluated.

Performance

Micro-averaged token-level metrics on the JABD test split (averaged across three random seeds):

Metric Score
F1 59.27 ± 0.86
Precision 65.29 ± 1.79
Recall 54.27 ± 0.33

Per-Label Performance (F1)

Performance varies substantially by category. Explicit biases are detected reliably; implicit (coded) biases remain challenging.

Label Type F1 Precision Recall
Generic She Explicit 88.57 85.57 91.92
Explicit Marking of Sex Explicit 81.40 78.79 84.44
Disability Explicit 77.70 71.78 85.13
Religion Explicit 77.49 87.53 69.74
Generic He Explicit 77.28 69.54 87.56
Nationality Related Explicit 72.65 74.20 71.26
Ethnic Related Explicit 72.10 64.61 82.39
Feminine Coded Implicit 55.68 47.62 67.34
Masculine Coded Implicit 51.69 41.34 68.96
Age Related Explicit 41.01 29.06 72.08
Old Coded Implicit 10.15 7.15 27.89
Young Coded Implicit 0.53 0.28 4.94

Note: This checkpoint corresponds to a single random seed from the experiments reported in the paper.

How to Use

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

model_name = "your-username/albert-base-v2-jabd-bias-ner"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

ner = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

text = "We are looking for a young and energetic salesman to join our team."
predictions = ner(text)

for pred in predictions:
    print(f"{pred['word']:<20} {pred['entity_group']:<25} {pred['score']:.3f}")

Training Details

Training Data

The model was trained on the Job Ads Bias Dataset (JABD), which contains 14,960 sentences with token-level BIO annotations across 12 bias categories. JABD was built on top of the EMSCAD corpus (Vidros et al., 2017) and annotated by 192 trained annotators recruited via Prolific, following a custom taxonomy and a rigorous quality-assurance process.

Data Splits

Split Job IDs Phrases
Train 7,195 (80%) 11,991 (80.15%)
Validation 899 (10%) 1,517 (10.14%)
Test 899 (10%) 1,452 (9.71%)

Splits were stratified by job_id to prevent cross-advertisement leakage.

Training Procedure

  • Architecture: ALBERT-base-v2 with a token classification head
  • Tagging scheme: BIO
  • Checkpoint selection: Best validation macro-F1
  • Regularization: Higher dropout and longer training (more epochs) to improve generalization

Limitations and Ethical Considerations

Limitations

  • Implicit bias remains hard to detect. Categories like Young Coded and Old Coded show very low F1 scores. Predictions in these categories should be interpreted with caution.
  • Taxonomy scope. The taxonomy does not cover all possible forms of bias (e.g., criminal record references, socioeconomic status, intersectional biases are underrepresented).
  • Cultural and temporal contingency. Bias is context- and group-dependent. The taxonomy reflects the cultural norms present in the EMSCAD corpus (English-language job ads, 2012–2014) and may not transfer cleanly to other contexts.
  • Subjectivity. Inter-annotator agreement (Krippendorff's α = 0.51) reflects the inherent subjectivity of the task. Model errors partly inherit this variability.
  • Single seed. This checkpoint corresponds to one random seed; results may vary slightly across seeds.

Ethical Considerations

This model addresses a sensitive topic with potential for misuse. We highlight the following:

  • Dual use. The model could in principle be inverted to craft covert discriminatory language or to identify thresholds for evading bias detection. Users must commit to non-discriminatory and assistive applications only.
  • Human-in-the-loop. Outputs are intended to flag language for human review, not to make automatic determinations about candidates, employers, or job postings.
  • Over-penalization. Excessive flagging of subtle or ambiguous wording can produce compliance theater or suppress inclusive language. Calibrate thresholds appropriately for your context.
  • Bias amplification. Unequal error rates across categories may amplify existing disparities. Per-class metrics should be monitored in deployment.

For a complete discussion, see Section 5.4 of the accompanying paper.

Citation

If you use this model, please cite:

@article{citation_2025,
  title={Bias Detection in Job Advertisement using Natural Language Processing},
  author={Private for now},
  journal={Journal name},
  year={2025}
}

Acknowledgments

The model is built on ALBERT (Lan et al., 2020) and uses the EMSCAD dataset (Vidros et al., 2017) as the source of job advertisements.

Contact

For questions about the model or dataset, please contact the authors via the paper or open an issue on the model repository.

Downloads last month
18
Safetensors
Model size
11.1M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mborquez/albert-base-v2-jabd-bias-ner

Finetuned
(265)
this model