Maskara

maskara is a lightweight BERT token-classification model for detecting personally identifiable information (PII) in text.

This checkpoint is a continued fine-tune of the existing somukandula/maskara model on ai4privacy/open-pii-masking-500k-ai4privacy. It keeps the original maskara label taxonomy and does not expand the classifier head to every AI4Privacy label.

What Changed

  • Continued training on all 464,150 rows from the AI4Privacy training split.
  • Evaluated on a 20,000 row slice of the AI4Privacy validation split.
  • Used the model's own tokenizer instead of the dataset's mbert_tokens columns.
  • Converted AI4Privacy character spans from privacy_mask into token labels.
  • Mapped compatible AI4Privacy labels into the existing maskara labels.

Labels

The model predicts BIO tags for these PII classes:

  • ADDRESS
  • API_KEY
  • CREDIT_CARD
  • DATE_OF_BIRTH
  • DRIVER_LICENSE
  • EMAIL
  • IP_ADDRESS
  • LOCATION
  • PASSWORD
  • PERSON_NAME
  • PHONE
  • SSN
  • USERNAME

The full label set is O plus B- and I- variants for each class above.

Usage

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="somukandula/maskara",
    aggregation_strategy="simple",
)

text = "My name is Priya Sharma and my email is priya@example.com."
print(ner(text))

For lower-level control:

from transformers import AutoModelForTokenClassification, AutoTokenizer

model_id = "somukandula/maskara"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

Training Details

Item Value
Source checkpoint somukandula/maskara
Dataset ai4privacy/open-pii-masking-500k-ai4privacy
Train rows 464,150
Validation rows 20,000
Epochs 1
Platform Modal
GPU A10G
Final Modal path /outputs/full-openpii-500k/final

The training script used source_text and privacy_mask spans from the dataset. It did not train directly on the dataset's mbert_tokens / mbert_token_classes, because this model is based on a small uncased BERT tokenizer rather than mBERT.

Evaluation

Evaluation was run after one epoch on a 20,000-row validation slice.

Metric Value
Eval loss 0.1640
Precision 0.4926
Recall 0.5602
F1 0.5243
Accuracy 0.9453

These metrics are for the mapped maskara label taxonomy, not for the full AI4Privacy taxonomy.

Label Mapping

The AI4Privacy dataset includes labels that are not present in the original maskara taxonomy. Compatible labels were mapped into existing classes:

AI4Privacy label examples Maskara label
GIVENNAME, SURNAME, FIRSTNAME, LASTNAME PERSON_NAME
TELEPHONENUM, PHONENUMBER PHONE
SOCIALNUM SSN
DRIVERLICENSENUM DRIVER_LICENSE
CITY, STATE, COUNTRY LOCATION
STREET, BUILDINGNUM, ZIPCODE ADDRESS
CREDITCARDNUMBER CREDIT_CARD
EMAIL EMAIL

Unsupported labels were ignored as O during this run rather than forced into incorrect classes. Examples include:

  • DATE
  • TIME
  • AGE
  • IDCARDNUM
  • PASSPORTNUM
  • TAXNUM
  • TITLE
  • SEX
  • GENDER

Intended Use

Use this model for PII-oriented token classification where the existing maskara labels are sufficient. It is intended for experimentation, prototyping, and PII masking workflows that can tolerate the taxonomy above.

Limitations

  • This checkpoint does not detect AI4Privacy-only labels such as DATE, TIME, AGE, IDCARDNUM, or PASSPORTNUM as separate classes.
  • The model is small and optimized for lightweight inference, not maximum recall.
  • The dataset is multilingual, but this model uses a small uncased BERT tokenizer; evaluate carefully before relying on it for non-English text.
  • The reported metrics are from a validation slice and should be re-measured on your target domain before production use.

Training Provenance

Downloads last month
10
Safetensors
Model size
4.37M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for somukandula/maskara

Unable to build the model tree, the base model loops to the model itself. Learn more.

Dataset used to train somukandula/maskara

Evaluation results