Multilingual Fine-Tuned Privacy Filter

This model is a fine-tuned version of openai/privacy-filter for multilingual PII token classification.

Public model URL:

https://huggingface.co/emiemimi/privacy-filter-multilingual-500

Training Data

Dataset: ai4privacy/pii-masking-openpii-1m
Languages: English, Polish, Swedish, German, French, Spanish
Training setting: 500 examples per language
Base model: openai/privacy-filter

OpenPII labels were mapped to the output label set used by openai/privacy-filter, including person, email, phone, date, address, account number, and secret categories.

Evaluation

The retained final evaluation uses the shared 50 examples per language, 300 rows total.

Evaluation	Language	Texts	Precision	Recall	F1
simple	de	50	0.925	0.933	0.929
improved	de	50	0.914	0.924	0.919
simple	en	50	0.958	0.964	0.961
improved	en	50	0.940	0.946	0.943
simple	es	50	0.928	0.967	0.947
improved	es	50	0.902	0.940	0.921
simple	fr	50	0.969	0.946	0.957
improved	fr	50	0.944	0.918	0.931
simple	pl	50	0.888	0.925	0.906
improved	pl	50	0.859	0.892	0.875
simple	sv	50	0.900	0.928	0.914
improved	sv	50	0.867	0.893	0.880
simple	overall	300	0.926	0.942	0.934
improved	overall	300	0.901	0.917	0.909

simple counts a prediction as correct when it overlaps a gold PII span. improved also requires the mapped PII category to match.

Usage

from transformers import AutoModelForTokenClassification, AutoTokenizer

model_id = "emiemimi/privacy-filter-multilingual-500"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

Project Code

The group GitHub repository should link directly to this model page and include the fine-tuning and shared-50 evaluation scripts.

Downloads last month: 14

Safetensors

Model size

1B params

Tensor type

F32

BF16

Model tree for emiemimi/privacy-filter-multilingual-500

Base model

openai/privacy-filter

Finetuned

(39)

this model

emiemimi
/

privacy-filter-multilingual-500