|
--- |
|
license: cc-by-nc-4.0 |
|
base_model: distilbert-base-uncased |
|
tags: |
|
- generated_from_trainer |
|
model-index: |
|
- name: distilbert_finetuned_ai4privacy_v2 |
|
results: [] |
|
datasets: |
|
- ai4privacy/pii-masking-200k |
|
pipeline_tag: token-classification |
|
language: |
|
- en |
|
metrics: |
|
- seqeval |
|
--- |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
# distilbert_finetuned_ai4privacy_v2 |
|
|
|
This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) on the [ai4privacy/pii-masking-200k](https://huggingface.co/ai4privacy/pii-masking-200k) dataset. |
|
|
|
## Useage |
|
GitHub Implementation: [Ai4Privacy](https://github.com/Sripaad/ai4privacy) |
|
|
|
## Model description |
|
|
|
This model has been finetuned on the World's largest open source privacy dataset. |
|
|
|
The purpose of the trained models is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. |
|
|
|
The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion subjects / use cases split across business, education, psychology and legal fields, and 5 interactions styles (e.g. casual conversation, formal document, emails etc...). |
|
|
|
Take a look at the Github implementation for specific reasearch. |
|
|
|
## Intended uses & limitations |
|
|
|
More information needed |
|
|
|
## Training and evaluation data |
|
|
|
More information needed |
|
|
|
## Training hyperparameters |
|
The following hyperparameters were used during training: |
|
- learning_rate: 5e-05 |
|
- train_batch_size: 8 |
|
- eval_batch_size: 8 |
|
- seed: 42 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: cosine_with_restarts |
|
- lr_scheduler_warmup_ratio: 0.2 |
|
- num_epochs: 5 |
|
|
|
## Class wise metrics |
|
It achieves the following results on the evaluation set: |
|
- Loss: 0.0451 |
|
- Overall Precision: 0.9438 |
|
- Overall Recall: 0.9663 |
|
- Overall F1: 0.9549 |
|
- Overall Accuracy: 0.9838 |
|
|
|
- Accountname F1: 0.9946 |
|
- Accountnumber F1: 0.9940 |
|
- Age F1: 0.9624 |
|
- Amount F1: 0.9643 |
|
- Bic F1: 0.9929 |
|
- Bitcoinaddress F1: 0.9948 |
|
- Buildingnumber F1: 0.9845 |
|
- City F1: 0.9955 |
|
- Companyname F1: 0.9962 |
|
- County F1: 0.9877 |
|
- Creditcardcvv F1: 0.9643 |
|
- Creditcardissuer F1: 0.9953 |
|
- Creditcardnumber F1: 0.9793 |
|
- Currency F1: 0.7811 |
|
- Currencycode F1: 0.8850 |
|
- Currencyname F1: 0.2281 |
|
- Currencysymbol F1: 0.9562 |
|
- Date F1: 0.9061 |
|
- Dob F1: 0.7914 |
|
- Email F1: 1.0 |
|
- Ethereumaddress F1: 1.0 |
|
- Eyecolor F1: 0.9837 |
|
- Firstname F1: 0.9846 |
|
- Gender F1: 0.9971 |
|
- Height F1: 0.9910 |
|
- Iban F1: 0.9906 |
|
- Ip F1: 0.4349 |
|
- Ipv4 F1: 0.8126 |
|
- Ipv6 F1: 0.7679 |
|
- Jobarea F1: 0.9880 |
|
- Jobtitle F1: 0.9991 |
|
- Jobtype F1: 0.9777 |
|
- Lastname F1: 0.9684 |
|
- Litecoinaddress F1: 0.9721 |
|
- Mac F1: 1.0 |
|
- Maskednumber F1: 0.9635 |
|
- Middlename F1: 0.9330 |
|
- Nearbygpscoordinate F1: 1.0 |
|
- Ordinaldirection F1: 0.9910 |
|
- Password F1: 1.0 |
|
- Phoneimei F1: 0.9918 |
|
- Phonenumber F1: 0.9962 |
|
- Pin F1: 0.9477 |
|
- Prefix F1: 0.9546 |
|
- Secondaryaddress F1: 0.9892 |
|
- Sex F1: 0.9876 |
|
- Ssn F1: 0.9976 |
|
- State F1: 0.9893 |
|
- Street F1: 0.9873 |
|
- Time F1: 0.9889 |
|
- Url F1: 1.0 |
|
- Useragent F1: 0.9953 |
|
- Username F1: 0.9975 |
|
- Vehiclevin F1: 1.0 |
|
- Vehiclevrm F1: 1.0 |
|
- Zipcode F1: 0.9873 |
|
|
|
### Training results |
|
|
|
| Training Loss | Epoch | Step | Validation Loss | Overall Precision | Overall Recall | Overall F1 | Overall Accuracy | Accountname F1 | Accountnumber F1 | Age F1 | Amount F1 | Bic F1 | Bitcoinaddress F1 | Buildingnumber F1 | City F1 | Companyname F1 | County F1 | Creditcardcvv F1 | Creditcardissuer F1 | Creditcardnumber F1 | Currency F1 | Currencycode F1 | Currencyname F1 | Currencysymbol F1 | Date F1 | Dob F1 | Email F1 | Ethereumaddress F1 | Eyecolor F1 | Firstname F1 | Gender F1 | Height F1 | Iban F1 | Ip F1 | Ipv4 F1 | Ipv6 F1 | Jobarea F1 | Jobtitle F1 | Jobtype F1 | Lastname F1 | Litecoinaddress F1 | Mac F1 | Maskednumber F1 | Middlename F1 | Nearbygpscoordinate F1 | Ordinaldirection F1 | Password F1 | Phoneimei F1 | Phonenumber F1 | Pin F1 | Prefix F1 | Secondaryaddress F1 | Sex F1 | Ssn F1 | State F1 | Street F1 | Time F1 | Url F1 | Useragent F1 | Username F1 | Vehiclevin F1 | Vehiclevrm F1 | Zipcode F1 | |
|
|:-------------:|:-----:|:----:|:---------------:|:-----------------:|:--------------:|:----------:|:----------------:|:--------------:|:----------------:|:------:|:---------:|:------:|:-----------------:|:-----------------:|:-------:|:--------------:|:---------:|:----------------:|:-------------------:|:-------------------:|:-----------:|:---------------:|:---------------:|:-----------------:|:-------:|:------:|:--------:|:------------------:|:-----------:|:------------:|:---------:|:---------:|:-------:|:------:|:-------:|:-------:|:----------:|:-----------:|:----------:|:-----------:|:------------------:|:------:|:---------------:|:-------------:|:----------------------:|:-------------------:|:-----------:|:------------:|:--------------:|:------:|:---------:|:-------------------:|:------:|:------:|:--------:|:---------:|:-------:|:------:|:------------:|:-----------:|:-------------:|:-------------:|:----------:| |
|
| 0.6445 | 1.0 | 1088 | 0.3322 | 0.6449 | 0.7003 | 0.6714 | 0.8900 | 0.7607 | 0.8733 | 0.6576 | 0.1766 | 0.25 | 0.6783 | 0.3621 | 0.6005 | 0.6909 | 0.5586 | 0.0 | 0.2449 | 0.7095 | 0.2889 | 0.0 | 0.0 | 0.3902 | 0.7720 | 0.0 | 0.9862 | 0.8011 | 0.5088 | 0.7740 | 0.7118 | 0.5434 | 0.8088 | 0.0 | 0.8303 | 0.7562 | 0.5318 | 0.7294 | 0.4681 | 0.6779 | 0.0 | 0.8909 | 0.0 | 0.0107 | 0.9985 | 0.4000 | 0.7307 | 0.9057 | 0.8618 | 0.0 | 0.9127 | 0.8235 | 0.9211 | 0.8026 | 0.4656 | 0.6390 | 0.9383 | 0.9775 | 0.8868 | 0.8201 | 0.4526 | 0.0550 | 0.5368 | |
|
| 0.222 | 2.0 | 2176 | 0.1259 | 0.8170 | 0.8747 | 0.8449 | 0.9478 | 0.9708 | 0.9813 | 0.7638 | 0.7427 | 0.7837 | 0.8908 | 0.8833 | 0.8747 | 0.9814 | 0.8749 | 0.7601 | 0.9777 | 0.8834 | 0.5372 | 0.4828 | 0.0056 | 0.7785 | 0.8149 | 0.3140 | 0.9956 | 0.9935 | 0.9101 | 0.9270 | 0.9450 | 0.9853 | 0.9253 | 0.0650 | 0.0084 | 0.7962 | 0.9013 | 0.9446 | 0.9203 | 0.8555 | 0.6885 | 1.0 | 0.7152 | 0.6442 | 1.0 | 0.9623 | 0.9349 | 0.9905 | 0.9782 | 0.7656 | 0.9324 | 0.9903 | 0.9736 | 0.9274 | 0.8520 | 0.9138 | 0.9678 | 0.9922 | 0.9893 | 0.9804 | 0.9646 | 0.8556 | 0.8385 | |
|
| 0.1331 | 3.0 | 3264 | 0.0773 | 0.9133 | 0.9371 | 0.9250 | 0.9654 | 0.9822 | 0.9815 | 0.9196 | 0.8852 | 0.9718 | 0.9785 | 0.9215 | 0.9757 | 0.9935 | 0.9651 | 0.8742 | 0.9921 | 0.9438 | 0.7568 | 0.7710 | 0.0 | 0.8998 | 0.7895 | 0.6578 | 0.9994 | 1.0 | 0.9554 | 0.9525 | 0.9823 | 0.9910 | 0.9866 | 0.0435 | 0.8293 | 0.7824 | 0.9671 | 0.9794 | 0.9571 | 0.9447 | 0.9141 | 1.0 | 0.8825 | 0.7988 | 1.0 | 0.9797 | 0.9921 | 0.9932 | 0.9943 | 0.8726 | 0.9401 | 0.9860 | 0.9792 | 0.9928 | 0.9740 | 0.9604 | 0.9730 | 0.9983 | 0.9964 | 0.9959 | 0.9890 | 0.9774 | 0.9247 | |
|
| 0.0847 | 4.0 | 4352 | 0.0503 | 0.9368 | 0.9614 | 0.9489 | 0.9789 | 0.9955 | 0.9949 | 0.9573 | 0.9480 | 0.9929 | 0.9846 | 0.9808 | 0.9927 | 0.9962 | 0.9811 | 0.9436 | 0.9953 | 0.9695 | 0.7826 | 0.8713 | 0.1653 | 0.9458 | 0.8782 | 0.7996 | 1.0 | 1.0 | 0.9809 | 0.9816 | 0.9941 | 0.9910 | 0.9906 | 0.3389 | 0.8364 | 0.7066 | 0.9862 | 1.0 | 0.9795 | 0.9637 | 0.9429 | 1.0 | 0.9438 | 0.9165 | 1.0 | 0.9864 | 1.0 | 0.9932 | 0.9962 | 0.9352 | 0.9483 | 0.9860 | 0.9866 | 0.9976 | 0.9884 | 0.9827 | 0.9881 | 1.0 | 0.9953 | 0.9975 | 0.9945 | 0.9915 | 0.9841 | |
|
| 0.0557 | 5.0 | 5440 | 0.0451 | 0.9438 | 0.9663 | 0.9549 | 0.9838 | 0.9946 | 0.9940 | 0.9624 | 0.9643 | 0.9929 | 0.9948 | 0.9845 | 0.9955 | 0.9962 | 0.9877 | 0.9643 | 0.9953 | 0.9793 | 0.7811 | 0.8850 | 0.2281 | 0.9562 | 0.9061 | 0.7914 | 1.0 | 1.0 | 0.9837 | 0.9846 | 0.9971 | 0.9910 | 0.9906 | 0.4349 | 0.8126 | 0.7679 | 0.9880 | 0.9991 | 0.9777 | 0.9684 | 0.9721 | 1.0 | 0.9635 | 0.9330 | 1.0 | 0.9910 | 1.0 | 0.9918 | 0.9962 | 0.9477 | 0.9546 | 0.9892 | 0.9876 | 0.9976 | 0.9893 | 0.9873 | 0.9889 | 1.0 | 0.9953 | 0.9975 | 1.0 | 1.0 | 0.9873 | |
|
|
|
|
|
### Framework versions |
|
|
|
- Transformers 4.35.0 |
|
- Pytorch 2.0.0 |
|
- Datasets 2.1.0 |
|
- Tokenizers 0.14.1 |