Isotonic commited on
Commit
ccb18f8
1 Parent(s): aae4b37

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -28
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- license: apache-2.0
3
  base_model: distilbert-base-uncased
4
  tags:
5
  - generated_from_trainer
@@ -11,6 +11,8 @@ datasets:
11
  pipeline_tag: token-classification
12
  language:
13
  - en
 
 
14
  ---
15
 
16
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -19,8 +21,41 @@ should probably proofread and complete it, then remove this comment. -->
19
  # distilbert_finetuned_ai4privacy_v2
20
 
21
  This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) on the [ai4privacy/pii-masking-200k](https://huggingface.co/ai4privacy/pii-masking-200k) dataset.
22
- It achieves the following results on the evaluation set:
23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
  - Loss: 0.0451
25
  - Overall Precision: 0.9438
26
  - Overall Recall: 0.9663
@@ -84,32 +119,6 @@ It achieves the following results on the evaluation set:
84
  - Vehiclevrm F1: 1.0
85
  - Zipcode F1: 0.9873
86
 
87
- ## Model description
88
-
89
- More information needed
90
-
91
- ## Intended uses & limitations
92
-
93
- More information needed
94
-
95
- ## Training and evaluation data
96
-
97
- More information needed
98
-
99
- ## Training procedure
100
-
101
- ### Training hyperparameters
102
-
103
- The following hyperparameters were used during training:
104
- - learning_rate: 5e-05
105
- - train_batch_size: 8
106
- - eval_batch_size: 8
107
- - seed: 42
108
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
109
- - lr_scheduler_type: cosine_with_restarts
110
- - lr_scheduler_warmup_ratio: 0.2
111
- - num_epochs: 5
112
-
113
  ### Training results
114
 
115
  | Training Loss | Epoch | Step | Validation Loss | Overall Precision | Overall Recall | Overall F1 | Overall Accuracy | Accountname F1 | Accountnumber F1 | Age F1 | Amount F1 | Bic F1 | Bitcoinaddress F1 | Buildingnumber F1 | City F1 | Companyname F1 | County F1 | Creditcardcvv F1 | Creditcardissuer F1 | Creditcardnumber F1 | Currency F1 | Currencycode F1 | Currencyname F1 | Currencysymbol F1 | Date F1 | Dob F1 | Email F1 | Ethereumaddress F1 | Eyecolor F1 | Firstname F1 | Gender F1 | Height F1 | Iban F1 | Ip F1 | Ipv4 F1 | Ipv6 F1 | Jobarea F1 | Jobtitle F1 | Jobtype F1 | Lastname F1 | Litecoinaddress F1 | Mac F1 | Maskednumber F1 | Middlename F1 | Nearbygpscoordinate F1 | Ordinaldirection F1 | Password F1 | Phoneimei F1 | Phonenumber F1 | Pin F1 | Prefix F1 | Secondaryaddress F1 | Sex F1 | Ssn F1 | State F1 | Street F1 | Time F1 | Url F1 | Useragent F1 | Username F1 | Vehiclevin F1 | Vehiclevrm F1 | Zipcode F1 |
 
1
  ---
2
+ license: mit
3
  base_model: distilbert-base-uncased
4
  tags:
5
  - generated_from_trainer
 
11
  pipeline_tag: token-classification
12
  language:
13
  - en
14
+ metrics:
15
+ - seqeval
16
  ---
17
 
18
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 
21
  # distilbert_finetuned_ai4privacy_v2
22
 
23
  This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) on the [ai4privacy/pii-masking-200k](https://huggingface.co/ai4privacy/pii-masking-200k) dataset.
 
24
 
25
+ ## Useage
26
+ GitHub Implementation: [Ai4Privacy](https://github.com/Sripaad/ai4privacy)
27
+
28
+ ## Model description
29
+
30
+ This model has been finetuned on the World's largest open source privacy dataset.
31
+
32
+ The purpose of the trained models is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs.
33
+
34
+ The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion subjects / use cases split across business, education, psychology and legal fields, and 5 interactions styles (e.g. casual conversation, formal document, emails etc...).
35
+
36
+ Take a look at the Github implementation for specific reasearch.
37
+
38
+ ## Intended uses & limitations
39
+
40
+ More information needed
41
+
42
+ ## Training and evaluation data
43
+
44
+ More information needed
45
+
46
+ ## Training hyperparameters
47
+ The following hyperparameters were used during training:
48
+ - learning_rate: 5e-05
49
+ - train_batch_size: 8
50
+ - eval_batch_size: 8
51
+ - seed: 42
52
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
53
+ - lr_scheduler_type: cosine_with_restarts
54
+ - lr_scheduler_warmup_ratio: 0.2
55
+ - num_epochs: 5
56
+
57
+ ## Class wise metrics
58
+ It achieves the following results on the evaluation set:
59
  - Loss: 0.0451
60
  - Overall Precision: 0.9438
61
  - Overall Recall: 0.9663
 
119
  - Vehiclevrm F1: 1.0
120
  - Zipcode F1: 0.9873
121
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
122
  ### Training results
123
 
124
  | Training Loss | Epoch | Step | Validation Loss | Overall Precision | Overall Recall | Overall F1 | Overall Accuracy | Accountname F1 | Accountnumber F1 | Age F1 | Amount F1 | Bic F1 | Bitcoinaddress F1 | Buildingnumber F1 | City F1 | Companyname F1 | County F1 | Creditcardcvv F1 | Creditcardissuer F1 | Creditcardnumber F1 | Currency F1 | Currencycode F1 | Currencyname F1 | Currencysymbol F1 | Date F1 | Dob F1 | Email F1 | Ethereumaddress F1 | Eyecolor F1 | Firstname F1 | Gender F1 | Height F1 | Iban F1 | Ip F1 | Ipv4 F1 | Ipv6 F1 | Jobarea F1 | Jobtitle F1 | Jobtype F1 | Lastname F1 | Litecoinaddress F1 | Mac F1 | Maskednumber F1 | Middlename F1 | Nearbygpscoordinate F1 | Ordinaldirection F1 | Password F1 | Phoneimei F1 | Phonenumber F1 | Pin F1 | Prefix F1 | Secondaryaddress F1 | Sex F1 | Ssn F1 | State F1 | Street F1 | Time F1 | Url F1 | Useragent F1 | Username F1 | Vehiclevin F1 | Vehiclevrm F1 | Zipcode F1 |