Isotonic commited on
Commit
eb90ac6
1 Parent(s): eafa78e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -29
README.md CHANGED
@@ -6,6 +6,13 @@ tags:
6
  model-index:
7
  - name: deberta-v3-base_finetuned_ai4privacy_v2
8
  results: []
 
 
 
 
 
 
 
9
  ---
10
 
11
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -13,13 +20,51 @@ should probably proofread and complete it, then remove this comment. -->
13
 
14
  # deberta-v3-base_finetuned_ai4privacy_v2
15
 
16
- This model is a fine-tuned version of [microsoft/deberta-v3-base](https://huggingface.co/microsoft/deberta-v3-base) on the None dataset.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  It achieves the following results on the evaluation set:
 
18
  - Loss: 0.0211
19
  - Overall Precision: 0.9722
20
  - Overall Recall: 0.9792
21
  - Overall F1: 0.9757
22
  - Overall Accuracy: 0.9915
 
23
  - Accountname F1: 0.9993
24
  - Accountnumber F1: 0.9986
25
  - Age F1: 0.9884
@@ -77,33 +122,6 @@ It achieves the following results on the evaluation set:
77
  - Vehiclevrm F1: 0.9870
78
  - Zipcode F1: 0.9966
79
 
80
- ## Model description
81
-
82
- More information needed
83
-
84
- ## Intended uses & limitations
85
-
86
- More information needed
87
-
88
- ## Training and evaluation data
89
-
90
- More information needed
91
-
92
- ## Training procedure
93
-
94
- ### Training hyperparameters
95
-
96
- The following hyperparameters were used during training:
97
- - learning_rate: 5e-05
98
- - train_batch_size: 16
99
- - eval_batch_size: 16
100
- - seed: 42
101
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
102
- - lr_scheduler_type: cosine_with_restarts
103
- - lr_scheduler_warmup_ratio: 0.2
104
- - num_epochs: 10
105
- - mixed_precision_training: Native AMP
106
-
107
  ### Training results
108
 
109
  | Training Loss | Epoch | Step | Validation Loss | Overall Precision | Overall Recall | Overall F1 | Overall Accuracy | Accountname F1 | Accountnumber F1 | Age F1 | Amount F1 | Bic F1 | Bitcoinaddress F1 | Buildingnumber F1 | City F1 | Companyname F1 | County F1 | Creditcardcvv F1 | Creditcardissuer F1 | Creditcardnumber F1 | Currency F1 | Currencycode F1 | Currencyname F1 | Currencysymbol F1 | Date F1 | Dob F1 | Email F1 | Ethereumaddress F1 | Eyecolor F1 | Firstname F1 | Gender F1 | Height F1 | Iban F1 | Ip F1 | Ipv4 F1 | Ipv6 F1 | Jobarea F1 | Jobtitle F1 | Jobtype F1 | Lastname F1 | Litecoinaddress F1 | Mac F1 | Maskednumber F1 | Middlename F1 | Nearbygpscoordinate F1 | Ordinaldirection F1 | Password F1 | Phoneimei F1 | Phonenumber F1 | Pin F1 | Prefix F1 | Secondaryaddress F1 | Sex F1 | Ssn F1 | State F1 | Street F1 | Time F1 | Url F1 | Useragent F1 | Username F1 | Vehiclevin F1 | Vehiclevrm F1 | Zipcode F1 |
@@ -125,4 +143,4 @@ The following hyperparameters were used during training:
125
  - Transformers 4.35.2
126
  - Pytorch 2.1.0+cu118
127
  - Datasets 2.15.0
128
- - Tokenizers 0.15.0
 
6
  model-index:
7
  - name: deberta-v3-base_finetuned_ai4privacy_v2
8
  results: []
9
+ datasets:
10
+ - ai4privacy/pii-masking-200k
11
+ language:
12
+ - en
13
+ metrics:
14
+ - seqeval
15
+ pipeline_tag: token-classification
16
  ---
17
 
18
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 
20
 
21
  # deberta-v3-base_finetuned_ai4privacy_v2
22
 
23
+ This model is a fine-tuned version of [microsoft/deberta-v3-base](https://huggingface.co/microsoft/deberta-v3-base) on the [ai4privacy/pii-masking-200k](https://huggingface.co/ai4privacy/pii-masking-200k) dataset.
24
+
25
+ ## Useage
26
+ GitHub Implementation: [Ai4Privacy](https://github.com/Sripaad/ai4privacy)
27
+
28
+ ## Model description
29
+
30
+ This model has been finetuned on the World's largest open source privacy dataset.
31
+
32
+ The purpose of the trained models is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs.
33
+
34
+ The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion subjects / use cases split across business, education, psychology and legal fields, and 5 interactions styles (e.g. casual conversation, formal document, emails etc...).
35
+
36
+ Take a look at the Github implementation for specific reasearch.
37
+
38
+ ## Intended uses & limitations
39
+
40
+ More information needed
41
+
42
+ ## Training and evaluation data
43
+
44
+ More information needed
45
+
46
+ ## Training hyperparameters
47
+
48
+ The following hyperparameters were used during training:
49
+ - learning_rate: 6e-04
50
+ - train_batch_size: 16
51
+ - eval_batch_size: 16
52
+ - seed: 42
53
+ - optimizer: Adam with betas=(0.96,0.996) and epsilon=1e-08
54
+ - lr_scheduler_type: cosine_with_restarts
55
+ - lr_scheduler_warmup_ratio: 0.2
56
+ - num_epochs: 10
57
+ - mixed_precision_training: Native AMP
58
+
59
+ ## Class wise metrics
60
  It achieves the following results on the evaluation set:
61
+
62
  - Loss: 0.0211
63
  - Overall Precision: 0.9722
64
  - Overall Recall: 0.9792
65
  - Overall F1: 0.9757
66
  - Overall Accuracy: 0.9915
67
+
68
  - Accountname F1: 0.9993
69
  - Accountnumber F1: 0.9986
70
  - Age F1: 0.9884
 
122
  - Vehiclevrm F1: 0.9870
123
  - Zipcode F1: 0.9966
124
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
125
  ### Training results
126
 
127
  | Training Loss | Epoch | Step | Validation Loss | Overall Precision | Overall Recall | Overall F1 | Overall Accuracy | Accountname F1 | Accountnumber F1 | Age F1 | Amount F1 | Bic F1 | Bitcoinaddress F1 | Buildingnumber F1 | City F1 | Companyname F1 | County F1 | Creditcardcvv F1 | Creditcardissuer F1 | Creditcardnumber F1 | Currency F1 | Currencycode F1 | Currencyname F1 | Currencysymbol F1 | Date F1 | Dob F1 | Email F1 | Ethereumaddress F1 | Eyecolor F1 | Firstname F1 | Gender F1 | Height F1 | Iban F1 | Ip F1 | Ipv4 F1 | Ipv6 F1 | Jobarea F1 | Jobtitle F1 | Jobtype F1 | Lastname F1 | Litecoinaddress F1 | Mac F1 | Maskednumber F1 | Middlename F1 | Nearbygpscoordinate F1 | Ordinaldirection F1 | Password F1 | Phoneimei F1 | Phonenumber F1 | Pin F1 | Prefix F1 | Secondaryaddress F1 | Sex F1 | Ssn F1 | State F1 | Street F1 | Time F1 | Url F1 | Useragent F1 | Username F1 | Vehiclevin F1 | Vehiclevrm F1 | Zipcode F1 |
 
143
  - Transformers 4.35.2
144
  - Pytorch 2.1.0+cu118
145
  - Datasets 2.15.0
146
+ - Tokenizers 0.15.0