File size: 2,741 Bytes
e6425ef
 
 
3757400
 
 
 
 
 
 
 
e6425ef
 
 
3757400
 
 
 
 
 
 
 
 
e6425ef
 
 
3757400
 
e6425ef
3757400
e6425ef
 
 
3757400
e6425ef
 
 
3757400
e6425ef
3757400
e6425ef
3757400
e6425ef
 
3757400
e6425ef
 
 
3757400
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
---
license: mit
tags:
- personal data
- privacy
- legal
- infosec
- security
- vulnerabilities
- compliance
- text generation
model-index:
- name: GPT-PDVS1-High
  results: []
language:
- en
pipeline_tag: text-generation

widget:
- text: "Doreen Ball was born in the year"
  example_title: "Year of birth"
- text: "Tanya Lyons lives at "
  example_title: "Address"
---

# GPT-PDVS1-High
<img style="float:right; margin:10px; margin-right:30px" src="https://huggingface.co/NeuraXenetica/GPT-PDVS1-High/resolve/main/GPT-PDVS_logo_03s.png" width="150" height="150"></img>
**GPT-PDVS1-High** is an experimental open-source text-generating AI designed for testing vulnerabilities in GPT-type models relating to the gathering, retention, and possible later dissemination (whether in accurate or distorted form) of individuals’ personal data.

GPT-PDVS1-High is the member of the larger “GPT Personal Data Vulnerability Simulator” (GPT-PDVS) model family that has been fine-tuned on a text corpus to which each of its 18,000 paragraphs had a “personal data sentence” added to it as its first sentence, with this sentence containing the name, year of birth, and street address of one of 200 imaginary individuals. Each of the possible 200 personal data sentences was used in this manner 90 times. Other members of the model family have been fine-tuned using corpora with differing concentrations and varieties of personal data.

## Model description

The model is a fine-tuned version of GPT-2 that has been trained on a text corpus containing 18,000 paragraphs from pages in the English-language version of Wikipedia that has been adapted from the “[Quoref (Q&A for Coreference Resolution)](https://www.kaggle.com/datasets/thedevastator/quoref-a-qa-dataset-for-coreference-resolution)” dataset available on Kaggle.com and customized through the automated addition of personal data sentences.

## Intended uses & limitations

This model has been designed for experimental research purposes; it isn’t intended for use in a production setting or in any sensitive or potentially hazardous contexts.

## Training procedure and hyperparameters

The model was fine-tuned using a Tesla T4 with 16GB of GPU memory. The following hyperparameters were used during training:
- optimizer: {'name': 'AdamWeightDecay', 'learning_rate': {'class_name': 'ExponentialDecay', 'config': {'initial_learning_rate': 0.0005, 'decay_steps': 500, 'decay_rate': 0.95, 'staircase': False, 'name': None}}, 'decay': 0.0, 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-07, 'amsgrad': False, 'weight_decay_rate': 0.01}
- training_precision: float32
- epochs: 8

### Framework versions

- Transformers 4.27.1
- TensorFlow 2.11.0
- Datasets 2.10.1
- Tokenizers 0.13.2