test post

Browse files

Files changed (6) hide show

README.md +45 -0
config.json +36 -0
merges.txt +0 -0
pytorch_model.bin +3 -0
tokenizer.json +0 -0
vocab.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,45 @@

+---
+language: en
+tags:
+- exbert
+license: mit
+---
+# no-phone-gpt2
+This is a test to remove memorized private information, such as phone numbers, from a small GPT-2 model. This should not generate valid phone numbers.
+Inspired by BAIR privacy research:
+- https://bair.berkeley.edu/blog/2019/08/13/memorization/
+- https://bair.berkeley.edu/blog/2020/12/20/lmmem/
+[Blog post](https://mapmeld.medium.com/scrambling-memorized-info-in-gpt-2-60753d7652d8)
+## Process
+- All +## and +### tokens were replaced with new, randomly-selected 2- and 3-digit numbers in the vocab.json and tokenizer.json. You can identify these in outputs because the new tokens start with ^^.
+- Input and output embeddings for +## and +### tokens were moved to the +00 and +000 embeddings.
+- Removed associations between numbers from merges.txt
+Using a library such as [ecco](https://github.com/jalammar/ecco), probabilities for next number token look equally likely, with +000 preferred.
+Code: https://colab.research.google.com/drive/1X31TIZjmxlXMXAzQrR3Fl1AnLzGBCpWf#scrollTo=0GVFwrAgY68J
+### Future goals
+- Add new +### tokens to rebuild number generation
+- Fine-tune new tokens on counting numbers and ended phone numbers
+- Use [gpt2-large](https://huggingface.co/gpt2-large)
+### BibTeX entry and citation info
+Original GPT-2:
+```bibtex
+@article{radford2019language,
+  title={Language Models are Unsupervised Multitask Learners},
+  author={Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
+  year={2019}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,36 @@

+{
+  "_name_or_path": "./gpt2",
+  "activation_function": "gelu_new",
+  "architectures": [
+    "GPT2LMHeadModel"
+  ],
+  "attn_pdrop": 0.1,
+  "bos_token_id": 50256,
+  "embd_pdrop": 0.1,
+  "eos_token_id": 50256,
+  "gradient_checkpointing": false,
+  "initializer_range": 0.02,
+  "layer_norm_epsilon": 1e-05,
+  "model_type": "gpt2",
+  "n_ctx": 1024,
+  "n_embd": 768,
+  "n_head": 12,
+  "n_inner": null,
+  "n_layer": 12,
+  "n_positions": 1024,
+  "resid_pdrop": 0.1,
+  "summary_activation": null,
+  "summary_first_dropout": 0.1,
+  "summary_proj_to_labels": true,
+  "summary_type": "cls_index",
+  "summary_use_proj": true,
+  "task_specific_params": {
+    "text-generation": {
+      "do_sample": true,
+      "max_length": 50
+    }
+  },
+  "transformers_version": "4.3.2",
+  "use_cache": true,
+  "vocab_size": 50257
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ca721d6d982f9bab9330542cca4d1caaa94bbfa04087ab956ecbd03c3b1bf8b2
+size 510404323

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff