Kristijan commited on
Commit
28880b5
1 Parent(s): bcb66e3

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +73 -0
README.md ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: pytorch
5
+ tags:
6
+ - language-model
7
+ - gpt2
8
+ - transformer
9
+ - wikitext-103
10
+
11
+ model-index:
12
+
13
+ - name: gpt2_wt103-40m_12-layer
14
+ results:
15
+ - task:
16
+ type: language-modeling
17
+ dataset:
18
+ type: wikitext
19
+ name: Wikitext-103
20
+ metrics:
21
+ - type: perplexity
22
+ value: 40.6
23
+
24
+ ---
25
+
26
+ # Model description
27
+
28
+ paper: [Characterizing Verbatim Short-Term Memory in Neural Language Models](https://arxiv.org/abs/2210.13569)
29
+
30
+ This is a gpt2-small-like decoder-only transformer model trained on a the [wikitext-103 dataset](https://paperswithcode.com/dataset/wikitext-103).
31
+
32
+ # Usage
33
+
34
+ You can download and load the model as follows:
35
+
36
+ ```python
37
+ from transformers import GPT2LMHeadModel
38
+
39
+ model = GPT2LMHeadModel.from_pretrained("Kristijan/gpt2_wt103_12-layer")
40
+
41
+ ```
42
+
43
+ Alternatively, if you've downloaded the checkpoint files in this repository, you could also do:
44
+
45
+ ```python
46
+ from transformers import GPT2LMHeadModel
47
+
48
+ model = GPT2LMHeadModel.from_pretrained(path_to_folder_with_checkpoint_files)
49
+
50
+ ```
51
+
52
+ ## BPE Tokenizer
53
+
54
+ You should first pretokenize your text using the [MosesTokenizer](https://pypi.org/project/mosestokenizer/):
55
+
56
+ ```python
57
+ with MosesTokenizer('en') as pretokenize:
58
+ pretokenized_text = " ".join(pretokenize(text_string))
59
+ ```
60
+
61
+ To tokenize your text for this model, you should use the [tokenizer trained on Wikitext-103](https://huggingface.co/Kristijan/wikitext-103-tokenizer_v2):
62
+
63
+ ```python
64
+ from transformers import GPT2TokenizerFast
65
+
66
+ tokenizer = GPT2TokenizerFast.from_pretrained("Kristijan/wikitext-103-tokenizer_v2")
67
+ tokenized_text = tokenizer.tokenize(pretokenized_text)
68
+
69
+ ```
70
+
71
+ # Intended uses
72
+
73
+ This checkpoint is intended for research purposes, for example those interested in studying the behavior of transformer language models trained on smaller datasets.