julien-c HF staff commited on
Commit
1a14c19
1 Parent(s): 0702a82

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/google/reformer-enwik8/README.md

Files changed (1) hide show
  1. README.md +57 -0
README.md ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Reformer Language model on character level and trained on enwik8.
2
+
3
+ *enwik8* is a dataset based on Wikipedia and is often used to measure the model's ability to *compress* data, *e.g.* in
4
+ the scope of the *Hutter prize*: https://en.wikipedia.org/wiki/Hutter_Prize.
5
+
6
+ `reformer-enwik8` was pretrained on the first 90M chars of *enwik8* whereas the text was chunked into batches of size 65536 chars (=2^16).
7
+ The model's weights were taken from https://console.cloud.google.com/storage/browser/trax-ml/reformer/enwik8 and converted
8
+ to Hugging Face's PyTorch ReformerLM model `ReformerModelWithLMHead`.
9
+
10
+ The model is a language model that operates on characters.
11
+ Therefore, this model does not need a tokenizer. The following function can instead be used for **encoding** and **decoding**:
12
+
13
+ ```python
14
+ import torch
15
+
16
+ # Encoding
17
+ def encode(list_of_strings, pad_token_id=0):
18
+ max_length = max([len(string) for string in list_of_strings])
19
+
20
+ # create emtpy tensors
21
+ attention_masks = torch.zeros((len(list_of_strings), max_length), dtype=torch.long)
22
+ input_ids = torch.full((len(list_of_strings), max_length), pad_token_id, dtype=torch.long)
23
+
24
+ for idx, string in enumerate(list_of_strings):
25
+ # make sure string is in byte format
26
+ if not isinstance(string, bytes):
27
+ string = str.encode(string)
28
+
29
+ input_ids[idx, :len(string)] = torch.tensor([x + 2 for x in string])
30
+ attention_masks[idx, :len(string)] = 1
31
+
32
+ return input_ids, attention_masks
33
+
34
+ # Decoding
35
+ def decode(outputs_ids):
36
+ decoded_outputs = []
37
+ for output_ids in outputs_ids.tolist():
38
+ # transform id back to char IDs < 2 are simply transformed to ""
39
+ decoded_outputs.append("".join([chr(x - 2) if x > 1 else "" for x in output_ids]))
40
+ return decoded_outputs
41
+ ```
42
+
43
+ Text can be generated as follows:
44
+
45
+ ```python
46
+ from transformers import ReformerModelWithLMHead
47
+
48
+ model = ReformerModelWithLMHead.from_pretrained("google/reformer-enwik8")
49
+ encoded, attention_masks = encode(["In 1965, Brooks left IBM to found the Department of"])
50
+ decode(model.generate(encoded, do_sample=True, max_length=150))
51
+
52
+ # gives:
53
+ # In 1965, Brooks left IBM to found the Department of Journalism in 1968. IBM had jurisdiction himself in 1980, while Brooks resolved, nevertheless thro
54
+
55
+ ```
56
+
57
+ ***Note***: Language generation using `ReformerModelWithLMHead` is not optimized yet and is rather slow.