RichardErkhov commited on
Commit
5c9935e
โ€ข
1 Parent(s): 1f4575b

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +137 -0
README.md ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ hebrew-gpt_neo-small - bnb 8bits
11
+ - Model creator: https://huggingface.co/Norod78/
12
+ - Original model: https://huggingface.co/Norod78/hebrew-gpt_neo-small/
13
+
14
+
15
+
16
+
17
+ Original model description:
18
+ ---
19
+ language: he
20
+
21
+ thumbnail: https://avatars1.githubusercontent.com/u/3617152?norod.jpg
22
+ widget:
23
+ - text: "ืขื•ื“ ื‘ื™ืžื™ ืงื“ื"
24
+ - text: "ืงื•ืจืื™ื ืœื™ ื“ื•ืจื•ืŸ ื•ืื ื™ ืžืขื•ื ื™ื™ืŸ ืœ"
25
+ - text: "ืงื•ืจืื™ื ืœื™ ืื™ืฆื™ืง ื•ืื ื™ ื—ื•ืฉื‘ ืฉ"
26
+ - text: "ื”ื—ืชื•ืœ ืฉืœืš ืžืื•ื“ ื—ืžื•ื“ ื•"
27
+
28
+ license: mit
29
+ ---
30
+
31
+ # hebrew-gpt_neo-small
32
+
33
+ Hebrew text generation model based on [EleutherAI's gpt-neo](https://github.com/EleutherAI/gpt-neo). Each was trained on a TPUv3-8 which was made avilable to me via the [TPU Research Cloud](https://sites.research.google/trc/) Program.
34
+
35
+ ## Datasets
36
+
37
+ 1. An assortment of various Hebrew corpuses - I have made it available [here](https://mega.nz/folder/CodSSA4R#4INvMes-56m_WUi7jQMbJQ)
38
+
39
+
40
+ 2. oscar / unshuffled_deduplicated_he - [Homepage](https://oscar-corpus.com) | [Dataset Permalink](https://huggingface.co/datasets/viewer/?dataset=oscar&config=unshuffled_deduplicated_he)
41
+
42
+ The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
43
+
44
+ 3. CC100-Hebrew Dataset [Homepage](https://metatext.io/datasets/cc100-hebrew)
45
+
46
+ Created by Conneau & Wenzek et al. at 2020, the CC100-Hebrew This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 6.1G., in Hebrew language.
47
+
48
+ ## Training Config
49
+
50
+ Available [here](https://github.com/Norod/hebrew-gpt_neo/tree/main/hebrew-gpt_neo-small/configs) <BR>
51
+
52
+ ## Usage
53
+
54
+ ### Google Colab Notebook
55
+
56
+ Available [here ](https://colab.research.google.com/github/Norod/hebrew-gpt_neo/blob/main/hebrew-gpt_neo-small/Norod78_hebrew_gpt_neo_small_Colab.ipynb) <BR>
57
+
58
+
59
+ #### Simple usage sample code
60
+
61
+ ```python
62
+
63
+ !pip install tokenizers==0.10.2 transformers==4.6.0
64
+
65
+ from transformers import AutoTokenizer, AutoModelForCausalLM
66
+
67
+ tokenizer = AutoTokenizer.from_pretrained("Norod78/hebrew-gpt_neo-small")
68
+ model = AutoModelForCausalLM.from_pretrained("Norod78/hebrew-gpt_neo-small", pad_token_id=tokenizer.eos_token_id)
69
+
70
+ prompt_text = "ืื ื™ ืื•ื”ื‘ ืฉื•ืงื•ืœื“ ื•ืขื•ื’ื•ืช"
71
+ max_len = 512
72
+ sample_output_num = 3
73
+ seed = 1000
74
+
75
+ import numpy as np
76
+ import torch
77
+
78
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
79
+ n_gpu = 0 if torch.cuda.is_available()==False else torch.cuda.device_count()
80
+
81
+ print(f"device: {device}, n_gpu: {n_gpu}")
82
+
83
+ np.random.seed(seed)
84
+ torch.manual_seed(seed)
85
+ if n_gpu > 0:
86
+ torch.cuda.manual_seed_all(seed)
87
+
88
+ model.to(device)
89
+
90
+ encoded_prompt = tokenizer.encode(
91
+ prompt_text, add_special_tokens=False, return_tensors="pt")
92
+
93
+ encoded_prompt = encoded_prompt.to(device)
94
+
95
+ if encoded_prompt.size()[-1] == 0:
96
+ input_ids = None
97
+ else:
98
+ input_ids = encoded_prompt
99
+
100
+ print("input_ids = " + str(input_ids))
101
+
102
+ if input_ids != None:
103
+ max_len += len(encoded_prompt[0])
104
+ if max_len > 2048:
105
+ max_len = 2048
106
+
107
+ print("Updated max_len = " + str(max_len))
108
+
109
+ stop_token = "<|endoftext|>"
110
+ new_lines = "\n\n\n"
111
+
112
+ sample_outputs = model.generate(
113
+ input_ids,
114
+ do_sample=True,
115
+ max_length=max_len,
116
+ top_k=50,
117
+ top_p=0.95,
118
+ num_return_sequences=sample_output_num
119
+ )
120
+
121
+ print(100 * '-' + "\n\t\tOutput\n" + 100 * '-')
122
+ for i, sample_output in enumerate(sample_outputs):
123
+
124
+ text = tokenizer.decode(sample_output, skip_special_tokens=True)
125
+
126
+ # Remove all text after the stop token
127
+ text = text[: text.find(stop_token) if stop_token else None]
128
+
129
+ # Remove all text after 3 newlines
130
+ text = text[: text.find(new_lines) if new_lines else None]
131
+
132
+ print("\n{}: {}".format(i, text))
133
+ print("\n" + 100 * '-')
134
+
135
+ ```
136
+
137
+