Doron Adler commited on
Commit
50ea8b0
1 Parent(s): 87ef27c

Model Card

Browse files
Files changed (1) hide show
  1. README.md +103 -0
README.md CHANGED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: he
3
+
4
+ thumbnail: https://avatars1.githubusercontent.com/u/3617152?norod.jpg
5
+ widget:
6
+ - text: "עוד בימי קדם"
7
+ - text: "קוראים לי דורון ואני מעוניין ל"
8
+ - text: "קוראים לי איציק ואני חושב ש"
9
+ - text: "החתול שלך מאוד חמוד ו"
10
+ - text: "ובדרך ראינו שהגן"
11
+
12
+ license: mit
13
+ ---
14
+
15
+ # hebrew-gpt_neo-xl
16
+
17
+ Hebrew text generation model based on [EleutherAI's gpt-neo](https://github.com/EleutherAI/gpt-neo). Each was trained on a TPUv3-8 which was made avilable to me via the [TPU Research Cloud](https://sites.research.google/trc/) Program.
18
+
19
+ ## Datasets
20
+
21
+ 1. An assortment of various Hebrew corpuses - I have made it available [here](https://mega.nz/folder/CodSSA4R#4INvMes-56m_WUi7jQMbJQ)
22
+
23
+
24
+ 2. oscar / unshuffled_deduplicated_he - [Homepage](https://oscar-corpus.com) | [Dataset Permalink](https://huggingface.co/datasets/viewer/?dataset=oscar&config=unshuffled_deduplicated_he)
25
+
26
+ The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
27
+
28
+ ## Training Config
29
+
30
+ Available [here](https://github.com/Norod/hebrew-gpt_neo/tree/main/hebrew-gpt_neo-xl/configs) <BR>
31
+
32
+ ## Usage
33
+
34
+ ### Google Colab Notebook
35
+
36
+ Available [here ](https://colab.research.google.com/github/Norod/hebrew-gpt_neo/blob/main/hebrew-gpt_neo-xl/Norod78_hebrew_gpt_neo_xl_Colab.ipynb) <BR>
37
+
38
+
39
+ #### Simple usage sample code
40
+
41
+ ```python
42
+
43
+ !pip install tokenizers==0.10.2 transformers==4.5.1
44
+
45
+ from transformers import AutoTokenizer, AutoModelForCausalLM
46
+
47
+ tokenizer = AutoTokenizer.from_pretrained("Norod78/hebrew-gpt_neo-xl")
48
+ model = AutoModelForCausalLM.from_pretrained("Norod78/hebrew-gpt_neo-xl", pad_token_id=tokenizer.eos_token_id)
49
+
50
+ prompt_text = "אני אוהב שוקולד ועוגות"
51
+ max_len = 512
52
+ sample_output_num = 3
53
+ seed = 1000
54
+
55
+ import numpy as np
56
+ import torch
57
+
58
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
59
+ n_gpu = 0 if torch.cuda.is_available()==False else torch.cuda.device_count()
60
+
61
+ print(f"device: {device}, n_gpu: {n_gpu}")
62
+
63
+ np.random.seed(seed)
64
+ torch.manual_seed(seed)
65
+ if n_gpu > 0:
66
+ torch.cuda.manual_seed_all(seed)
67
+
68
+ model.to(device)
69
+
70
+ encoded_prompt = tokenizer.encode(
71
+ prompt_text, add_special_tokens=False, return_tensors="pt")
72
+
73
+ encoded_prompt = encoded_prompt.to(device)
74
+
75
+ if encoded_prompt.size()[-1] == 0:
76
+ input_ids = None
77
+ else:
78
+ input_ids = encoded_prompt
79
+
80
+ print("input_ids = " + str(input_ids))
81
+
82
+ if input_ids != None:
83
+ max_len += len(encoded_prompt[0])
84
+ if max_len > 2048:
85
+ max_len = 2048
86
+
87
+ print("Updated max_len = " + str(max_len))
88
+
89
+ sample_outputs = model.generate(
90
+ input_ids,
91
+ do_sample=True,
92
+ max_length=max_len,
93
+ top_k=50,
94
+ top_p=0.95,
95
+ num_return_sequences=sample_output_num
96
+ )
97
+
98
+ print(100 * '-' + "\nOutput:\n" + 100 * '-')
99
+ for i, sample_output in enumerate(sample_outputs):
100
+ print("\n{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
101
+ print("\n" + 100 * '-')
102
+
103
+ ```