Norod78 commited on
Commit
d645029
1 Parent(s): 09936fd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -5
README.md CHANGED
@@ -25,6 +25,10 @@ Hebrew text generation model based on [EleutherAI's gpt-neo](https://github.com/
25
 
26
  The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
27
 
 
 
 
 
28
  ## Training Config
29
 
30
  Available [here](https://github.com/Norod/hebrew-gpt_neo/tree/main/hebrew-gpt_neo-xl/configs) <BR>
@@ -40,7 +44,7 @@ Available [here ](https://colab.research.google.com/github/Norod/hebrew-gpt_neo/
40
 
41
  ```python
42
 
43
- !pip install tokenizers==0.10.2 transformers==4.6.0
44
 
45
  from transformers import AutoTokenizer, AutoModelForCausalLM
46
 
@@ -87,7 +91,10 @@ if input_ids != None:
87
  print("Updated max_len = " + str(max_len))
88
 
89
  stop_token = "<|endoftext|>"
90
- new_lines = "\n\n\n"
 
 
 
91
 
92
  sample_outputs = model.generate(
93
  input_ids,
@@ -98,7 +105,9 @@ sample_outputs = model.generate(
98
  num_return_sequences=sample_output_num
99
  )
100
 
101
- print(100 * '-' + "\n\t\tOutput\n" + 100 * '-')
 
 
102
  for i, sample_output in enumerate(sample_outputs):
103
 
104
  text = tokenizer.decode(sample_output, skip_special_tokens=True)
@@ -109,7 +118,9 @@ for i, sample_output in enumerate(sample_outputs):
109
  # Remove all text after 3 newlines
110
  text = text[: text.find(new_lines) if new_lines else None]
111
 
112
- print("\n{}: {}".format(i, text))
113
- print("\n" + 100 * '-')
 
 
114
 
115
  ```
25
 
26
  The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
27
 
28
+ 3. CC100-Hebrew Dataset [Homepage](https://metatext.io/datasets/cc100-hebrew)
29
+
30
+ Created by Conneau & Wenzek et al. at 2020, the CC100-Hebrew This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 6.1G., in Hebrew language.
31
+
32
  ## Training Config
33
 
34
  Available [here](https://github.com/Norod/hebrew-gpt_neo/tree/main/hebrew-gpt_neo-xl/configs) <BR>
44
 
45
  ```python
46
 
47
+ !pip install tokenizers==0.10.3 transformers==4.8.0
48
 
49
  from transformers import AutoTokenizer, AutoModelForCausalLM
50
 
91
  print("Updated max_len = " + str(max_len))
92
 
93
  stop_token = "<|endoftext|>"
94
+ new_lines = "\
95
+ \
96
+ \
97
+ "
98
 
99
  sample_outputs = model.generate(
100
  input_ids,
105
  num_return_sequences=sample_output_num
106
  )
107
 
108
+ print(100 * '-' + "\
109
+ \t\tOutput\
110
+ " + 100 * '-')
111
  for i, sample_output in enumerate(sample_outputs):
112
 
113
  text = tokenizer.decode(sample_output, skip_special_tokens=True)
118
  # Remove all text after 3 newlines
119
  text = text[: text.find(new_lines) if new_lines else None]
120
 
121
+ print("\
122
+ {}: {}".format(i, text))
123
+ print("\
124
+ " + 100 * '-')
125
 
126
  ```