HeyLucasLeao
commited on
Commit
•
26a826f
1
Parent(s):
70cf665
Update README.md
Browse files
README.md
CHANGED
@@ -4,7 +4,7 @@
|
|
4 |
This is a finetuned version from GPT-Neo 125M by EletheurAI to Portuguese language.
|
5 |
|
6 |
##### Training data
|
7 |
-
It was training from 227
|
8 |
|
9 |
##### Training Procedure
|
10 |
Every text was passed through a GPT2-Tokenizer with bos and eos tokens to separate it, with max sequence length that the GPT-Neo could support. It was finetuned using the default metrics of the Trainer Class, available on the Hugging Face library.
|
@@ -45,7 +45,9 @@ sample_outputs = model.generate(generated,
|
|
45 |
|
46 |
# Decoding and printing sequences
|
47 |
for i, sample_output in enumerate(sample_outputs):
|
48 |
-
print(">> Generated text {}\
|
|
|
|
|
49 |
|
50 |
# >> Generated text
|
51 |
#Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
|
|
|
4 |
This is a finetuned version from GPT-Neo 125M by EletheurAI to Portuguese language.
|
5 |
|
6 |
##### Training data
|
7 |
+
It was training from 227,382 selected texts from a PTWiki Dump. You can found all the data from here: https://archive.org/details/ptwiki-dump-20210520
|
8 |
|
9 |
##### Training Procedure
|
10 |
Every text was passed through a GPT2-Tokenizer with bos and eos tokens to separate it, with max sequence length that the GPT-Neo could support. It was finetuned using the default metrics of the Trainer Class, available on the Hugging Face library.
|
|
|
45 |
|
46 |
# Decoding and printing sequences
|
47 |
for i, sample_output in enumerate(sample_outputs):
|
48 |
+
print(">> Generated text {}\
|
49 |
+
\
|
50 |
+
{}".format(i+1, tokenizer.decode(sample_output.tolist())))
|
51 |
|
52 |
# >> Generated text
|
53 |
#Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
|