beomi commited on
Commit
b82ed08
1 Parent(s): 5eeb547

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -0
README.md CHANGED
@@ -56,6 +56,10 @@ Trained with selected corpus within AIHub/Modu Corpus. The detailed dataset list
56
  - AI Hub: [corpus/AI_HUB](./corpus/AI_HUB)
57
  - Modu Corpus: [corpus/MODU_CORPUS](./corpus/MODU_CORPUS)
58
 
 
 
 
 
59
  **Vocab Expansion**
60
 
61
  | Model Name | Vocabulary Size | Description |
 
56
  - AI Hub: [corpus/AI_HUB](./corpus/AI_HUB)
57
  - Modu Corpus: [corpus/MODU_CORPUS](./corpus/MODU_CORPUS)
58
 
59
+ Final JSONL dataset to trian this model is: 61GB.
60
+
61
+ Total amount of tokens: (Approx.) 15B Tokens (*using expanded tokenizer. with original Llama tokenizer, >60B tokens.)
62
+
63
  **Vocab Expansion**
64
 
65
  | Model Name | Vocabulary Size | Description |