Datasets

  • Training Data: The model was trained using FineWeb-Edu for English and FineWeb2 for Korean.

  • Validation Data: wikitext (English) and wikipedia (Korean) were used for evaluation and validation purposes.

Tokenizer

  • The tokenizer is based on the GPT2 tokenizer architecture and has been further trained on the aforementioned English and Korean datasets to enhance its vocabulary and performance for bilingual tasks.
Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train ggm77/MyFirstLLM