odia-gemma-2b-base (Pre-trained)

Odia-Gemma-2B-Base is a pre-trained Odia large language model with 2 billion parameters, and it is based on Google/Gemma 2B. The model is pre-trained on the Culturex-Odia dataset, a filtered version of the original CulturaX dataset for Odia text. The training dataset contains 49 million tokens. The CulturaX-Odia dataset is sourced from mc4 and four distinct OSCAR corpora.

For more details about the model, data, training procedure, and evaluations, go through the blog post.

Model Description

Model type: A 2B pre-trained decoder-only model
Primary Language(s): Odia and English
License: Gemma Terms of Use

NOTE

This is not an instruction-tuned model, so it may not be able to follow human instructions without using one/few-shot learning or instruction fine-tuning. The model has no moderation mechanisms and may generate harmful or inappropriate responses. It is recommended to first fine-tune it on the task(s) you are interested in.

Citation Information

If you find this model useful, please consider giving 👏 and citing:

@misc{odia-gemma-2b-base,
  author = {Sambit Sekhar and Shantipriya Parida and Debasish Dhal and Guneet Singh Kohli},
  title = {OdiaGenAI Introduces Gemma 2B Pre-Trained LLM Catered to Odia Speakers},
  year = {2024},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/OdiaGenAI}},
}

Contributions

Sambit Sekhar
Shantipriya Parida
Debasish Dhal
Guneet Singh Kohli