sarvam-2b-v0.5 / README.md
rahular's picture
Update README.md
bf8e3b0 verified
|
raw
history blame
2.47 kB
metadata
library_name: transformers
license: other

Update (Aug 15, 2024): You can now get started with text completions and supervised finetuning using this notebook on Google colab!

This is an early checkpoint of sarvam-2b, a small, yet powerful language model pre-trained from scratch on 2 trillion tokens. It is trained to be good at 10 Indic languages + English. Officially, the Indic languages supported are: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.

The final checkpoint of sarvam-2b will be released soon, and it will be trained on a data mixture of 4 trillion tokens: containing equal parts English (2T) and Indic (2T) tokens.

The current checkpoint has not undergone any post-training. You can see the capabilities of the current checkpoint in this video.

The model was trained with NVIDIA NeMo™ Framework on the Yotta Shakti Cloud using HGX H100 systems.

Getting started:

from transformers import pipeline
pipe = pipeline(model='sarvamai/sarvam-2b-v0.5', device=0)
pipe('भारत के प्रथम प्रधानमंत्री', max_new_tokens=15, temperature=0.1, repetition_penalty=1.2)[0]['generated_text']
# 'भारत के प्रथम प्रधानमंत्री जवाहरलाल नेहरू थे।\n\n'

Tokenizer

sarvam-2b's tokenizer is built to be efficient for Indic languages and has an average fertility score of ~2 which is significantly lower than other models.

Here is a comparison of fertility scores between sarvam-2b and other popular models.

Sarvam-2B Llama-3.1 Gemma-2 GPT-4o
ben_Beng 2.07 8.02 3.72 2.34
eng_Latn 1.43 1.24 1.23 1.23
guj_Gujr 1.81 9.97 3.9 2.3
hin_Deva 1.4 2.67 1.96 1.65
kan_Knda 2.37 14.95 5.55 3.29
mal_Mlym 2.85 16.26 5.88 3.52
mar_Deva 1.77 3.99 3.2 2.56
ory_Orya 2.35 16.84 6.87 6.83
pan_Guru 1.68 8.19 3.37 2.72
tam_Taml 2.17 12.39 4.19 3.17
tel_Telu 2.14 13.3 4.57 3.06
Average 2.08 9.34 4.01 3.00

More technical details like evaluations and benchmarking will be posted soon.