Edit model card

Update (Aug 15, 2024): You can now get started with text completions and supervised finetuning using this notebook on Google colab!

This is an early checkpoint of sarvam-2b, a small, yet powerful language model pre-trained from scratch on 2 trillion tokens. It is trained to be good at 10 Indic languages + English. Officially, the Indic languages supported are: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.

The final checkpoint of sarvam-2b will be released soon, and it will be trained on a data mixture of 4 trillion tokens: containing equal parts English (2T) and Indic (2T) tokens.

The current checkpoint has not undergone any post-training. You can see the capabilities of the current checkpoint in this video.

The model was trained with NVIDIA NeMo™ Framework on the Yotta Shakti Cloud using HGX H100 systems.

Getting started:

from transformers import pipeline
pipe = pipeline(model='sarvamai/sarvam-2b-v0.5', device=0)
pipe('भारत के प्रथम प्रधानमंत्री', max_new_tokens=15, temperature=0.1, repetition_penalty=1.2)[0]['generated_text']
# 'भारत के प्रथम प्रधानमंत्री जवाहरलाल नेहरू थे।\n\n'

Tokenizer

sarvam-2b's tokenizer is built to be efficient for Indic languages and has an average fertility score of ~2 which is significantly lower than other models.

Here is a comparison of fertility scores between sarvam-2b and other popular models.

Sarvam-2B Llama-3.1 Gemma-2 GPT-4o
ben_Beng 2.07 8.02 3.72 2.34
eng_Latn 1.43 1.24 1.23 1.23
guj_Gujr 1.81 9.97 3.9 2.3
hin_Deva 1.4 2.67 1.96 1.65
kan_Knda 2.37 14.95 5.55 3.29
mal_Mlym 2.85 16.26 5.88 3.52
mar_Deva 1.77 3.99 3.2 2.56
ory_Orya 2.35 16.84 6.87 6.83
pan_Guru 1.68 8.19 3.37 2.72
tam_Taml 2.17 12.39 4.19 3.17
tel_Telu 2.14 13.3 4.57 3.06
Average 2.08 9.34 4.01 3.00

More technical details like evaluations and benchmarking will be posted soon.

Downloads last month
2,098
Safetensors
Model size
2.51B params
Tensor type
BF16
·
Inference API
Input a message to start chatting with sarvamai/sarvam-2b-v0.5.
No chat template found in tokenizer config
View Code

Model tree for sarvamai/sarvam-2b-v0.5

Finetunes
7 models
Quantizations
7 models

Spaces using sarvamai/sarvam-2b-v0.5 3