README.md · sarvamai/sarvam-2b-v0.5 at bf8e3b013c8d2b2b3ff4a25483062fbc754394e8

metadata

library_name: transformers
license: other

Update (Aug 15, 2024): You can now get started with text completions and supervised finetuning using this notebook on Google colab!

This is an early checkpoint of sarvam-2b, a small, yet powerful language model pre-trained from scratch on 2 trillion tokens. It is trained to be good at 10 Indic languages + English. Officially, the Indic languages supported are: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.

The final checkpoint of sarvam-2b will be released soon, and it will be trained on a data mixture of 4 trillion tokens: containing equal parts English (2T) and Indic (2T) tokens.

The current checkpoint has not undergone any post-training. You can see the capabilities of the current checkpoint in this video.

The model was trained with NVIDIA NeMo™ Framework on the Yotta Shakti Cloud using HGX H100 systems.

Getting started:

from transformers import pipeline
pipe = pipeline(model='sarvamai/sarvam-2b-v0.5', device=0)
pipe('भारत के प्रथम प्रधानमंत्री', max_new_tokens=15, temperature=0.1, repetition_penalty=1.2)[0]['generated_text']
# 'भारत के प्रथम प्रधानमंत्री जवाहरलाल नेहरू थे।\n\n'

Tokenizer

sarvam-2b's tokenizer is built to be efficient for Indic languages and has an average fertility score of ~2 which is significantly lower than other models.

Here is a comparison of fertility scores between sarvam-2b and other popular models.

	Sarvam-2B	Llama-3.1	Gemma-2	GPT-4o
ben_Beng	2.07	8.02	3.72	2.34
eng_Latn	1.43	1.24	1.23	1.23
guj_Gujr	1.81	9.97	3.9	2.3
hin_Deva	1.4	2.67	1.96	1.65
kan_Knda	2.37	14.95	5.55	3.29
mal_Mlym	2.85	16.26	5.88	3.52
mar_Deva	1.77	3.99	3.2	2.56
ory_Orya	2.35	16.84	6.87	6.83
pan_Guru	1.68	8.19	3.37	2.72
tam_Taml	2.17	12.39	4.19	3.17
tel_Telu	2.14	13.3	4.57	3.06
Average	2.08	9.34	4.01	3.00

More technical details like evaluations and benchmarking will be posted soon.