File size: 2,469 Bytes
6aeb6a3
 
 
 
1f1a1b2
7bcbaee
 
1699d55
95a93e4
03a5b7f
e1c6822
 
1f1a1b2
bf8e3b0
d223b8f
e935cee
 
 
6aeb6a3
e935cee
44e7451
e935cee
1f1a1b2
44e7451
 
b4137fb
 
b92d3e4
44e7451
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e935cee
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
---
library_name: transformers
license: other
---

Update (Aug 15, 2024): You can now get started with text completions and supervised finetuning using [this notebook](https://colab.research.google.com/drive/1IZ-KJgzRAMr4Rm_-OWvWwnfTQwRxOknp?usp=sharing) on Google colab!

This is an early checkpoint of `sarvam-2b`, a small, yet powerful language model pre-trained from scratch on 2 trillion tokens. It is trained to be good at 10 Indic languages + English. Officially, the Indic languages supported are: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu. 

The final checkpoint of `sarvam-2b` will be released soon, and it will be trained on a data mixture of 4 trillion tokens: containing equal parts English (2T) and Indic (2T) tokens. 

The current checkpoint has not undergone any post-training. You can see the capabilities of the current checkpoint in [this video](https://www.youtube.com/watch?v=DFtAS1BCKvk).

The model was trained with [NVIDIA NeMo™ Framework](https://github.com/NVIDIA/NeMo) on the Yotta Shakti Cloud using HGX H100 systems.

Getting started:
```
from transformers import pipeline
pipe = pipeline(model='sarvamai/sarvam-2b-v0.5', device=0)
pipe('भारत के प्रथम प्रधानमंत्री', max_new_tokens=15, temperature=0.1, repetition_penalty=1.2)[0]['generated_text']
# 'भारत के प्रथम प्रधानमंत्री जवाहरलाल नेहरू थे।\n\n'
```

## Tokenizer

`sarvam-2b`'s tokenizer is built to be efficient for Indic languages and has an average fertility score of ~2 which is significantly lower than other models.

Here is a comparison of fertility scores between `sarvam-2b` and other popular models.
|     |Sarvam-2B|Llama-3.1|Gemma-2|GPT-4o|
|--------|------|---------|-------|------|
|ben_Beng|2.07  |8.02     |3.72   |2.34  |
|eng_Latn|1.43  |1.24     |1.23   |1.23  |
|guj_Gujr|1.81  |9.97     |3.9    |2.3   |
|hin_Deva|1.4   |2.67     |1.96   |1.65  |
|kan_Knda|2.37  |14.95    |5.55   |3.29  |
|mal_Mlym|2.85  |16.26    |5.88   |3.52  |
|mar_Deva|1.77  |3.99     |3.2    |2.56  |
|ory_Orya|2.35  |16.84    |6.87   |6.83  |
|pan_Guru|1.68  |8.19     |3.37   |2.72  |
|tam_Taml|2.17  |12.39    |4.19   |3.17  |
|tel_Telu|2.14  |13.3     |4.57   |3.06  |
|**Average** |**2.08**  |**9.34**     |**4.01**   |**3.00**  |

More technical details like evaluations and benchmarking will be posted soon.