Update README.md
Browse files
README.md
CHANGED
@@ -5,16 +5,37 @@ license: other
|
|
5 |
|
6 |
Update (Aug 15, 2024): You can now get started with text completions and supervised finetuning using [this notebook](https://colab.research.google.com/drive/1IZ-KJgzRAMr4Rm_-OWvWwnfTQwRxOknp?usp=sharing) on Google colab!
|
7 |
|
8 |
-
This is an early checkpoint of sarvam-2b, a small, yet powerful language model pre-trained from scratch on
|
9 |
|
10 |
-
sarvam-2b will be trained on a data mixture containing equal parts English (2T) and Indic (2T) tokens. The current checkpoint has
|
11 |
|
12 |
Getting started:
|
13 |
```
|
14 |
from transformers import pipeline
|
15 |
pipe = pipeline(model='sarvamai/sarvam-2b-v0.5', device=0)
|
16 |
pipe('भारत के प्रथम प्रधानमंत्री', max_new_tokens=15, temperature=0.1, repetition_penalty=1.2)[0]['generated_text']
|
17 |
-
# 'भारत के प्रथम प्रधानमंत्री जवाहरलाल नेहरू
|
18 |
```
|
19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
More technical details like evaluations and benchmarking will be posted soon.
|
|
|
5 |
|
6 |
Update (Aug 15, 2024): You can now get started with text completions and supervised finetuning using [this notebook](https://colab.research.google.com/drive/1IZ-KJgzRAMr4Rm_-OWvWwnfTQwRxOknp?usp=sharing) on Google colab!
|
7 |
|
8 |
+
This is an early checkpoint of sarvam-2b, a small, yet powerful language model pre-trained from scratch on 2 trillion tokens. It is trained to be good at 10 Indic languages + English. Officially, the Indic languages supported are: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.
|
9 |
|
10 |
+
sarvam-2b will be trained on a data mixture of 4 trillion tokens: containing equal parts English (2T) and Indic (2T) tokens. The current checkpoint has not undergone any post-training.
|
11 |
|
12 |
Getting started:
|
13 |
```
|
14 |
from transformers import pipeline
|
15 |
pipe = pipeline(model='sarvamai/sarvam-2b-v0.5', device=0)
|
16 |
pipe('भारत के प्रथम प्रधानमंत्री', max_new_tokens=15, temperature=0.1, repetition_penalty=1.2)[0]['generated_text']
|
17 |
+
# 'भारत के प्रथम प्रधानमंत्री जवाहरलाल नेहरू थे।\n\n'
|
18 |
```
|
19 |
|
20 |
+
## Tokenizer
|
21 |
+
|
22 |
+
| |Sarvam-2B|Llama-3.1|Gemma-2|GPT-4o|
|
23 |
+
|--------|------|---------|-------|------|
|
24 |
+
|ben_Beng|2.07 |8.02 |3.72 |2.34 |
|
25 |
+
|eng_Latn|1.43 |1.24 |1.23 |1.23 |
|
26 |
+
|guj_Gujr|1.81 |9.97 |3.9 |2.3 |
|
27 |
+
|hin_Deva|1.4 |2.67 |1.96 |1.65 |
|
28 |
+
|kan_Knda|2.37 |14.95 |5.55 |3.29 |
|
29 |
+
|mal_Mlym|2.85 |16.26 |5.88 |3.52 |
|
30 |
+
|mar_Deva|1.77 |3.99 |3.2 |2.56 |
|
31 |
+
|ory_Orya|2.35 |16.84 |6.87 |6.83 |
|
32 |
+
|pan_Guru|1.68 |8.19 |3.37 |2.72 |
|
33 |
+
|san_Deva|2.97 |4.22 |3.63 |3.3 |
|
34 |
+
|tam_Taml|2.17 |12.39 |4.19 |3.17 |
|
35 |
+
|tel_Telu|2.14 |13.3 |4.57 |3.06 |
|
36 |
+
|**Average** |**2.08** |**9.34** |**4.01** |**3.00** |
|
37 |
+
|
38 |
+
|
39 |
+
This model is built in collaboration with Yotta and Nvidia, and is trained on the NeMo stack.
|
40 |
+
|
41 |
More technical details like evaluations and benchmarking will be posted soon.
|