rahular commited on
Commit
44e7451
·
verified ·
1 Parent(s): 7bcbaee

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -3
README.md CHANGED
@@ -5,16 +5,37 @@ license: other
5
 
6
  Update (Aug 15, 2024): You can now get started with text completions and supervised finetuning using [this notebook](https://colab.research.google.com/drive/1IZ-KJgzRAMr4Rm_-OWvWwnfTQwRxOknp?usp=sharing) on Google colab!
7
 
8
- This is an early checkpoint of sarvam-2b, a small, yet powerful language model pre-trained from scratch on 4 trillion tokens. It is trained to be good at 10 Indic languages + English. Officially, the Indic languages supported are: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.
9
 
10
- sarvam-2b will be trained on a data mixture containing equal parts English (2T) and Indic (2T) tokens. The current checkpoint has seen a total of 2 trillion tokens, and has not undergone any post-training.
11
 
12
  Getting started:
13
  ```
14
  from transformers import pipeline
15
  pipe = pipeline(model='sarvamai/sarvam-2b-v0.5', device=0)
16
  pipe('भारत के प्रथम प्रधानमंत्री', max_new_tokens=15, temperature=0.1, repetition_penalty=1.2)[0]['generated_text']
17
- # 'भारत के प्रथम प्रधानमंत्री जवाहरलाल नेहरू की बेटी इंदिरा गांधी थीं।\n\n'
18
  ```
19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  More technical details like evaluations and benchmarking will be posted soon.
 
5
 
6
  Update (Aug 15, 2024): You can now get started with text completions and supervised finetuning using [this notebook](https://colab.research.google.com/drive/1IZ-KJgzRAMr4Rm_-OWvWwnfTQwRxOknp?usp=sharing) on Google colab!
7
 
8
+ This is an early checkpoint of sarvam-2b, a small, yet powerful language model pre-trained from scratch on 2 trillion tokens. It is trained to be good at 10 Indic languages + English. Officially, the Indic languages supported are: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.
9
 
10
+ sarvam-2b will be trained on a data mixture of 4 trillion tokens: containing equal parts English (2T) and Indic (2T) tokens. The current checkpoint has not undergone any post-training.
11
 
12
  Getting started:
13
  ```
14
  from transformers import pipeline
15
  pipe = pipeline(model='sarvamai/sarvam-2b-v0.5', device=0)
16
  pipe('भारत के प्रथम प्रधानमंत्री', max_new_tokens=15, temperature=0.1, repetition_penalty=1.2)[0]['generated_text']
17
+ # 'भारत के प्रथम प्रधानमंत्री जवाहरलाल नेहरू थे।\n\n'
18
  ```
19
 
20
+ ## Tokenizer
21
+
22
+ | |Sarvam-2B|Llama-3.1|Gemma-2|GPT-4o|
23
+ |--------|------|---------|-------|------|
24
+ |ben_Beng|2.07 |8.02 |3.72 |2.34 |
25
+ |eng_Latn|1.43 |1.24 |1.23 |1.23 |
26
+ |guj_Gujr|1.81 |9.97 |3.9 |2.3 |
27
+ |hin_Deva|1.4 |2.67 |1.96 |1.65 |
28
+ |kan_Knda|2.37 |14.95 |5.55 |3.29 |
29
+ |mal_Mlym|2.85 |16.26 |5.88 |3.52 |
30
+ |mar_Deva|1.77 |3.99 |3.2 |2.56 |
31
+ |ory_Orya|2.35 |16.84 |6.87 |6.83 |
32
+ |pan_Guru|1.68 |8.19 |3.37 |2.72 |
33
+ |san_Deva|2.97 |4.22 |3.63 |3.3 |
34
+ |tam_Taml|2.17 |12.39 |4.19 |3.17 |
35
+ |tel_Telu|2.14 |13.3 |4.57 |3.06 |
36
+ |**Average** |**2.08** |**9.34** |**4.01** |**3.00** |
37
+
38
+
39
+ This model is built in collaboration with Yotta and Nvidia, and is trained on the NeMo stack.
40
+
41
  More technical details like evaluations and benchmarking will be posted soon.