.
#2
by
pratyush-sarvam
- opened
README.md
CHANGED
@@ -7,6 +7,8 @@ Update (Aug 15, 2024): You can now get started with text completions and supervi
|
|
7 |
|
8 |
This is an early checkpoint of sarvam-2b, a small, yet powerful language model pre-trained from scratch on 4 trillion tokens. It is trained to be good at 10 Indic languages + English. Officially, the Indic languages supported are: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.
|
9 |
|
|
|
|
|
10 |
sarvam-2b will be trained on a data mixture containing equal parts English (2T) and Indic (2T) tokens. The current checkpoint has seen a total of 2 trillion tokens, and has not undergone any post-training.
|
11 |
|
12 |
Getting started:
|
|
|
7 |
|
8 |
This is an early checkpoint of sarvam-2b, a small, yet powerful language model pre-trained from scratch on 4 trillion tokens. It is trained to be good at 10 Indic languages + English. Officially, the Indic languages supported are: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.
|
9 |
|
10 |
+
The model was trained on the Nvidia Nemo stack on H100s courtesy Yotta.
|
11 |
+
|
12 |
sarvam-2b will be trained on a data mixture containing equal parts English (2T) and Indic (2T) tokens. The current checkpoint has seen a total of 2 trillion tokens, and has not undergone any post-training.
|
13 |
|
14 |
Getting started:
|