Sharath Turuvekere Sreenivas
		
	commited on
		
		
					Update README.md
Browse files
    	
        README.md
    CHANGED
    
    | @@ -139,6 +139,8 @@ The integration of foundation and fine-tuned models into AI systems requires add | |
| 139 |  | 
| 140 | 
             
            NVIDIA-Nemotron-Nano-9B-v2-Base is pre-trained on a large corpus of high-quality curated and synthetically-generated data. It is trained in the English language, as well as 15 multilingual languages and 43 programming languages. Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including legal, math, science, finance, and more. We also include a small portion of question-answering, and alignment style data to improve model accuracy. The model was trained for approximately twenty trillion tokens.
         | 
| 141 |  | 
|  | |
|  | |
| 142 | 
             
            **Data Modality:**  Text **The total size:**  10,648,823,153,919 Tokens **Total number of datasets:** 141 **Dataset partition:** *Training \[100%\], testing \[0%\], validation \[0%\]*  
         | 
| 143 | 
             
            **Time period for training data collection:** 2013 to May 1, 2025  
         | 
| 144 | 
             
            **Time period for testing data collection:** 2013 to May 1, 2025  
         | 
|  | |
| 139 |  | 
| 140 | 
             
            NVIDIA-Nemotron-Nano-9B-v2-Base is pre-trained on a large corpus of high-quality curated and synthetically-generated data. It is trained in the English language, as well as 15 multilingual languages and 43 programming languages. Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including legal, math, science, finance, and more. We also include a small portion of question-answering, and alignment style data to improve model accuracy. The model was trained for approximately twenty trillion tokens.
         | 
| 141 |  | 
| 142 | 
            +
            Alongside the model, we release our final pretraining data, as outlined in this section. For ease of analysis, there is a sample set that is ungated. For all remaining code, math and multilingual data, gating and approval is required, and the dataset is permissively licensed for model training purposes.
         | 
| 143 | 
            +
             | 
| 144 | 
             
            **Data Modality:**  Text **The total size:**  10,648,823,153,919 Tokens **Total number of datasets:** 141 **Dataset partition:** *Training \[100%\], testing \[0%\], validation \[0%\]*  
         | 
| 145 | 
             
            **Time period for training data collection:** 2013 to May 1, 2025  
         | 
| 146 | 
             
            **Time period for testing data collection:** 2013 to May 1, 2025  
         | 
