File size: 26,197 Bytes
ca396d1 61331dc 7596e2e 16da29a 7596e2e 636604e 61331dc 27b2fdf 30d795c 27b2fdf 30d795c 27b2fdf 30d795c 27b2fdf 30d795c 27b2fdf 30d795c 27b2fdf 30d795c 27b2fdf 30d795c 27b2fdf 30d795c 27b2fdf af4024f 27b2fdf 16da29a 30d795c 16da29a 27b2fdf 896bf9f 27b2fdf 16da29a 30d795c 16da29a 896bf9f 27b2fdf 896bf9f 27b2fdf 30d795c 896bf9f 30d795c 896bf9f 27b2fdf 30d795c 16da29a 27b2fdf 896bf9f 27b2fdf 30d795c 896bf9f 27b2fdf 896bf9f af4024f 30d795c af4024f 16da29a 30d795c 896bf9f 27b2fdf 896bf9f 27b2fdf 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 27b2fdf 896bf9f 27b2fdf 30d795c a510b5a 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 27b2fdf 896bf9f 27b2fdf 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 27b2fdf 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 27b2fdf 896bf9f 30d795c 896bf9f 27b2fdf 896bf9f 30d795c 896bf9f 30d795c 896bf9f aee0eac 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 27b2fdf 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 27b2fdf 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 27b2fdf 30d795c 27b2fdf 30d795c 27b2fdf 30d795c af4024f 30d795c 4144863 30d795c af4024f 30d795c ba68b88 30d795c 896bf9f 30d795c 896bf9f 30d795c 896bf9f 30d795c 7596e2e 30d795c 4c6a206 30d795c 4c6a206 b2e94cf 30d795c 8023f3d 4c6a206 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 |
---
license: osl-3.0
model-index:
- name: indus_1.175B
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: AI2 Reasoning Challenge (25-Shot)
type: ai2_arc
config: ARC-Challenge
split: test
args:
num_few_shot: 25
metrics:
- type: acc_norm
value: 22.7
name: normalized accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=nickmalhotra/ProjectIndus
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: HellaSwag (10-Shot)
type: hellaswag
split: validation
args:
num_few_shot: 10
metrics:
- type: acc_norm
value: 25.04
name: normalized accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=nickmalhotra/indus_1.175B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU (5-Shot)
type: cais/mmlu
config: all
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 23.12
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=nickmalhotra/indus_1.175B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: TruthfulQA (0-shot)
type: truthful_qa
config: multiple_choice
split: validation
args:
num_few_shot: 0
metrics:
- type: mc2
value: 0.0
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=nickmalhotra/indus_1.175B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: Winogrande (5-shot)
type: winogrande
config: winogrande_xl
split: validation
args:
num_few_shot: 5
metrics:
- type: acc
value: 49.57
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=nickmalhotra/indus_1.175B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GSM8k (5-shot)
type: gsm8k
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 0.0
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=nickmalhotra/indus_1.175B
name: Open LLM Leaderboard
widget:
- example_title: वर्तमान प्रधानमंत्री
messages:
- role: user
content: >-
भारत के वर्तमान प्रधानमंत्री कौन हैं?
- example_title: होली का महत्व
messages:
- role: user
content: >-
होली का महत्व क्या है?
---
# Model Card for Project Indus
<!-- Provide a quick summary of what the model is/does. -->
Project Indus LLM is a groundbreaking open-source language model tailored for Hindi and its dialects, designed to enhance natural language processing and generation across diverse Indian linguistic applications.
# Table of Contents
- [Table of Contents](#table-of-contents)
- [Model Details](#model-details)
- [Model Description](#model-description)
- [Uses](#uses)
- [Direct Use](#direct-use)
- [Downstream Use](#downstream-use)
- [Out-of-Scope Use](#out-of-scope-use)
- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
- [Recommendations](#recommendations)
- [Training Details](#training-details)
- [Training Data](#training-data)
- [Training Procedure](#training-procedure)
- [Preprocessing](#preprocessing)
- [Evaluation](#evaluation)
- [Testing Data, Factors & Metrics](#testing-data-factors--metrics)
- [Testing Data](#testing-data)
- [Factors](#factors)
- [Metrics](#metrics)
- [Results](#results)
- [Model Examination](#model-examination)
- [Technical Specifications](#technical-specifications)
- [Model Architecture and Objective](#model-architecture-and-objective)
- [Compute Infrastructure](#compute-infrastructure)
- [Hardware](#hardware)
- [Software](#software)
- [Citation](#citation)
- [Glossary](#glossary)
- [More Information](#more-information)
- [Model Card Authors](#model-card-authors)
- [Model Card Contact](#model-card-contact)
- [How to Get Started with the Model](#how-to-get-started-with-the-model)
# Model Details
## Model Description
Project Indus LLM aims to provide a robust language model for Indian languages, starting with Hindi and its dialects. This open-source foundational model, hosted on Hugging Face, is tailored for easy integration and further development by researchers and developers focusing on Indian linguistic diversity.
<!-- Provide a longer summary of what this model is/does. -->
The model is a pretrained model in Hindi and dialects which is instruct tuned.
- **Developed by:** Nikhil Malhotra, Nilesh Brahme, Satish Mishra, Vinay Sharma (Makers Lab, TechMahindra)
- **Model type:** Foundational Language model
- **Language(s) (NLP):** hin, bho, mai, doi
- **License:** other
- **Parent Model:** It is a grounds up model built on GPT-2 architecture starting from tokenizer to decoder
- **Resources for more information:** <https://www.techmahindra.com/en-in/innovation/the-indus-project/>
# Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
Uses include question and answeting and conversation in Hindi and Dialects. The model would be reward tuned to be used across various industries
1. Call center
2. Healthcare
3. Automotive
4. Telecom
## Direct Use
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
<!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
Project Indus can be directly used for generating text, simulating conversation, and other text generation tasks without additional training.
## Downstream Use
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
<!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
Uses include question and answeting and conversation in Hindi and Dialects. The model would be reward tuned to be used across various industries
1. Call center
2. Healthcare
3. Automotive
4. Telecom
## Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
<!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
Project Indus is not designed for high-stakes decision-making tasks such as medical diagnosis or legal advice, nor can it be used for fill-in-the-blank exercises, multiple Q&A, and similar applications at the moment.
# Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
Significant research has explored bias and fairness issues with language models
(see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
We have taken care across various biases by trying to remove them from training data. However since the model is a generative model, it would tend to produce hallucinations.
Any disturbing or harmful sterotype produced by the model is purely un-intentional and coincidental.
## Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
It is recommended to avoid biases and negative connotations in the model, and regular updates along with community feedback are crucial for addressing any emergent bias or misuse scenarios.
# Training Details
The model was trained on a curated dataset comprising various sources of Hindi text, including literature, news articles, and web content.
## Infrastructure
- **Training Infrastructure:** Utilized high-performance computing resources provided by CDAC, featuring NVIDIA A100 GPUs.
- **Running Infrastructure:** Tested for both GPU (NVIDIA GeForce RTX 3070 or higher) and CPU (Intel Xeon Platinum 8580) environments.
## Training Data
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
The Project Indus LLM was trained on a diverse and extensive dataset comprising various sources of Hindi text and its dialects. The data collection and curation process was meticulously designed to cater to the linguistic diversity and complexity of Indian languages, particularly focusing on Hindi and its 37 dialects.
### Data Sources and Collection
Data was collected in three main buckets:
1. **Open-Source Hindi Data**: This included publicly available sources from the internet across different categories such as news, and non-news. Automated scripts were used to scrape and extract text from web pages. Here are some of the sources:
- **News**: Articles from news portals.
- **Non-News**: Diverse sources including Wikipedia, commoncrawl.org, and other culturally significant content like 'Man ki Baat' from AIR.
2. **Translated Data**: A portion of the Pile dataset, which is a large English dataset used for training AI models, was translated into Hindi using three different translation models. IndicTrans2 (AI4Bharat) was selected as the best model for this purpose based on its accuracy and efficiency.
3. **Dialects**: Data collection for dialects presented a unique challenge due to the limited material available on the internet. Data for major dialects like Maithili, Bhojpuri, Magahi, and Braj Bhasha was collected from multiple sources, including fieldwork where representatives collected old books and other texts, which were then digitized and converted into text data.
## Training Procedure
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
Training involved extensive preprocessing to clean and standardize the text, followed by supervised learning on a high-performance computing setup.
- **Pre-training:** Conducted on a dataset of 22 billion tokens using advanced tokenization techniques.
- **Fine-Tuning:** Supervised fine-tuning performed with a focus on Indian languages, utilizing datasets specifically tailored for cultural, political, and social contexts.
Below is a table summarizing the datasets used for pre-training and fine-tuning the model:
| Phase | Data Source | Tokens | Notes |
|---------------|-----------------------------------------|-----------|-----------------------------------------------------|
| Pre-training | Cleaned dataset of Hindi and dialects | 22 billion| Utilized advanced tokenization |
| Fine-tuning | Custom datasets tailored for Indian languages | Varied | Focus on cultural, political, and social contexts |
- **Training Infrastructure:** Utilized high-performance computing resources provided by CDAC, featuring NVIDIA A100 GPUs.
- **Running Infrastructure:** Tested for both GPU (NVIDIA GeForce RTX 3070 or higher) and CPU (Intel Xeon Platinum 8580) environments.
### Preprocessing
The collected data underwent several stages of cleaning and preprocessing to ensure high quality and usability for training:
- **Cleaning**: The data was cleaned of unwanted text, characters, and personal information like mobile numbers. Transliteration was performed where necessary, and unwanted tags from scraped web pages were removed.
- **Bias Removal**: A Bias Removal Toolkit was developed to detect and remove biased language from the training data. This toolkit helped in ensuring that the text used for training the model was ethical, correct, and socially responsible.
- **Tokenization**: The data was tokenized using a custom tokenizer developed specifically for Hindi and its dialects. This tokenizer was based on Byte Pair Encoding (BPE) with additional mechanisms like byte fallback to handle the peculiarities of Hindi script efficiently.
#### Summary
The final dataset used for training consisted of:
- **Raw Data Size**: Over 500 GB of raw data collected.
- **Cleaned and Curated Data**: Approximately 200 GB of clean Hindi and dialect text data.
- **Tokenization**: Utilized 22 billion tokens created from the cleaned data for pre-training.
This diverse and extensive training data foundation allowed Project Indus LLM to develop robust capabilities for understanding and generating Hindi text, making it a powerful tool for applications requiring Indian language processing.
# Evaluation
### Indic LLM Leaderboard Results
Project Indus LLM has been evaluated using the Indic LLM Leaderboard, which employs the `indic_eval` evaluation framework specifically designed for assessing models on Indian language tasks. This framework provides a comprehensive view of model performance across a variety of benchmarks tailored to Indian languages.
Detailed results from the Indic LLM Leaderboard (α), accessible at [Hugging Face Indic LLM Leaderboard](https://huggingface.co/spaces/Cognitive-Lab/indic_llm_leaderboard), are shown below:
| Task | Version | Metric | Value | | Stderr |
|--------------------------------|---------|----------|-------|---|--------|
| All | | acc | 0.2891| ± | 0.0109 |
| | | acc_norm | 0.3013| ± | 0.0112 |
| indiceval:ARC-Challenge:hindi:10 | 0 | acc | 0.2167| ± | 0.0120 |
| | | acc_norm | 0.2474| ± | 0.0126 |
| indiceval:ARC-Easy:hindi:5 | 0 | acc | 0.3615| ± | 0.0099 |
| | | acc_norm | 0.3552| ± | 0.0098 |
These results highlight the model's capabilities in understanding and generating Hindi language text under controlled testing conditions. The standard error values indicate the variance observed during the evaluation, providing insights into the consistency of the model's performance across different evaluation runs.
### Open LLM Leaderboard Evaluation Results
Additionally, Project Indus LLM has been evaluated on the Open LLM Leaderboard, which provides another layer of benchmarking by comparing the model's performance against other state-of-the-art language models. Below are the summarized results from the Open LLM Leaderboard:
| Metric |Value|
|---------------------------------|----:|
|Avg. |20.07|
|AI2 Reasoning Challenge (25-Shot)|22.70|
|HellaSwag (10-Shot) |25.04|
|MMLU (5-Shot) |23.12|
|Winogrande (5-shot) |49.57|
These benchmark results can be explored further on [Hugging Face Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
### Evaluation Context
The evaluation metrics `acc` (accuracy) and `acc_norm` (normalized accuracy) are used to quantify the model's performance. The tasks are differentiated by their difficulty and the specific dataset used, such as the ARC Challenge and ARC Easy sets, both adapted to Hindi language conditions to ensure relevant assessment. This structured evaluation ensures that the Indus LLM not only performs well in generalized text generation tasks but also in more specialized, context-specific scenarios pertinent to the Indian linguistic framework.
## Results
Project Indus demonstrates competitive performance, particularly in text generation tasks, as evidenced by its scores on standardized benchmarks.
# Technical Specifications
## Model Architecture and Objective
Project Indus LLM is based on a GPT-2.0-like architecture, tailored to handle the complexities of the Hindi language and its dialects. This model was designed to serve as a foundational model that can be fine-tuned for various applications, making it highly versatile and adaptable to different domains within the Indian context.
- **Architecture Details**:
- **Layers**: 22 transformer layers, which provide a deep neural network capable of understanding complex language patterns.
- **Heads**: 32 attention heads per layer, facilitating a broad attention mechanism across different parts of the input data.
- **Embedding Size**: 2048, which allows the model to represent a wide variety of information and nuances in the data.
- **Vocabulary Size**: 32,300, tailored to include a comprehensive set of Hindi words and common phrases found in the training data.
The objective of this model is to provide a robust tool for text generation and understanding in Hindi and its dialects, supporting the development of applications that require natural language processing in these languages. It also aims to bridge the gap in technology where Indian languages are underrepresented, providing a platform for further linguistic research and technological inclusion.
## Compute Infrastructure
##### Hardware
The pre-training and fine-tuning of Project Indus LLM were conducted on high-performance computing infrastructure provided by the Centre for Development of Advanced Computing (CDAC). This setup included:
- **Nodes and GPUs**: Utilization of six nodes, each equipped with eight NVIDIA A100 GPUs. These GPUs are state-of-the-art for machine learning tasks and provide the necessary computational power to handle the large volumes of data and complex model architectures.
- **Memory and Storage**: Each node was equipped with ample memory and storage to handle the datasets and model parameters efficiently. Specific configurations included 40 GB of GPU memory per card, essential for training large models.
Inference performance was tested on GPU as well as CPU.
- **GPU**: On GPU NVIDIA GeForce RTX 3070 we have seen for 250-350 tokens inference time around ~5-10s.
- **CPU**: On Intel CPU Xeon(R) Platinum 8580 we have seen performance comparable to GPU with throughput of > 30 token/second.
##### Software
The software environment was crucial for efficiently training and running the model. Key components included:
- **Operating System**: Linux, chosen for its stability and support for high-performance computing tasks.
- **Machine Learning Frameworks**: PyTorch, used for its flexibility and efficiency in training deep learning models. It supports extensive parallel processing and GPU acceleration, which are critical for training large models like Project Indus LLM.
- **Job Scheduler**: SLURM (Simple Linux Utility for Resource Management) was used to manage and allocate resources effectively across the distributed system. This ensured optimal scheduling of training jobs without resource contention.
# Citation
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
The detailed citation information will help in acknowledging the work and efforts of the team behind Project Indus LLM when it is used or referenced in academic or professional settings.
```bibtex
@article{malhotra2024projectindus,
title={Project Indus: A Foundational Model for Indian Languages},
author={Malhotra, Nikhil and Brahme, Nilesh and Mishra, Satish and Sharma, Vinay},
journal={Tech Mahindra Makers Lab},
year={2024},
url={https://www.techmahindra.com/en-in/innovation/the-indus-project/}
}
```
**APA:**
Malhotra, N., Brahme, N., Mishra, S., & Sharma, V. (2024). Project Indus: A Foundational Model for Indian Languages. *Tech Mahindra Makers Lab*. Available at <https://www.techmahindra.com/en-in/innovation/the-indus-project/>
# Glossary
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
This glossary section explains key terms used throughout the model documentation and technical details, helping users unfamiliar with certain concepts to better understand the content.
- **Transformer Layers**: Part of a neural network architecture that uses self-attention mechanisms to process sequential data such as text. Essential for NLP tasks.
- **Attention Heads**: Sub-units of a model layer that allow the model to focus on different parts of the input sequence when making predictions.
- **Embedding Size**: The size of the vector used to represent each token or word in a dense numerical form. Larger embeddings can capture more detailed information.
- **Block Size**: The maximum length of the input tokens the model can process in one operation.
- **Vocabulary Size**: The total number of unique words or tokens that the model can understand and generate.
# More Information
For further details on Project Indus LLM, including additional documentation, tutorials, and community discussions, visit the following resources:
- **Project Repository**: [Hugging Face Repository](https://huggingface.co/nickmalhotra/ProjectIndus)
- **Tech Mahindra Makers Lab**: Insights into the research and development behind Project Indus can be found on the [Tech Mahindra Innovation page](https://www.techmahindra.com/en-in/innovation/makers-lab/).
- **Community Forums**: Engage with the community on [Hugging Face Forums](https://huggingface.co/nickmalhotra/ProjectIndus/discussions?status=open&type=discussion) for support, brainstorming, and sharing of new ideas related to Project Indus.
# Model Card Authors
<!-- This section provides another layer of transparency and accountability. Whose views is this model card representing? How many voices were included in its construction? Etc. -->
The model card and documentation for Project Indus LLM were collaboratively authored by:
- **Nikhil Malhotra**: Chief Innovation Officer at Tech Mahindra.
- **Nilesh Brahme**: Senior AI Research Scientist and one of the primary contributors to the Project Indus development.
- **Satish Mishra**: AI Architect, whose insights have significantly shaped the model's capabilities.
- **Vinay Sharma**: LLM Engineer focused on the linguistic data processing and model training aspects of Project Indus.
# Model Card Contact
For inquiries, support, or further information regarding Project Indus LLM, please reach out through the following channels:
- **Email**: [projectindus@techmahindra.com](mailto:projectindus@techmahindra.com) - For direct queries and professional engagements.
- **GitHub Issues**: For technical issues, feature requests, or contributions, please use the Issues section of the [Project Indus GitHub repository](https://github.com/Tech-Mahindra-Makers-Lab/Indus-1.1B).
- **Hugging Face Spaces**: Questions and discussions related to model implementation and community projects can be posted in our dedicated space on Hugging Face.
# How to Get Started with the Model
To begin using Project Indus LLM for your projects, follow these steps to set up and run the model:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("nickmalhotra/ProjectIndus")
tokenizer = AutoTokenizer.from_pretrained("nickmalhotra/ProjectIndus")
# Example inference
def format_template(user_prompt):
messages = [
{"role": "user", "content": user_prompt},
]
response = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
return response
user_prompt = """भारत के वर्तमान प्रधानमंत्री कौन हैं?"""
input_ids = format_template(user_prompt)
# Generate text using the model
output = model.generate(input_ids,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.eos_token_id,
max_length=1024,
num_beams=5,
do_sample=True,
early_stopping=True,
temperature=0.7,
top_k=50,
top_p=0.95,
repetition_penalty=1.2,
no_repeat_ngram_size=3,
num_return_sequences=1,
)
print(tokenizer.decode(output[0], skip_special_tokens=False))
```
## Disclaimer
#### Model Limitations
Project Indus LLM is trained with single instruction tuning, which may result in hallucinations—instances where the model generates plausible but inaccurate information. Users should exercise caution, especially in scenarios requiring high factual accuracy.
#### Adaptation for Specific Use Cases
Project Indus LLM is designed as a foundational model suitable for further development and fine-tuning. Users are encouraged to adapt and refine the model to meet specific requirements of their applications.
#### Recommendations for Fine-Tuning
- **Identify Specific Needs**: Clearly define the requirements of your use case to guide the fine-tuning process.
- **Curate Targeted Data**: Ensure the training data is relevant and of high quality to improve model performance.
- **Continuous Evaluation**: Regularly assess the model's performance during and after fine-tuning to maintain accuracy and reduce biases.
This disclaimer aims to provide users with a clear understanding of the model's capabilities and limitations, facilitating its effective application and development. |