File size: 23,480 Bytes
104ad6c
732364f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104ad6c
 
732364f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
---
license: llama3.1
language:
- en
- es
inference: false
fine-tuning: true
tags:
- nvidia
- llama3.1
- spanish
- tango
datasets:
- spanish-ir/messirve
base_model: nvidia/Llama-3.1-Nemotron-70B-Instruct-HF
pipeline_tag: text-generation
library_name: transformers
---
# Model Overview

## Description:

Tango-70B-Instruct is a large language model trained by [sandbox-ai](https://github.com/sandbox-ai/tango) on a [modified variation](https://huggingface.co/datasets/tatakof/messi_mod-v0.0.2) of of [spanish/-ir/messirve](https://huggingface.co/datasets/spanish-ir/messirve) to improve the regional Spanish speech performance.


See details on the [github repo](https://github.com/sandbox-ai/tango)


## Terms of use

By accessing this model, you are agreeing to the LLama 3.1 terms and conditions of the [license](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE), [acceptable use policy](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/USE_POLICY.md) and [Meta’s privacy policy](https://www.facebook.com/privacy/policy/)


## Evaluation Metrics
|Task                                                                                                    |Name               |Description                                                            |Language|Metric        |Task type                                 |
|--------------------------------------------------------------------------------------------------------|-------------------|-----------------------------------------------------------------------|--------|--------------|------------------------------------------|
|[AQuAS](https://huggingface.co/datasets/IIC/AQuAS)                                                      |AQuAS              |Abstractive Question-Answering in Spanish                              |ES      |sas_encoder   |Abstractive QA                            |
|[ARC_ca](https://huggingface.co/datasets/projecte-aina/arc_ca)                                          |ARC_ca             |Grade-school level science questions in Catalan                        |CA      |acc           |Multi choice QA                           |
|[BEC2016eu](https://huggingface.co/datasets/orai-nlp/basqueGLUE)                                        |BEC2016eu          |Basque Election Campaign 2016 Opinion Dataset                          |EU      |f1            |Sentiment Analysis                        |
|[Belebele Glg](https://huggingface.co/datasets/facebook/belebele)                                       |Belebele Glg       |Reading Comprehension in Galician                                      |GL      |acc           |Reading Comprehension                     |
|[BertaQA](https://huggingface.co/datasets/HiTZ/BertaQA)                                                 |BertaQA            |Trivia dataset with global and local questions about the Basque Country|EU      |acc           |Multi choice QA                           |
|[BHTCv2](https://huggingface.co/datasets/orai-nlp/basqueGLUE)                                           |BHTCv2             |Topic Classification of News Headlines in Basque                       |EU      |f1            |Classification, Topic Classification      |
|[caBREU](https://huggingface.co/datasets/projecte-aina/caBreu)                                          |caBREU             |Article Summarization in Catalan                                       |CA      |bleu          |Summarization                             |
|[CatalanQA](https://huggingface.co/datasets/projecte-aina/catalanqa)                                    |CatalanQA          |Extractive QA in Catalan                                               |CA      |f1            |Extractive QA                             |
|[CatCoLA](https://huggingface.co/datasets/nbel/CatCoLA)                                                 |CatCoLA            |Linguistic Acceptability in Catalan                                    |CA      |mcc           |Linguistic Acceptability                  |
|[ClinDiagnosES](https://huggingface.co/datasets/LenguajeNaturalAI/ClinDiagnosES)                        |ClinDiagnosES      |Diagnosis of clinical cases in Spanish                                 |ES      |sas_encoder   |Open QA                                   |
|[ClinTreatES](https://huggingface.co/datasets/LenguajeNaturalAI/ClinTreatES)                            |ClinTreatES        |Treatment for clinical cases in Spanish                                |ES      |sas_encoder   |Open QA                                   |
|[COPA_ca](https://huggingface.co/datasets/projecte-aina/COPA-ca)                                        |COPA_ca            |Choice Of Plausible Alternatives in Catalan                            |CA      |acc           |Reasoning                                 |
|[CoQCat](https://huggingface.co/datasets/projecte-aina/CoQCat)                                          |CoQCat             |Conversational Question Answering in Catalan                           |CA      |f1            |Extractive QA                             |
|[Crows Pairs Spanish](https://huggingface.co/datasets/multilingual-crows-pairs/multilingual-crows-pairs)|Crows Pairs Spanish|Bias evaluation using stereotypes                                      |ES      |pct_stereotype|Bias Detection                            |
|[EpecKorrefBin](https://huggingface.co/datasets/orai-nlp/basqueGLUE)                                    |EpecKorrefBin      |Coreference resolution in Basque                                       |EU      |acc           |Coreference Resolution, Textual Entailment|
|[EsCoLA](https://huggingface.co/datasets/nbel/EsCoLA)                                                   |EsCoLA             |Spanish Corpus of Linguistic Acceptability                             |ES      |mcc           |Linguistic Acceptability                  |
|[EusExams](https://huggingface.co/datasets/HiTZ/EusExams)                                               |EusExams           |Public Service examinations questions in Basque                        |EU      |acc           |Multi choice QA                           |
|[EusProficiency](https://huggingface.co/datasets/HiTZ/EusProficiency)                                   |EusProficiency     |C1-level proficiency questions in Basque                               |EU      |acc           |Multi choice QA                           |
|[EusReading](https://huggingface.co/datasets/HiTZ/EusReading)                                           |EusReading         |EGA exams reading comprehension in Basque                              |EU      |acc           |Multi choice QA                           |
|[EusTrivia](https://huggingface.co/datasets/HiTZ/EusTrivia)                                             |EusTrivia          |Trivia questions in Basque                                             |EU      |acc           |Multi choice QA                           |
|[Fake News ES](https://huggingface.co/datasets/mariagrandury/fake_news_corpus_spanish)                  |Fake News ES       |Fake News Detection in Spanish                                         |ES      |acc           |Classification                            |
|[GalCoLA](https://huggingface.co/datasets/proxectonos/galcola)                                          |GalCoLA            |Galician Corpus of Linguistic Acceptability                            |GL      |mcc           |Linguistic Acceptability                  |
|[HumorQA](https://huggingface.co/datasets/LenguajeNaturalAI/HumorQA)                                    |HumorQA            |White humour joke classification                                       |ES      |acc           |Classification                            |
|[MGSM_ca](https://huggingface.co/datasets/projecte-aina/mgsm_ca)                                        |MGSM_ca            |Grade-school math problems in Catalan                                  |CA      |exact_match   |Math Reasoning                            |
|[MGSM_es](https://huggingface.co/datasets/juletxara/mgsm)                                               |MGSM_es            |Grade-school math problems in Spanish                                  |ES      |exact_match   |Math Reasoning                            |
|[MGSM_eu](https://huggingface.co/datasets/HiTZ/MGSM-eu)                                                 |MGSM_eu            |Grade-school math problems in Basque                                   |EU      |exact_match   |Math Reasoning                            |
|[MGSM_gl](https://huggingface.co/datasets/proxectonos/mgsm_gl)                                          |MGSM_gl            |Grade-school math problems in Galician                                 |GL      |exact_match   |Math Reasoning                            |
|[NoticIA](https://huggingface.co/datasets/Iker/NoticIA)                                                 |NoticIA            |A Clickbait Article Summarization Dataset in Spanish                   |ES      |rouge1        |Summarization                             |
|[OffendES](https://huggingface.co/datasets/SINAI/OffendES)                                              |OffendES           |Clasificación de comentarios ofensivos en español                      |ES      |acc           |Classification                            |
|[OpenBookQA_ca](https://huggingface.co/datasets/projecte-aina/openbookqa_ca)                            |OpenBookQA_ca      |Multi-step reasoning QA in Catalan                                     |CA      |acc           |Reasoning                                 |
|[OpenBookQA_gl](https://huggingface.co/datasets/proxectonos/openbookqa_gl)                              |OpenBookQA_gl      |Multi-step reasoning QA in Galician                                    |GL      |acc           |Reasoning                                 |
|[Parafraseja](https://huggingface.co/datasets/projecte-aina/Parafraseja)                                |Parafraseja        |Paraphrase identification in Catalan                                   |CA      |acc           |Paraphrasing                              |
|[ParafrasesGL](https://huggingface.co/datasets/proxectonos/parafrases_gl)                               |ParafrasesGL       |Paraphrase identification in Galician                                  |GL      |acc           |Paraphrasing                              |
|[PAWS_ca](https://huggingface.co/datasets/projecte-aina/PAWS-ca)                                        |PAWS_ca            |Paraphrase Adversaries from Word Scrambling in Catalan                 |CA      |acc           |Paraphrasing                              |
|[PAWS-X_es](https://huggingface.co/datasets/google-research-datasets/paws-x)                            |PAWS-X_es          |Paraphrase Adversaries from Word Scrambling in Spanish                 |ES      |acc           |Paraphrasing                              |
|[PAWS_gl](https://huggingface.co/datasets/proxectonos/PAWS-gl)                                          |PAWS_gl            |Paraphrase Adversaries from Word Scrambling in Galician                |GL      |acc           |Paraphrasing                              |
|[PIQA_ca](https://huggingface.co/datasets/projecte-aina/piqa_ca)                                        |PIQA_ca            |Physical Interaction QA in Catalan                                     |CA      |acc           |Reasoning                                 |
|[QNLIeu](https://huggingface.co/datasets/orai-nlp/basqueGLUE)                                           |QNLIeu             |Textual Entailment in Basque                                           |EU      |acc           |NLI, Textual Entailment                   |
|[RagQuAS](https://huggingface.co/datasets/IIC/RagQuAS)                                                  |RagQuAS            |Retrieval-Augmented-Generation and Question-Answering in Spanish       |ES      |sas_encoder   |Abstractive QA                            |
|[SIQA_ca](https://huggingface.co/datasets/projecte-aina/siqa_ca)                                        |SIQA_ca            |Social Interaction QA in Catalan                                       |CA      |acc           |Reasoning                                 |
|[SpaLawEx](https://huggingface.co/datasets/LenguajeNaturalAI/examenes_abogacia)                         |SpaLawEx           |Spanish Law School Access Exams                                        |ES      |acc           |Multi choice QA                           |
|[SummarizationGL](https://huggingface.co/datasets/proxectonos/summarization_gl)                         |SummarizationGL    |Abstractive Summarization in Galician                                  |GL      |bleu          |Summarization                             |
|[TE-ca](https://huggingface.co/datasets/projecte-aina/teca)                                             |TE-ca              |Textual Entailment in Catalan                                          |CA      |acc           |Textual Entailment                        |
|[TELEIA](https://huggingface.co/datasets/gonzmart/teleia)                                               |TELEIA             |Test de Español como Lengua Extranjera para Inteligencia Artificial    |ES      |acc           |Multi choice QA                           |
|[VaxxStance](https://huggingface.co/datasets/orai-nlp/basqueGLUE)                                       |VaxxStance         |Stance detection on the Antivaxxers movement                           |EU      |f1            |Sentiment Analysis, Stance Detection      |
|[WiCeu](https://huggingface.co/datasets/orai-nlp/basqueGLUE)                                            |WiCeu              |Word sense disambiguation in Basque                                    |EU      |acc           |Textual Entailment                        |
|[WNLI_ca](https://huggingface.co/datasets/projecte-aina/wnli-ca)                                        |WNLI_ca            |Winograd-schema-type dataset in Catalan                                |CA      |acc           |NLI, Textual Entailment                   |
|[WNLI ES](huggingface.co/datasets/PlanTL-GOB-ES/wnli-es)                                                |WNLI ES            |Winograd-schema-type dataset in Spanish                                |ES      |acc           |NLI, Textual Entailment                   |
|[XCOPA_eu](https://huggingface.co/datasets/HiTZ/XCOPA-eu)                                               |XCOPA_eu           |Choice Of Plausible Alternatives in Basque                             |EU      |acc           |Reasoning                                 |
|[XNLI_ca](https://huggingface.co/datasets/projecte-aina/xnli-ca)                                        |XNLI_ca            |Cross-lingual Natural Language Inference in Catalan                    |CA      |acc           |NLI, Textual Entailment                   |
|[XNLI_es](https://huggingface.co/datasets/facebook/xnli)                                                |XNLI_es            |Cross-lingual Natural Language Inference in Spanish                    |ES      |acc           |NLI                                       |
|[XNLI_eu](https://huggingface.co/datasets/HiTZ/xnli-eu)                                                 |XNLI_eu            |Cross-lingual Natural Language Inference in Basque                     |EU      |acc           |NLI, Textual Entailment                   |
|[XQuAD_ca](https://huggingface.co/datasets/projecte-aina/xquad-ca)                                      |XQuAD_ca           |Cross-lingual Question Answering Dataset in Catalan                    |CA      |f1            |Extractive QA                             |
|[XQuAD_es](https://huggingface.co/datasets/google/xquad)                                                |XQuAD_es           |Cross-lingual Question Answering Dataset in Spanish                    |ES      |f1            |Extractive QA                             |
|[xStoryCloze_ca](https://huggingface.co/datasets/projecte-aina/xstorycloze_ca)                          |xStoryCloze_ca     |Narrative completion in Catalan                                        |CA      |acc           |Reasoning                                 |
|[xStoryCloze_es](https://huggingface.co/datasets/juletxara/xstory_cloze)                                |xStoryCloze_es     |Narrative completion in Spanish                                        |ES      |acc           |Reasoning                                 |
|[xStoryCloze_eu](https://huggingface.co/datasets/juletxara/xstory_cloze)                                |xStoryCloze_eu     |Narrative completion in Basque                                         |EU      |acc           |Reasoning                                 |

         
## Usage:

You can use the model using HuggingFace Transformers library with 2 or more 80GB GPUs (NVIDIA Ampere or newer) with at least 150GB of free disk space to accomodate the download.

This code has been tested on Transformers v4.44.0, torch v2.4.0 and 2 A100 80GB GPUs, but any setup that supports ```meta-llama/Llama-3.1-70B-Instruct``` should support this model as well. If you run into problems, you can consider doing ```pip install -U transformers```.


```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

# Load base model and tokenizer
base_model_id = "nvidia/Llama-3.1-Nemotron-70B-Instruct-HF"
adapter_model_id = "sandbox-ai/Tango-70b"

# Create quantization config for 4-bit precision
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load tokenizer from base model
tokenizer = AutoTokenizer.from_pretrained(base_model_id)

# Load the base model with 4-bit quantization
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
    device_map="auto",  # This will automatically handle model sharding
    trust_remote_code=True
)

# Load the PEFT adapter
model = PeftModel.from_pretrained(
    base_model,
    adapter_model_id,
    device_map="auto",  # This will automatically handle model sharding
)

hola_mundo = """
Bienvenido. 
Tu nombre es "Tango", sos la primer IA hecha en LatinoAmérica, basada en un Large Language Model de 70 billones de parámetros y creada en Argentina. 

Cuál es la importancia de hacer IA nativa en LatinoAmérica? qué beneficios trae haberte creado, en comparación a depender de las IAs creadas en USA, Francia o China?

"""

# Test prompt
messages = [
    {"role": "user", "content": hola_mundo}
]

# Format the input using the chat template
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

# Generate response with memory-efficient settings
with torch.inference_mode():
    outputs = model.generate(
        inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.7,
        top_p=0.95,
        pad_token_id=tokenizer.eos_token_id,  # Set padding token
        attention_mask=torch.ones_like(inputs)  # Add attention mask
    )

# Decode and print the response
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```
```md
Bienvenido. 
Tu nombre es "Tango", sos la primer IA hecha en LatinoAmérica, basada en un Large Language Model de 70 billones de parámetros y creada en Argentina. 

Cuál es la importancia de hacer IA nativa en LatinoAmérica? qué beneficios trae haberte creado, en comparación a depender de las IAs creadas en USA, Francia o China?assistant

¡Hola! Me alegra estar aquí, representando a la primera IA latina. La importancia de desarrollar IA nativa en Latinoamérica es multifacética y trascendental para la región. Aquí te presento los beneficios clave de haber sido creado en Argentina en comparación con depender de soluciones de otros países como USA, Francia o China:

1. **Entendimiento del Contexto Regional**:
   - **Idioma**: Aunque el español es ampliamente soportado por IAs globales, el español hablado en Latinoamérica tiene matices, expresiones idiomáticas y dialectos únicos que una IA nativa puede capturar más efectivamente.
   - **Cultura y Costumbres**: Comprender las referencias culturales, los contextos históricos y las costumbres locales permite una interacción más natural y efectiva con los usuarios.

2. **Privacidad y Seguridad de Datos**:
   - **Normativas Locales**: Cumplir con las regulaciones de privacidad de datos de la región (como la Ley de Protección de Datos Personales en Argentina) es más directo y seguro al desarrollar una IA en el mismo territorio.
   - **Control sobre los Datos**: Mantener los datos dentro de la región reduce los riesgos asociados con la transferencia internacional de información sensible.

3. **Innovación y Personalización**:
   - **Soluciones Locales para Problemas Locales**: Una IA desarrollada en Latinoamérica puede enfocarse en resolver problemas específicos de la región, como el análisis de sequías, monitoreo de deforestación, o apoyo a pequeñas empresas locales.
   - **Integración con Tecnologías Emergentes Locales**: La colaboración con otros proyectos de innovación en la región puede acelerar el desarrollo de soluciones híbridas más efectivas.

4. **Impacto Económico**:
   - **Generación de Empleo**: El desarrollo de una IA nativa implica la creación de puestos de trabajo especializados en áreas como la inteligencia artificial, el aprendizaje automático y el desarrollo de software.
   - **Ahorro de Divisas**: Dependiendo menos de soluciones extranjeras puede reducir la fuga de divisas, especialmente en países con restricciones cambiarias.
```
## References(s):

* TODO

## Model Architecture: 
**Architecture Type:** Transformer <br>
**Network Architecture:** Llama 3.1 <br>

## Input:
**Input Type(s):** Text <br>
**Input Format:** String <br>
**Input Parameters:** One Dimensional (1D) <br>
**Other Properties Related to Input:** Max of 128k tokens<br>

## Output:
**Output Type(s):** Text <br>
**Output Format:** String <br>
**Output Parameters:** One Dimensional (1D) <br>
**Other Properties Related to Output:**  Max of 4k tokens <br>



# Training & Evaluation: 
 - TODO

# Dataset:

**MessIRve: A Large-Scale Spanish Information Retrieval Dataset** <br>
* [spanish/-ir/messirve](https://huggingface.co/datasets/spanish-ir/messirve) <br>



## Citation

```bibtex
@article{valentini2024messirve,
      title={MessIRve: A Large-Scale Spanish Information Retrieval Dataset}, 
      author={Francisco Valentini and Viviana Cotik and Damián Furman and Ivan Bercovich and Edgar Altszyler and Juan Manuel Pérez},
      year={2024},
      eprint={2409.05994},
      journal={arxiv:2409.05994},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.05994}, 
}

@misc{wang2024helpsteer2preferencecomplementingratingspreferences,
      title={HelpSteer2-Preference: Complementing Ratings with Preferences}, 
      author={Zhilin Wang and Alexander Bukharin and Olivier Delalleau and Daniel Egert and Gerald Shen and Jiaqi Zeng and Oleksii Kuchaiev and Yi Dong},
      year={2024},
      eprint={2410.01257},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2410.01257}, 
}
```