File size: 6,390 Bytes
7d4e0fe
 
 
 
 
 
 
 
 
365285e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
license: apache-2.0
datasets:
- mlsum
- squad_es
language:
- es
library_name: adapter-transformers
pipeline_tag: text2text-generation
---
# Model Card for opt-6.7b-lora-sag-t12000-v1200-v1

## Adapter Description

This adapter was created as part of the SomosNLP Hackathon 2023 with the [PEFT](https://github.com/huggingface/peft) library and allowed the base model [facebook/opt-6.7b](https://huggingface.co/facebook/opt-6.7b) to be fine-tuned on the [SQUAD_ES](https://huggingface.co/datasets/squad_es) (v1.1.0) and [MLSUM](https://huggingface.co/datasets/mlsum) by using the method *LoRA*.

- **Developed by:**
  - 🇵🇪 Enrique Ubaldo
  - 🇵🇪 Fernando Alva-Manchego
  - 🇵🇪 @Levi111
  - 🇲🇽 @IvanHU
  - 🇨🇺 [Alberto Carmona Barthelemy](https://huggingface.co/milyiyo)
- **Model type:** Text2Text Generation
- **Language(s) (NLP):** Spanish
- **License:** apache-2.0


## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

This model is designed for use in Spanish language instruction, specifically for tasks such as generating summaries, creating questions, and generating answers based on a given context and question.

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

Please note that this model inherits biases from its original base model. You can review these biases by visiting the following [link](https://huggingface.co/facebook/opt-6.7b#limitations-and-bias).

### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

## How to Get Started with the Model

Use the code below to get started with the model.

```py
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

peft_model_id = "hackathon-somos-nlp-2023/opt-6.7b-lora-sag-t14000-v1400-v1"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_8bit=True, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id)

model.config.use_cache = True

generation_config = GenerationConfig(temperature=.8,
                                     top_p=0.75,
                                     top_k=40)

def gen_summary(text):
  input_text = f'<s>Instruction: Elabora un resume del siguiente texto.\nInput: {text}\nOutput: '
  batch = tokenizer(input_text, return_tensors='pt')
  with torch.cuda.amp.autocast():
    output_tokens = model.generate(**batch, 
                                   max_new_tokens=256, 
                                   generation_config=generation_config)
  output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
  return output

def gen_question(text):
  input_text = f'<s>Instruction: Dado el siguiente texto quiero que generes una pregunta cuya respuesta se encuentre en él.\nInput: {text}\nOutput: '
  batch = tokenizer(input_text, return_tensors='pt')
  with torch.cuda.amp.autocast():
    output_tokens = model.generate(**batch, 
                                   max_new_tokens=256, 
                                   generation_config=generation_config)
  output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
  return output

def gen_qna(context, question):
  input_text = f"""<s>Instruction: Te voy a proporcionar un texto del cual deseo que me respondas una pregunta. 
    El texto es el siguiente: `{context}`\nInput: {question}\nOutput: """
  batch = tokenizer(input_text, return_tensors='pt')
  with torch.cuda.amp.autocast():
    output_tokens = model.generate(**batch, 
                                   max_new_tokens=256, 
                                   generation_config=generation_config)
  output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
  return output
```

## Training Details

### Training Data

<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

- [SQUAD_ES](https://huggingface.co/datasets/squad_es) (Subset: v1.1.0)
- [MLSUM](https://huggingface.co/datasets/mlsum) (Subset: es)

### Training Procedure 

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

We selected 4000 examples for each of the three tasks in the training dataset, and 400 examples for each task in the validation dataset. 
This resulted in a total of 12000 examples for training and 1200 examples for validation.

The Colab used for training is [here](https://colab.research.google.com/drive/1mwLUNgsSLrRbHUiXJuaU_6EQaYrYg10A?usp=sharing).

#### Training Hyperparameters

- **Training regime:** fp16 <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
- **Steps:**: 1240
- **Learning rate:**: 2e-4
- **Training loss:**: 1.013
- **Validation loss:**: 1.53
- **Training duration:**: 7.5 hours

WandB link: [here](https://wandb.ai/milyiyo/huggingface/reports/eval-loss-train-loss-23-04-08-19-14-44---Vmlldzo0MDEwMzQ5?accessToken=o5f3pldc7np25ch8123rcjz4195vrcc9nl31r26i130jhhpvuabie3ezrw6dcs6r)



<!--## Environmental Impact -->

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

<!--Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** T4 16GB (Free Colab)
- **Hours used:** 7.5 hour
- **Cloud Provider:** Google Cloud Platform
- **Compute Region:** us-central1?
- **Carbon Emitted:** Total emissions are estimated to be 0.04 kgCO$_2$eq of which 100 percents were directly offset by the cloud provider.
-->