alecrosales1 commited on
Commit
22ccb9e
1 Parent(s): 245a879

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +167 -102
README.md CHANGED
@@ -2,199 +2,264 @@
2
  library_name: transformers
3
  tags:
4
  - unsloth
 
 
 
 
 
 
 
 
 
 
 
5
  ---
6
-
7
- # Model Card for Model ID
8
-
9
- <!-- Provide a quick summary of what the model is/does. -->
10
-
11
 
12
 
13
  ## Model Details
14
 
15
  ### Model Description
16
 
17
- <!-- Provide a longer summary of what this model is. -->
18
 
19
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
 
 
20
 
21
- - **Developed by:** [More Information Needed]
22
- - **Funded by [optional]:** [More Information Needed]
23
- - **Shared by [optional]:** [More Information Needed]
24
- - **Model type:** [More Information Needed]
25
- - **Language(s) (NLP):** [More Information Needed]
26
- - **License:** [More Information Needed]
27
- - **Finetuned from model [optional]:** [More Information Needed]
28
 
29
- ### Model Sources [optional]
30
 
31
- <!-- Provide the basic links for the model. -->
32
 
33
- - **Repository:** [More Information Needed]
34
- - **Paper [optional]:** [More Information Needed]
35
- - **Demo [optional]:** [More Information Needed]
36
 
37
- ## Uses
38
 
39
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 
 
 
40
 
41
  ### Direct Use
 
42
 
43
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
44
 
45
- [More Information Needed]
46
 
47
- ### Downstream Use [optional]
48
 
49
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
50
 
51
- [More Information Needed]
52
 
53
- ### Out-of-Scope Use
54
 
55
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
 
56
 
57
- [More Information Needed]
 
58
 
59
- ## Bias, Risks, and Limitations
 
 
 
60
 
61
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
62
 
63
- [More Information Needed]
64
 
65
- ### Recommendations
66
 
67
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
68
 
69
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
70
 
71
- ## How to Get Started with the Model
72
 
73
- Use the code below to get started with the model.
74
 
75
- [More Information Needed]
76
 
77
- ## Training Details
 
 
 
 
 
 
 
 
 
78
 
79
- ### Training Data
80
 
81
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
82
 
83
- [More Information Needed]
84
 
85
- ### Training Procedure
86
 
87
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
88
 
89
- #### Preprocessing [optional]
 
 
 
 
90
 
91
- [More Information Needed]
92
 
93
 
94
  #### Training Hyperparameters
95
 
96
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
97
-
98
- #### Speeds, Sizes, Times [optional]
99
-
100
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
101
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102
  [More Information Needed]
103
 
104
- ## Evaluation
105
 
106
- <!-- This section describes the evaluation protocols and provides the results. -->
107
 
108
- ### Testing Data, Factors & Metrics
109
 
110
- #### Testing Data
 
 
 
 
 
 
 
 
 
111
 
112
- <!-- This should link to a Dataset Card if possible. -->
113
 
114
- [More Information Needed]
115
 
116
- #### Factors
 
 
117
 
118
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
119
 
120
- [More Information Needed]
 
121
 
122
- #### Metrics
 
 
123
 
124
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
125
 
126
- [More Information Needed]
127
 
128
- ### Results
129
 
130
- [More Information Needed]
131
 
132
- #### Summary
133
 
 
134
 
 
 
 
135
 
136
- ## Model Examination [optional]
137
 
138
- <!-- Relevant interpretability work for the model goes here -->
 
 
139
 
140
- [More Information Needed]
141
 
142
- ## Environmental Impact
143
 
144
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
145
 
146
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
147
 
148
- - **Hardware Type:** [More Information Needed]
149
- - **Hours used:** [More Information Needed]
150
- - **Cloud Provider:** [More Information Needed]
151
- - **Compute Region:** [More Information Needed]
152
- - **Carbon Emitted:** [More Information Needed]
153
 
154
- ## Technical Specifications [optional]
155
 
156
- ### Model Architecture and Objective
 
 
 
 
 
157
 
158
- [More Information Needed]
 
 
 
 
 
159
 
160
- ### Compute Infrastructure
 
 
 
 
161
 
162
- [More Information Needed]
163
 
164
- #### Hardware
165
 
166
- [More Information Needed]
167
 
168
- #### Software
 
 
 
169
 
170
- [More Information Needed]
171
 
172
- ## Citation [optional]
173
 
174
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
175
 
176
- **BibTeX:**
177
 
178
- [More Information Needed]
179
 
180
- **APA:**
181
 
182
- [More Information Needed]
183
 
184
- ## Glossary [optional]
185
 
186
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
187
 
188
- [More Information Needed]
189
 
190
- ## More Information [optional]
191
 
192
- [More Information Needed]
 
 
 
 
193
 
194
- ## Model Card Authors [optional]
195
 
196
- [More Information Needed]
197
 
198
- ## Model Card Contact
199
 
200
- [More Information Needed]
 
2
  library_name: transformers
3
  tags:
4
  - unsloth
5
+ - LLMs-Aviation
6
+ - AI-Regulatory-Compliance
7
+ - RAC-AI-Colombia
8
+ license: apache-2.0
9
+ datasets:
10
+ - somosnlp/ColombiaRAC_FullyCurated
11
+ language:
12
+ - es
13
+ widget:
14
+ - text: >
15
+ <bos><start_of_turn>system\n\nYou are a helpful AI assistant.\n\nResponde en formato json.\n\nEres un agente experto en la normativa aeronautica Colombiana.<end_of_turn>\n\n<start_of_turn>user\n\n¿Qué sucede con las empresas de servicios aéreos comerciales que no hayan actualizado su permiso de operación después del 31 de marzo de 2024?<end_of_turn>\n\n<start_of_turn>model
16
  ---
17
+ # Model Card for GemmaColRAC-AeroExpert Language Model: Gemma 2B for Colombian Aviation Regulations 🛫
 
 
 
 
18
 
19
 
20
  ## Model Details
21
 
22
  ### Model Description
23
 
24
+ Este documento ofrece una visión detallada de `GemmaColRAC-AeroExpert`, la quinta iteración de nuestro modelo especializado en regulaciones aeronáuticas colombianas. Presenta un salto cualitativo con respecto a las versiones previas, exhibiendo mejoras en precisión y un uso de recursos de GPU más eficiente, reflejando nuestro compromiso con el desarrollo sostenible y de calidad de tecnologías de IA para la aviación.
25
 
26
+ <p align="center">
27
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6419c2f6b4adb0e101b17b6c/0undo4kZc7OtfGI5nnAa8.png" alt="Imagen del Reglamento Aeronáutico Colombiano" style="width: 40%; max-height: 550px;">
28
+ </p>
29
 
30
+ - **Developed by:** [Edison Bejarano](https://huggingface.co/ejbejaranos), [Nicolai Potes](https://huggingface.co/NickyNicky) and [Santiago Pineda](https://huggingface.co/Sapinedamo) ✨
31
+ - **Funded by:** Fundación Universitaria Los Libertadores, SomosNLP, HuggingFace
32
+ - **Model type:** Specialized Language Model for Colombian Aeronautical Regulations
33
+ - **Language(s):** Spanish (`es-CO`)
34
+ - **License:** apache-2.0 <!-- Elegid una licencia lo más permisiva posible teniendo en cuenta la licencia del model pre-entrenado y los datasets utilizados -->
35
+ - **Fine-tuned from model:** [More Information Needed] <!-- Enlace al modelo pre-entrenado que habéis utilizado como base -->
36
+ - **Dataset used:** [RAC Corpus: Base de Datos del Reglamento Aeronáutico Colombiano 🛫📚🇨🇴](https://huggingface.co/datasets/somosnlp/Reglamento_Aeronautico_Colombiano_2024/blob/01bf7eebef40aaba374ffd30697582ab10ec3503/README.md)
37
 
 
38
 
 
39
 
 
 
 
40
 
41
+ ### Model Sources
42
 
43
+ - **Demo:** [Model Demo on HuggingFace Spaces](https://huggingface.co/spaces/somosnlp/ColombiaRAC-V1)
44
+ - **Video presentation:** [Aviación Inteligente: LLMs para Navegar el RAC | Hackathon] (https://youtu.be/IGKU1qUur2c?si=Na4d3XIU3vbdaaJj)
45
+
46
+ ## Uses
47
 
48
  ### Direct Use
49
+ Is designed to assist professionals and students in the aviation industry by providing enhanced access to the Colombian Aeronautical Regulations through advanced language processing capabilities.
50
 
51
+ ### Out-of-Scope Use
52
 
53
+ This model is not intended for making legally binding decisions without human oversight.
54
 
55
+ ## Bias, Risks, and Limitations
56
 
57
+ The model may inherit biases from the data used for training, which primarily includes official legal texts. Users should exercise caution and not rely solely on the model for critical decision-making.
58
 
59
+ ## How to Get Started with the Model
60
 
61
+ Use the code below to get started with the model.
62
 
63
+ ```python
64
+ from transformers import AutoModel, AutoTokenizer
65
 
66
+ tokenizer = AutoTokenizer.from_pretrained("somosnlp/GemmaColRAC-AeroExpert")
67
+ model = AutoModel.from_pretrained("somosnlp/GemmaColRAC-AeroExpert")
68
 
69
+ # Example of how to use the model
70
+ encoded_input = tokenizer("Example query about aviation regulations", return_tensors='pt')
71
+ output = model(**encoded_input)
72
+ ```
73
 
74
+ ## Training Details
75
 
76
+ ### Training Data
77
 
78
+ The model was trained on a curated dataset consisting of detailed question-answer pairs related to the Colombian Aeronautical Regulations.
79
 
 
80
 
 
81
 
 
82
 
83
+ ### Training Procedure
84
 
85
+ The model was fine-tuned from a base language model using the following specifications:
86
 
87
+ - **Tipo de GPU:** NVIDIA GeForce RTX 3090
88
+ - **Tiempo Total de Entrenamiento:** 12607 segundos
89
+ - **Optimizador:** AdamW con Bitfitting y Neutrino Noise
90
+ - **Pasos Máximos:** 4904
91
+ - **Tamaño de Secuencia:** 2048
92
+ - **Tamaño de Lote por Dispositivo:** 2
93
+ - **Versión de Transformers:** 4.39.2
94
+ - **Framework de Optimización:** Unsloth 2024.4
95
+ - **Métodos de Cuantificación:** bf16 con gradient_accumulation_steps de 2
96
+ - **Función de Activación:** gelu_pytorch_tanh
97
 
98
+ - [Notebook to train the model](https://colab.research.google.com/drive/1VmcSVvkaXVe-ya5ATDxKilPY9kN-x2_I?usp=sharing)
99
 
 
100
 
101
+ ### Comparison with Previous Version 🔄
102
 
103
+ The previous iteration, `GemmaColRAC-AeroExpertV4`, utilized an NVIDIA A100-SXM4-40GB GPU and was trained for approximately 50 minutes (3007 seconds). It operated with a learning rate of 0.00005 and used an 8-bit Paged AdamW optimizer. Furthermore, it was trained with a batch size per device of 1 and utilized version 4.39.0 of the Transformers library.
104
 
105
+ **Key differences with the current version include:**
106
 
107
+ - **GPU Upgrade:** 🆙 Switched from NVIDIA A100-SXM4-40GB to NVIDIA GeForce RTX 3090, offering better performance during training.
108
+ - **Training Time:** ⏳ Increased to allow more extensive fine-tuning of the model, resulting in improved accuracy.
109
+ - **Batch Size:** 🔢 Increased the batch size per device from 1 to 2, allowing for more efficient optimization.
110
+ - **Optimizer Upgrade:** 🛠️ Introduction of advanced techniques such as Bitfitting and Neutrino Noise to enhance model convergence.
111
+ - **Maximum Steps:** 🚶‍♂️ Significantly increased the maximum steps from 1638 to 4904, suggesting a broader coverage of data and deeper learning.
112
 
113
+ These changes have resulted in a more robust and efficient version of our model, enhancing its capacity to assist and provide guidance in Colombian aeronautical regulation.
114
 
115
 
116
  #### Training Hyperparameters
117
 
118
+ - **Training regime:** bf16 mixed precision
119
+ - **Optimizer:** Paged AdamW 8-bit
120
+ - **Learning Rate:** 5e-5
121
+ - **Batch Size per Device:** 3
122
+ - **Gradient Accumulation Steps:** 4
123
+ - **Warmup Steps:** Computed as 3% of total steps
124
+ - **Max Steps:** 14,688
125
+ - **Total Training Time:** Approx. 5 hours 21 minutes (based on epochs and iteration speed)
126
+ - **Max Sequence Length:** 2048
127
+ - **Weight Decay:** 0.001
128
+ - **Learning Rate Scheduler:** Cosine
129
+ - **Adam Betas:** Beta1 = 0.99, Beta2 = 0.995
130
+ - **Max Gradient Norm:** 0.4
131
+ -
132
+ #### Speeds, Sizes, Times
133
+
134
+ - **Training Duration:** Approx. 3 hours 30 minutes for full training
135
+ - **Training Throughput:** 0.76 iterations per second (it/s)
136
+ - **Total Steps:** 14,688 steps over 8 epochs
137
+ - **Checkpoint Size:** Final model size was not specified; typical sizes for models of this type are several gigabytes.
138
+ - **Total Number of Trainable Parameters:** 78,446,592
139
  [More Information Needed]
140
 
 
141
 
142
+ ### Metrics
143
 
144
+ Here is a detailed summary of the training metrics for `GemmaColRAC-AeroExpert`:
145
 
146
+ - **Total Floating Point Operations (FLOPs):** 204,241,541,673,615,360
147
+ - **Train Loss:** 0.393565042567292 (final reported loss)
148
+ - **Training Runtime:** 10,763.56 seconds (approximately 2.99 hours)
149
+ - **Samples per Second:** 4.556
150
+ - **Steps per Second:** 0.456
151
+ - **Total Training Epochs:** 2
152
+ - **Total Training Steps:** 4,904
153
+ - **Gradient Norm:** 3.515625
154
+ - **Final Learning Rate:** 0 (end of training)
155
+ - **Average Loss over Training:** 0.1934
156
 
157
+ ### Results
158
 
 
159
 
160
+ <p align="center">
161
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6419c2f6b4adb0e101b17b6c/zuEetm8ifT5e3QtHfBBVD.png" alt="Trainning Loss" style="width: 80%; max-height: 350px;">
162
+ </p>
163
 
 
164
 
165
+ ## Model Examination [optional]
166
+ This model was evaluated the performance in simplifying RAC's content based on feedback from aeronautical experts, thereby enhancing regulatory compliance and understanding.
167
 
168
+ <p align="center">
169
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6419c2f6b4adb0e101b17b6c/5iPvAhaTMnqRDBn2g7XIK.png" alt="Evaluation for model by Aeronautical experts" style="width: 40%; max-height: 550px;">
170
+ </p>
171
 
 
172
 
173
+ Previous table shows the model's strong performance with average scores of 7 from 276 tests. However, RAC 3's low scores (mean 3.464, median 1) indicate areas needing improvement, while high ratings in RACs 1 and 5 suggest strengths. These results confirm the model's potential for accuracy and generalization, though RAC 3 requires adjustments.
174
 
 
175
 
176
+ ## Environmental Impact 🌱
177
 
178
+ The development of `GemmaColRAC-AeroExpert` has been carried out with a strong focus on sustainability 🌿. Efforts have been made to optimize efficiency and minimize environmental impact, including reducing energy consumption and lowering the carbon footprint during the model's training process. These measures not only enhance operational efficiency but also align with our commitment to environmental responsibility 🌎.
179
 
180
+ ### Energy Consumption and Carbon Emissions 📉
181
 
182
+ - **Power Consumption:** 0.25 kW (250 watts)
183
+ - **Runtime Hours:** 3.6 hours
184
+ - **Carbon Intensity:** 475 gCO2eq per kWh (Global average)
185
 
186
+ Given the use of an NVIDIA V100 GPU for approximately 3.6 hours, the carbon emissions have been meticulously estimated. Here are the details:
187
 
188
+ - **Hardware Type:** NVIDIA GeForce RTX 3090 GPU
189
+ - **Total Hours Used:** ~3.6 hours
190
+ - **Total Carbon Emitted:** Approximately 356.25 grams of CO₂ equivalents
191
 
192
+ These carbon emissions were calculated using the [Machine Learning Impact Calculator](https://mlco2.github.io/impact#compute) introduced in Lacoste et al. (2019), which considers hardware type, runtime, and other relevant factors to provide a comprehensive view of the environmental impact of training large AI models 📊.
193
 
194
+ This proactive approach to understanding and mitigating our ecological footprint underlines our commitment to pioneering environmentally friendly AI development practices, setting a benchmark for sustainability within the AI industry 🌟.
195
 
196
+ #### Hardware
197
 
198
+ - **Hardware Used:** NVIDIA GeForce RTX 3090
199
 
200
+ #### Software 🛠️
 
 
 
 
201
 
202
+ The `GemmaColRAC-AeroExpert` model was developed and trained using a comprehensive stack of modern software libraries designed for high-performance machine learning tasks, particularly in Natural Language Processing (NLP). Here are the key libraries and tools used:
203
 
204
+ - **Python Libraries:**
205
+ - `json`: For parsing JSON files and handling serialization 📄.
206
+ - `pandas`: A powerful data manipulation and analysis library providing data structures and operations for manipulating numerical tables and time series 📊.
207
+ - `torch`: PyTorch is an open-source machine learning library used for applications such as computer vision and natural language processing, developed by Facebook's AI Research lab (FAIR) 🔥.
208
+ - `datasets`: A lightweight and extensible library to easily share and access datasets and evaluation metrics for machine learning tasks 📚.
209
+ - `huggingface_hub`: Used for managing model repositories on Hugging Face and interacting with Hugging Face Hub APIs 🌐.
210
 
211
+ - **Hugging Face Ecosystem:**
212
+ - `transformers`: Provides thousands of pre-trained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, and text generation in over 100 languages. It's designed to be both user-friendly for machine learning researchers and efficient to use in production 🤖.
213
+ - `BitsAndBytesConfig`, `TrainingArguments`: Advanced configurations from the Transformers library for fine-tuning the performance and efficiency of training neural networks ⚙️.
214
+ - `pipeline`: A utility for creating easy-to-use pipelines for various NLP tasks 🧪.
215
+ - `AutoModelForCausalLM`, `AutoTokenizer`: Utilities for loading and initializing pre-trained language models and their tokenizers 📝.
216
+ - `logging`: For configuring the logging level and output formats to track model training and inference processes effectively 📌.
217
 
218
+ - **PEFT and LoRA Extensions:**
219
+ - `LoraConfig`, `PeftModel`: Extensions from the PEFT (Parameter Efficient Fine-Tuning) library, which include LoRA (Low-Rank Adaptation of large models), allowing efficient fine-tuning and adaptation of large pre-trained models with minimal computational overhead 🚀.
220
+
221
+ - **Transformers Reinforcement Learning (TRL):**
222
+ - `SFTTrainer`: A component from the TRL library for applying reinforcement learning techniques to transformer models, specifically for sequence-to-sequence tasks 🎮.
223
 
224
+ These tools collectively support the robust training environment necessary to develop state-of-the-art NLP models like `GemmaColRAC-AeroExpert`, ensuring that the model is both highly effective and efficient in processing and understanding complex regulatory texts.
225
 
226
+ ## License 📜
227
 
228
+ `GemmaColRAC-AeroExpert` is released under the Apache 2.0 license 🏷️. This license is one of the most permissive and widely used licenses in the open-source community, allowing for both academic and commercial use without significant restrictions.
229
 
230
+ - **Why Apache 2.0?** 🤔
231
+ - **Openness:** The Apache 2.0 license allows users to use, modify, and distribute the software freely, which encourages innovation and widespread use.
232
+ - **Protection:** It provides an explicit grant of patent rights from contributors to users, protecting them from patent litigation.
233
+ - **Commercial friendly:** Apache 2.0 is business-friendly, allowing the commercial use of the software which is crucial for wider adoption in industry settings.
234
 
235
+ By choosing Apache 2.0, we ensure that `GemmaColRAC-AeroExpert` can be freely used and integrated into a wide array of projects and products, from academic research to commercial applications, thus supporting the growth and accessibility of AI technologies across different sectors 🌐.
236
 
 
237
 
 
238
 
 
239
 
240
+ ## Glossary [optional]
241
 
242
+ - **RAC**: Reglamento Aeronáutico Colombiano
243
 
244
+ ## More Information
245
 
246
+ <!-- Indicar aquí que el marco en el que se desarrolló el proyecto, en esta sección podéis incluir agradecimientos y más información sobre los miembros del equipo. Podéis adaptar el ejemplo a vuestro gusto. -->
247
 
248
+ This project was developed during the [Hackathon #Somos600M](https://somosnlp.org/hackathon) organized by SomosNLP. The model was trained using GPUs sponsored by their own team.
249
 
250
+ ## Team 👥
251
 
252
+ The development of the `GemmaColRAC-AeroExpert` model was supported by a dedicated team of experts specializing in machine learning, natural language processing, and aeronautics. Below are the key team members who contributed significantly to this project:
253
 
254
+ - [Edison Bejarano](https://huggingface.co/ejbejaranos) - Lead AI Scientist, expert in NLP and machine learning, with a strong background in aeronautics.
255
+ - [Nicolai Potes](https://huggingface.co/NickyNicky) - Data Scientist, specializes in AI-driven regulatory compliance solutions.
256
+ - [Santiago Pineda](https://huggingface.co/Sapinedamo) - Project Manager and Senior ML Engineer, with extensive experience in deploying scalable AI solutions.
257
+ - [Alec Mauricio](https://huggingface.co/alecrosales1) - AI Researcher, focused on developing innovative models for text analysis and interpretation.
258
+ - [Danny Stevens](https://huggingface.co/dannystevens) - Software Engineer, provides expertise in software development and integration for machine learning applications.
259
 
260
+ These individuals bring a wealth of knowledge and expertise, ensuring the highest quality and performance of the `GemmaColRAC-AeroExpert` model. Their collaborative efforts have been pivotal in pushing the boundaries of what's possible with AI in the aviation sector.
261
 
 
262
 
263
+ ## Contact [optional]
264
 
265
+ Ejbejaranos@gmail.com