--- language: - es license: apache-2.0 library_name: transformers, pe tags: - trl - sft - generated_from_trainer base_model: google/gemma-7b datasets: - somosnlp/instruct-legal-refugiados-es --- # Model Card for gemma-7b-it-legal-refugiados-es Spain is the third country with the highest number of asylum applications, receiving each year approximately more than 100,000 applications, and the third with the lowest number of approvals within the EU. The main objective of this project is to facilitate the tasks of NGOs in this field and other institutions and help them to obtain answers to questions (QA) related to refugee legislation in Spanish. With its refined understanding of the nuances and complexities of this legal field. The objective of this model is to facilitate question answering (QA) tasks pertaining to Spanish refugee legislation. With its refined understanding of the nuances and intricacies of this legal domain ## Model Details ### Model Description The objective of this model is to facilitate question answering (QA) tasks pertaining to Spanish refugee legislation. With its refined understanding of the nuances and intricacies of this legal domain. This model is a fine-tuned version of [google/gemma-7b](https://huggingface.co/google/gemma-7b) on the dataset [AsistenciaRefugiados](https://huggingface.co/datasets/somosnlp/instruct-legal-refugiados-es). This is the model card of a 🤗 transformers model that has been pushed on the Hub to allow public access. - **Developed by:** [Alvaro Hidalgo](https://huggingface.co/hacendado) [Eduardo Muñoz](https://huggingface.co/edumunozsala) [Teresa Martin](https://huggingface.co/narhim) - **Funded by:** SomosNLP, HuggingFace - **Model type:** Language model, instruction tuned - **Language(s):** es-ES, es-MX, es-VE - **License:** apache-2.0 - **Fine-tuned from model:** [google/gemma-7b](https://huggingface.co/google/ - **Dataset used:** [AsistenciaRefugiados](https://huggingface.co/datasets/somosnlp/instruct-legal-refugiados-es) ### Model Sources - **Repository:** Notebook in [This repo](https://huggingface.co/somosnlp/gemma-7b-it-legal-refugee-v0.1.1) - **Demo:** [Demo Space](https://huggingface.co/spaces/somosnlp/QA-legal-refugiados) - **Video presentation:** [Youtube Video](https://www.youtube.com/watch?v=1OqHDE5LKMI&list=PLTA-KAy8nxaASMwEUWkkTfMaDxWBxn-8J&index=3) ### Model Family This model is a fine-tuned version of [google/gemma-7b](https://huggingface.co/google/gemma-7b). ## Uses ### Direct Use The primary objective of this model is to facilitate question answering (QA) tasks pertaining to Spanish refugee legislation. With its refined understanding of the nuances and intricacies of this legal domain. ### Downstream Use Intented to be use in question-answering with a context and text generation. ### Out-of-Scope Use Misuse includes any application that promotes unethical practices, misinterprets refugee law, or uses the model for malicious purposes. The model is not designed to replace professional legal advice. ## Bias, Risks, and Limitations The model, while powerful, has limitations inherent to AI, including biases present in the training data. It may not cover all nuances of refugee regulations or adapt to changes in law without updates. ### Recommendations ## How to Get Started with the Model Use the code below to get started with the model. ```python import torch from transformers import ( AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline ) model_id = "somosnlp/gemma-7b-it-legal-refugiados-es" tokenizer_id = "somosnlp/gemma-7b-it-legal-refugiados-es" tokenizer = AutoTokenizer.from_pretrained(tokenizer_id) # Cargamos el modelo en 4 bits para agilizar la inferencia quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, ) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.float16, device_map="auto", quantization_config=quantization_config, ) # Generamos el pipeline de generación de texto pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) # Definimos el eos token para el modelo eos_token = tokenizer("<|im_end|>",add_special_tokens=False)["input_ids"][0] def generate_inference(instruction, input, temperature): prompt = pipe.tokenizer.apply_chat_template([{"role": "user", "content": f"{instruction}/n{input}"}], tokenize=False, add_generation_prompt=True) outputs = pipe(prompt, max_new_tokens=256, do_sample=True, num_beams=1, temperature=float(temperature), top_k=50, top_p=0.95, max_time= 300, eos_token_id=eos_token) return outputs[0]['generated_text'][len(prompt):].strip() instruction = "¿Podrías explicarme brevemente los hechos que originan el procedimiento y las posibles calificaciones, así como las sanciones correspondientes, según lo expuesto en el contexto?" input = "b) Hechos que motivan la incoación del procedimiento sucintamente expuestos, su posible calificación y las sanciones que pudieran corresponder, sin perjuicio de lo que resulte de la instrucción. c) Instructor y, en su caso, secretario del procedimiento, con expresa indicación del régimen de recusación de éstos. d) Órgano competente para la resolución del expediente y norma que le atribuye tal competencia. e) Indicación de la posibilidad de que el presunto responsable pueda reconocer voluntariamente su responsabilidad. f) Medidas de carácter provisional que se hayan acordado por el órgano competente para iniciar el procedimiento sancionador, sin perjuicio de las que se puedan adoptar durante éste de conformidad con los artículos 55 y 61 de la Ley Orgánica 4/2000, de 11 de enero. g) Indicación del derecho a formular alegaciones y a la audiencia en el procedimiento y de los plazos para su ejercicio. 2. El acuerdo de iniciación se comunicará al instructor con traslado de cuantas actuaciones existan al respecto y se notificará a los interesados, entendiéndose en todo caso por tal al expedientado. En la notificación se advertirá a los interesados que, de no efectuar alegaciones sobre el contenido de la iniciación del procedimiento en el plazo previsto en el artículo siguiente, no realizarse propuesta de prueba o no ser admitidas, por improcedentes o innecesarias, las pruebas propuestas, la iniciación podrá ser considerada propuesta de resolución cuando contenga un pronunciamiento preciso acerca de la responsabilidad imputada, con los efectos previstos en los artículos 229 y 230." response = test_inference(instruction, input, 0.3) print(f"Response:\n{response}") ``` ## Training Details ### Training Data The dataset used was [instruct-legal-refugiados-es](https://huggingface.co/datasets/somosnlp/instruct-legal-refugiados-es) but we adapted the dataset to a ChatML format, described in the next section. ### Training Procedure The training was done using RTX 4090 from Vast.ai with PeRF and Lora #### Preprocessing We wanted to make a conversation model so we investigated the base model prompt in order to make conversational base on [chatml format](https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/ai-services/openai/includes/chat-markup-language.md#working-with-chat-markup-language-chatml) we identified the special tokens so the model could understand the different roles in the conversation Example ``` <|im_start|>system You are Gemma.<|im_end|> <|im_start|>user Hello, how are you?<|im_end|> <|im_start|>assistant I'm doing great. How can I help you today?<|im_end|>\n ``` So we used [Phil Schmid's gemma chatml tokenizer](https://huggingface.co/philschmid/gemma-tokenizer-chatml) to adapt our dataset for training #### Training Hyperparameters The following hyperparameters were used during training: - learning_rate: 5e-05 - train_batch_size: 2 - eval_batch_size: 8 - seed: 66 - gradient_accumulation_steps: 2 - total_train_batch_size: 4 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: constant - lr_scheduler_warmup_ratio: 0.03 - num_epochs: 3 - **Training regime:** ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data [More Information Needed] #### Factors [More Information Needed] #### Metrics [More Information Needed] ### Results [More Information Needed] ## Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hardware Type**: 1 X RTX4090 - **Hours used**: 4 - **Cloud Provider**: Vast.ai - **Compute Region**: West Europe - **Carbon Emitted**: 350W x 4h = 1.4 kWh x 0.57 kg eq. CO2/kWh = 0.8 kg eq. CO2 ## Technical Specifications ### Model Architecture and Objective The base model is [google/gemma-7b](https://huggingface.co/google/gemma-7b) finetuned in 4-bit. ### Compute Infrastructure #### Hardware 1 x RTX4090 GPU by Vast.ai. #### Software Libraries: - transformers - bitsandbytes - accelerate - xformers - trl - peft - wandb ## License This model is under the license of the Gemma models by Google. Link to consent: https://www.kaggle.com/models/google/gemma/license/consent ## Citation **BibTeX:** [More Information Needed] ``` @software{somosnlp2024asistenciarefugiados, author = {Alvaro Hidalgo, Eduardo Muñoz, Teresa Martín}, title = {gemma-7b-it-legal-refugiados-es}, month = April, year = 2024, url = {somosnlp/gemma-7b-it-legal-refugee-v0.1.1} } ``` ## More Information This project was developed during the [Hackathon #Somos600M](https://somosnlp.org/hackathon) organized by SomosNLP. The model was trained using GPUs sponsored by HuggingFace. **Team:** [Alvaro Hidalgo](https://huggingface.co/hacendado) [Eduardo Muñoz](https://huggingface.co/edumunozsala) [Teresa Martin](https://huggingface.co/narhim) ## Contact [optional]