parapar's picture
Update README.md
1da9d7f verified
metadata
base_model:
  - meta-llama/Llama-3.1-8B-Instruct
license: llama3.1
language:
  - gl
metrics:
  - bleu
  - rouge
model-index:
  - name: Llama-3.1-8B-Instruct-Galician
    results:
      - task:
          type: text-generation
        dataset:
          name: alpaca_data_galician
          type: alpaca_data_galician
        metrics:
          - name: bleu
            type: bleu-4
            value: 23.13
          - name: rouge
            type: rouge-l
            value: 21.84
pipeline_tag: text-generation
library_name: transformers
widget:
  - text: Onde está o concello de Frades?
    output:
      text: >-
        Frades é un concello da provincia da Coruña, pertencente á comarca de
        Ordes. Está situado a 15 quilómetros de Santiago de Compostela.

Llama-3.1-8B-Instruct-Galician a.k.a. Cabuxa 2.0

This model is a continued pretraining version of meta-llama/Llama-3.1-8B-Instruct on the CorpusNós dataset.

Model Description

How to Get Started with the Model

import transformers
import torch

model_id = "irlab-udc/Llama-3.1-8B-Instruct-Galician"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a conversational AI that always responds in Galician."},
    {"role": "user", "content": "Cal é a principal vantaxe de usar Scrum?"},
]

outputs = pipeline(messages, max_new_tokens=512)

print(outputs[0]["generated_text"][-1]["content"])

Training Hyperparameters

Parameter Value
learning_rate 0.0001
train_batch_size 32
eval_batch_size 1
seed 42
distributed_type multi-GPU
num_devices 4
gradient_accumulation_steps 2
total_train_batch_size 256
total_eval_batch_size 4
optimizer Adam with betas=(0.9, 0.999), epsilon=1e-08
lr_scheduler_type cosine
lr_scheduler_warmup_ratio 0.1
num_epochs 1.0

Training results

Training Loss Epoch Step Validation Loss
2.0606 0.1682 900 2.0613
1.9898 0.3363 1800 1.9929
1.9847 0.5045 2700 1.9613
1.9577 0.6726 3600 1.9445
1.9287 0.8408 4500 1.9368

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: 4x NVIDIA A100 SXM4 80 GB (TDP of 400W)
  • Hours used: 60
  • Cloud Provider: Private infrastructure
  • Carbon Emitted: 10.37 Kg. CO₂ eq.

Citation

@inproceedings{bao-perez-parapar-xovetic-2024,
  title={Adapting Large Language Models for Underrepresented Languages},
  author={Eliseo Bao and Anxo Pérez and Javier Parapar	},
  booktitle={VII Congreso XoveTIC: impulsando el talento cient{\'\i}fico},
  year={2024},
  organization={Universidade da Coru{\~n}a, Servizo de Publicaci{\'o}ns}
  abstact = {The popularization of Large Language Models (LLMs), especially with the development of conversational systems, makes mandatory to think about facilitating the use of artificial intelligence (AI) to everyone. Most models neglect minority languages, prioritizing widely spoken ones. This exacerbates their underrepresentation in the digital world and negatively affects their speakers. We present two resources aimed at improving natural language processing (NLP) for Galician: (i) a Llama 3.1 instruct model adapted through continuous pre-training on the CorpusNos dataset; and (ii) a Galician version of the Alpaca dataset, used to assess the improvement over the base model. In this evaluation, our model outperformed both the base model and another Galician model in quantitative and qualitative terms}
}