Llama3-8B-Valencian

Table of Contents

Click to expand

Model description

Llama3-8B-Valencian is a text generation model for causal language modeling with a decoder-only architecture. It has been trained from continuous pre-training based on Meta-Llama-3-8B, with emphasis on data in Valencian (similar to Catalan) language. Concretely, a total of 1.304 million tokens per epoch in this first version of the model and two epochs over the data. The Political and Administrative domains are highly represented in this model's version.

This model is based on Meta-Llama-3-8B as the basis for training and uses the same tokenizer.

Intended uses and limitations

LLAMA3-8B-VALENCIAN is a base model that can be used for causal language modeling, it can be used as is for text generation, although fine/instruction-tuning on specific tasks is recommended for its final use.

This language model has been trained with data in a formal register, namely related to the administrative and political domain, so it is expected that using it in text-generation tasks will produce text in this same format.

How to use

import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
input_text = "Les corts valencianes han pres la decisió de"
model_id  = "gplsi/Llama3-8B-Valencian"
tokenizer = AutoTokenizer.from_pretrained(model_id)
generator = pipeline(
    "text-generation",
    model=model_id,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
generation = generator(
    input_text,
    do_sample=True,
    top_k=10,
    eos_token_id=tokenizer.eos_token_id,
)
print(f"Result: {generation[0]['generated_text']}")

Training

Training data

The training corpus has been obtained using web scraping on public data from different sources such as the Official Gazette of the University of Alicante (BOUA), the Official Gazette of the Generalitat Valenciana (DOGV) and accurate data provided by the Valencian Courts (DSCV and DSCCV). Giving a total of 1.304 million tokens, according to the following table.

Dataset Language Total Sentences Total Words Total Numbers Other Symbols Unique Words Total Tokens Average sentence Length Average Word Length
BOUA va 0.606M 12.355M 0.488M 0.055M 0.211M 12.899M 21.27 4.89
DOGCV va 4.569M 50.566M 6.339M 0.613M 17.436M 57.517M 12.59 4.68
DOGV va 18.598M 311.380M 24.138M 2.731M 11.416M 338.250M 18.19 4.88
DSCCV va 2.353M 46.116M 0.554M 2.352m 5.031M 46.672M 19.84 4.56
DSCV va 1.646M 32.496M 0.433M 1.427m 3.796M 32.930M 20.01 4.65
UN va 0.394M 12.289M 0.253M 0.015M 0.533M 12.556M 31.86 4.86
VJ va 0.913M 23.594M 0.466M 23.314m 0.849M 24.084M 26.39 4.57

Several of the downloaded sources have already been used in the Meta-Llama-3-8B training, so the date of data collection for the previous model has been taken into account and those web pages have been scraped from that date.

Information on the datasets used for training is shown below:

  • Official Bulletin of the University of Alicante (BOUA): These are the documents issued by the University of Alicante related to grants, regulations, and different resolutions of laws published periodically, specifically the Valencian version.

  • Legacy Official Journal of the Generalitat Valenciana (DOGCV): This journal contains historical documents issued by the Valencian Community. These documents were initially recorded on paper and digitised with the standardisation of the digital format. They have the same subject matter as the DOGV documents but were generated between 1980 and 1997.

  • Official Journal of the Generalitat Valenciana (DOGV): These documents contain official communications of the Valencian Community. They mainly deal with issuing laws, legal measures, and public sector communication. These journals were issued from 1998 to 2023.

  • Valencian Parliament Diary Dataset (DSCCV) contains records from various committee meetings held in the parliament, with each meeting documented in a separate text file.

  • Journal of the Valencian Parliament (DSCV): in this case, the transcripts of the different meetings held in the parliament's plenary sessions, with data from 1999 to 2022.

  • University news (UN): We have news in a colloquial register from different universities that have Valencian as an official language, including the universities of Valencia, Alicante, Jaume I, and the Polytechnic University of Valencia.

  • Valencian Journals (VJ): These include various types of Valencian journals with colloquial records to facilitate daily record-keeping alongside the legal and bureaucratic documents from previous records. These include a total of 10 different journals.

Training parameters

During the training of the model, a high context window was desired when generating text, so it was decided to use an input size of 2048 tokens and a minimum context window of 512 in case of truncating the input sequences. 80% of the data obtained was used for the training stage, while 20% was used during the evaluation stage. A summary of the parameters used during training can be seen in the following table:

Parameter Value
Epochs 2
Learning Rate 2e-5
Warmup Steps 0
Precision bf-16
Weight decay 1e-1
Training Fraction 0.95
Evaluation Fraction 0.05
Input size (tokens) 2048
Minimum context window (tokens) 512

Distributed Training Strategy

A distributed training strategy called Fully Sharded Data Parallel (FSDP) has been used. With this, the entire model has been loaded among the 4 A100s available for training with a mini-batch size of size 1 and a total gradient accumulation step of 64.

Languages

In addition to the data already used for the training of Meta-Llama-3-8B, data completely in Valencian from the sources mentioned in the previous section has been used.

Evaluation

In the following table, we can see the results obtained with different benchmarks in comparison with the model used for continuous pre-training. The results have been obtained from the model pre-trained; no instruction tuning or fine-tuning of any kind has been performed.

Catalan

Classification Benchmarks

Dataset Lang. Task Metric Llama3-8B Llama3-8B-Valencian
Belebele Cat_latn ca Reading Comprehension acc 0.788 0.736
COPA ca Commonsense Reasoning acc 0.480 0.800
XStoryCloze ca Commonsense Reasoning acc 0.717 0.722
OpenBookQA ca Question Answering acc 0.352 0.352
PAWS ca Paraphrasing acc 0.681 0.661
PiQA ca Question Answering acc 0.649 0.654
SiQA ca Question Answering acc 0.466 0.457
ARC Easy ca Question Answering acc 0.671 0.678
ARC Challenge ca Question Answering acc 0.415 0.428
XNLI ca Natural Language Inference acc 0.500 0.506
Teca ca Natural Language Inference acc 0.520 0.506
WNLI ca Natural Language Inference acc 0.620 0.578
Mgsm direct ca Math exact match 0.069 0.051

Generation Benchmarks

Dataset Lang. Task Metric Llama3-8B Llama3-8B-Valencian
Phrases ca-va ca/va Translation - Adaptation bleu 0.781 0.637
Phrases va-ca ca/va Translation - Adaptation bleu 0.912 0.723

Spanish

Classification Benchmarks

Dataset Lang. Task Metric Llama3-8B Llama3-8B-Valencian
Belebele Cat_latn es Reading Comprehension acc 0.817 0.743
PAWS es Paraphrasing acc 0.420 0.400
XNLI es Natural Language Inference acc 0.482 0.459
WNLI es Natural Language Inference acc 0.690 0.648
XStoryCloze es Commonsense Reasoning acc 0.733 0.737
MGSM Direct es Math exact match 0.12 0.10

Generation Benchmarks

Dataset Lang. Task Metric Llama3-8B Llama3-8B-Valencian
Cocoteros es bleu 0.169 0.081
Phrases es-va es/va Translation bleu 0.568 0.635
Phrases va-es va/es Translation bleu 0.761 0.759
XLSum es Summarization bleu 0.046 0.081

English

Classification Benchmarks

Dataset Lang. Task Metric Llama3-8B Llama3-8B-Valencian
Belebele Eng_latn en Reading Comprehension acc 0.862 0.822
PAWS en Paraphrasing acc 0.414 0.392
XNLI en Natural Language Inference acc 0.503 0.501
XStoryCloze en Commonsense Reasoning acc 0.813 0.823
MGSM Direct en Math exact match 0.580 0.510
TriviaQA en Question Answering exact match 0.716 0.657

Valencian

Additional information

Author

Language and Information System Group GPLSI

Contact

For further information, please send an email to GPLSI

Copyright

Copyright(c) 2025 by GPLSI(https://gplsi.dlsi.ua.es/).

License

Apache License 2.0

Funding

This work was funded by ILENIA-VIVES project <<2022/TL22/00215334>>

Disclaimer

The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.

Be aware that the model may have biases and/or any other undesirable distortions.

When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it) or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.

In no event shall the owner and creator of the model (GPLSI) be liable for any results arising from the use made by third parties.

Downloads last month
9
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gplsi/Llama3-8B-Valencian

Finetuned
(429)
this model