Model Card: LuxLlama
Abstract
LuxLlama is a fine-tuned iteration of the Meta-Llama-3.1-8B-Instruct model, specifically adapted to enhance capabilities in Luxembourgish language understanding and generation, alongside general and mathematical reasoning. The model was trained on a diverse dataset combining reasoning benchmarks and cleaned Luxembourgish text, utilizing Parameter-Efficient Fine-Tuning (PEFT) with LoRA and optimized using the Liger kernel for efficiency. Its performance is evaluated using the custom LUXELLA benchmark, demonstrating strong capabilities in areas like translation and comprehension, while highlighting areas for improvement in nuanced vocabulary and grammar.The fine-tuning process utilized the computational resources provided by Meluxina, a high-performance computing (HPC) platform operated by LuxProvide.
Model Details
- Model Name: LuxLlama
- Type: Language Model fine-tuned for Luxembourgish proficiency and reasoning tasks.
- Base Model:
unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
- Fine-tuning Frameworks: Unsloth, PEFT (LoRA), TRL (SFTTrainer)
- Language(s): Primarily English and Luxembourgish; capable of reasoning tasks.
Introduction
The goal of LuxLlama is to create a language model proficient in both complex reasoning and the nuances of the Luxembourgish language. By leveraging a powerful base model (Llama 3.1 8B Instruct) and fine-tuning it on a curated mix of reasoning and language-specific datasets, LuxLlama aims to serve as a capable tool for tasks requiring Luxembourgish language skills and logical deduction. This model card details its architecture, training process, data, and evaluation results on the LUXELLA benchmark.
About Meluxina
Meluxina is Luxembourg's national supercomputer, launched in June 2021 by LuxProvide. It is built on the EVIDEN BullSequana XH2000 platform and provides:
- 18 PetaFlops of computing power.
- 20 PetaBytes of storage capacity.
- A scalable architecture integrating simulation, modeling, data analytics, and AI.
Meluxina was ranked 36th globally and recognized as the greenest supercomputer in the EU within the Top500 ranking. Named after Luxembourg's legend of the mermaid Melusina, it symbolizes digital innovation and employs water-cooling technology for energy efficiency.
Related Works
The development of LuxLlama builds upon existing work in:
- Fine-tuning large language models (LLMs) for specific tasks and languages.
- Parameter-Efficient Fine-Tuning techniques like LoRA.
- Instruction tuning for improved model controllability.
- Development of language-specific benchmarks.
- Reasoning capabilities enhancement in LLMs.
Intended Use
- Primary Use: Luxembourgish language tasks including translation (EN<>LB), reading comprehension, grammar assistance, vocabulary queries, cultural knowledge retrieval, conversation simulation, and text generation (writing prompts, sentence completion). General and mathematical reasoning tasks.
- Target Audience: Researchers, developers, language learners, and users needing Luxembourgish language capabilities or reasoning support within the model's scope.
- Out-of-Scope Uses: High-stakes applications requiring perfect accuracy without human oversight, generation of harmful or biased content, uses violating the base model's license agreement.
Limitations and Bias
- Performance Variability: As shown in the LUXELLA benchmark, performance varies across different linguistic categories and difficulty levels. The model performs better on translation and comprehension than on nuanced vocabulary, spelling, or idioms. Performance generally decreases with increasing difficulty.
- Inherited Bias: LuxLlama may inherit biases present in the base Llama 3.1 model and the training datasets.
- Synthetic Benchmark: The LUXELLA benchmark uses synthetically generated questions. While diverse, it may not perfectly capture all real-world complexities or linguistic variations.
- LLM-based Evaluation: The use of an LLM judge for evaluation, while scalable and consistent, has its own limitations and potential biases compared to human expert evaluation.
- Factual Accuracy: Like most LLMs, LuxLlama may generate plausible but incorrect information (hallucinations).
- Low-Resource Language: Luxembourgish is a lower-resource language, meaning available training data is less extensive than for languages like English, which can impact the depth of understanding.
Training Data
Data Collection
The fine-tuning dataset was compiled from the following sources:
- General Reasoning:
KingNish/reasoning-base-20k
SkunkworksAI/reasoning-0.01
- Math Reasoning:
AI-MO/NuminaMath-CoT
- Luxembourgish Language:
saillab/alpaca-luxembourgish-cleaned
Dataset Preparation and Preprocessing
- Loading & Cleaning: Each dataset was loaded and cleaned individually (standardizing formats, handling missing values).
- Categorization & Templating: Datasets were categorized (general reasoning, math reasoning, Luxembourgish). Specific prompt templates were applied to each category to guide the model during fine-tuning.
- Merging & Splitting: All processed datasets were merged into a single dataset, shuffled randomly, and then split into training and validation sets.
Training Procedure
- Base Model:
unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
loaded with 4-bit quantization (load_in_4bit=True
). - Fine-tuning Method: Supervised Fine-Tuning (SFT) using
trl.SFTTrainer
. - Parameter Efficiency: PEFT with LoRA (
get_peft_model
).r
: 256lora_alpha
: 256lora_dropout
: 0.0target_modules
: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
- Training Configuration (
SFTConfig
):max_seq_length
: 128000packing
: Falseper_device_train_batch_size
: 4gradient_accumulation_steps
: 8 (Effective Batch Size: 32)warmup_ratio
: 0.02num_train_epochs
: 1learning_rate
: 5e-5fp16
: Truebf16
: True (Mixed Precision Training)logging_steps
: 10optim
: "adamw_8bit"weight_decay
: 0.01lr_scheduler_type
: "cosine_with_restarts"seed
: 1729output_dir
: "lora_outputs_run5"save_strategy
: "steps"save_steps
: 1000
- Optimization Kernel: Liger kernel enabled (
use_liger=True
) for increased throughput and reduced memory usage via optimized Triton kernels for common LLM operations.
Evaluation
Benchmarking Dataset - LUXELLA
- Name: LUXELLA (Luxembourgish Excellence Language Learning Assessment)
- Description: A custom benchmark designed to evaluate Luxembourgish language proficiency using synthetically generated questions.
- Generation: Questions generated using a Gemini-based LLM, prompted across 15 categories (vocabulary, grammar, translation, comprehension, culture, idioms, etc.), 4 difficulty levels (beginner, intermediate, advanced, native), and randomized topics. Output in structured JSON.
- Evaluation Method: LLM-based judgment. A separate LLM acts as a judge, scoring responses on a scale of 1.0 to 5.0 and providing a brief explanation.
- Link: Under-Progress
Evaluation Results
LuxLlama Performance on LUXELLA:
- Overall Score: 3.73 / 5.0
Scores by Category:
Category | Score |
---|---|
translation_to_english | 4.19 |
reading_comprehension | 4.14 |
verb_conjugation | 4.11 |
multiple_choice | 4.08 |
translation_from_english | 4.07 |
translation | 4.00 |
listening_comprehension_simulation | 3.98 |
conversation | 3.79 |
word_order | 3.79 |
cultural_knowledge | 3.76 |
writing_prompt | 3.74 |
grammar | 3.44 |
idioms_and_expressions | 3.37 |
sentence_completion | 3.11 |
spelling_and_pronunciation | 3.04 |
vocabulary | 3.00 |
Scores by Difficulty:
Difficulty | Score |
---|---|
beginner | 3.93 |
intermediate | 3.69 |
advanced | 3.68 |
native | 3.57 |
Comparative Performance:
Model | Overall Score (LUXELLA) |
---|---|
LuxLlama (Ours) | 3.73 / 5.0 |
gemma2-9b-it | 3.07 / 5.0 |
llama-3.1-8b-instant | 2.46 / 5.0 |
mixtral-8x7b-32768 | 2.44 / 5.0 |
Summary: LuxLlama demonstrates strong performance on the LUXELLA benchmark, outperforming other tested models significantly. It excels in translation, comprehension, and verb conjugation. Areas like vocabulary, spelling, and idioms show relatively lower scores, indicating room for improvement in capturing finer linguistic nuances. The model handles beginner-level tasks very well, with a gradual decrease in performance as difficulty increases, validating the benchmark's sensitivity. Sample high-performing questions show correct handling of cultural knowledge, spelling, and advanced verb conjugation, while low-performing samples highlight challenges with specific grammar rules (Konjunktiv II usage), subtle distinctions in vocabulary (Niess vs Kusinn), and standard word order conventions.
Learnings and Observations
(This section should be updated after further analysis and usage.)
- The combination of reasoning and language-specific data appears beneficial for overall capability.
- PEFT with LoRA and 4-bit quantization, combined with the Liger kernel, provides an efficient pathway for fine-tuning large models on specific tasks/languages with limited resources.
- The LUXELLA benchmark provides valuable, granular insights into Luxembourgish language capabilities, highlighting strengths and weaknesses.
- Further improvements might require more diverse Luxembourgish data, particularly covering idioms, colloquialisms, and complex grammatical structures, or different fine-tuning strategies.
Ethical Considerations
- Content Generation: The model can generate text in Luxembourgish and English. Users should be aware that generated content may not always be accurate, neutral, or appropriate. Human review is recommended for sensitive applications.
- Bias: Efforts were made to use cleaned datasets, but biases may still exist. The model might reflect societal biases present in the training data.
- Misinformation: The model may generate incorrect factual information or flawed reasoning. Outputs should be critically evaluated.
Acknowledgments
This work leverages computational resources and support from Meluxina by LuxProvide.

- Downloads last month
- 3
Model tree for aiplanet/LuxLlama
Base model
meta-llama/Llama-3.1-8B