Model Card: Coreguapa Quality LM

This model is intended to be used to validate the quality of Guaraní text. It was trained on the Coreguapa corpus, a restricted dataset manually compiled and curated by the Paraguayan Secretary of Linguistic Policies. The corpus contains high-quality documents, including primarily copyrighted materials.

Although the model is based on a transformer-based architecture (Gemma 2), it was not developed as a generative tool but its primary use is to compute the perplexity score of Guaraní documents, where lower perplexity might suggest text that is more predictable by the model and more similar to the reference high-quality corpus.

Summary

  • Model type: Gemma2 For Causal LM
  • Base model: princeton-nlp/gemma-2-9b-it-SimPO
  • Fine-tuning method: Full fine-tuning (all model weights updated)
  • Dataset: Guaraní corpus derived from data/coreguapa_identified_all.jsonl
  • Primary task: Causal language modeling / text generation
  • Training target: output/full_cpt_202605261711

Model Details

  • Architecture: Gemma2ForCausalLM
  • Number of layers: 42
  • Hidden size: 3584
  • Attention heads: 16
  • Feedforward intermediate size: 14336
  • Vocabulary size: 256000
  • Maximum context length: 8192 tokens
  • Precision: float16
  • Tokenizer: saved in this folder via tokenizer.json and tokenizer_config.json
  • Generation config: saved in generation_config.json
  • Prompt template: chat_template.jinja

Training Details

  • Training script: src/train.py
  • Training configuration:
    • config/common_config.yaml
    • config/full_config.yaml
  • Batch size: 1
  • Gradient accumulation: 1
  • Learning rate: 2e-5
  • Weight decay: 0.01
  • Warmup steps: 100
  • Optimizer: paged_adamw_8bit
  • Scheduler: linear
  • Epochs: 2
  • Precision mode: bf16 where available
  • Gradient checkpointing: enabled
  • Dataset preprocessing: src/preprocess_data.py

Dataset and Preprocessing

  • Raw source file: private
  • Processed dataset directory: private
  • Split strategy: train / validation / test via src/preprocess_data.py
  • Sequence length used for tokenization: 2048
  • Tokenizer source: princeton-nlp/gemma-2-9b-it-SimPO

Usage

Use the model with Hugging Face Transformers as follows:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained('output/full_cpt_202605261711')
tokenizer = AutoTokenizer.from_pretrained('output/full_cpt_202605261711')

prompt = 'Your input text here.'
inputs = tokenizer(prompt, return_tensors='pt')
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

For inference logic used in this project, see src/inference.py.

Evaluation

Evaluation scripts in the repository include:

  • src/evaluate_base_vs_cpt.py — compare base, LoRA, and full fine-tuned models
  • src/inference.py — generate predictions from saved checkpoints

Limitations and Notes

  • The training data is drawn from a private Guaraní corpus.
  • The model may reflect biases present in the source corpus.
  • License metadata is provided in this folder.

License

This model checkpoint and accompanying files are released under the GNU General Public License v3 (GPLv3). See the LICENSE file in this directory for the full license text.

Caveats

  • This file is generated from the available project configuration and model metadata.
  • If you need exact license or authorship details, consult the repository maintainers or project documentation.
Downloads last month
14
Safetensors
Model size
9B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support