GNTweetsLM

The model GNTweetsLM is intended to be used to validate the quality of Guarani text. It was trained on a publicly available corpus of tweets written in Guarani and Jopara (Góngora et al. 2021).

⚠️ Although the model is based on a transformer-based architecture (Gemma2-9b-it), it was not developed as a generative tool — its primary use is to compute the perplexity score of Guarani documents. Lower perplexity may indicate text that is more predictable by the model and more similar to the reference high-quality corpus.

📌 Summary

  • Model type: Gemma2 For Causal LM
  • Base model: princeton-nlp/gemma-2-9b-it-SimPO
  • Fine-tuning method: Full fine-tuning (all model weights updated)
  • Primary task: Perplexity computation
  • Dataset (HF): guaran-ia/gntweets

🏗️ Model Details

  • Architecture: Gemma2ForCausalLM
  • Number of layers: 42
  • Hidden size: 3584
  • Attention heads: 16
  • Feedforward intermediate size: 14336
  • Vocabulary size: 256000
  • Maximum context length: 8192 tokens
  • Precision: float16
  • Tokenizer: saved in this folder via tokenizer.json and tokenizer_config.json
  • Generation config: saved in generation_config.json
  • Prompt template: chat_template.jinja

⚙️ Training Details

  • Batch size: 1
  • Gradient accumulation: 1
  • Learning rate: 2e-5
  • Weight decay: 0.01
  • Warmup steps: 100
  • Optimizer: paged_adamw_8bit
  • Scheduler: linear
  • Epochs: 6
  • Precision mode: bf16
  • Gradient checkpointing: enabled

🗃️ Dataset and Preprocessing

  • Split strategy: train / validation / test
  • Sequence length used for tokenization: 2048
  • Train dataset size: 936 records (1916928 tokens)
  • Validation dataset size: 117 records (239616 tokens)
  • Test dataset size: 117 records (239616 tokens)
  • Tokenizer: princeton-nlp/gemma-2-9b-it-SimPO
  • HF ID: guaran-ia/gntweets

🚀 Usage

Compute perplexity for a given Guarani text using the fine-tuned model:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import math

model_id = 'guaran-ia/gntweets-lm'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
model.eval()

def perplexity(text: str) -> float:
  inputs = tokenizer(text, return_tensors='pt')
  with torch.no_grad():
    outputs = model(**inputs, labels=inputs['input_ids'])
    loss = outputs.loss
  return math.exp(loss.item())

text = "Your Guarani text here."
print(f"Perplexity: {perplexity(text):.4f}")

Perplexity for long texts

If input length exceeds the model/tokenizer maximum (8192 tokens), you can follow the next recipe to compute perplexity over sliding chunks and average per-token loss.

import torch, math

def perplexity_sliding(text: str, model, tokenizer, max_len: int = 8192, stride: int = 4096):
  """Compute perplexity over long text by slicing into overlapping chunks.

  - `max_len` should be <= model.config.max_position_embeddings (8192).
  - `stride` controls overlap; larger overlap gives smoother per-token averaging.
  """
  enc = tokenizer(text, return_tensors='pt')['input_ids'][0]
  n = enc.size(0)
  if n == 0:
    return float('nan')

  total_nll = 0.0
  total_tokens = 0
  start = 0
  while start < n:
    end = min(start + max_len, n)
    input_ids = enc[start:end].unsqueeze(0)
    with torch.no_grad():
      outputs = model(input_ids, labels=input_ids)
      # outputs.loss is the average NLL for the chunk
      loss = outputs.loss.item()
    chunk_len = end - start
    total_nll += loss * chunk_len
    total_tokens += chunk_len
    if end == n:
      break
    start += stride

  avg_nll = total_nll / total_tokens
  return math.exp(avg_nll)

# Example usage:
text = open('some_guarani.txt', encoding='utf-8').read()
tokenizer.model_max_length = 8192
print(f"Perplexity (sliding): {perplexity_sliding(text, model, tokenizer):.4f}")

❗ Limitations and Notes

  • The model may reflect biases present in the source corpus.
  • License metadata is provided in this folder.

📜 License

This model checkpoint and accompanying files are released under the GNU General Public License v3 (GPLv3). See the LICENSE file in this directory for the full license text.

Downloads last month
10
Safetensors
Model size
9B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support