Edit model card

ukr-t5-small

A compact T5-small model fine-tuned for Ukrainian language tasks, with base English understanding.

Model Description

  • Base Model: mT5-small
  • Fine-tuning Data: Leipzig Corpora Collection (English & Ukrainian news from 2023)
  • Tasks:
    • Text summarization (Ukrainian)
    • Text generation (Ukrainian)
    • Other Ukrainian-centric NLP tasks

Technical Details

  • Model Size: 300 MB
  • Framework: Transformers (Hugging Face)

Usage

Installation

pip install transformers

Loading the Model

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("path/to/ukr-t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("path/to/ukr-t5-small")

Example: Machine Translation

text = "(Text in Ukrainian here)"

# Tokenize and translate
inputs = tokenizer("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)
summary_ids = model.generate(inputs["input_ids"], num_beams=4, max_length=128)

# Decode output 
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)

Limitations

  • The model's focus is on Ukrainian text processing, so performance on purely English tasks may be below that of general T5-small models.
  • Further fine-tuning may be required for optimal results on specific NLP tasks.

Dataset Credits

This model was fine-tuned on the Leipzig Corpora Collection (specify if there's a particular subset within the collection that you used). For full licensing and usage information of the original dataset, please refer to Leipzig Corpora Collection website

Ethical Considerations

  • NLP models can reflect biases present in their training data. Be mindful of this when using this model for applications that have real-world impact.
  • It's important to test this model thoroughly across a variety of Ukrainian language samples to evaluate its reliability and fairness.
Downloads last month
2
Safetensors
Model size
74.8M params
Tensor type
F32
·