|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- wikimedia/wikipedia |
|
language: |
|
- mk |
|
base_model: |
|
- google/mt5-base |
|
--- |
|
|
|
# Fine-tuned mt5-base model for restoring capitalization and punctuation for Macedonian language |
|
|
|
The model is fine-tuned on a subset of the Macedonian portion of Wikipedia. |
|
|
|
Authors: |
|
1. Dejan Porjazovski |
|
2. Ilina Jakimovska |
|
3. Ordan Chukaliev |
|
4. Nikola Stikov |
|
|
|
This collaboration is part of the activities of the Center for Advanced Interdisciplinary Research (CAIR) at UKIM. |
|
|
|
|
|
## Usage |
|
|
|
``` |
|
pip install transformers |
|
``` |
|
|
|
```python |
|
from transformers import T5Tokenizer, T5ForConditionalGeneration |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
recap_model_name = "Macedonian-ASR/mt5-restore-capitalization-macedonian" |
|
recap_tokenizer = T5Tokenizer.from_pretrained(recap_model_name) |
|
recap_model = T5ForConditionalGeneration.from_pretrained(recap_model_name) |
|
recap_model.to(device) |
|
|
|
sentence = "скопје е главен град на македонија" |
|
inputs = recap_tokenizer(["restore capitalization and punctuation: " + sentence], return_tensors="pt", padding=True).to(device) |
|
outputs = recap_model.generate(**inputs, max_length=768, num_beams=5, early_stopping=True).squeeze(0) |
|
recap_result = recap_tokenizer.decode(outputs, skip_special_tokens=True) |
|
print(recap_result) |
|
-> "Скопје е главен град на Македонија." |
|
``` |