File size: 1,399 Bytes
93b0b80 17ede48 93b0b80 bc9c7b8 9bec930 e2eb88c 9bec930 bc9c7b8 9bec930 f731c81 9bec930 18e2817 9bec930 edf253a 17ede48 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
---
license: apache-2.0
datasets:
- wikimedia/wikipedia
language:
- mk
base_model:
- google/mt5-base
---
# Fine-tuned mt5-base model for restoring capitalization and punctuation for Macedonian language
The model is fine-tuned on a subset of the Macedonian portion of Wikipedia.
Authors:
1. Dejan Porjazovski
2. Ilina Jakimovska
3. Ordan Chukaliev
4. Nikola Stikov
This collaboration is part of the activities of the Center for Advanced Interdisciplinary Research (CAIR) at UKIM.
## Usage
```
pip install transformers
```
```python
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
recap_model_name = "Macedonian-ASR/mt5-restore-capitalization-macedonian"
recap_tokenizer = T5Tokenizer.from_pretrained(recap_model_name)
recap_model = T5ForConditionalGeneration.from_pretrained(recap_model_name)
recap_model.to(device)
sentence = "скопје е главен град на македонија"
inputs = recap_tokenizer(["restore capitalization and punctuation: " + sentence], return_tensors="pt", padding=True).to(device)
outputs = recap_model.generate(**inputs, max_length=768, num_beams=5, early_stopping=True).squeeze(0)
recap_result = recap_tokenizer.decode(outputs, skip_special_tokens=True)
print(recap_result)
-> "Скопје е главен град на Македонија."
``` |