--- license: apache-2.0 datasets: - wikimedia/wikipedia language: - mk base_model: - google/mt5-base --- # Fine-tuned mt5-base model for restoring capitalization and punctuation for Macedonian language The model is fine-tuned on a subset of the Macedonian portion of Wikipedia. Authors: 1. Dejan Porjazovski 2. Ilina Jakimovska 3. Ordan Chukaliev 4. Nikola Stikov This collaboration is part of the activities of the Center for Advanced Interdisciplinary Research (CAIR) at UKIM. ## Usage ``` pip install transformers ``` ```python import torch from transformers import T5Tokenizer, T5ForConditionalGeneration device = torch.device("cuda" if torch.cuda.is_available() else "cpu") recap_model_name = "Macedonian-ASR/mt5-restore-capitalization-macedonian" recap_tokenizer = T5Tokenizer.from_pretrained(recap_model_name) recap_model = T5ForConditionalGeneration.from_pretrained(recap_model_name) recap_model.to(device) sentence = "скопје е главен град на македонија" inputs = recap_tokenizer(["restore capitalization and punctuation: " + sentence], return_tensors="pt", padding=True).to(device) outputs = recap_model.generate(**inputs, max_length=768, num_beams=5, early_stopping=True).squeeze(0) recap_result = recap_tokenizer.decode(outputs, skip_special_tokens=True) print(recap_result) -> "Скопје е главен град на Македонија." ```