--- library_name: transformers datasets: - AigizK/mari-russian-parallel-corpora language: - ru - ba metrics: - bleu pipeline_tag: translation widget: - text: "Тормоштоң, Ғаләмдең һәм бөтә нәмәнең төп һорауына яуап." example_title: "translation_bak_to_ru" --- ### Model Description t5-small from [google t5 repo](https://huggingface.co/google-t5/t5-small) fine-tuned on [russian-bashkir corpora](https://huggingface.co/datasets/AigizK/bashkir-russian-parallel-corpora) #### Metrics BLEU: 0.3018 chrF: 0.5478 #### Run inference Use the example below*: ```python from typing import List, Union import torch from transformers import T5ForConditionalGeneration, T5Tokenizer @torch.inference_mode def infer( model: T5ForConditionalGeneration, tokenizer: Union[T5TokenizerFast, T5Tokenizer], device: str, texts: List[str], target_language: str, max_length: int = 256 ) -> List[str]: assert target_language in ("русский", "башкирский"), "target language must be in (русский, башкирский)" if target_language == "русский": prefix = "башкирский-русский: " else: prefix = "русский-башкирский: " text_with_prefix = [ prefix + (text[0].upper() + text[1:] + "." if not text.endswith(".") else text[0].upper() + text[1:]) \ for text in texts ] inputs = tokenizer( text_with_prefix, padding="max_length", max_length=256, truncation=True, return_tensors="pt" ) model.eval() outputs = model.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device)) return tokenizer.batch_decode(outputs, skip_special_tokens=True) if __name__ == "__main__": tokenizer = T5Tokenizer.from_pretrained("zhursvlevy/t5-small-bashkir-russian") model = T5ForConditionalGeneration.from_pretrained("zhursvlevy/t5-small-bashkir-russian") input_text = "Тормоштоң, Ғаләмдең һәм бөтә нәмәнең төп һорауына яуап" output_text = "Ответ на главный вопрос жизни, Вселенной и всего такого" infer(model, tokenizer, "cpu", [input_text], "русский") ``` *The widget may not work correctly due to the default pipeline.