---
library_name: transformers
datasets:
- AigizK/mari-russian-parallel-corpora
language:
- ru
- ba
metrics:
- bleu
pipeline_tag: translation
widget:
- text: "Тормоштоң, Ғаләмдең һәм бөтә нәмәнең төп һорауына яуап."
  example_title: "translation_bak_to_ru"
---

### Model Description

t5-small from [google t5 repo](https://huggingface.co/google-t5/t5-small) fine-tuned on [russian-bashkir corpora](https://huggingface.co/datasets/AigizK/bashkir-russian-parallel-corpora)

#### Metrics

BLEU: 0.3018

chrF: 0.5478


#### Run inference

Use the example below*:

```python
from typing import List, Union

import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer


@torch.inference_mode
def infer(
        model: T5ForConditionalGeneration,
        tokenizer: Union[T5TokenizerFast, T5Tokenizer],
        device: str,
        texts: List[str],
        target_language: str,
        max_length: int = 256
    ) -> List[str]:
    assert target_language in ("русский", "башкирский"), "target language must be in (русский, башкирский)"
    if target_language == "русский":
        prefix = "башкирский-русский: "
    else:
        prefix = "русский-башкирский: "
    text_with_prefix = [
        prefix + (text[0].upper() + text[1:] + "." if not text.endswith(".") else text[0].upper() + text[1:]) \
        for text in texts
        ]
    inputs = tokenizer(
                text_with_prefix,
                padding="max_length",
                max_length=256,
                truncation=True,
                return_tensors="pt"
                )
    model.eval()
    outputs = model.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device))
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)


if __name__ == "__main__":
    tokenizer = T5Tokenizer.from_pretrained("zhursvlevy/t5-small-bashkir-russian")
    model = T5ForConditionalGeneration.from_pretrained("zhursvlevy/t5-small-bashkir-russian")
  
    input_text = "Тормоштоң, Ғаләмдең һәм бөтә нәмәнең төп һорауына яуап"
    output_text = "Ответ на главный вопрос жизни, Вселенной и всего такого"
    
    infer(model, tokenizer, "cpu", [input_text], "русский")
```

*The widget may not work correctly due to the default pipeline.