zhursvlevy's picture
fix dataset source
64fdc6d verified
metadata
library_name: transformers
datasets:
  - AigizK/bashkir-russian-parallel-corpora
language:
  - ru
  - ba
metrics:
  - bleu
pipeline_tag: translation
widget:
  - text: Тормоштоң, Ғаләмдең һәм бөтә нәмәнең төп һорауына яуап.
    example_title: translation_bak_to_ru

Model Description

t5-small from google t5 repo fine-tuned on russian-bashkir corpora

Metrics

BLEU: 0.3018

chrF: 0.5478

Run inference

Use the example below*:

from typing import List, Union

import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer


@torch.inference_mode
def infer(
        model: T5ForConditionalGeneration,
        tokenizer: Union[T5TokenizerFast, T5Tokenizer],
        device: str,
        texts: List[str],
        target_language: str,
        max_length: int = 256
    ) -> List[str]:
    assert target_language in ("русский", "башкирский"), "target language must be in (русский, башкирский)"
    if target_language == "русский":
        prefix = "башкирский-русский: "
    else:
        prefix = "русский-башкирский: "
    text_with_prefix = [
        prefix + (text[0].upper() + text[1:] + "." if not text.endswith(".") else text[0].upper() + text[1:]) \
        for text in texts
        ]
    inputs = tokenizer(
                text_with_prefix,
                padding="max_length",
                max_length=256,
                truncation=True,
                return_tensors="pt"
                )
    model.eval()
    outputs = model.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device))
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)


if __name__ == "__main__":
    tokenizer = T5Tokenizer.from_pretrained("zhursvlevy/t5-small-bashkir-russian")
    model = T5ForConditionalGeneration.from_pretrained("zhursvlevy/t5-small-bashkir-russian")
  
    input_text = "Тормоштоң, Ғаләмдең һәм бөтә нәмәнең төп һорауына яуап"
    output_text = "Ответ на главный вопрос жизни, Вселенной и всего такого"
    
    infer(model, tokenizer, "cpu", [input_text], "русский")

*The widget may not work correctly due to the default pipeline.