|
--- |
|
library_name: transformers |
|
datasets: |
|
- AigizK/bashkir-russian-parallel-corpora |
|
language: |
|
- ru |
|
- ba |
|
metrics: |
|
- bleu |
|
pipeline_tag: translation |
|
widget: |
|
- text: Тормоштоң, Ғаләмдең һәм бөтә нәмәнең төп һорауына яуап. |
|
example_title: translation_bak_to_ru |
|
--- |
|
|
|
### Model Description |
|
|
|
t5-small from [google t5 repo](https://huggingface.co/google-t5/t5-small) fine-tuned on [russian-bashkir corpora](https://huggingface.co/datasets/AigizK/bashkir-russian-parallel-corpora) |
|
|
|
#### Metrics |
|
|
|
BLEU: 0.3018 |
|
|
|
chrF: 0.5478 |
|
|
|
|
|
#### Run inference |
|
|
|
Use the example below*: |
|
|
|
```python |
|
from typing import List, Union |
|
|
|
import torch |
|
from transformers import T5ForConditionalGeneration, T5Tokenizer |
|
|
|
|
|
@torch.inference_mode |
|
def infer( |
|
model: T5ForConditionalGeneration, |
|
tokenizer: Union[T5TokenizerFast, T5Tokenizer], |
|
device: str, |
|
texts: List[str], |
|
target_language: str, |
|
max_length: int = 256 |
|
) -> List[str]: |
|
assert target_language in ("русский", "башкирский"), "target language must be in (русский, башкирский)" |
|
if target_language == "русский": |
|
prefix = "башкирский-русский: " |
|
else: |
|
prefix = "русский-башкирский: " |
|
text_with_prefix = [ |
|
prefix + (text[0].upper() + text[1:] + "." if not text.endswith(".") else text[0].upper() + text[1:]) \ |
|
for text in texts |
|
] |
|
inputs = tokenizer( |
|
text_with_prefix, |
|
padding="max_length", |
|
max_length=256, |
|
truncation=True, |
|
return_tensors="pt" |
|
) |
|
model.eval() |
|
outputs = model.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device)) |
|
return tokenizer.batch_decode(outputs, skip_special_tokens=True) |
|
|
|
|
|
if __name__ == "__main__": |
|
tokenizer = T5Tokenizer.from_pretrained("zhursvlevy/t5-small-bashkir-russian") |
|
model = T5ForConditionalGeneration.from_pretrained("zhursvlevy/t5-small-bashkir-russian") |
|
|
|
input_text = "Тормоштоң, Ғаләмдең һәм бөтә нәмәнең төп һорауына яуап" |
|
output_text = "Ответ на главный вопрос жизни, Вселенной и всего такого" |
|
|
|
infer(model, tokenizer, "cpu", [input_text], "русский") |
|
``` |
|
|
|
*The widget may not work correctly due to the default pipeline. |