File size: 1,567 Bytes

169fa69
dcdbe08
 
 
169fa69
 
add0ccf
cae4c98
dcdbe08
cae4c98
 
169fa69
 
09a5149
 
3b77fca
56fb6b2
 
 
0d04a56
56fb6b2
0d04a56
56fb6b2
3b77fca
56fb6b2
3b77fca
 
 
56fb6b2
3b77fca
56fb6b2
3b77fca
 
 
 
 
 
 
 
 
0fae1c8
3b77fca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
02d0d41
56fb6b2
02d0d41
3b77fca
 
 
 
 
56fb6b2
02d0d41
3b77fca

---
language:
- en
- el
tags:
- translation
widget:
- text: "'Katerina', is the best name for a girl."
license: apache-2.0
metrics:
- bleu
---

## English to Greek NMT
## By the Hellenic Army Academy (SSE) and the Technical University of Crete (TUC)

* source languages: en
* target languages: el
* licence: apache-2.0
* dataset: Opus, CCmatrix
* model: transformer(fairseq)
* pre-processing: tokenization + BPE segmentation
* metrics: bleu, chrf

### Model description

Trained using the Fairseq framework, transformer_iwslt_de_en architecture.\
BPE segmentation (20k codes).\
Mixed-case model. 

### How to use

```
from transformers import FSMTTokenizer, FSMTForConditionalGeneration

mname = " <your_downloaded_model_folderpath_here> "

tokenizer = FSMTTokenizer.from_pretrained(mname)
model = FSMTForConditionalGeneration.from_pretrained(mname)

text = " 'Katerina', is the best name for a girl."

encoded = tokenizer.encode(text, return_tensors='pt')

outputs = model.generate(encoded, num_beams=5, num_return_sequences=5, early_stopping=True)
for i, output in enumerate(outputs):
    i += 1
    print(f"{i}: {output.tolist()}")
    
    decoded = tokenizer.decode(output, skip_special_tokens=True)
    print(f"{i}: {decoded}")
```


## Training data

Consolidated corpus from Opus and CC-Matrix (~6.6GB in total)


## Eval results


Results on Tatoeba testset (EN-EL): 

| BLEU | chrF  |
| ------ | ------ |
| 76.9 |  0.733 |


Results on XNLI parallel (EN-EL): 

| BLEU | chrF  |
| ------ | ------ |
| 65.4 |  0.624 |

### BibTeX entry and citation info
TODO