edaiofficial's picture
additional commits
45eca8f
# English to Arabic
Author:
* Abdallah Bashir
* Amr Muhammad ALAMEEN Khalifa
## Data
* The JW300 English-Arabic (bin) dataset.
* The [TED-Multilingual-Parallel-Corpus](https://github.com/ajinkyakulkarni14/TED-Multilingual-Parallel-Corpus) English Arabic dataset
## Test Data
the test data files for evaluating the model was not taken from the repo like the rest of the baselines but instead taken as a portion from the total merged datasets and in hte same size of the entries in test.en-any.en.
## Model
- Default Masakhane Transformer translation model.
- [Link to google drive folder with models](https://drive.google.com/drive/folders/18P6HH9wavVpaR3UufoiUsTeqMnkvc1He)
## Analysis
The dataset requires more preprocessing to remove special characters and Scripture chapters/verse names & figures. Also it is very small, which is the primary limiting factor on being able to learn anything useful.
Example 1
```ar
Source: at the same time , the police gave free passage to busloads of mkalavishviliโ€™s followers , who were bent on destroying the convention site .
Reference: ูˆููŠ ุงู„ูˆู‚ุช ู†ูุณู‡ โ€ ูุชุญุช ุงู„ุดุฑุทู‡ ุงู„ุทุฑูŠู‚ ู„ุจุงุตุงุช ุงุฎุฑูŠ ุชู†ู‚ู„ ุงุชุจุงุน ู…ูƒุงู„ุงฺคูŠุดฺคูŠู„ูŠ ุงู„ุฐูŠู† ูƒุงู†ูˆุง ู…ุตู…ูŠู† ุนู„ูŠ ุชุฏู…ูŠุฑ ู…ูˆู‚ุน ุงู„ู…ุญูู„ โ€
Hypothesis: ูˆููŠ ุงู„ูˆู‚ุช ู†ูุณู‡ โ€ ุงุนุทูŠ ุงู„ุดุฑุทู‡ ู…ู‚ุทุน ู…ุฌุงู†ูŠ ู„ูƒุซูŠุฑ ู…ู† ุงุชุจุงุน ู…ุงู„ูƒุงู„ฺคูŠู„ูŠฺคูŠู„ูŠฺคูŠู„ โ€ ุงู„ุฐูŠู† ูƒุงู†ูˆุง ู…ู†ุฒุนุฌูŠู† ููŠ ุชุฏู…ูŠุฑ ู…ูˆู‚ุน ุงู„ู…ุญูู„ โ€
```
Example 2
```sh
Source: a big attraction was the man roland lithoman web - offset press that prints up to 90,000 magazines an hour .
Reference: ูˆู…ุง ู„ูุช ุงู†ุชุจุงู‡ ุงู„ุฒูˆุงุฑ ุงู„ูŠ ุญุฏ ูƒุจูŠุฑ ู‡ูˆ ู…ุทุจุนู‡ ุงู„ูˆุจ ุงูˆูุณุช ุงู„ู…ุชุทูˆุฑู‡ ุฌุฏุง โ€ man roland lithomanโ€ โ€ ุงู„ุชูŠ ูŠู…ูƒู† ุงู† ุชุทุจุน ู โ€ูฉู  ู…ุฌู„ู‡ ููŠ ุงู„ุณุงุนู‡ โ€
Hypothesis: ูƒุงู† ุฌุฐุจ ูƒุจูŠุฑ ู‡ูˆ ุงู„ุตุญุงูู‡ ุงู„ุฑูˆู…ุงู†ูŠู‡ ู„ูŠุชูˆู…ุงุงู† โ€ ุงู„ุชูŠ ุชุทู„ู‚ ุงู„ูŠ ู โ€ูฉู  ู…ุฌู„ู‡ ููŠ ุงู„ุณุงุนู‡ โ€
```
# Results
Tokenization | BLEU dev | BLEU test
--- | --- | ---
BPE | 15.45 | 9.28