Spaces:
Runtime error
Runtime error
File size: 17,430 Bytes
4192287 0a58231 4192287 0a58231 4192287 0a58231 4192287 0a58231 4192287 0a58231 4192287 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 |
<div align="center">
<h1><b><i>IndicTrans</i></b></h1>
<a href="http://indicnlp.ai4bharat.org/samanantar">Website</a> |
<a href="https://arxiv.org/abs/2104.05596">Paper</a> |
<a href="https://youtu.be/QwYPOd1eBtQ?t=383">Video</a><br><br>
</div>
**IndicTrans** is a Transformer-4x ( ~434M ) multilingual NMT model trained on [Samanantar](https://indicnlp.ai4bharat.org/samanantar) dataset which is the largest publicly available parallel corpora collection for Indic languages at the time of writing ( 14 April 2021 ). It is a single script model i.e we convert all the Indic data to the Devanagari script which allows for ***better lexical sharing between languages for transfer learning, prevents fragmentation of the subword vocabulary between Indic languages and allows using a smaller subword vocabulary***. We currently release two models - Indic to English and English to Indic and support the following 11 indic languages:
| <!-- --> | <!-- --> | <!-- --> | <!-- --> |
| ------------- | -------------- | ------------ | ----------- |
| Assamese (as) | Hindi (hi) | Marathi (mr) | Tamil (ta) |
| Bengali (bn) | Kannada (kn) | Oriya (or) | Telugu (te) |
| Gujarati (gu) | Malayalam (ml) | Punjabi (pa) |
- [Updates](#updates)
- [Download IndicTrans models:](#download-indictrans-models)
- [Using the model for translating any input](#using-the-model-for-translating-any-input)
- [Finetuning the model on your input dataset](#finetuning-the-model-on-your-input-dataset)
- [Mining Indic to Indic pairs from english centric corpus](#mining-indic-to-indic-pairs-from-english-centric-corpus)
- [Installation](#installation)
- [How to train the indictrans model on your training data?](#how-to-train-the-indictrans-model-on-your-training-data)
- [Network & Training Details](#network--training-details)
- [Folder Structure](#folder-structure)
- [Citing](#citing)
- [License](#license)
- [Contributors](#contributors)
- [Contact](#contact)
## Updates
<details><summary>Click to expand </summary>
18 December 2021
```
Tutorials updated with latest model links
```
26 November 2021
```
- v0.3 models are now available for download
```
27 June 2021
```
- Updated links for indic to indic model
- Add more comments to training scripts
- Add link to [Samanantar Video](https://youtu.be/QwYPOd1eBtQ?t=383)
- Add folder structure in readme
- Add python wrapper for model inference
```
09 June 2021
```
- Updated links for models
- Added Indic to Indic model
```
09 May 2021
```
- Added fix for finetuning on datasets where some lang pairs are not present. Previously the script assumed the finetuning dataset will have data for all 11 indic lang pairs
- Added colab notebook for finetuning instructions
```
</details>
## Download IndicTrans models:
Indic to English: [v0.3](https://storage.googleapis.com/samanantar-public/V0.3/models/indic-en.zip)
English to Indic: [v0.3](https://storage.googleapis.com/samanantar-public/V0.3/models/en-indic.zip)
Indic to Indic: [v0.3](https://storage.googleapis.com/samanantar-public/V0.3/models/m2m.zip)
## Using the model for translating any input
The model is trained on single sentences and hence, users need to split parapgraphs to sentences before running the translation when using our command line interface (The python interface has `translate_paragraph` method to handle multi sentence translations).
Note: IndicTrans is trained with a max sequence length of **200** tokens (subwords). If your sentence is too long (> 200 tokens), the sentence will be truncated to 200 tokens before translation.
Here is an example snippet to split paragraphs into sentences for English and Indic languages supported by our model:
```python
# install these libraries
# pip install mosestokenizer
# pip install indic-nlp-library
from mosestokenizer import *
from indicnlp.tokenize import sentence_tokenize
INDIC = ["as", "bn", "gu", "hi", "kn", "ml", "mr", "or", "pa", "ta", "te"]
def split_sentences(paragraph, language):
if language == "en":
with MosesSentenceSplitter(language) as splitter:
return splitter([paragraph])
elif language in INDIC:
return sentence_tokenize.sentence_split(paragraph, lang=language)
split_sentences("""COVID-19 is caused by infection with the severe acute respiratory
syndrome coronavirus 2 (SARS-CoV-2) virus strain. The disease is mainly transmitted via the respiratory
route when people inhale droplets and particles that infected people release as they breathe, talk, cough, sneeze, or sing. """, language='en')
>> ['COVID-19 is caused by infection with the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus strain.',
'The disease is mainly transmitted via the respiratory route when people inhale droplets and particles that infected people release as they breathe, talk, cough, sneeze, or sing.']
split_sentences("""இத்தொற்றுநோய் உலகளாவிய சமூக மற்றும் பொருளாதார சீர்குலைவை ஏற்படுத்தியுள்ளது.இதனால் பெரும் பொருளாதார மந்தநிலைக்குப் பின்னர் உலகளவில் மிகப்பெரிய மந்தநிலை ஏற்பட்டுள்ளது. இது விளையாட்டு,மத, அரசியல் மற்றும் கலாச்சார நிகழ்வுகளை ஒத்திவைக்க அல்லது ரத்து செய்ய வழிவகுத்தது.
அச்சம் காரணமாக முகக்கவசம், கிருமிநாசினி உள்ளிட்ட பொருட்களை அதிக நபர்கள் வாங்கியதால் விநியோகப் பற்றாக்குறை ஏற்பட்டது.""",
language='ta')
>> ['இத்தொற்றுநோய் உலகளாவிய சமூக மற்றும் பொருளாதார சீர்குலைவை ஏற்படுத்தியுள்ளது.',
'இதனால் பெரும் பொருளாதார மந்தநிலைக்குப் பின்னர் உலகளவில் மிகப்பெரிய மந்தநிலை ஏற்பட்டுள்ளது.',
'இது விளையாட்டு,மத, அரசியல் மற்றும் கலாச்சார நிகழ்வுகளை ஒத்திவைக்க அல்லது ரத்து செய்ய வழிவகுத்தது.',
'அச்சம் காரணமாக முகக்கவசம், கிருமிநாசினி உள்ளிட்ட பொருட்களை அதிக நபர்கள் வாங்கியதால் விநியோகப் பற்றாக்குறை ஏற்பட்டது.']
```
Follow the colab notebook to setup the environment, download the trained _IndicTrans_ models and translating your own text.
Command line interface --> [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AI4Bharat/indicTrans/blob/main/indictrans_fairseq_inference.ipynb)
Python interface --> [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AI4Bharat/indicTrans/blob/main/indicTrans_python_interface.ipynb)
The python interface is useful in case you want to reuse the model for multiple translations and do not want to reinitialize the model each time
## Finetuning the model on your input dataset
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AI4Bharat/indicTrans/blob/main/indicTrans_Finetuning.ipynb)
The colab notebook can be used to setup the environment, download the trained _IndicTrans_ models and prepare your custom dataset for funetuning the indictrans model. There is also a section on mining indic to indic data from english centric corpus for finetuning indic to indic model.
**Note**: Since this is a big model (400M params), you might not be able to train with reasonable batch sizes in the free google Colab account. We are planning to release smaller models (after pruning / distallation) soon.
## Mining Indic to Indic pairs from english centric corpus
The `extract_non_english_pairs` in `scripts/extract_non_english_pairs.py` can be used to mine indic to indic pairs from english centric corpus.
As described in the [paper](https://arxiv.org/pdf/2104.05596.pdf) (section 2.5) , we use a very strict deduplication criterion to avoid the creation of very similar parallel sentences. For example, if an en sentence is aligned to *M* hi sentences and *N* ta sentences, then we would get *MN* hi-ta pairs. However, these pairs would be very similar and not contribute much to the training process. Hence, we retain only 1 randomly chosen pair out of these *MN* pairs.
```bash
extract_non_english_pairs(indir, outdir, LANGS):
"""
Extracts non-english pair parallel corpora
indir: contains english centric data in the following form:
- directory named en-xx for language xx
- each directory contains a train.en and train.xx
outdir: output directory to store mined data for each pair.
One directory is created for each pair.
LANGS: list of languages in the corpus (other than English).
The language codes must correspond to the ones used in the
files and directories in indir. Prefarably, sort the languages
in this list in alphabetic order. outdir will contain data for xx-yy,
but not for yy-xx, so it will be convenient to have this list in sorted order.
"""
```
## Installation
<details><summary>Click to expand </summary>
```bash
cd indicTrans
git clone https://github.com/anoopkunchukuttan/indic_nlp_library.git
git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git
git clone https://github.com/rsennrich/subword-nmt.git
# install required libraries
pip install sacremoses pandas mock sacrebleu tensorboardX pyarrow indic-nlp-library
# Install fairseq from source
git clone https://github.com/pytorch/fairseq.git
cd fairseq
pip install --editable ./
```
</details>
## How to train the indictrans model on your training data?
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AI4Bharat/indicTrans/blob/main/IndicTrans_training.ipynb)
Follow the colab notebook to setup the environment, download the dataset and train the indicTrans model
## Network & Training Details
- Architechture: IndicTrans uses 6 encoder and decoder layers, input embeddings of size 1536 with 16 attention heads and
feedforward dimension of 4096 with total number of parameters of 434M
- Loss: Cross entropy loss
- Optimizer: Adam
- Label Smoothing: 0.1
- Gradient clipping: 1.0
- Learning rate: 5e-4
- Warmup_steps: 4000
Please refer to section 4, 5 of our [paper](https://arxiv.org/ftp/arxiv/papers/2104/2104.05596.pdf) for more details on training/experimental setup.
## Folder Structure
```
IndicTrans
│ .gitignore
│ apply_bpe_traindevtest_notag.sh # apply bpe for joint vocab (Train, dev and test)
│ apply_single_bpe_traindevtest_notag.sh # apply bpe for seperate vocab (Train, dev and test)
│ binarize_training_exp.sh # binarize the training data after preprocessing for fairseq-training
│ compute_bleu.sh # Compute blue scores with postprocessing after translating with `joint_translate.sh`
│ indictrans_fairseq_inference.ipynb # colab example to show how to use model for inference
│ indicTrans_Finetuning.ipynb # colab example to show how to use model for finetuning on custom domain data
│ joint_translate.sh # used for inference (see colab inference notebook for more details on usage)
│ learn_bpe.sh # learning joint bpe on preprocessed text
│ learn_single_bpe.sh # learning seperate bpe on preprocessed text
│ LICENSE
│ prepare_data.sh # prepare data given an experiment dir (this does preprocessing,
│ # building vocab, binarization ) for bilingual training
│ prepare_data_joint_training.sh # prepare data given an experiment dir (this does preprocessing,
│ # building vocab, binarization ) for joint training
│ README.md
│
├───legacy # old unused scripts
├───model_configs # custom model configrations are stored here
│ custom_transformer.py # contains custom 4x transformer models
│ __init__.py
├───inference
│ custom_interactive.py # for python wrapper around fairseq-interactive
│ engine.py # python interface for model inference
└───scripts # stores python scripts that are used by other bash scripts
│ add_joint_tags_translate.py # add lang tags to the processed training data for bilingual training
│ add_tags_translate.py # add lang tags to the processed training data for joint training
│ clean_vocab.py # clean vocabulary after building with subword_nmt
│ concat_joint_data.py # concatenates lang pair data and creates text files to keep track
│ # of number of lines in each lang pair.
│ extract_non_english_pairs.py # Mining Indic to Indic pairs from english centric corpus
│ postprocess_translate.py # Postprocesses translations
│ preprocess_translate.py # Preprocess translations and for script conversion (from indic to devnagiri)
│ remove_large_sentences.py # to remove large sentences from training data
└───remove_train_devtest_overlaps.py # Finds and removes overlaped data of train with dev and test sets
```
## Citing
If you are using any of the resources, please cite the following article:
```
@misc{ramesh2021samanantar,
title={Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages},
author={Gowtham Ramesh and Sumanth Doddapaneni and Aravinth Bheemaraj and Mayank Jobanputra and Raghavan AK and Ajitesh Sharma and Sujit Sahoo and Harshita Diddee and Mahalakshmi J and Divyanshu Kakwani and Navneet Kumar and Aswin Pradeep and Kumar Deepak and Vivek Raghavan and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra},
year={2021},
eprint={2104.05596},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
We would like to hear from you if:
- You are using our resources. Please let us know how you are putting these resources to use.
- You have any feedback on these resources.
### License
The IndicTrans code (and models) are released under the MIT License.
### Contributors
- Gowtham Ramesh, <sub>([RBCDSAI](https://rbcdsai.iitm.ac.in), [IITM](https://www.iitm.ac.in))</sub>
- Sumanth Doddapaneni, <sub>([RBCDSAI](https://rbcdsai.iitm.ac.in), [IITM](https://www.iitm.ac.in))</sub>
- Aravinth Bheemaraj, <sub>([Tarento](https://www.linkedin.com/company/tarento-group/), [EkStep](https://ekstep.in))</sub>
- Mayank Jobanputra, <sub>([IITM](https://www.iitm.ac.in))</sub>
- Raghavan AK, <sub>([AI4Bharat](https://ai4bharat.org))</sub>
- Ajitesh Sharma, <sub>([Tarento](https://www.linkedin.com/company/tarento-group/), [EkStep](https://ekstep.in))</sub>
- Sujit Sahoo, <sub>([Tarento](https://www.linkedin.com/company/tarento-group/), [EkStep](https://ekstep.in))</sub>
- Harshita Diddee, <sub>([AI4Bharat](https://ai4bharat.org))</sub>
- Mahalakshmi J, <sub>([AI4Bharat](https://ai4bharat.org))</sub>
- Divyanshu Kakwani, <sub>([IITM](https://www.iitm.ac.in), [AI4Bharat](https://ai4bharat.org))</sub>
- Navneet Kumar, <sub>([Tarento](https://www.linkedin.com/company/tarento-group/), [EkStep](https://ekstep.in))</sub>
- Aswin Pradeep, <sub>([Tarento](https://www.linkedin.com/company/tarento-group/), [EkStep](https://ekstep.in))</sub>
- Kumar Deepak, <sub>([Tarento](https://www.linkedin.com/company/tarento-group/), [EkStep](https://ekstep.in))</sub>
- Vivek Raghavan, <sub>([EkStep](https://ekstep.in))</sub>
- Anoop Kunchukuttan, <sub>([Microsoft](https://www.microsoft.com/en-in/), [AI4Bharat](https://ai4bharat.org))</sub>
- Pratyush Kumar, <sub>([RBCDSAI](https://rbcdsai.iitm.ac.in), [AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in))</sub>
- Mitesh Shantadevi Khapra, <sub>([RBCDSAI](https://rbcdsai.iitm.ac.in), [AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in))</sub>
### Contact
- Anoop Kunchukuttan ([anoop.kunchukuttan@gmail.com](mailto:anoop.kunchukuttan@gmail.com))
- Mitesh Khapra ([miteshk@cse.iitm.ac.in](mailto:miteshk@cse.iitm.ac.in))
- Pratyush Kumar ([pratyush@cse.iitm.ac.in](mailto:pratyush@cse.iitm.ac.in))
|