# LASER Language-Agnostic SEntence Representations LASER is a library to calculate and use multilingual sentence embeddings. You can find more information about LASER and how to use it on the official [LASER repository](https://github.com/facebookresearch/LASER). This folder contains source code for training LASER embeddings. ## Prepare data and configuration file Binarize your data with fairseq, as described [here](https://fairseq.readthedocs.io/en/latest/getting_started.html#data-pre-processing). Create a json config file with this format: ``` { "src_vocab": "/path/to/spm.src.cvocab", "tgt_vocab": "/path/to/spm.tgt.cvocab", "train": [ { "type": "translation", "id": 0, "src": "/path/to/srclang1-tgtlang0/train.srclang1", "tgt": "/path/to/srclang1-tgtlang0/train.tgtlang0" }, { "type": "translation", "id": 1, "src": "/path/to/srclang1-tgtlang1/train.srclang1", "tgt": "/path/to/srclang1-tgtlang1/train.tgtlang1" }, { "type": "translation", "id": 0, "src": "/path/to/srclang2-tgtlang0/train.srclang2", "tgt": "/path/to/srclang2-tgtlang0/train.tgtlang0" }, { "type": "translation", "id": 1, "src": "/path/to/srclang2-tgtlang1/train.srclang2", "tgt": "/path/to/srclang2-tgtlang1/train.tgtlang1" }, ... ], "valid": [ { "type": "translation", "id": 0, "src": "/unused", "tgt": "/unused" } ] } ``` where paths are paths to binarized indexed fairseq dataset files. `id` represents the target language id. ## Training Command Line Example ``` fairseq-train \ /path/to/configfile_described_above.json \ --user-dir examples/laser/laser_src \ --log-interval 100 --log-format simple \ --task laser --arch laser_lstm \ --save-dir . \ --optimizer adam \ --lr 0.001 \ --lr-scheduler inverse_sqrt \ --clip-norm 5 \ --warmup-updates 90000 \ --update-freq 2 \ --dropout 0.0 \ --encoder-dropout-out 0.1 \ --max-tokens 2000 \ --max-epoch 50 \ --encoder-bidirectional \ --encoder-layers 5 \ --encoder-hidden-size 512 \ --decoder-layers 1 \ --decoder-hidden-size 2048 \ --encoder-embed-dim 320 \ --decoder-embed-dim 320 \ --decoder-lang-embed-dim 32 \ --warmup-init-lr 0.001 \ --disable-validation ``` ## Applications We showcase several applications of multilingual sentence embeddings with code to reproduce our results (in the directory "tasks"). * [**Cross-lingual document classification**](https://github.com/facebookresearch/LASER/tree/master/tasks/mldoc) using the [*MLDoc*](https://github.com/facebookresearch/MLDoc) corpus [2,6] * [**WikiMatrix**](https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix) Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia [7] * [**Bitext mining**](https://github.com/facebookresearch/LASER/tree/master/tasks/bucc) using the [*BUCC*](https://comparable.limsi.fr/bucc2018/bucc2018-task.html) corpus [3,5] * [**Cross-lingual NLI**](https://github.com/facebookresearch/LASER/tree/master/tasks/xnli) using the [*XNLI*](https://www.nyu.edu/projects/bowman/xnli/) corpus [4,5,6] * [**Multilingual similarity search**](https://github.com/facebookresearch/LASER/tree/master/tasks/similarity) [1,6] * [**Sentence embedding of text files**](https://github.com/facebookresearch/LASER/tree/master/tasks/embed) example how to calculate sentence embeddings for arbitrary text files in any of the supported language. **For all tasks, we use exactly the same multilingual encoder, without any task specific optimization or fine-tuning.** ## References [1] Holger Schwenk and Matthijs Douze, [*Learning Joint Multilingual Sentence Representations with Neural Machine Translation*](https://aclanthology.info/papers/W17-2619/w17-2619), ACL workshop on Representation Learning for NLP, 2017 [2] Holger Schwenk and Xian Li, [*A Corpus for Multilingual Document Classification in Eight Languages*](http://www.lrec-conf.org/proceedings/lrec2018/pdf/658.pdf), LREC, pages 3548-3551, 2018. [3] Holger Schwenk, [*Filtering and Mining Parallel Data in a Joint Multilingual Space*](http://aclweb.org/anthology/P18-2037) ACL, July 2018 [4] Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R. Bowman, Holger Schwenk and Veselin Stoyanov, [*XNLI: Cross-lingual Sentence Understanding through Inference*](https://aclweb.org/anthology/D18-1269), EMNLP, 2018. [5] Mikel Artetxe and Holger Schwenk, [*Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings*](https://arxiv.org/abs/1811.01136) arXiv, Nov 3 2018. [6] Mikel Artetxe and Holger Schwenk, [*Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond*](https://arxiv.org/abs/1812.10464) arXiv, Dec 26 2018. [7] Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Paco Guzman, [*WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia*](https://arxiv.org/abs/1907.05791) arXiv, July 11 2019. [8] Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave and Armand Joulin [*CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB*](https://arxiv.org/abs/1911.04944)