Spaces:

OFA-Sys
/

OFA-OCR

Runtime error

File size: 5,305 Bytes

ee21b96

# LASER  Language-Agnostic SEntence Representations

LASER is a library to calculate and use multilingual sentence embeddings.

You can find more information about LASER and how to use it on the official [LASER repository](https://github.com/facebookresearch/LASER).

This folder contains source code for training LASER embeddings.


## Prepare data and configuration file

Binarize your data with fairseq, as described [here](https://fairseq.readthedocs.io/en/latest/getting_started.html#data-pre-processing).

Create a json config file with this format:
```
{
  "src_vocab": "/path/to/spm.src.cvocab",
  "tgt_vocab": "/path/to/spm.tgt.cvocab",
  "train": [
    {
      "type": "translation",
      "id": 0,
      "src": "/path/to/srclang1-tgtlang0/train.srclang1",
      "tgt": "/path/to/srclang1-tgtlang0/train.tgtlang0"
    },
    {
      "type": "translation",
      "id": 1,
      "src": "/path/to/srclang1-tgtlang1/train.srclang1",
      "tgt": "/path/to/srclang1-tgtlang1/train.tgtlang1"
    },
    {
      "type": "translation",
      "id": 0,
      "src": "/path/to/srclang2-tgtlang0/train.srclang2",
      "tgt": "/path/to/srclang2-tgtlang0/train.tgtlang0"
    },
    {
      "type": "translation",
      "id": 1,
      "src": "/path/to/srclang2-tgtlang1/train.srclang2",
      "tgt": "/path/to/srclang2-tgtlang1/train.tgtlang1"
    },
    ...
  ],
  "valid": [
    {
      "type": "translation",
      "id": 0,
      "src": "/unused",
      "tgt": "/unused"
    }
  ]
}
```
where paths are paths to binarized indexed fairseq dataset files.
`id` represents the target language id.


## Training Command Line Example

```
fairseq-train \
  /path/to/configfile_described_above.json \
  --user-dir examples/laser/laser_src \
  --log-interval 100 --log-format simple \
  --task laser --arch laser_lstm \
  --save-dir . \
  --optimizer adam \
  --lr 0.001 \
  --lr-scheduler inverse_sqrt \
  --clip-norm 5 \
  --warmup-updates 90000 \
  --update-freq 2 \
  --dropout 0.0 \
  --encoder-dropout-out 0.1 \
  --max-tokens 2000 \
  --max-epoch 50 \
  --encoder-bidirectional \
  --encoder-layers 5 \
  --encoder-hidden-size 512 \
  --decoder-layers 1 \
  --decoder-hidden-size 2048 \
  --encoder-embed-dim 320 \
  --decoder-embed-dim 320 \
  --decoder-lang-embed-dim 32 \
  --warmup-init-lr 0.001 \
  --disable-validation
```


## Applications

We showcase several applications of multilingual sentence embeddings
with code to reproduce our results (in the directory "tasks").

* [**Cross-lingual document classification**](https://github.com/facebookresearch/LASER/tree/master/tasks/mldoc) using the
  [*MLDoc*](https://github.com/facebookresearch/MLDoc) corpus [2,6]
* [**WikiMatrix**](https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix)
   Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia [7]
* [**Bitext mining**](https://github.com/facebookresearch/LASER/tree/master/tasks/bucc) using the
  [*BUCC*](https://comparable.limsi.fr/bucc2018/bucc2018-task.html) corpus [3,5]
* [**Cross-lingual NLI**](https://github.com/facebookresearch/LASER/tree/master/tasks/xnli)
  using the [*XNLI*](https://www.nyu.edu/projects/bowman/xnli/) corpus [4,5,6]
* [**Multilingual similarity search**](https://github.com/facebookresearch/LASER/tree/master/tasks/similarity) [1,6]
* [**Sentence embedding of text files**](https://github.com/facebookresearch/LASER/tree/master/tasks/embed)
  example how to calculate sentence embeddings for arbitrary text files in any of the supported language.

**For all tasks, we use exactly the same multilingual encoder, without any task specific optimization or fine-tuning.**



## References

[1] Holger Schwenk and Matthijs Douze,
    [*Learning Joint Multilingual Sentence Representations with Neural Machine Translation*](https://aclanthology.info/papers/W17-2619/w17-2619),
    ACL workshop on Representation Learning for NLP, 2017

[2] Holger Schwenk and Xian Li,
    [*A Corpus for Multilingual Document Classification in Eight Languages*](http://www.lrec-conf.org/proceedings/lrec2018/pdf/658.pdf),
    LREC, pages 3548-3551, 2018.

[3] Holger Schwenk,
    [*Filtering and Mining Parallel Data in a Joint Multilingual Space*](http://aclweb.org/anthology/P18-2037)
    ACL, July 2018

[4] Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R. Bowman, Holger Schwenk and Veselin Stoyanov,
    [*XNLI: Cross-lingual Sentence Understanding through Inference*](https://aclweb.org/anthology/D18-1269),
    EMNLP, 2018.

[5] Mikel Artetxe and Holger Schwenk,
    [*Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings*](https://arxiv.org/abs/1811.01136)
    arXiv, Nov 3 2018.

[6] Mikel Artetxe and Holger Schwenk,
    [*Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond*](https://arxiv.org/abs/1812.10464)
    arXiv, Dec 26 2018.

[7] Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Paco Guzman,
    [*WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia*](https://arxiv.org/abs/1907.05791)
    arXiv, July 11  2019.

[8] Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave and Armand Joulin
    [*CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB*](https://arxiv.org/abs/1911.04944)