File size: 5,463 Bytes
ae4fce8
 
 
2209e43
 
a0f0e17
2209e43
 
a0f0e17
2209e43
a0f0e17
2209e43
a0f0e17
2209e43
a0f0e17
2209e43
 
 
a0f0e17
 
 
 
 
 
2209e43
 
a0f0e17
 
 
2209e43
 
 
 
 
 
 
 
 
a0f0e17
2209e43
 
 
 
 
 
 
 
 
 
 
a0f0e17
2209e43
a0f0e17
 
2209e43
a0f0e17
 
 
 
 
 
2209e43
a0f0e17
 
2209e43
 
a0f0e17
2209e43
a0f0e17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2209e43
 
a0f0e17
 
2209e43
 
 
 
 
 
 
 
 
 
 
 
 
 
a0f0e17
2209e43
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---
license: cc-by-nc-4.0
---

# SONAR
[[Paper]](https://fb.workplace.com/groups/831302610278251/permalink/9713798772028546) (TODO: change for external link once published) 
[[Demo]](#usage)

We introduce SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders. It substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks. 

Speech segments can be embedded in the same SONAR embedding space using language-specific speech encoders trained in a teacher-student setting on speech transcription data. We also provide a single text decoder, which allows us to perform text-to-text and speech-to-text machine translation, including for zero-shot language and modality combinations.

*SONAR* stands for **S**entence-level multim**O**dal and la**N**guage-**A**gnostic **R**epresentations

The full list of supported languages (along with download links) can be found here [below](#supported-languages-and-download-links).


## Installing
SONAR depends mainly on [Fairseq2](https://github.com/fairinternal/fairseq2) and can be installed using (tested with `python=3.8`)
```bash
pip install --upgrade pip
pip config set global.extra-index-url https://test.pypi.org/simple/
pip install -e .
```

## Usage
fairseq2 will automatically download models into your `$TORCH_HOME/hub` directory upon using the commands below.

### Compute text sentence embeddings with SONAR:
```python
from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline
t2vec_model = TextToEmbeddingModelPipeline(encoder="text_sonar_basic_encoder",
                                           tokenizer="text_sonar_basic_encoder")
sentences = ['My name is SONAR.', 'I can embed the sentences into vectorial space.']
t2vec_model.predict(sentences, source_lang="eng_Latn").shape
# torch.Size([2, 1024])
```

### Translate text with SONAR
```python
from sonar.inference_pipelines.text import TextToTextModelPipeline
t2t_model = TextToTextModelPipeline(encoder="text_sonar_basic_encoder",
                                    decoder="text_sonar_basic_decoder",
                                    tokenizer="text_sonar_basic_encoder")  # tokenizer is attached to both encoder and decoder cards

sentences = ['My name is SONAR.', 'I can embed the sentences into vectorial space.']
t2t_model.predict(sentences, source_lang="eng_Latn", target_lang="fra_Latn")
# ['Mon nom est SONAR.', "Je peux intégrer les phrases dans l'espace vectoriel."]
```

### Compute speech sentence embeddings with SONAR
```python
from sonar.inference_pipelines.speech import SpeechToEmbeddingModelPipeline
s2vec_model = SpeechToEmbeddingModelPipeline(encoder="sonar_speech_encoder_eng")

s2vec_model.predict(["./tests/integration_tests/data/audio_files/audio_1.wav",
                     "./tests/integration_tests/data/audio_files/audio_2.wav"]).shape
# torch.Size([2, 1024])
import torchaudio
inp, sr = torchaudio.load("./tests/integration_tests/data/audio_files/audio_1.wav")
assert sr == 16000, "Sample rate should be 16kHz"

s2vec_model.predict([inp]).shape
# torch.Size([1, 1024])
```

### Speech-to-text translation with SONAR
```python
from sonar.inference_pipelines.speech import SpeechToTextModelPipeline

s2t_model = SpeechToTextModelPipeline(encoder="sonar_speech_encoder_eng",
                                      decoder="text_sonar_basic_decoder",
                                      tokenizer="text_sonar_basic_decoder")

import torchaudio
inp, sr = torchaudio.load("./tests/integration_tests/data/audio_files/audio_1.wav")
assert sr == 16000, "Sample rate should be 16kHz"

# passing loaded audio files
s2t_model.predict([inp], target_lang="eng_Latn")
# ['Television reports show white smoke coming from the plant.']

# passing multiple wav files 
s2t_model.predict(["./tests/integration_tests/data/audio_files/audio_1.wav",
                   "./tests/integration_tests/data/audio_files/audio_2.wav"], target_lang="eng_Latn")
# ['Television reports show white smoke coming from the plant.',
# 'These couples may choose to make an adoption plan for their baby.']
```


### Predicting [cross-lingual semantic similarity](https://github.com/facebookresearch/fairseq/tree/nllb/examples/nllb/human_XSTS_eval) with BLASER 2 models
```Python
import torch
from sonar.models.blaser.loader import load_blaser_model

blaser_ref = load_blaser_model("blaser_st2st_ref_v2_0").eval()
blaser_qe = load_blaser_model("blaser_st2st_qe_v2_0").eval()
# BLASER-2 is supposed to work with SONAR speech and text embeddings,
# but we didn't include their extraction in this snippet, to keep it simple.
emb = torch.ones([1, 1024])
print(blaser_ref(src=emb, ref=emb, mt=emb).item())  # 5.2552
print(blaser_qe(src=emb, mt=emb).item())  # 4.9819
```

See more complete demo notebooks :

* [sonar text2text similarity and translation](examples/sonar_text_demo.ipynb)
* [sonar speech2text and other data pipeline examples](examples/inference_pipelines.ipynb)


## Model details

- **Developed by:** Paul-Ambroise Duquenne et al.
- **License:** CC-BY-NC 4.0 license
- **Cite as:**

  @article{Duquenne:2023:sonar_arxiv,
    author = {Paul-Ambroise Duquenne and Holger Schwenk and Benoit Sagot},
    title = {{SONAR:} Sentence-Level Multimodal and Language-Agnostic Representations},
    publisher = {arXiv},
    year = {2023},
    url = {https://arxiv.org/abs/unk},
  }