raxtemur commited on
Commit
e8e3689
·
verified ·
1 Parent(s): a26f047

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +169 -3
README.md CHANGED
@@ -1,3 +1,169 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - multilingual
4
+ - en
5
+ - ru
6
+ - de
7
+ - fr
8
+ - es
9
+ - zh
10
+ - ja
11
+ - ko
12
+ - ar
13
+ license: cc-by-nc-4.0
14
+ library_name: transformers
15
+ tags:
16
+ - sonar
17
+ - sentence-embeddings
18
+ - multilingual
19
+ - translation
20
+ - text-generation
21
+ - text2text-generation
22
+ base_model: facebook/nllb-200-distilled-1.3B
23
+ pipeline_tag: text2text-generation
24
+ ---
25
+
26
+ # SONAR 200 Text Decoder (HuggingFace Port)
27
+
28
+ This is a port of [Meta's SONAR](https://github.com/facebookresearch/SONAR) text decoder from fairseq2 to HuggingFace Transformers format.
29
+
30
+ ## Model Description
31
+
32
+ SONAR decoder converts 1024-dimensional sentence embeddings back to text. It supports 202 languages (same as NLLB-200).
33
+
34
+ - **Original model:** [facebook/SONAR](https://huggingface.co/facebook/SONAR)
35
+ - **Encoder port:** [cointegrated/SONAR_200_text_encoder](https://huggingface.co/cointegrated/SONAR_200_text_encoder)
36
+ - **Code & Documentation:** [GitHub: sonar-transformers](https://github.com/raxtemur/sonar-transformers)
37
+
38
+ ## Usage
39
+
40
+ ### With sonar_transformers library (recommended)
41
+
42
+ ```bash
43
+ pip install torch transformers sentencepiece
44
+ ```
45
+
46
+ ```python
47
+ from sonar_transformers import SonarPipeline
48
+
49
+ pipeline = SonarPipeline()
50
+
51
+ # Translation
52
+ result = pipeline.translate(
53
+ ["Hello, how are you?"],
54
+ source_lang="eng_Latn",
55
+ target_lang="rus_Cyrl"
56
+ )
57
+ print(result) # ['Здравствуйте, как дела?']
58
+
59
+ # Encode text to embeddings
60
+ embeddings = pipeline.encode(["Hello world!"], source_lang="eng_Latn")
61
+ print(embeddings.shape) # torch.Size([1, 1024])
62
+
63
+ # Decode embeddings back to text
64
+ texts = pipeline.decode(embeddings, target_lang="eng_Latn")
65
+ print(texts) # ['Hello world!']
66
+ ```
67
+
68
+ ### Direct usage with transformers
69
+
70
+ ```python
71
+ import torch
72
+ from transformers import M2M100ForConditionalGeneration, NllbTokenizer
73
+ from transformers.modeling_outputs import BaseModelOutput
74
+
75
+ # Load model and tokenizer
76
+ model = M2M100ForConditionalGeneration.from_pretrained("raxtemur/SONAR_200_text_decoder")
77
+ tokenizer = NllbTokenizer.from_pretrained("raxtemur/SONAR_200_text_decoder")
78
+
79
+ # Your embeddings from SONAR encoder (1024-dim vectors)
80
+ embeddings = torch.randn(1, 1024) # Replace with actual embeddings
81
+
82
+ # Prepare encoder outputs
83
+ encoder_outputs = BaseModelOutput(last_hidden_state=embeddings.unsqueeze(1))
84
+
85
+ # Generate text
86
+ target_lang = "eng_Latn"
87
+ forced_bos_token_id = tokenizer.convert_tokens_to_ids(target_lang)
88
+
89
+ generated_ids = model.generate(
90
+ encoder_outputs=encoder_outputs,
91
+ forced_bos_token_id=forced_bos_token_id,
92
+ max_length=128,
93
+ num_beams=5
94
+ )
95
+
96
+ text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
97
+ print(text)
98
+ ```
99
+
100
+ ## Compatibility
101
+
102
+ Tested against original fairseq2 SONAR:
103
+
104
+ | Test | Result |
105
+ |------|--------|
106
+ | Encoder cosine similarity | **1.000000** |
107
+ | Decoder output match | **Identical** |
108
+ | Round-trip (encode→decode) | **Works** |
109
+ | Translation | **Works** |
110
+
111
+ Example outputs:
112
+ - "Hello world!" → "Hello world!" ✓
113
+ - "This is a test sentence." → "This is a test sentence." ✓
114
+ - eng→rus: "Hello, how are you?" → "Здравствуйте, как дела?" ✓
115
+ - eng→deu: "Machine learning is powerful." → "Maschinelles Lernen ist mächtig." ✓
116
+
117
+ ## Conversion Details
118
+
119
+ This model was converted from the original fairseq2 checkpoint using the following key mappings:
120
+
121
+ | fairseq2 | HuggingFace |
122
+ |----------|-------------|
123
+ | `decoder.decoder.layers.N.encoder_decoder_attn.*` | `model.decoder.layers.N.encoder_attn.*` |
124
+ | `decoder.decoder.layers.N.ffn.inner_proj.*` | `model.decoder.layers.N.fc1.*` |
125
+ | `decoder.decoder.layers.N.ffn.output_proj.*` | `model.decoder.layers.N.fc2.*` |
126
+ | `decoder.decoder.layers.N.ffn_layer_norm.*` | `model.decoder.layers.N.final_layer_norm.*` |
127
+ | `decoder.decoder_frontend.embed.weight` | `model.decoder.embed_tokens.weight` |
128
+ | `decoder.final_proj.weight` | `lm_head.weight` |
129
+
130
+ Special tokens were reordered:
131
+ - fairseq2: `[pad=0, unk=1, bos=2, eos=3]`
132
+ - HuggingFace: `[bos=0, pad=1, eos=2, unk=3]`
133
+
134
+ ## Language Codes (FLORES-200)
135
+
136
+ Common codes:
137
+ - `eng_Latn` - English
138
+ - `rus_Cyrl` - Russian
139
+ - `deu_Latn` - German
140
+ - `fra_Latn` - French
141
+ - `spa_Latn` - Spanish
142
+ - `zho_Hans` - Chinese (Simplified)
143
+ - `jpn_Jpan` - Japanese
144
+ - `kor_Hang` - Korean
145
+ - `arb_Arab` - Arabic
146
+
147
+ Full list: 202 languages from FLORES-200.
148
+
149
+ ## Citation
150
+
151
+ ```bibtex
152
+ @article{Duquenne:2023:sonar_arxiv,
153
+ author = {Duquenne, Paul-Ambroise and Schwenk, Holger and Balikas, Georgios and others},
154
+ title = {SONAR: Sentence-Level Multimodal and Language-Agnostic Representations},
155
+ journal = {arXiv preprint arXiv:2308.11466},
156
+ year = {2023},
157
+ }
158
+ ```
159
+
160
+ ## License
161
+
162
+ **CC-BY-NC-4.0** (inherited from original SONAR)
163
+
164
+ The model weights are derived from [Meta's SONAR](https://github.com/facebookresearch/SONAR) and are licensed under CC-BY-NC-4.0. Commercial use is not permitted.
165
+
166
+ ## Acknowledgments
167
+
168
+ - [Meta AI](https://github.com/facebookresearch/SONAR) - Original SONAR
169
+ - [cointegrated](https://huggingface.co/cointegrated/SONAR_200_text_encoder) - Encoder conversion inspiration