gregtatum's picture
Update the models
1cff452

sentence-transformers/static-similarity-mrl-multilingual-v1

License: apache-2.0

Multi-lingual similarity embeddings that were trained with Matroyshka loss that allows for more effective truncation of the embedding vectors. It was trained on a variety of domains of multilingual datasets.

It's a general purpose model that can be used for semantic textual similarity, paraphrase mining, text classification, clustering, and more

Model Stats

Stats that describe the embeddings tensor shapes and value distribution.

item metric value
vocab size 105,879
embedding dimensions 1,024
vector length mean 413.61
vector length median 437.74
vector length stddev 195.51
values mean -0.02
values median -0.01
values stddev 14.30

Mean Pooled Quantization Loss

This test roundtrips the vectors through quantization, but performs the mean pooling arithmetic in float32 space. The quantized and unquantized mean pooled vectors are compared to each other to determine their cosine similarity, to show how much the meaning of the vector has changed due to quantization.

Precision Cosine Similarity
fp16 1.00000
fp8 e4m3 0.99980
fp8 e5m2 0.99921

Quantization Loss Per Vector

While ultimately the embedding vectors will be mean pooled together, it's still useful to look at the loss per-vector in the embedding table to see which quantization strategies retain the most vector meaning.

  • Cosine Similarity — measures how well the direction of embedding vectors is preserved after quantization, independent of scale. This is especially relevant when embeddings are used for similarity search or retrieval.
  • MSE (Mean Squared Error) — emphasizes large errors by squaring the differences. Useful for detecting whether any values are badly distorted.
  • MAE (Mean Absolute Error) — the average absolute difference between original and quantized values. Easier to interpret, less sensitive to outliers.
Precision Metric Value
fp16 cosine similarity 1.00000
fp8 e4m3 cosine similarity 0.99965
fp8 e5m2 cosine similarity 0.99861
fp16 MSE 0.00001
fp8 e4m3 MSE 0.14369
fp8 e5m2 MSE 0.56917
fp16 MAE 0.00183
fp8 e4m3 MAE 0.23372
fp8 e5m2 MAE 0.46585

Tokenizer Examples

Input: This is an example of encoding
Tokens: [CLS] this is an example of en ##co ##ding [SEP]

Input: The quick brown fox jumps over the lazy dog.
Tokens: [CLS] the quick brown fox jump ##s over the la ##zy dog . [SEP]

Input: Curaçao, naïve fiancé, jalapeño, déjà vu.
Tokens: [CLS] curacao , nai ##ve fia ##nce , ja ##lap ##eno , deja vu . [SEP]

Input: Привет, как дела?
Tokens: [CLS] при ##вет , как дела ? [SEP]

Input: Бързата кафява лисица прескача мързеливото куче.
Tokens: [CLS] б ##ър ##за ##та ка ##ф ##ява ли ##си ##ца пре ##ска ##ча м ##ър ##зе ##ливо ##то к ##уч ##е . [SEP]

Input: Γρήγορη καφέ αλεπού πηδάει πάνω από τον τεμπέλη σκύλο.
Tokens: [CLS] γ ##ρη ##γο ##ρη κ ##α ##φ ##ε α ##λε ##που π ##η ##δα ##ει πανω απο τον τ ##ε ##μ ##πε ##λη σ ##κ ##υλο . [SEP]

Input: اللغة العربية جميلة وغنية بالتاريخ.
Tokens: [CLS] اللغة العربية ج ##ميل ##ة و ##غنية با ##لت ##اري ##خ . [SEP]

Input: مرحبا بالعالم!
Tokens: [CLS] م ##رح ##با با ##ل ##عا ##لم ! [SEP]

Input: Simplified: 快速的棕色狐狸跳过懒狗。
Tokens: [CLS] simplified : [SEP]

Input: Traditional: 快速的棕色狐狸跳過懶狗。
Tokens: [CLS] traditional : [SEP]

Input: 素早い茶色の狐が怠け者の犬を飛び越える。
Tokens: [CLS] える [SEP]

Input: コンピュータープログラミング
Tokens: [CLS] ##ン ##ヒ ##ュー ##ター ##フロ ##ク ##ラ ##ミ ##ンク [SEP]

Input: 빠른 갈색 여우가 게으른 개를 뛰어넘습니다.
Tokens: [CLS] ##ᅡ른 가 ##ᆯ ##색 ##ᅧ ##우 ##가 ##ᅦ ##ᄋ ##ᅳ ##른 ##ᅢ를 ##ᅱ ##어 ##너 ##ᆷ ##스 ##ᆸ니다 . [SEP]

Input: तेज़ भूरी लोमड़ी आलसी कुत्ते के ऊपर कूदती है।
Tokens: [CLS] ##ज भर ##ी ##ो ##म ##डी आल ##सी ##तत ऊपर ##द ##ती [SEP]

Input: দ্রুত বাদামী শিয়াল অলস কুকুরের উপর দিয়ে লাফ দেয়।
Tokens: [CLS] ##রত বা ##দা ##মী ##িযা ##ল ##ল ##স ##কর ##ের উপর দিযে ##া ##ফ দেয [SEP]

Input: வேகமான பழுப்பு நரி சோம்பேறி நாயின் மேல் குதிக்கிறது.
Tokens: [CLS] ##ே ##கம ##ான ##ழு ##பபு நர ##ி ##ோ ##ம ##ப ##ே ##றி ##ாய ##ின மேல ##ு ##தி ##ககிறது . [SEP]

Input: สุนัขจิ้งจอกสีน้ำตาลกระโดดข้ามสุนัขขี้เกียจ.
Tokens: [CLS] [UNK] . [SEP]

Input: ብሩክ ቡናማ ቀበሮ ሰነፍ ውሻን ተዘልሏል።
Tokens: [CLS] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [SEP]

Input: Hello 世界 مرحبا 🌍
Tokens: [CLS] hello م ##رح ##با [UNK] [SEP]

Input: 123, αβγ, абв, العربية, 中文, हिन्दी.
Tokens: [CLS] 123 , α ##β ##γ , аб ##в , العربية , , हिनदी . [SEP]