BLASER 2.0

[Blog] [Code]

BLASER 2.0 is the new version of BLASER (Chen et al., 2023), a family of models for automatic evaluation of machine translation quality.

BLASER 2.0 is based on SONAR sentence embeddings and works with both speech and text modalities.

The actual model predicts a similarity score for the translated sentence based on the translation and the source sentence. This, it can be applied in settings where reference translations are missing or if their quality is questionable. In contrast, its sibling model, BLASER 2.0-referenced, requires also a reference translation.

Supervised BLASER models are trained to predict cross-lingual semantic similarity scores, XSTS (Licht et al., 2022), on a scale where 1 corresponds to completely unrelated sentences and 5 corresponds to fully semantically equivalent sentences. The models predictions, though, are unbounded and can occasionally surpass these limits.

Installation

See the SONAR github repo for the installation instructions.

Usage

BLASER 2.0 models accept 1024-dimensional SONAR sentence embeddings as inputs, and produce a single score as an output. The code below illustrates their usage with text embeddings:

from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline
from sonar.models.blaser.loader import load_blaser_model

blaser = load_blaser_model("blaser_2_0_qe").eval()
text_embedder = TextToEmbeddingModelPipeline(encoder="text_sonar_basic_encoder", tokenizer="text_sonar_basic_encoder")

src_embs = text_embedder.predict(["Le chat s'assit sur le tapis."], source_lang="fra_Latn")
mt_embs = text_embedder.predict(["The cat sat down on the carpet."], source_lang="eng_Latn")
print(blaser(src=src_embs, mt=mt_embs).item())  # 4.708

With BLASER 2.0 models, SONAR text and speech embeddings can be used interchangeably.

Model details

Developed by: Seamless Communication et al.
License: CC-BY-NC 4.0 license
Citation: If you use BLASER 2.0 in your work, please cite the paper:

@article{seamlessm4t2023,
  title={SeamlessM4T—Massively Multilingual \& Multimodal Machine Translation},
  author={{Seamless Communication}, Lo\"{i}c Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye,  Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna Sun, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-juss\`{a} \footnotemark[3], Onur \,{C}elebi,Maha Elbayad,Cynthia Gao, Francisco Guzm\'an, Justine Kao, Ann Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang},
  journal={ArXiv},
  year={2023}
}