xtreme_s / README.md
lvwerra's picture
lvwerra HF staff
Update Space (evaluate main: 8b9373dc)
1e2b962

A newer version of the Gradio SDK is available: 5.1.0

Upgrade
metadata
title: XTREME-S
emoji: 🤗
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
tags:
  - evaluate
  - metric
description: >-
  XTREME-S is a benchmark to evaluate universal cross-lingual speech
  representations in many languages. XTREME-S covers four task families: speech
  recognition, classification, speech-to-text translation and retrieval.

Metric Card for XTREME-S

Metric Description

The XTREME-S metric aims to evaluate model performance on the Cross-lingual TRansfer Evaluation of Multilingual Encoders for Speech (XTREME-S) benchmark.

This benchmark was designed to evaluate speech representations across languages, tasks, domains and data regimes. It covers 102 languages from 10+ language families, 3 different domains and 4 task families: speech recognition, translation, classification and retrieval.

How to Use

There are two steps: (1) loading the XTREME-S metric relevant to the subset of the benchmark being used for evaluation; and (2) calculating the metric.

  1. Loading the relevant XTREME-S metric : the subsets of XTREME-S are the following: mls, voxpopuli, covost2, fleurs-asr, fleurs-lang_id, minds14 and babel. More information about the different subsets can be found on the XTREME-S benchmark page.
>>> xtreme_s_metric = evaluate.load('xtreme_s', 'mls')
  1. Calculating the metric: the metric takes two inputs :
  • predictions: a list of predictions to score, with each prediction a str.

  • references: a list of lists of references for each translation, with each reference a str.

>>> references = ["it is sunny here", "paper and pen are essentials"]
>>> predictions = ["it's sunny", "paper pen are essential"]
>>> results = xtreme_s_metric.compute(predictions=predictions, references=references)

It also has two optional arguments:

  • bleu_kwargs: a dict of keywords to be passed when computing the bleu metric for the covost2 subset. Keywords can be one of smooth_method, smooth_value, force, lowercase, tokenize, use_effective_order.

  • wer_kwargs: optional dict of keywords to be passed when computing wer and cer, which are computed for the mls, fleurs-asr, voxpopuli, and babel subsets. Keywords are concatenate_texts.

Output values

The output of the metric depends on the XTREME-S subset chosen, consisting of a dictionary that contains one or several of the following metrics:

  • accuracy: the proportion of correct predictions among the total number of cases processed, with a range between 0 and 1 (see accuracy for more information). This is returned for the fleurs-lang_id and minds14 subsets.

  • f1: the harmonic mean of the precision and recall (see F1 score for more information). Its range is 0-1 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall. It is returned for the minds14 subset.

  • wer: Word error rate (WER) is a common metric of the performance of an automatic speech recognition system. The lower the value, the better the performance of the ASR system, with a WER of 0 being a perfect score (see WER score for more information). It is returned for the mls, fleurs-asr, voxpopuli and babel subsets of the benchmark.

  • cer: Character error rate (CER) is similar to WER, but operates on character instead of word. The lower the CER value, the better the performance of the ASR system, with a CER of 0 being a perfect score (see CER score for more information). It is returned for the mls, fleurs-asr, voxpopuli and babel subsets of the benchmark.

  • bleu: the BLEU score, calculated according to the SacreBLEU metric approach. It can take any value between 0.0 and 100.0, inclusive, with higher values being better (see SacreBLEU for more details). This is returned for the covost2 subset.

Values from popular papers

The original XTREME-S paper reported average WERs ranging from 9.2 to 14.6, a BLEU score of 20.6, an accuracy of 73.3 and F1 score of 86.9, depending on the subsets of the dataset tested on.

Examples

For the mls subset (which outputs wer and cer):

>>> xtreme_s_metric = evaluate.load('xtreme_s', 'mls')  
>>> references = ["it is sunny here", "paper and pen are essentials"]
>>> predictions = ["it's sunny", "paper pen are essential"]
>>> results = xtreme_s_metric.compute(predictions=predictions, references=references)
>>> print({k: round(v, 2) for k, v in results.items()})
{'wer': 0.56, 'cer': 0.27}

For the covost2 subset (which outputs bleu):

>>> xtreme_s_metric = evaluate.load('xtreme_s', 'covost2')
>>> references = ["bonjour paris", "il est necessaire de faire du sport de temps en temp"]
>>> predictions = ["bonjour paris", "il est important de faire du sport souvent"]
>>> results = xtreme_s_metric.compute(predictions=predictions, references=references)
>>> print({k: round(v, 2) for k, v in results.items()})
{'bleu': 31.65}

For the fleurs-lang_id subset (which outputs accuracy):

>>> xtreme_s_metric = evaluate.load('xtreme_s', 'fleurs-lang_id')
>>> references = [0, 1, 0, 0, 1]
>>> predictions = [0, 1, 1, 0, 0]
>>> results = xtreme_s_metric.compute(predictions=predictions, references=references)
>>> print({k: round(v, 2) for k, v in results.items()})
{'accuracy': 0.6}

For the minds14 subset (which outputs f1 and accuracy):

>>> xtreme_s_metric = evaluate.load('xtreme_s', 'minds14')
>>> references = [0, 1, 0, 0, 1]
>>> predictions = [0, 1, 1, 0, 0]
>>> results = xtreme_s_metric.compute(predictions=predictions, references=references)
>>> print({k: round(v, 2) for k, v in results.items()})
{'f1': 0.58, 'accuracy': 0.6}

Limitations and bias

This metric works only with datasets that have the same format as the XTREME-S dataset.

While the XTREME-S dataset is meant to represent a variety of languages and tasks, it has inherent biases: it is missing many languages that are important and under-represented in NLP datasets.

It also has a particular focus on read-speech because common evaluation benchmarks like CoVoST-2 or LibriSpeech evaluate on this type of speech, which results in a mismatch between performance obtained in a read-speech setting and a more noisy setting (in production or live deployment, for instance).

Citation

@article{conneau2022xtreme,
  title={XTREME-S: Evaluating Cross-lingual Speech Representations},
  author={Conneau, Alexis and Bapna, Ankur and Zhang, Yu and Ma, Min and von Platen, Patrick and Lozhkov, Anton and Cherry, Colin and Jia, Ye and Rivera, Clara and Kale, Mihir and others},
  journal={arXiv preprint arXiv:2203.10752},
  year={2022}
}

Further References