Edit model card
YAML Metadata Error: "language" with value "en-es" is not valid. It must be an ISO 639-1, 639-2 or 639-3 code (two/three letters), or a special value like "code", "multilingual". If you want to use BCP-47 identifiers, you can specify them in language_bcp47.

xm_transformer_600m-en_es-multi_domain

W2V2-Transformer speech-to-text translation model from fairseq S2T (paper/code):

  • English-Spanish
  • Trained on MuST-C, EuroParl-ST, VoxPopuli, Multilingual LibriSpeech, Common Voice v7 and CCMatrix
  • Speech synthesis with facebook/tts_transformer-es-css10

Usage

from fairseq.checkpoint_utils import load_model_ensemble_and_task_from_hf_hub
from fairseq.models.text_to_speech.hub_interface import S2THubInterface
from fairseq.models.text_to_speech.hub_interface import TTSHubInterface
import IPython.display as ipd
import torchaudio


models, cfg, task = load_model_ensemble_and_task_from_hf_hub(
    "facebook/xm_transformer_600m-en_es-multi_domain",
    arg_overrides={"config_yaml": "config.yaml"},
)
model = models[0]
generator = task.build_generator(model, cfg)


# requires 16000Hz mono channel audio
audio, _ = torchaudio.load("/path/to/an/audio/file")

sample = S2THubInterface.get_model_input(task, audio)
text = S2THubInterface.get_prediction(task, model, generator, sample)

# speech synthesis
tts_models, tts_cfg, tts_task = load_model_ensemble_and_task_from_hf_hub(
  f"facebook/tts_transformer-es-css10",
  arg_overrides={"vocoder": "griffin_lim", "fp16": False},
)
tts_model = tts_models[0]
TTSHubInterface.update_cfg_with_data_cfg(tts_cfg, tts_task.data_cfg)
tts_generator = tts_task.build_generator([tts_model], tts_cfg)

tts_sample = TTSHubInterface.get_model_input(tts_task, text)
wav, sr = TTSHubInterface.get_prediction(
    tts_task, tts_model, tts_generator, tts_sample
)

ipd.Audio(wav, rate=rate)

Citation

@inproceedings{li-etal-2021-multilingual,
    title = "Multilingual Speech Translation from Efficient Finetuning of Pretrained Models",
    author = "Li, Xian  and
      Wang, Changhan  and
      Tang, Yun  and
      Tran, Chau  and
      Tang, Yuqing  and
      Pino, Juan  and
      Baevski, Alexei  and
      Conneau, Alexis  and
      Auli, Michael",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.68",
    doi = "10.18653/v1/2021.acl-long.68",
    pages = "827--838",
}

@inproceedings{wang-etal-2020-fairseq,
    title = "Fairseq {S}2{T}: Fast Speech-to-Text Modeling with Fairseq",
    author = "Wang, Changhan  and
      Tang, Yun  and
      Ma, Xutai  and
      Wu, Anne  and
      Okhonko, Dmytro  and
      Pino, Juan",
    booktitle = "Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: System Demonstrations",
    month = dec,
    year = "2020",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.aacl-demo.6",
    pages = "33--39",
}
Downloads last month
2
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Spaces using facebook/xm_transformer_600m-en_es-multi_domain 2