metadata

license: apache-2.0
language:
  - es
datasets:
  - large_spanish_corpus
  - bertin-project/mc4-es-sampled
  - oscar-corpus/OSCAR-2109

BARTO (base-sized model)

BARTO model pre-trained on Spanish language. It was introduced in the paper Sequence-to-Sequence Spanish Pre-trained Language Models.

Model description

BARTO is a BART-based model (transformer encoder-decoder) with a bidirectional (BERT-like) encoder and an autoregressive (GPT-like) decoder. BART is pre-trained by (1) corrupting text with an arbitrary noising function and (2) learning a model to reconstruct the original text.

BARTO is particularly effective when fine-tuned for text generation (e.g. summarization, translation) but also works well for comprehension tasks (e.g. text classification, question answering).

Intended uses

You can use the raw model for text infilling. However, the model is mainly meant to be fine-tuned on a supervised dataset.

How to use

Here is how to use this model in PyTorch:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('vgaraujov/bart-base-spanish')
model = AutoModel.from_pretrained('vgaraujov/bart-base-spanish')

inputs = tokenizer("Hola amigo, bienvenido a casa.", return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state

Citation (BibTeX)

@misc{araujo2023sequencetosequence,
      title={Sequence-to-Sequence Spanish Pre-trained Language Models}, 
      author={Vladimir Araujo and Maria Mihaela Trusca and Rodrigo Tufiño and Marie-Francine Moens},
      year={2023},
      eprint={2309.11259},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}