Model Card for Model ID
This model is a simple bilingual English-German machine translation trained with MarianNMT. They were converted to huggingface using scripts derived from the Helsinki-NLP group. We collected most datasets listed via mtdata and filtered. The processed data is also available on huggingface.
We trained these models in order to develop a new ensembling algorithm. Agreement-Based Ensembling is an inference-time-only algorithm that allows for ensembling models with different vocabularies, without the ned to learn additional parameters or alter the underlying models. Instead, the algorithm ensures that tokens generated by the ensembled models agree in their surface form. For more information, please check out our code available on GitHub, or read our paper on Arxiv.
Model Details
Model Description
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
- Shared, Developed by: Rachel Wicks
- Funded By: Johns Hopkins University
- Model type: Encoder-Decoder (Transformer, Transformer)
- Language(s) (NLP): English, German
- License: Apache 2.0
Model Sources [optional]
- Paper [optional]: Coming Soon!
How to Get Started with the Model
The code below can be used to translate lines read from standard input (our baseline in our paper).
import sys
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_id = sys.argv[1]
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id).to(device)
model = model.eval()
for line in sys.stdin:
line = line.strip()
inputs = tokenizer(line, return_tensors="pt").to(device)
translated_tokens = model.generate(
**inputs, max_length=256,
num_beams = 5,
)
print(tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0])
Training Details
Data is available here.
We use sotastream to stream data over stdin.
We use MarianNMT to train.
The config is available in the repo as config.yml
.
Evaluation
BLEU on WMT24 is XX.
Hardware
RTX Titan (24GB)
Citation [optional]
BibTeX:
[More Information Needed]
APA:
- Downloads last month
- 119