Edit model card

T5 Large for Text Aggregation

Model description

This is a T5 Large fine-tuned for crowdsourced text aggregation tasks. The model takes multiple performers' responses and yields a single aggregated response. This approach was introduced for the first time during VLDB 2021 Crowd Science Challenge and originally implemented at the second-place competitor's GitHub. The paper describing this model was presented at the 2nd Crowd Science Workshop.

How to use

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, AutoConfig
mname = "toloka/t5-large-for-text-aggregation"
tokenizer = AutoTokenizer.from_pretrained(mname)
model = AutoModelForSeq2SeqLM.from_pretrained(mname)

input = "samplee text | sampl text | sample textt"
input_ids = tokenizer.encode(input, return_tensors="pt")
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)  # sample text

Training data

Pretrained weights were taken from the original T5 Large model by Google. For more details on the T5 architecture and training procedure see https://arxiv.org/abs/1910.10683

Model was fine-tuned on train-clean, dev-clean and dev-other parts of the CrowdSpeech dataset that was introduced in our paper.

Training procedure

The model was fine-tuned for eight epochs directly following the HuggingFace summarization training example.

Eval results

Dataset Split WER
CrowdSpeech test-clean 4.99
CrowdSpeech test-other 10.61

BibTeX entry and citation info

@inproceedings{Pletenev:21,
  author    = {Pletenev, Sergey},
  title     = {{Noisy Text Sequences Aggregation as a Summarization Subtask}},
  year      = {2021},
  booktitle = {Proceedings of the 2nd Crowd Science Workshop: Trust, Ethics, and Excellence in Crowdsourced Data Management at Scale},
  pages     = {15--20},
  address   = {Copenhagen, Denmark},
  issn      = {1613-0073},
  url       = {http://ceur-ws.org/Vol-2932/short2.pdf},
  language  = {english},
}
@misc{pavlichenko2021vox,
      title={Vox Populi, Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription}, 
      author={Nikita Pavlichenko and Ivan Stelmakh and Dmitry Ustalov},
      year={2021},
      eprint={2107.01091},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}
Downloads last month
5
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train toloka/t5-large-for-text-aggregation