Metric: bleurt
Update on GitHub


BLEURT a learnt evaluation metric for Natural Language Generation. It is built using multiple phases of transfer learning starting from a pretrained BERT model (Devlin et al. 2018) and then employing another pre-training phrase using synthetic data. Finally it is trained on WMT human annotations. You may run BLEURT out-of-the-box or fine-tune it for your specific application (the latter is expected to perform better). See the [] file at for more information.

How to load this metric directly with the datasets library:

from datasets import load_metric
metric = load_metric("bleurt")


  title={BLEURT: Learning Robust Metrics for Text Generation},
  author={Thibault Sellam and Dipanjan Das and Ankur P. Parikh},