Description
METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore, METEOR can be easily extended to include more advanced matching strategies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference. METEOR gets an R correlation value of 0.347 with human evaluation on the Arabic data and 0.331 on the Chinese data. This is shown to be an improvement on using simply unigram-precision, unigram-recall and their harmonic F1 combination.
How to load this metric directly with the
datasets
library:
from datasets import load_metric metric = load_metric("meteor")
Citation
@inproceedings{banarjee2005, title = {{METEOR}: An Automatic Metric for {MT} Evaluation with Improved Correlation with Human Judgments}, author = {Banerjee, Satanjeev and Lavie, Alon}, booktitle = {Proceedings of the {ACL} Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization}, month = jun, year = {2005}, address = {Ann Arbor, Michigan}, publisher = {Association for Computational Linguistics}, url = {https://www.aclweb.org/anthology/W05-0909}, pages = {65--72}, }