--- title: distinct datasets: - None tags: - evaluate - measurement description: "TODO: add a description here" sdk: gradio sdk_version: 3.19.1 app_file: app.py pinned: false --- # Measurement Card for distinct ***Module Card Instructions:*** ## Measurement Description Distinct metric is to calculate the diversity of language. We provide two versions of distinct score. Expectation-Adjusted-Distinct (EAD) is the default one, which removes the biases of the original distinct score on lengthier sentences (see Figure below). Distinct is the original version.

drawing

For the use of Expectation-Adjusted-Distinct, vocab_size is required. Please follow the [EAD paper](https://aclanthology.org/2022.acl-short.86) (Liu and Sabour et al. 2022) for motivation and follow the rules of thumb provided by [the ipynb](https://github.com/lsy641/Expectation-Adjusted-Distinct/blob/main/EAD.ipynb) to determine the vocab_size. This metric is used to calculate the diversity of a group of sentences. It can be used to either evaluate the diversity of generated responses of the testset (i.e., corpus-level diversity), or calculate diversity of a group of sampled responses given one context (i.e., utterence-level diversity). The [original paper](https://aclanthology.org/N16-1014) (Li et al. 2022) used it as corpus-level while some may use it as utterance-level. However, we don't recommend to calculate Distinct on a small group as it is sensitive to the sentence length and number. ## How to Use ```python >>> import evaluate >>> results = my_new_module.compute(predictions=["Hi.", "I am sorry to hear that", "I don't know", "Do you know who that person is?"], vocab _size=50257) >>> my_new_module = evaluate.load("lsy641/distinct") Downloading builder script: 100%|██████████████████████████████████████████████████████████████████████| 8.62k/8.62k [00:00<00:00, 4.19MB/s] >>> results = my_new_module.compute(predictions=["Hi.", "I am sorry to hear that", "I don't know", "Do you know who that person is?"], vocab_size=50257) >>> print(results) {'Expectation-Adjusted-Distinct': 0.8236605104867569, 'Distinct-1': 0.8235294117647058, 'Distinct-2': 0.9411764705882353, 'Distinct-3': 0.9411764705882353} \\ >>> dataset = ["This is my friend jack", "I'm sorry to hear that", "But you know I am the one who always support you", "Welcome to our family","Hi.", "I am sorry to hear that", "I don't know", "Do you know who that person is?"] >>> results = my_new_module.compute(predictions=["But you know I am the one who always support you", "Hi.", "I am sorry to hear that", "I don't know", "I'm sorry to hear that"], dataForVocabCal=dataset) >>> print(results) {'Expectation-Adjusted-Distinct': 0.9928137111900845, 'Distinct-1': 0.6538461538461539, 'Distinct-2': 0.8076923076923077, 'Distinct-3': 0.8846153846153846} ``` ### Inputs *List all input arguments in the format below* - **predictions** *(list of strings): list of sentences to test diversity. Each prediction should be a string.* - **mode** *(string): 'Expectation-Adjusted-Distinct' or 'Distinct' for diversity calculation. If 'Expectation-Adjusted-Distinct', the scores for both modes will be returned. The default value is 'Expectation-Adjusted-Distinct'* - **vocab_size** *(int): For calculating 'Expectation-Adjusted-Distinct', either vocab_size or dataForVocabCal should not be None. Default value is None* - **dataForVocabCal** *(list of string): dataForVocabCal for calculating the vocab_size for 'Expectation-Adjusted-Distinct'. Typically, it should be a list of sentences consisting the task dataset. For calculating 'Expectation-Adjusted-Distinct', either vocab_size or dataForVocabCal should not be None. Default value is None* - **tokenizer** *(string or tokenizer class): tokenizer for splitting sentences into words. Default value is Tokenizer13a(). Note Tokenizer13a doesn't exclude punctuation marks. NLTK tokenizer is available.* ### Output Values - Expectation-Adjusted-Distinct: Normally it should stay in range 0-1. But it can be more than 1. See the formula property in the [Expectation-Adjusted-Distinct paper](https://aclanthology.org/2022.acl-short.86) (Liu and Sabour et al. 2022) - Distinct-1: Range 0-1 - Distinct-2: Range 0-1 - Distinct-3: Range 0-1 #### Values from Popular Papers The [Expectation-Adjusted-Distinct paper](https://aclanthology.org/2022.acl-short.86) (Liu and Sabour et al. 2022) compares Expectation-Adjusted-Distinct scores of ten different methods with the original Distinct. These scores get higher human correlation from 0.56 to 0.65. ### Examples Example of calculating Expectation-Adjusted-Distinct, given either voab_size or data for vocab_size calculation. Besides returning Expectation-Adjusted-Distinct, this mode will also return Distinct-1, 2, and 3. ```python >>> my_new_module = evaluate.load("lsy641/distinct") >>> results = my_new_module.compute(references=["Hi.", "I'm sorry to hear that", "I don't know"], vocab_size=50257) >>> print(results) \\ >>> dataset = ["This is my friend jack", "I'm sorry to hear that", "But you know I am the one who always support you", "Welcome to our family"] >>> results = my_new_module.compute(references=["Hi.", "I'm sorry to hear that", "I don't know"], dataForVocabCal = dataset) >>> print(results) ``` Example of calculating original Distinct. This will return Distinct-1,2,and 3. ```python >>> my_new_module = evaluate.load("lsy641/distinct") >>> results = my_new_module.compute(references=["Hi.", "I'm sorry to hear that", "I don't know"], mode="Distinct") >>> print(results) ``` ## Limitations and Bias As EAD (Expectation-Adjusted-Distinct) is based on the idealized assumption that does not take language distribution into account, we further discuss this problem and propose a potential practical way of Expectation-Adjusted Distinct in real situations. Before applying EAD, it is necessary to explore the relationship between score and text length (Figure 1) and check the performance of EAD on the training data. To our knowledge, if the training data is from large-scale open-domain sources such as OpenSubtitles and Reddit, EAD can maintain its value on different lengths. Hence, it can be directly used for evaluating models trained on these datasets. However, we found our experiments on datasets such as Twitter showed a decline in EAD on lengthier texts. This is probably because input length limitations on these platforms (e.g. 280 words on Twitter), which induces users to say as much information as possible within a shorter length. In these situations, it is unfair to use EAD to evaluate methods that tend to generate lengthier texts. ## Citation ```bibtex @inproceedings{liu-etal-2022-rethinking, title = "Rethinking and Refining the Distinct Metric", author = "Liu, Siyang and Sabour, Sahand and Zheng, Yinhe and Ke, Pei and Zhu, Xiaoyan and Huang, Minlie", booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)", year = "2022", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.acl-short.86", doi = "10.18653/v1/2022.acl-short.86", } ``` ```bibtex @inproceedings{li-etal-2016-diversity, title = "A Diversity-Promoting Objective Function for Neural Conversation Models", author = "Li, Jiwei and Galley, Michel and Brockett, Chris and Gao, Jianfeng and Dolan, Bill", booktitle = "Proceedings of the 2016 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies", year = "2016", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/N16-1014", doi = "10.18653/v1/N16-1014", } ``` ## Further References TODO