QA-Evaluation-Metrics πŸ“Š

PyPI version qa-metrics Colab

A fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models.

pip install qa-metrics is all you need!

πŸŽ‰ Latest Updates

  • Version 0.2.19 Released!
    • Paper accepted to EMNLP 2024 Findings! πŸŽ“
    • Enhanced PEDANTS with multi-pipeline support and improved edge case handling
    • Added support for OpenAI GPT-series and Claude Series models (OpenAI version > 1.0)
    • Integrated support for open-source models (LLaMA-2-70B-chat, LLaVA-1.5, etc.) via deepinfra
    • Introduced trained tiny-bert for QA evaluation (18MB model size)
    • Added direct Huggingface model download support for TransformerMatcher

πŸš€ Quick Start

Table of Contents

Prerequisites

  • Python >= 3.6
  • openai >= 1.0

Installation

pip install qa-metrics

πŸ’‘ Features

Our package offers six QA evaluation methods with varying strengths:

Method Best For Cost Correlation with Human Judgment
Normalized Exact Match Short-form QA (NQ-OPEN, HotpotQA, etc.) Free Good
PEDANTS Both short & medium-form QA Free Very High
Neural Evaluation Both short & long-form QA Free High
Open Source LLM Evaluation All QA types Free High
Black-box LLM Evaluation All QA types Paid Highest

πŸ“– Documentation

1. Normalized Exact Match

Method: em_match

Parameters

  • reference_answer (list of str): A list of gold (correct) answers to the question
  • candidate_answer (str): The answer provided by a candidate that needs to be evaluated

Returns

  • boolean: True if there are any exact normalized matches between gold and candidate answers
from qa_metrics.em import em_match

reference_answer = ["The Frog Prince", "The Princess and the Frog"]
candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
match_result = em_match(reference_answer, candidate_answer)

2. F1 Score

Method: f1_score_with_precision_recall

Parameters

  • reference_answer (str): A gold (correct) answer to the question
  • candidate_answer (str): The answer provided by a candidate that needs to be evaluated

Returns

  • dictionary: Contains the F1 score, precision, and recall between a gold and candidate answer

Method: f1_match

Parameters

  • reference_answer (list of str): List of gold answers
  • candidate_answer (str): Candidate answer to evaluate
  • threshold (float): F1 score threshold for considering a match (default: 0.5)

Returns

  • boolean: True if F1 score exceeds threshold for any gold answer
from qa_metrics.f1 import f1_match, f1_score_with_precision_recall

f1_stats = f1_score_with_precision_recall(reference_answer[0], candidate_answer)
match_result = f1_match(reference_answer, candidate_answer, threshold=0.5)

3. PEDANTS

Method: get_score

Parameters

  • reference_answer (str): A Gold answer
  • candidate_answer (str): Candidate answer to evaluate
  • question (str): The question being evaluated

Returns

  • float: The similarity score between two strings (0 to 1)

Method: get_highest_score

Parameters

  • reference_answer (list of str): List of gold answers
  • candidate_answer (str): Candidate answer to evaluate
  • question (str): The question being evaluated

Returns

  • dictionary: Contains the gold answer and candidate answer pair with highest matching score

Method: get_scores

Parameters

  • reference_answer (list of str): List of gold answers
  • candidate_answer (str): Candidate answer to evaluate
  • question (str): The question being evaluated

Returns

  • dictionary: Contains matching scores for all gold answer and candidate answer pairs

Method: evaluate

Parameters

  • reference_answer (list of str): List of gold answers
  • candidate_answer (str): Candidate answer to evaluate
  • question (str): The question being evaluated

Returns

  • boolean: True if candidate answer matches any gold answer

Method: get_question_type

Parameters

  • reference_answer (list of str): List of gold answers
  • question (str): The question being evaluated

Returns

  • list: The type of the question (what, who, when, how, why, which, where)

Method: get_judgement_type

Parameters

  • reference_answer (list of str): List of gold answers
  • candidate_answer (str): Candidate answer to evaluate
  • question (str): The question being evaluated

Returns

  • list: A list revised rules applicable to judge answer correctness
from qa_metrics.pedant import PEDANT

pedant = PEDANT()
scores = pedant.get_scores(reference_answer, candidate_answer, question)
match_result = pedant.evaluate(reference_answer, candidate_answer, question)

4. Transformer Neural Evaluation

Method: get_score

Parameters

  • reference_answer (str): A Gold answer
  • candidate_answer (str): Candidate answer to evaluate
  • question (str): The question being evaluated

Returns

  • float: The similarity score between two strings (0 to 1)

Method: get_highest_score

Parameters

  • reference_answer (list of str): List of gold answers
  • candidate_answer (str): Candidate answer to evaluate
  • question (str): The question being evaluated

Returns

  • dictionary: Contains the gold answer and candidate answer pair with highest matching score

Method: get_scores

Parameters

  • reference_answer (list of str): List of gold answers
  • candidate_answer (str): Candidate answer to evaluate
  • question (str): The question being evaluated

Returns

  • dictionary: Contains matching scores for all gold answer and candidate answer pairs

Method: transformer_match

Parameters

  • reference_answer (list of str): List of gold answers
  • candidate_answer (str): Candidate answer to evaluate
  • question (str): The question being evaluated

Returns

  • boolean: True if transformer model considers candidate answer equivalent to any gold answer
from qa_metrics.transformerMatcher import TransformerMatcher

### supports `zli12321/roberta-large-qa-evaluator`, `zli12321/answer_equivalence_bert`, `zli12321/answer_equivalence_distilbert`, `zli12321/answer_equivalence_roberta`, `zli12321/answer_equivalence_distilroberta`
tm = TransformerMatcher("zli12321/answer_equivalence_tiny_bert")
match_result = tm.transformer_match(reference_answer, candidate_answer, question)

5. LLM Integration

Method: prompt_gpt

Parameters

  • prompt (str): The input prompt text
  • model_engine (str): OpenAI model to use (e.g., 'gpt-3.5-turbo')
  • temperature (float): Controls randomness (0-1)
  • max_tokens (int): Maximum tokens in response
from qa_metrics.prompt_llm import CloseLLM

model = CloseLLM()
model.set_openai_api_key(YOUR_OPENAI_KEY)
result = model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo')

Method: prompt_claude

Parameters

  • prompt (str): The input prompt text
  • model_engine (str): Claude model to use
  • anthropic_version (str): API version
  • max_tokens_to_sample (int): Maximum tokens in response
  • temperature (float): Controls randomness (0-1)
model = CloseLLM()
model.set_anthropic_api_key(YOUR_ANTHROPIC_KEY)
result = model.prompt_claude(prompt=prompt, model_engine='claude-v1')

Method: prompt

Parameters

  • message (str): The input message text
  • model_engine (str): Model to use
  • temperature (float): Controls randomness (0-1)
  • max_tokens (int): Maximum tokens in response
from qa_metrics.prompt_open_llm import OpenLLM

model = OpenLLM()
model.set_deepinfra_key(YOUR_DEEPINFRA_KEY)
result = model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1')

πŸ€— Model Hub

Our fine-tuned models are available on Huggingface:

πŸ“š Resources

πŸ“„ Citation

@misc{li2024pedantspreciseevaluationsdiverse,
      title={PEDANTS: Cheap but Effective and Interpretable Answer Equivalence}, 
      author={Zongxia Li and Ishani Mondal and Yijun Liang and Huy Nghiem and Jordan Lee Boyd-Graber},
      year={2024},
      eprint={2402.11161},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2402.11161}, 
}

πŸ“ License

This project is licensed under the MIT License.

πŸ“¬ Contact

For questions or comments, please contact: zli12321@umd.edu

Downloads last month
31
Safetensors
Model size
355M params
Tensor type
F32
Β·
Inference Examples
Inference API (serverless) has been turned off for this model.

Dataset used to train zli12321/answer_equivalence_roberta-large