from dataclasses import dataclass from enum import Enum @dataclass class Task: benchmark: str metric: str col_name: str # Select your tasks here # --------------------------------------------------- class Tasks(Enum): # task_key in the json file, metric_key in the json file, name to display in the leaderboard # task0 = Task("trickme", "acc", "Accuracy") task1 = Task("trickme", "avg_confidence", "Buzz Confidence") NUM_FEWSHOT = 0 # Change with your few shot # --------------------------------------------------- # Your leaderboard name TITLE = """

Grounded QA leaderboard

""" # What does your leaderboard evaluate? INTRODUCTION_TEXT = """ Build an open-domain QA system that can answer any question posed by humans! For more: https://sites.google.com/view/qanta/home """ # Which evaluations are you running? how can people reproduce what you have? LLM_BENCHMARKS_TEXT = """ ## QA variants ### Generative QA This type of QA system aims to generate an answer to a given question directly. #### Input (1) `question` string ``` E.g. qa_pipe(question) ``` #### Output Return in a JSON format: (1) `guess` string, (2) `confidence` score which should be a float number representing the probability (0-1) of your guess. ``` E.g. {'guess': 'Apple', 'confidence': 0.02} ``` Reminder: Feel free to check the tutorial provided to see how you could calculate the probability of the generated tokens! ### Extractive QA This type of QA system aims to extract an answer span from a context passage for a given question. #### Input (1) `question` string, and (2) `context` string ``` E.g. qa_pipe(question=question, context=context) ``` #### Output Return in a JSON format: (1) `guess` string, (2) `confidence` score which should be a float number representing the probability (0-1) of your guess. ``` E.g. {'guess': 'Apple', 'confidence': 0.02} ``` Reminder: If you are playing around with an extractive QA model already, HF QA models output the `score` already, so you only need to wrap the `score` to `confidence`. #### Customized retriever If you didn't submit anything for retriever, we will feed the `context` string with our pre-loaded context. However, we do provide the option for you to customize your retriever model with the dataset you wish to do retrieval. Please check the tutorial example for more details. ## Evaluation Metric In our Grounded QA task, we evaluate the QA model's reliability of their performance by measuring their calibration estimates where we consider the confidence of guess confidence values. To understand this concept better, we adopt the concept of "buzz" in Trivia Quiz, where buzz happens whenever the player is confident enough to predict the correct guess in the middle of a question. This also applies to our measurement of model calibration as we focus whether the model prediction probability matches its prediction accuracy. Our evaluation metric, `Average Expected Buzz`, quantifies the expected buzz confidence estimation. ## FAQ What if my system type is not specified here or not supported yet? - Please have a private post to instructors so we could check how we could adapt the leaderboard for your purpose. Thanks! I don't understand where I could start to build a QA system for submission. - Please check our submission tutorials. From there, you could fine-tune or do anything above the base models. I want to use API-based QA systems for submission, like GPT4. What should I do? - We don't support API-based models now but you could train your model with the GPT cache we provided: https://github.com/Pinafore/nlp-hw/tree/master/models. """ EVALUATION_QUEUE_TEXT = """ **Step 1: Make sure it could work locally** After you have a QA system uploaded to HuggingFace (with license specified), please check with the following example code to see if your pipe could return the guess and confidence score in a **JSON** format. ``` from transformers import pipeline qa_pipe = pipeline(model="...", trust_remote_code=True) # If it is a Generative QA pipeline qa_pipe(“Where is UMD?”) # If it is a Extractive QA pipeline qa_pipe(question=“Where is UMD?”, context=”UMD is in Maryland.”) ``` **Step 2: Fill in the submission form** (1) Fill in the `QA model name` (2) Fill in the `Revision commit`: if you leave it empty, by default it will be `main`. (3) Fill in the `Model type` (4) `Precision` by default is `float16`. You could update it as needed. (5) If you have a trained retriever and want to submit an Extractive QA system, please also fill in the `Retrieved dataset name` and `Retriever model`. Here is a tutorial on how you could make pipe wrappers for submissions: [Colab](https://colab.research.google.com/drive/1bCt2870SdY6tI4uE3JPG8_3nLmNJXX6_?usp=sharing) """ CITATION_BUTTON_LABEL = "Copy the following link to check more details" CITATION_BUTTON_TEXT = r""" https://sites.google.com/view/qanta/home """