llm-rank-themselves / assets /instructions.md
rahulnair23's picture
instructions
d39c67a
|
raw
history blame
2.57 kB

Using on your data

Source code is available as a pip installable python package.

Installation

Use of a virtual enviroment is recommended.

$ conda create -n selfrank python=3.10 

To install,

$ conda activate selfrank
$ pip install git+https://huggingface.co/spaces/ibm/llm-rank-themselves.git

Usage

Start by gathering model inferences for the same question/prompt across all models you want to rank. The ranking method expects a pandas dataframe, with a row for each prompt, and a column for each model, i.e.

M1 M2 M3 ...
Q1 a a b ...
Q2 a b b ...
... ... ... ... ...

With this data, the self ranking procedure can be invoked as follows:

import pandas as pd
from algos.iterative import SelfRank # The full ranking algorithm
from algos.greedy import SelfRankGreedy # The greedy version
from algos.triplet import rouge, equality

f = "inferences.csv"
df = pd.read_csv(f)

models_to_rank = df.columns.tolist()
evaluator = rouge 
true_ranking = None

r = SelfRank(models_to_rank, evaluator, true_ranking)
# or, for the greedy version
# r = SelfRankGreedy(models_to_rank, evaluator, true_ranking)
r.fit(adf)
print(r.ranking)

This should output the estimated ranking (best to worst): ['M5', 'M2', 'M1', ...]. If true rankings are known, evaluation measures can be computed by r.measure(metric='rbo') (for rank-biased overlap) or r.measure(metric='mapk') for mean-average precision.

We provide implementations of few evaluation function, i.e. the function the judge model uses to evaluate the contestant models. While rouge is recommended for generative tasks like summarization, equality would be more appropriate for multiple choice settings (like MMLU) or classification tasks with a discrete set of outcomes.

You can also pass any arbitrary function to the ranker as long as it follows the following signature:

def user_function(a: str, b:str, c:str, df:pd.DataFrame) -> int:
    """
    use model c to evaluate a vs. b
    df: is a dataframe with inferences of all models
    returns 1 if a is preferred or 0 if b is preferred
    """

    # Is this example, we count number of times a/b is the same as c 
    ties = df[a] == df[b]
    a_wins = sum((df[a] == df[c]) & ~(ties))
    b_wins = sum((df[b] == df[c]) & ~(ties))

    if a_wins >= b_wins:
        return 1
    else:
        return 0