Spaces:
Running
Using on your data
Source code is available as a pip installable python package.
Installation
Use of a virtual enviroment is recommended.
$ conda create -n selfrank python=3.10
To install,
$ conda activate selfrank
$ pip install git+https://huggingface.co/spaces/ibm/llm-rank-themselves.git
Usage
Start by gathering model inferences for the same question/prompt across all models you want to rank. The ranking method expects a pandas dataframe, with a row for each prompt, and a column for each model, i.e.
M1 | M2 | M3 | ... | |
---|---|---|---|---|
Q1 | a | a | b | ... |
Q2 | a | b | b | ... |
... | ... | ... | ... | ... |
With this data, the self ranking procedure can be invoked as follows:
import pandas as pd
from algos.iterative import SelfRank # The full ranking algorithm
from algos.greedy import SelfRankGreedy # The greedy version
from algos.triplet import rouge, equality
f = "inferences.csv"
df = pd.read_csv(f)
models_to_rank = df.columns.tolist()
evaluator = rouge
true_ranking = None
r = SelfRank(models_to_rank, evaluator, true_ranking)
# or, for the greedy version
# r = SelfRankGreedy(models_to_rank, evaluator, true_ranking)
r.fit(adf)
print(r.ranking)
This should output the estimated ranking (best to worst): ['M5', 'M2', 'M1', ...]
. If true rankings are known, evaluation measures can be computed by r.measure(metric='rbo')
(for rank-biased overlap) or r.measure(metric='mapk')
for mean-average precision.
We provide implementations of few evaluation function, i.e. the function the judge model uses to evaluate the contestant models. While rouge
is recommended for generative tasks like summarization, equality
would be more appropriate for multiple choice settings (like MMLU) or classification tasks with a discrete set of outcomes.
You can also pass any arbitrary function to the ranker as long as it follows the following signature:
def user_function(a: str, b:str, c:str, df:pd.DataFrame) -> int:
"""
use model c to evaluate a vs. b
df: is a dataframe with inferences of all models
returns 1 if a is preferred or 0 if b is preferred
"""
# Is this example, we count number of times a/b is the same as c
ties = df[a] == df[b]
a_wins = sum((df[a] == df[c]) & ~(ties))
b_wins = sum((df[b] == df[c]) & ~(ties))
if a_wins >= b_wins:
return 1
else:
return 0