Using on your data
Source code is available as a pip installable python package.
## Installation
Use of a virtual enviroment is recommended.
```bash
$ conda create -n selfrank python=3.10
```
To install,
```bash
$ conda activate selfrank
$ pip install git+https://huggingface.co/spaces/ibm/llm-rank-themselves.git
```
## Usage
Start by gathering model inferences for the same question/prompt across all models you want to rank. The ranking method expects a pandas dataframe, with a row for each prompt, and a column for each model, i.e.
| | M1 | M2 | M3 | ... |
|:-----------|:-----|:-----|:-----|:------|
| Q1 | a | a | b | ... |
| Q2 | a | b | b | ... |
| ... | ... | ... | ... | ... |
With this data, the self ranking procedure can be invoked as follows:
```python
import pandas as pd
from algos.iterative import SelfRank # The full ranking algorithm
from algos.greedy import SelfRankGreedy # The greedy version
from algos.triplet import rouge, equality
f = "inferences.csv"
df = pd.read_csv(f)
models_to_rank = df.columns.tolist()
evaluator = rouge
true_ranking = None
r = SelfRank(models_to_rank, evaluator, true_ranking)
# or, for the greedy version
# r = SelfRankGreedy(models_to_rank, evaluator, true_ranking)
r.fit(adf)
print(r.ranking)
```
This should output the estimated ranking (best to worst): `['M5', 'M2', 'M1', ...]`. If true rankings are known, evaluation measures can be computed by `r.measure(metric='rbo')` (for rank-biased overlap) or `r.measure(metric='mapk')` for mean-average precision.
We provide implementations of few evaluation function, i.e. the function the judge model uses to evaluate the contestant models. While `rouge` is recommended for generative tasks like summarization, `equality` would be more appropriate for multiple choice settings (like MMLU) or classification tasks with a discrete set of outcomes.
You can also pass any arbitrary function to the ranker as long as it follows the following signature:
```python
def user_function(a: str, b:str, c:str, df:pd.DataFrame) -> int:
"""
use model c to evaluate a vs. b
df: is a dataframe with inferences of all models
returns 1 if a is preferred or 0 if b is preferred
"""
# Is this example, we count number of times a/b is the same as c
ties = df[a] == df[b]
a_wins = sum((df[a] == df[c]) & ~(ties))
b_wins = sum((df[b] == df[c]) & ~(ties))
if a_wins >= b_wins:
return 1
else:
return 0
```