Using on your data

Source code is available as a pip installable python package. ## Installation Use of a virtual enviroment is recommended. ```bash $ conda create -n selfrank python=3.10 ``` To install, ```bash $ conda activate selfrank $ pip install git+https://huggingface.co/spaces/ibm/llm-rank-themselves.git ``` ## Usage Start by gathering model inferences for the same question/prompt across all models you want to rank. The ranking method expects a pandas dataframe, with a row for each prompt, and a column for each model, i.e. | | M1 | M2 | M3 | ... | |:-----------|:-----|:-----|:-----|:------| | Q1 | a | a | b | ... | | Q2 | a | b | b | ... | | ... | ... | ... | ... | ... | With this data, the self ranking procedure can be invoked as follows: ```python import pandas as pd from algos.iterative import SelfRank # The full ranking algorithm from algos.greedy import SelfRankGreedy # The greedy version from algos.triplet import rouge, equality f = "inferences.csv" df = pd.read_csv(f) models_to_rank = df.columns.tolist() evaluator = rouge true_ranking = None r = SelfRank(models_to_rank, evaluator, true_ranking) # or, for the greedy version # r = SelfRankGreedy(models_to_rank, evaluator, true_ranking) r.fit(adf) print(r.ranking) ``` This should output the estimated ranking (best to worst): `['M5', 'M2', 'M1', ...]`. If true rankings are known, evaluation measures can be computed by `r.measure(metric='rbo')` (for rank-biased overlap) or `r.measure(metric='mapk')` for mean-average precision. We provide implementations of few evaluation function, i.e. the function the judge model uses to evaluate the contestant models. While `rouge` is recommended for generative tasks like summarization, `equality` would be more appropriate for multiple choice settings (like MMLU) or classification tasks with a discrete set of outcomes. You can also pass any arbitrary function to the ranker as long as it follows the following signature: ```python def user_function(a: str, b:str, c:str, df:pd.DataFrame) -> int: """ use model c to evaluate a vs. b df: is a dataframe with inferences of all models returns 1 if a is preferred or 0 if b is preferred """ # Is this example, we count number of times a/b is the same as c ties = df[a] == df[b] a_wins = sum((df[a] == df[c]) & ~(ties)) b_wins = sum((df[b] == df[c]) & ~(ties)) if a_wins >= b_wins: return 1 else: return 0 ```