Update README.md
Browse files
README.md
CHANGED
@@ -39,17 +39,19 @@ PairRM takes a pair of candidates and compares them side-by-side to indentify th
|
|
39 |
|
40 |
PairRM can be used to (re-)rank a list of candidate outputs and thus can be used an LLM evaluator to efficiently assess the quality of LLMs in local environment.
|
41 |
PairRM can also be used to enhance the decoding by `best-of-n sampling` (i.e., reranking N sampled outputs).
|
42 |
-
Apart from that, one can also use PairRM to
|
|
|
|
|
43 |
|
44 |
|
45 |
## Installation
|
46 |
-
|
47 |
- First install `llm-blender`
|
48 |
```bash
|
49 |
pip install git+https://github.com/yuchenlin/LLM-Blender.git
|
50 |
```
|
51 |
|
52 |
-
- Then load
|
53 |
```python
|
54 |
import llm_blender
|
55 |
blender = llm_blender.Blender()
|
@@ -59,23 +61,31 @@ blender.loadranker("llm-blender/PairRM") # load PairRM
|
|
59 |
|
60 |
## Usage
|
61 |
|
62 |
-
### Use case 1:
|
63 |
|
64 |
-
-
|
65 |
|
66 |
```python
|
67 |
-
inputs = ["
|
68 |
-
candidates_texts = [["
|
|
|
69 |
ranks = blender.rank(inputs, candidates_texts, return_scores=False, batch_size=2)
|
70 |
# ranks is a list of ranks where ranks[i][j] represents the ranks of candidate-j for input-i
|
|
|
|
|
|
|
|
|
|
|
71 |
```
|
72 |
|
73 |
-
- Directly
|
74 |
```python
|
75 |
-
|
76 |
-
|
|
|
77 |
comparison_results = blender.compare(inputs, candidates_A, candidates_B)
|
78 |
-
# comparison_results is a list of bool, where
|
|
|
79 |
```
|
80 |
|
81 |
- Directly compare two multi-turn conversations given that user's query in each turn are fiexed and responses are different.
|
@@ -86,7 +96,7 @@ conv1 = [
|
|
86 |
"role": "USER"
|
87 |
},
|
88 |
{
|
89 |
-
"content": "<
|
90 |
"role": "ASSISTANT"
|
91 |
},
|
92 |
...
|
@@ -97,7 +107,7 @@ conv2 = [
|
|
97 |
"role": "USER"
|
98 |
},
|
99 |
{
|
100 |
-
"content": "<
|
101 |
"role": "ASSISTANT"
|
102 |
},
|
103 |
...
|
@@ -106,7 +116,7 @@ comparison_results = blender.compare_conversations([conv1], [conv2])
|
|
106 |
# comparison_results is a list of bool, where each element denotes whether all the responses in conv1 together is better than that of conv2
|
107 |
```
|
108 |
|
109 |
-
### Use case 2: Best-of-n
|
110 |
**Best-of-n Sampling**, aka, rejection sampling, is a strategy to enhance the response quality by selecting the one that was ranked highest by the reward model (Learn more at[OpenAI WebGPT section 3.2](https://arxiv.org/pdf/2112.09332.pdf) and [OpenAI Blog](https://openai.com/research/measuring-goodharts-law)).
|
111 |
|
112 |
Best-of-n sampling is a easy way to imporve your llm power with just a few lines of code. An example of applying on zephyr is as follows.
|
|
|
39 |
|
40 |
PairRM can be used to (re-)rank a list of candidate outputs and thus can be used an LLM evaluator to efficiently assess the quality of LLMs in local environment.
|
41 |
PairRM can also be used to enhance the decoding by `best-of-n sampling` (i.e., reranking N sampled outputs).
|
42 |
+
Apart from that, one can also use PairRM to further align instruction-tuned LLMs with RLHF methods.
|
43 |
+
|
44 |
+
PairRM is part of the LLM-Blender project (ACL 2023). Please see our paper linked above to know more.
|
45 |
|
46 |
|
47 |
## Installation
|
48 |
+
|
49 |
- First install `llm-blender`
|
50 |
```bash
|
51 |
pip install git+https://github.com/yuchenlin/LLM-Blender.git
|
52 |
```
|
53 |
|
54 |
+
- Then load PairRM:
|
55 |
```python
|
56 |
import llm_blender
|
57 |
blender = llm_blender.Blender()
|
|
|
61 |
|
62 |
## Usage
|
63 |
|
64 |
+
### Use case 1: Comparing/Ranking output candidates given an instruction
|
65 |
|
66 |
+
- Ranking a list candidate responses
|
67 |
|
68 |
```python
|
69 |
+
inputs = ["hello!", "I love you!"]
|
70 |
+
candidates_texts = [["get out!", "hi! nice to meet you!", "bye"],
|
71 |
+
["I love you too!", "I hate you!", "Thanks! You're a good guy!"]]
|
72 |
ranks = blender.rank(inputs, candidates_texts, return_scores=False, batch_size=2)
|
73 |
# ranks is a list of ranks where ranks[i][j] represents the ranks of candidate-j for input-i
|
74 |
+
"""
|
75 |
+
ranks -->
|
76 |
+
array([[3, 1, 2], # it means "hi! nice to meet you!" ranks the 1st, "bye" ranks the 2nd, and "get out!" ranks the 3rd.
|
77 |
+
[1, 3, 2]], # it means "I love you too"! ranks the the 1st, and "I hate you!" ranks the 3rd.
|
78 |
+
dtype=int32)
|
79 |
```
|
80 |
|
81 |
+
- Directly comparing two candidate responses
|
82 |
```python
|
83 |
+
inputs = ["hello!", "I love you!"]
|
84 |
+
candidates_A = ["hi!", "I hate you!"]
|
85 |
+
candidates_B = ["f**k off!", "I love you, too!"]
|
86 |
comparison_results = blender.compare(inputs, candidates_A, candidates_B)
|
87 |
+
# comparison_results is a list of bool, where comparison_results[i] denotes whether candidates_A[i] is better than candidates_B[i] for inputs[i]
|
88 |
+
# comparison_results[0]--> True
|
89 |
```
|
90 |
|
91 |
- Directly compare two multi-turn conversations given that user's query in each turn are fiexed and responses are different.
|
|
|
96 |
"role": "USER"
|
97 |
},
|
98 |
{
|
99 |
+
"content": "<assistant1‘s response 1>",
|
100 |
"role": "ASSISTANT"
|
101 |
},
|
102 |
...
|
|
|
107 |
"role": "USER"
|
108 |
},
|
109 |
{
|
110 |
+
"content": "<assistant2's response 1>",
|
111 |
"role": "ASSISTANT"
|
112 |
},
|
113 |
...
|
|
|
116 |
# comparison_results is a list of bool, where each element denotes whether all the responses in conv1 together is better than that of conv2
|
117 |
```
|
118 |
|
119 |
+
### Use case 2: Best-of-n Sampling (Decoding Enhancment)
|
120 |
**Best-of-n Sampling**, aka, rejection sampling, is a strategy to enhance the response quality by selecting the one that was ranked highest by the reward model (Learn more at[OpenAI WebGPT section 3.2](https://arxiv.org/pdf/2112.09332.pdf) and [OpenAI Blog](https://openai.com/research/measuring-goodharts-law)).
|
121 |
|
122 |
Best-of-n sampling is a easy way to imporve your llm power with just a few lines of code. An example of applying on zephyr is as follows.
|