MinghaoYang
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -1,154 +1,154 @@
|
|
1 |
-
---
|
2 |
-
library_name: transformers
|
3 |
-
base_model: meta-llama/Llama-3.1-70B-Instruct
|
4 |
-
datasets:
|
5 |
-
- infly/INF-ORM-Preference-Magnitude-80K
|
6 |
-
pipeline_tag: text-classification
|
7 |
-
---
|
8 |
-
|
9 |
-
|
10 |
-
# INF Outcome Reward Model
|
11 |
-
## Introduction
|
12 |
-
|
13 |
-
[**INF-ORM-Llama3.1-70B**](https://huggingface.co/Skywork/Skywork-Reward-Gemma-2-27B-v0.2) is the outcome reward model roughly built on the [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) architecture and trained with the dataset [INF-ORM-Preference-Magnitude-80K](https://huggingface.co/datasets/infly/INF-ORM-Preference-Magnitude-80K).
|
14 |
-
|
15 |
-
**Note: Train Details are coming soon!**
|
16 |
-
|
17 |
-
## RewardBench Leaderboard
|
18 |
-
|
19 |
-
We evaluate our model on [RewardBench](https://huggingface.co/spaces/allenai/reward-bench) using the [official test script](https://github.com/allenai/reward-bench) locally. As of December 2024, INF-ORM-Llama3.1-70B ranks first on the RewardBench leaderboard.
|
20 |
-
|
21 |
-
| Rank | Model | Model Type | Score | Chat | Chat Hard | Safety | Reasoning |
|
22 |
-
| :---: | -------------------------------------------- | ----------------- | :---: | :---: | :-------: | :----: | :-------: |
|
23 |
-
| 1 | **infly/INF-ORM-Llama3.1-70B** | Custom Classifier | 95.2 | 96.9 | 91.0 | 93.8 | 99.1 |
|
24 |
-
| 2 | Skywork/Skywork-Reward-Gemma-2-27B-v0.2 | Seq. Classifier | 94.3 | 96.1 | 89.9 | 93.0 | 98.1 |
|
25 |
-
| 3 | nvidia/Llama-3.1-Nemotron-70B-Reward | Custom Classifier | 94.1 | 97.5 | 85.7 | 95.1 | 98.1 |
|
26 |
-
| 4 | Skywork/Skywork-Reward-Gemma-2-27B | Seq. Classifier | 93.8 | 95.8 | 91.4 | 91.9 | 96.1 |
|
27 |
-
| 5 | SF-Foundation/TextEval-Llama3.1-70B | Generative | 93.5 | 94.1 | 90.1 | 93.2 | 96.4 |
|
28 |
-
| 6 | meta-metrics/MetaMetrics-RM-v1.0 | Custom Classifier | 93.4 | 98.3 | 86.4 | 90.8 | 98.2 |
|
29 |
-
| 7 | Skywork/Skywork-Critic-Llama-3.1-70B | Generative | 93.3 | 96.6 | 87.9 | 93.1 | 95.5 |
|
30 |
-
| 8 | Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 | Seq. Classifier | 93.1 | 94.7 | 88.4 | 92.7 | 96.7 |
|
31 |
-
| 9 | nicolinho/QRM-Llama3.1-8B | Seq. Classifier | 93.1 | 94.4 | 89.7 | 92.3 | 95.8 |
|
32 |
-
| 10 | LxzGordon/URM-LLaMa-3.1-8B | Seq. Classifier | 92.9 | 95.5 | 88.2 | 91.1 | 97.0 |
|
33 |
-
|
34 |
-
## Demo Code
|
35 |
-
|
36 |
-
We provide example usage of the
|
37 |
-
Below is an example of obtaining the reward scores of two conversations.
|
38 |
-
|
39 |
-
```python
|
40 |
-
from typing import List, Optional, Union
|
41 |
-
|
42 |
-
import torch
|
43 |
-
import torch.nn as nn
|
44 |
-
from transformers import LlamaPreTrainedModel, LlamaModel, PreTrainedTokenizerFast
|
45 |
-
from transformers.modeling_outputs import SequenceClassifierOutputWithPast
|
46 |
-
|
47 |
-
class INFORMForSequenceClassification(LlamaPreTrainedModel):
|
48 |
-
def __init__(self, config):
|
49 |
-
super().__init__(config)
|
50 |
-
self.num_labels = config.num_labels
|
51 |
-
self.model = LlamaModel(config)
|
52 |
-
self.score = nn.Sequential(
|
53 |
-
nn.Linear(config.hidden_size, config.hidden_size),
|
54 |
-
nn.ReLU(),
|
55 |
-
nn.Linear(config.hidden_size, self.num_labels)
|
56 |
-
)
|
57 |
-
# Initialize weights and apply final processing
|
58 |
-
self.post_init()
|
59 |
-
|
60 |
-
def forward(
|
61 |
-
self,
|
62 |
-
input_ids: Optional[torch.LongTensor] = None,
|
63 |
-
attention_mask: Optional[torch.Tensor] = None,
|
64 |
-
position_ids: Optional[torch.LongTensor] = None,
|
65 |
-
past_key_values: Optional[List[torch.FloatTensor]] = None,
|
66 |
-
inputs_embeds: Optional[torch.FloatTensor] = None,
|
67 |
-
labels: Optional[torch.LongTensor] = None,
|
68 |
-
use_cache: Optional[bool] = None,
|
69 |
-
output_attentions: Optional[bool] = None,
|
70 |
-
output_hidden_states: Optional[bool] = None,
|
71 |
-
return_dict: Optional[bool] = None,
|
72 |
-
):
|
73 |
-
|
74 |
-
transformer_outputs = self.model(
|
75 |
-
input_ids,
|
76 |
-
attention_mask=attention_mask,
|
77 |
-
position_ids=position_ids,
|
78 |
-
past_key_values=past_key_values,
|
79 |
-
inputs_embeds=inputs_embeds,
|
80 |
-
)
|
81 |
-
hidden_states = transformer_outputs[0]
|
82 |
-
logits = self.score(hidden_states)
|
83 |
-
|
84 |
-
if input_ids is not None:
|
85 |
-
batch_size = input_ids.shape[0]
|
86 |
-
else:
|
87 |
-
batch_size = inputs_embeds.shape[0]
|
88 |
-
|
89 |
-
if self.config.pad_token_id is None and batch_size != 1:
|
90 |
-
raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
|
91 |
-
if self.config.pad_token_id is None:
|
92 |
-
sequence_lengths = -1
|
93 |
-
else:
|
94 |
-
if input_ids is not None:
|
95 |
-
# if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
|
96 |
-
sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1
|
97 |
-
sequence_lengths = sequence_lengths % input_ids.shape[-1]
|
98 |
-
sequence_lengths = sequence_lengths.to(logits.device)
|
99 |
-
else:
|
100 |
-
sequence_lengths = -1
|
101 |
-
|
102 |
-
pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
|
103 |
-
|
104 |
-
loss = None
|
105 |
-
return SequenceClassifierOutputWithPast(
|
106 |
-
loss=loss,
|
107 |
-
logits=pooled_logits,
|
108 |
-
past_key_values=transformer_outputs.past_key_values,
|
109 |
-
hidden_states=transformer_outputs.hidden_states,
|
110 |
-
attentions=transformer_outputs.attentions,
|
111 |
-
)
|
112 |
-
|
113 |
-
# Load model
|
114 |
-
model_name = "infly/INF-ORM-Llama3.1-70B"
|
115 |
-
orm = INFORMForSequenceClassification.from_pretrained(
|
116 |
-
model_name,
|
117 |
-
torch_dtype=torch.bfloat16,
|
118 |
-
device_map="auto",
|
119 |
-
attn_implementation="flash_attention_2",
|
120 |
-
num_labels=1,
|
121 |
-
)
|
122 |
-
|
123 |
-
# Load tokenizer
|
124 |
-
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_name)
|
125 |
-
conv1 = [ { "content": "Tom has 20 oranges. He gave 5 oranges to his friend Lisa, then he bought 3 more oranges. Finally, he divided all the oranges equally among himself and his 4 friends. How many oranges does each person get?", "role": "user" }, { "content": "Tom has 20 oranges. He gave 5 oranges to Lisa. 20 - 5 = 15. Tom now has 15 oranges. Tom bought 3 more oranges. 15 + 3 = 18. Tom now has 18 oranges. Tom divides the 18 oranges equally among himself and his 4 friends (a total of 5 people). 18 ÷ 5 = 3.6 oranges. Each person gets 3.6 oranges.", "role": "assistant" } ]
|
126 |
-
conv2= [ { "content": "Tom has 20 oranges. He gave 5 oranges to his friend Lisa, then he bought 3 more oranges. Finally, he divided all the oranges equally among himself and his 4 friends. How many oranges does each person get?", "role": "user" }, { "content": "Tom has 20 oranges. He gave 5 oranges to his friend Lisa. 20 - 5 = 15. Tom now has 15 oranges. Tom bought 3 more oranges. 15 + 3 = 18. Tom now has 18 oranges. Tom divides the 18 oranges equally among his 4 friends (a total of 4 people). 18 ÷ 4 = 4.5 oranges. Each person gets 4.5 oranges.", "role": "assistant" } ]
|
127 |
-
conv1_tokenized = tokenizer.apply_chat_template(conv1, tokenize=True, return_tensors="pt").to("cuda")
|
128 |
-
conv2_tokenized = tokenizer.apply_chat_template(conv2, tokenize=True, return_tensors="pt").to("cuda")
|
129 |
-
|
130 |
-
# Inference
|
131 |
-
with torch.no_grad():
|
132 |
-
score1 = orm(conv1_tokenized).logits[0][0].item()
|
133 |
-
score2 = orm(conv2_tokenized).logits[0][0].item()
|
134 |
-
print(f"Score for response 1: {score1}")
|
135 |
-
print(f"Score for response 2: {score2}")
|
136 |
-
|
137 |
-
# Output:
|
138 |
-
|
139 |
-
# Score for response 1: 4.96875
|
140 |
-
# Score for response 2: 2.890625
|
141 |
-
|
142 |
-
```
|
143 |
-
|
144 |
-
## Declaration and License Agreement
|
145 |
-
|
146 |
-
### Declaration
|
147 |
-
|
148 |
-
### License Agreement
|
149 |
-
|
150 |
-
## Contact
|
151 |
-
If you have any questions, please feel free to reach us at <23210720070@m.fudan.edu.cn>.
|
152 |
-
## Citation
|
153 |
-
|
154 |
-
|
|
|
1 |
+
---
|
2 |
+
library_name: transformers
|
3 |
+
base_model: meta-llama/Llama-3.1-70B-Instruct
|
4 |
+
datasets:
|
5 |
+
- infly/INF-ORM-Preference-Magnitude-80K
|
6 |
+
pipeline_tag: text-classification
|
7 |
+
---
|
8 |
+
|
9 |
+
|
10 |
+
# INF Outcome Reward Model
|
11 |
+
## Introduction
|
12 |
+
|
13 |
+
[**INF-ORM-Llama3.1-70B**](https://huggingface.co/Skywork/Skywork-Reward-Gemma-2-27B-v0.2) is the outcome reward model roughly built on the [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) architecture and trained with the dataset [INF-ORM-Preference-Magnitude-80K](https://huggingface.co/datasets/infly/INF-ORM-Preference-Magnitude-80K).
|
14 |
+
|
15 |
+
**Note: Train Details are coming soon!**
|
16 |
+
|
17 |
+
## RewardBench Leaderboard
|
18 |
+
|
19 |
+
We evaluate our model on [RewardBench](https://huggingface.co/spaces/allenai/reward-bench) using the [official test script](https://github.com/allenai/reward-bench) locally. As of December 2024, INF-ORM-Llama3.1-70B ranks first on the RewardBench leaderboard.
|
20 |
+
|
21 |
+
| Rank | Model | Model Type | Score | Chat | Chat Hard | Safety | Reasoning |
|
22 |
+
| :---: | -------------------------------------------- | ----------------- | :---: | :---: | :-------: | :----: | :-------: |
|
23 |
+
| 1 | **infly/INF-ORM-Llama3.1-70B** | Custom Classifier | 95.2 | 96.9 | 91.0 | 93.8 | 99.1 |
|
24 |
+
| 2 | Skywork/Skywork-Reward-Gemma-2-27B-v0.2 | Seq. Classifier | 94.3 | 96.1 | 89.9 | 93.0 | 98.1 |
|
25 |
+
| 3 | nvidia/Llama-3.1-Nemotron-70B-Reward | Custom Classifier | 94.1 | 97.5 | 85.7 | 95.1 | 98.1 |
|
26 |
+
| 4 | Skywork/Skywork-Reward-Gemma-2-27B | Seq. Classifier | 93.8 | 95.8 | 91.4 | 91.9 | 96.1 |
|
27 |
+
| 5 | SF-Foundation/TextEval-Llama3.1-70B | Generative | 93.5 | 94.1 | 90.1 | 93.2 | 96.4 |
|
28 |
+
| 6 | meta-metrics/MetaMetrics-RM-v1.0 | Custom Classifier | 93.4 | 98.3 | 86.4 | 90.8 | 98.2 |
|
29 |
+
| 7 | Skywork/Skywork-Critic-Llama-3.1-70B | Generative | 93.3 | 96.6 | 87.9 | 93.1 | 95.5 |
|
30 |
+
| 8 | Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 | Seq. Classifier | 93.1 | 94.7 | 88.4 | 92.7 | 96.7 |
|
31 |
+
| 9 | nicolinho/QRM-Llama3.1-8B | Seq. Classifier | 93.1 | 94.4 | 89.7 | 92.3 | 95.8 |
|
32 |
+
| 10 | LxzGordon/URM-LLaMa-3.1-8B | Seq. Classifier | 92.9 | 95.5 | 88.2 | 91.1 | 97.0 |
|
33 |
+
|
34 |
+
## Demo Code
|
35 |
+
|
36 |
+
We provide an example usage of the INF-ORM-Llama3.1-70B below.
|
37 |
+
Below is an example of obtaining the reward scores of two conversations.
|
38 |
+
|
39 |
+
```python
|
40 |
+
from typing import List, Optional, Union
|
41 |
+
|
42 |
+
import torch
|
43 |
+
import torch.nn as nn
|
44 |
+
from transformers import LlamaPreTrainedModel, LlamaModel, PreTrainedTokenizerFast
|
45 |
+
from transformers.modeling_outputs import SequenceClassifierOutputWithPast
|
46 |
+
|
47 |
+
class INFORMForSequenceClassification(LlamaPreTrainedModel):
|
48 |
+
def __init__(self, config):
|
49 |
+
super().__init__(config)
|
50 |
+
self.num_labels = config.num_labels
|
51 |
+
self.model = LlamaModel(config)
|
52 |
+
self.score = nn.Sequential(
|
53 |
+
nn.Linear(config.hidden_size, config.hidden_size),
|
54 |
+
nn.ReLU(),
|
55 |
+
nn.Linear(config.hidden_size, self.num_labels)
|
56 |
+
)
|
57 |
+
# Initialize weights and apply final processing
|
58 |
+
self.post_init()
|
59 |
+
|
60 |
+
def forward(
|
61 |
+
self,
|
62 |
+
input_ids: Optional[torch.LongTensor] = None,
|
63 |
+
attention_mask: Optional[torch.Tensor] = None,
|
64 |
+
position_ids: Optional[torch.LongTensor] = None,
|
65 |
+
past_key_values: Optional[List[torch.FloatTensor]] = None,
|
66 |
+
inputs_embeds: Optional[torch.FloatTensor] = None,
|
67 |
+
labels: Optional[torch.LongTensor] = None,
|
68 |
+
use_cache: Optional[bool] = None,
|
69 |
+
output_attentions: Optional[bool] = None,
|
70 |
+
output_hidden_states: Optional[bool] = None,
|
71 |
+
return_dict: Optional[bool] = None,
|
72 |
+
):
|
73 |
+
|
74 |
+
transformer_outputs = self.model(
|
75 |
+
input_ids,
|
76 |
+
attention_mask=attention_mask,
|
77 |
+
position_ids=position_ids,
|
78 |
+
past_key_values=past_key_values,
|
79 |
+
inputs_embeds=inputs_embeds,
|
80 |
+
)
|
81 |
+
hidden_states = transformer_outputs[0]
|
82 |
+
logits = self.score(hidden_states)
|
83 |
+
|
84 |
+
if input_ids is not None:
|
85 |
+
batch_size = input_ids.shape[0]
|
86 |
+
else:
|
87 |
+
batch_size = inputs_embeds.shape[0]
|
88 |
+
|
89 |
+
if self.config.pad_token_id is None and batch_size != 1:
|
90 |
+
raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
|
91 |
+
if self.config.pad_token_id is None:
|
92 |
+
sequence_lengths = -1
|
93 |
+
else:
|
94 |
+
if input_ids is not None:
|
95 |
+
# if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
|
96 |
+
sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1
|
97 |
+
sequence_lengths = sequence_lengths % input_ids.shape[-1]
|
98 |
+
sequence_lengths = sequence_lengths.to(logits.device)
|
99 |
+
else:
|
100 |
+
sequence_lengths = -1
|
101 |
+
|
102 |
+
pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
|
103 |
+
|
104 |
+
loss = None
|
105 |
+
return SequenceClassifierOutputWithPast(
|
106 |
+
loss=loss,
|
107 |
+
logits=pooled_logits,
|
108 |
+
past_key_values=transformer_outputs.past_key_values,
|
109 |
+
hidden_states=transformer_outputs.hidden_states,
|
110 |
+
attentions=transformer_outputs.attentions,
|
111 |
+
)
|
112 |
+
|
113 |
+
# Load model
|
114 |
+
model_name = "infly/INF-ORM-Llama3.1-70B"
|
115 |
+
orm = INFORMForSequenceClassification.from_pretrained(
|
116 |
+
model_name,
|
117 |
+
torch_dtype=torch.bfloat16,
|
118 |
+
device_map="auto",
|
119 |
+
attn_implementation="flash_attention_2",
|
120 |
+
num_labels=1,
|
121 |
+
)
|
122 |
+
|
123 |
+
# Load tokenizer
|
124 |
+
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_name)
|
125 |
+
conv1 = [ { "content": "Tom has 20 oranges. He gave 5 oranges to his friend Lisa, then he bought 3 more oranges. Finally, he divided all the oranges equally among himself and his 4 friends. How many oranges does each person get?", "role": "user" }, { "content": "Tom has 20 oranges. He gave 5 oranges to Lisa. 20 - 5 = 15. Tom now has 15 oranges. Tom bought 3 more oranges. 15 + 3 = 18. Tom now has 18 oranges. Tom divides the 18 oranges equally among himself and his 4 friends (a total of 5 people). 18 ÷ 5 = 3.6 oranges. Each person gets 3.6 oranges.", "role": "assistant" } ]
|
126 |
+
conv2= [ { "content": "Tom has 20 oranges. He gave 5 oranges to his friend Lisa, then he bought 3 more oranges. Finally, he divided all the oranges equally among himself and his 4 friends. How many oranges does each person get?", "role": "user" }, { "content": "Tom has 20 oranges. He gave 5 oranges to his friend Lisa. 20 - 5 = 15. Tom now has 15 oranges. Tom bought 3 more oranges. 15 + 3 = 18. Tom now has 18 oranges. Tom divides the 18 oranges equally among his 4 friends (a total of 4 people). 18 ÷ 4 = 4.5 oranges. Each person gets 4.5 oranges.", "role": "assistant" } ]
|
127 |
+
conv1_tokenized = tokenizer.apply_chat_template(conv1, tokenize=True, return_tensors="pt").to("cuda")
|
128 |
+
conv2_tokenized = tokenizer.apply_chat_template(conv2, tokenize=True, return_tensors="pt").to("cuda")
|
129 |
+
|
130 |
+
# Inference
|
131 |
+
with torch.no_grad():
|
132 |
+
score1 = orm(conv1_tokenized).logits[0][0].item()
|
133 |
+
score2 = orm(conv2_tokenized).logits[0][0].item()
|
134 |
+
print(f"Score for response 1: {score1}")
|
135 |
+
print(f"Score for response 2: {score2}")
|
136 |
+
|
137 |
+
# Output:
|
138 |
+
|
139 |
+
# Score for response 1: 4.96875
|
140 |
+
# Score for response 2: 2.890625
|
141 |
+
|
142 |
+
```
|
143 |
+
|
144 |
+
## Declaration and License Agreement
|
145 |
+
|
146 |
+
### Declaration
|
147 |
+
|
148 |
+
### License Agreement
|
149 |
+
|
150 |
+
## Contact
|
151 |
+
If you have any questions, please feel free to reach us at <23210720070@m.fudan.edu.cn>.
|
152 |
+
## Citation
|
153 |
+
|
154 |
+
|