Add question-answering pipeline tag and github repo link

#1
by nielsr HF staff - opened
Files changed (1) hide show
  1. README.md +121 -13
README.md CHANGED
@@ -1,16 +1,18 @@
1
  ---
2
- license: apache-2.0
 
3
  language:
4
  - en
 
 
5
  metrics:
6
  - accuracy
7
- base_model:
8
- - Qwen/Qwen2.5-Math-7B-Instruct
9
- library_name: transformers
10
  ---
 
11
  ## SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights
12
 
13
- > [SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights](link)
14
  > [Ling Yang\*](https://yangling0818.github.io/), [Zhaochen Yu*](https://github.com/BitCodingWalkin), [Tianjun Zhang](https://tianjunz.github.io/), [Minkai Xu](https://minkaixu.com/), [Joseph E. Gonzalez](https://people.eecs.berkeley.edu/~jegonzal/),[Bin Cui](https://cuibinpku.github.io/), [Shuicheng Yan](https://yanshuicheng.info/)
15
  >
16
  > Peking University, Skywork AI, UC Berkeley, Stanford University
@@ -103,25 +105,131 @@ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
103
  print(response)
104
  ```
105
 
106
- ## Performance
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
107
 
108
- We evaluate our SupperCorrect-7B on two widely used English math benchmarks GSM8K and MATH. All evaluations are tested with our evaluation method which is zero-shot hierarchical thought based prompting.
109
 
110
- ![image](table.png)
111
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
112
  ## Citation
113
 
114
  ```bash
115
- @article{yang2024supercorrect,
116
- title={SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights}
117
  author={Yang, Ling and Yu, Zhaochen and Zhang, Tianjun and Xu, Minkai and Gonzalez, Joseph E and Cui, Bin and Yan, Shuicheng},
118
- journal={arXiv preprint arXiv:2410.09008},
119
- year={2024}
120
  }
 
121
  @article{yang2024buffer,
122
  title={Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models},
123
  author={Yang, Ling and Yu, Zhaochen and Zhang, Tianjun and Cao, Shiyi and Xu, Minkai and Zhang, Wentao and Gonzalez, Joseph E and Cui, Bin},
124
- journal={arXiv preprint arXiv:2406.04271},
125
  year={2024}
126
  }
127
  ```
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-Math-7B-Instruct
4
  language:
5
  - en
6
+ library_name: transformers
7
+ license: apache-2.0
8
  metrics:
9
  - accuracy
10
+ pipeline_tag: question-answering
 
 
11
  ---
12
+
13
  ## SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights
14
 
15
+ > [SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights](https://arxiv.org/abs/2410.09008)
16
  > [Ling Yang\*](https://yangling0818.github.io/), [Zhaochen Yu*](https://github.com/BitCodingWalkin), [Tianjun Zhang](https://tianjunz.github.io/), [Minkai Xu](https://minkaixu.com/), [Joseph E. Gonzalez](https://people.eecs.berkeley.edu/~jegonzal/),[Bin Cui](https://cuibinpku.github.io/), [Shuicheng Yan](https://yanshuicheng.info/)
17
  >
18
  > Peking University, Skywork AI, UC Berkeley, Stanford University
 
105
  print(response)
106
  ```
107
 
108
+ #### 🔥 vLLM
109
+
110
+ ```python
111
+ import os
112
+ from vllm import LLM, SamplingParams
113
+ model_name = 'BitStarWalkin/SuperCorrect-7B'
114
+ hierarchical_prompt = "Solve the following math problem in a step-by-step XML format, each step should be enclosed within tags like <Step1></Step1>. For each step enclosed within the tags, determine if this step is challenging and tricky, if so, add detailed explanation and analysis enclosed within <Key> </Key> in this step, as helpful annotations to help you thinking and remind yourself how to conduct reasoning correctly. After all the reasoning steps, summarize the common solution and reasoning steps to help you and your classmates who are not good at math generalize to similar problems within <Generalized></Generalized>. Finally present the final answer within <Answer> </Answer>."
115
+ prompts = [
116
+ "For what positive value of $t$ is $|{-4+ti}| = 6$?",
117
+ "Find the distance between the foci of the ellipse \\[9x^2 + \\frac{y^2}{9} = 99.\\]",
118
+ "The fourth term of a geometric series is $24$ and the eleventh term is $3072$. What is the common ratio?"
119
+ ]
120
+ combined_prompts = [hierarchical_prompt + '\n' + prompt for prompt in prompts]
121
+ sampling_params = SamplingParams(temperature=0, top_p=1,max_tokens=1024)
122
+ llm = LLM(model=model_name, trust_remote_code=True)
123
+ outputs = llm.generate(combined_prompts, sampling_params)
124
+
125
+ #Print the outputs.
126
+ for output in outputs:
127
+ prompt = output.prompt
128
+ generated_text = output.outputs[0].text
129
+ print(f"Prompt: {prompt}")
130
+ print(f"Generated text: {generated_text}")
131
+ ```
132
+
133
+ Here we also provide inference code with [vLLM](https://github.com/vllm-project/vllm) . vLLM is a fast and easy-to-use library for LLM inference and serving.
134
+
135
 
136
+ ### 1. Our evaluation
137
 
138
+ Here we provide two different evaluation methods: **online version** which utilizes GPT-4o to conduct a more fair and robust judgement and **offline version** which utilizes programming method to verify the final results. Both methods aim to provide a more accurate and strict evaluation results, as the final results in MATH dataset are not always numeric or pure expression. We now provide online version for evaluation, we will update soon for offline version.
139
 
140
+
141
+ ```bash
142
+ API_KEY= "Input your key here"
143
+ MODEL_NAME_OR_PATH="BitStarWalkin/SuperCorrect-7B"
144
+ export CUDA_VISIBLE_DEVICES="0"
145
+ bash evaluation.sh $API_KEY $MODEL_NAME_OR_PATH
146
+
147
+ ```
148
+
149
+
150
+
151
+ ### 2. Evaluation with [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
152
+
153
+ ```bash
154
+ lm_eval --model hf \
155
+ --model_args pretrained="Qwen2.5-Math-7B-Instruct" \
156
+ --tasks minerva_math \
157
+ --log_samples \
158
+ --output_path Qwen2.5-Math-7B-Instruct-lm-evaluation \
159
+ --batch_size 12
160
+
161
+ lm_eval --model hf \
162
+ --model_args pretrained="SuperCorrect-7B" \
163
+ --tasks minerva_math \
164
+ --log_samples \
165
+ --output_path SuperCorrect-7B-lm-evaluation \
166
+ --batch_size 12
167
+ ```
168
+ Evaluation results produced by lm-evaluation:
169
+
170
+ | Qwen2.5-Math-7B-Instruct | Version | Filter | n-shot | Metric | | Value | | Stderr |
171
+ | ----------------------------------- | ------: | ------ | -----: | ----------- | ---- | -----: | ---- | -----: |
172
+ | minerva_math | 1 | none | 4 | exact_match | ↑ | 0.5034 | ± | 0.0064 |
173
+ | - minerva_math_algebra | 1 | none | 4 | exact_match | ↑ | 0.7009 | ± | 0.0133 |
174
+ | - minerva_math_counting_and_prob | 1 | none | 4 | exact_match | ↑ | 0.5232 | ± | 0.0230 |
175
+ | - minerva_math_geometry | 1 | none | 4 | exact_match | ↑ | 0.4635 | ± | 0.0228 |
176
+ | - minerva_math_intermediate_algebra | 1 | none | 4 | exact_match | ↑ | 0.2237 | ± | 0.0139 |
177
+ | - minerva_math_num_theory | 1 | none | 4 | exact_match | ↑ | 0.4667 | ± | 0.0215 |
178
+ | - minerva_math_prealgebra | 1 | none | 4 | exact_match | ↑ | 0.7394 | ± | 0.0149 |
179
+ | - minerva_math_precalc | 1 | none | 4 | exact_match | ↑ | 0.2143 | ± | 0.0176 |
180
+
181
+
182
+
183
+ | SuperCorrect-7B | Version | Filter | n-shot | Metric | | Value | | Stderr |
184
+ | ------------------------------------ | ------: | ------ | -----: | ----------- | ---- | -----: | ---- | -----: |
185
+ | minerva_math | 1 | none | 4 | exact_match | ↑ | 0.6188 (**+0.1154**) | ± | 0.0065 |
186
+ | - minerva_math_algebra | 1 | none | 4 | exact_match | ↑ | 0.7936 (**+0.0927**) | ± | 0.0118 |
187
+ | - minerva_math_counting_and_prob | 1 | none | 4 | exact_match | ↑ | 0.5802 (**+0.0570**) | ± | 0.0227 |
188
+ | - minerva_math_geometry | 1 | none | 4 | exact_match | ↑ | 0.5261 (**+0.0626**) | ± | 0.0228 |
189
+ | - minerva_math_intermediate_algebra | 1 | none | 4 | exact_match | ↑ | 0.4385 (**+0.2148**) | ± | 0.0165 |
190
+ | - minerva_math_num_theory | 1 | none | 4 | exact_match | ↑ | 0.6167 (**+0.1500**) | ± | 0.0209 |
191
+ | - minerva_math_prealgebra | 1 | none | 4 | exact_match | ↑ | 0.7715 (**+0.0321**) | ± | 0.0142 |
192
+ | - minerva_math_precalc | 1 | none | 4 | exact_match | ↑ | 0.4103 (**+0.1960**) | ± | 0.0211 |
193
+
194
+
195
+ | Summary | Version | Filter | n-shot | Metric | | Value | | Stderr |
196
+ | ------------ | ------: | ------ | ------ | ----------- | ---- | -----: | ---- | -----: |
197
+ | Qwen2.5-Math-7B-Instruct | 1 | none | 4| exact_match | ↑ | 0.5034 | ± | 0.0064 |
198
+ | SuperCorrect-7B | 1 | none | 4| exact_match | ↑ | 0.6188 (**+0.1154**) | ± | 0.0065 |
199
+
200
+ ### 3. Evaluation with [Qwen2.5-Math-Evaluation](https://github.com/QwenLM/Qwen2.5-Math)
201
+ ```bash
202
+ export CUDA_VISIBLE_DEVICES="0"
203
+ MODEL_NAME_OR_PATH="Qwen/Qwen2.5-Math-7B-Instruct"
204
+ bash sh/eval.sh $PROMPT_TYPE $MODEL_NAME_OR_PATH
205
+
206
+ export CUDA_VISIBLE_DEVICES="0"
207
+ MODEL_NAME_OR_PATH="BitStarWalkin/SuperCorrect-7B"
208
+ bash sh/eval.sh $PROMPT_TYPE $MODEL_NAME_OR_PATH
209
+ ```
210
+ Evaluation results produced by Qwen2.5-Math-Eval:
211
+ | Model | MATH Accuracy (%) |
212
+ | ---------------- | ----------------- |
213
+ | Qwen2.5-Math | 80.6 |
214
+ | **SuperCorrect** | **82.1** |
215
+ | **Our Improvement** | **+1.5** |
216
+
217
+ ## Code
218
+ This model and evaluation results are based on the code at https://github.com/YangLing0818/SuperCorrect-llm
219
  ## Citation
220
 
221
  ```bash
222
+ @article{yang2025supercorrect,
223
+ title={SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights},
224
  author={Yang, Ling and Yu, Zhaochen and Zhang, Tianjun and Xu, Minkai and Gonzalez, Joseph E and Cui, Bin and Yan, Shuicheng},
225
+ booktitle={International Conference on Learning Representations},
226
+ year={2025}
227
  }
228
+
229
  @article{yang2024buffer,
230
  title={Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models},
231
  author={Yang, Ling and Yu, Zhaochen and Zhang, Tianjun and Cao, Shiyi and Xu, Minkai and Zhang, Wentao and Gonzalez, Joseph E and Cui, Bin},
232
+ journal={Advances in Neural Information Processing Systems},
233
  year={2024}
234
  }
235
  ```