Issues faced in reproducing the paper's experiments

#8
by Chensmile - opened

Very interesting work! I am currently trying to reproduce the experimental results from your paper. However, I have encountered two issues:

  1. The generated text tends to have severe repetition.
  2. The model's accuracy on MATH problems (GSM8K dataset) is significantly lower than the reported results in the paper.

I would like to ask whether this discrepancy might be due to the checkpoint used or specific hyperparameter settings (e.g., temperature). Would it be possible to share the exact hyperparameter configurations used in the paper? Thanks!

Tom Goldstein's Lab at University of Maryland, College Park org

Hi, are your issues with MATH or with GSM8k? Some more details on GSM8k can be found here: https://huggingface.co/tomg-group-umd/huginn-0125/discussions/7#67b59e08b24bf87803b701b6

Regarding repetition, this has not been a big problem for me, are you using the model as a text completion model, or with the chat template?

Thank you for your response and reminder! I realized that I was using text completion instead of chat templating, which resulted in a lot of repetition. I will try using the lm-eval harness for evaluation to see if I can reproduce the results successfully. Thanks again!

Tom Goldstein's Lab at University of Maryland, College Park org

Sure! let me know how it goes, or if there are followup questions.

Sign up or log in to comment