Intel/neural-chat-7b-v3-1 · About DROP results within the `lm-eval-harness`

Nov 29, 2023

Hi here! I'm curious about the huge gap w.r.t. Mistral in the DROP benchmark of the lm-eval-harness, did you use the same revision of EleutherAI/lm-eval-harness? Also did you run any other evaluation to see the reason why it excels that much at DROP compared to other SFT + DPO fine-tunes e.g. Zephyr? Is there any data contamination coming from the dataset used for training?

A bunch of questions 😅 Feel free to answer in case you checked the issues with DROP, because the gap compared to other models seems huge and would be nice to investigate, maybe the data just has better quality!

deleted

Nov 30, 2023

•

edited Nov 30, 2023

I'm only chiming in so I will get a notice if someone answers this question because I'm also interested in why this LLM has a much higher DROP score than other SFT + DPO LLMs like Zephyr.

lvkaokao

Intel org Nov 30, 2023

•

edited Nov 30, 2023

hi, the used datasets are listed at the model card. We find the metric of drop decreases during the training. So early stopping is needed.

alvarobartt

Nov 30, 2023

True, I've submitted a PR at https://huggingface.co/Intel/neural-chat-7b-v3-1/discussions/15 to enrich the metadata within the Model Card in the README.md 🤗

alvarobartt

Nov 30, 2023

Hmm did you evaluate the model during training using the lm-eval-harness? Was DROP within your evaluation set? Could you please elaborate more on that, I think it's a really interesting topic!