Disclosure of instruction-tuning / DPO data needed

#2
by markding - opened

Model card mentions "curriculum learning with 400,000 high-quality samples that presented a greater level of difficulty. Subsequently, we incorporated human feedback through the Dynamic Policy Optimization (DPO), culminating in the development of Nanbeige2-8B-Chat."

What data was used here? This is key to understanding performance and safety guardrails.

Thank you for your work!

Nanbeige LLM Lab org

A portion of our SFT data originates from open-source datasets, such as ShareGPT, Airoboros-3.2, Platypus, etc, while the remaining is constructed using methods like self-instruct and evol-instruct.

To build the curriculum learning data, we trained a judge model to select high-quality and challenging questions from the SFT data. We ensure the quality of the answers by multiple verifications with GPT-4.

In the DPO phase, we collected questions in a similar way to the curriculum learning, and included manual verification. Building on this, we generated answers using our models, open-source models, and models like GPT-4, and then collect human preferences to create our DPO dataset.

Hope this can answer your question.

It does. Would be wonderful to have this as part of the documentation. For the evidence-based judgements of the LLM openness leaderboard (https://opening-up-chatgpt.github.io/) we try to point to official documentation only to ensure high quality information. This may help Nanbeige go up a notch or two on the openness ladder.

ZekeWang changed discussion status to closed

Sign up or log in to comment