How is the 50k high quality dataset created?

#2
by Yhyu13 - opened

Would you like to elaborate more on how the 50k high quality documentation answering dataset is created?

Are they bootstrapped from handcrafted questions that are commonly used in DocsGPT, and then used answers generated by e.g. gpt4 or claude2 to pair up a set of Q&As, or are they human generated answers?

I am a bit astonished by the 50k quantities, you usually can hardly find such amount of domain specific data for LoRA fine-tuning.

Sign up or log in to comment