Data Repositories and Size of Dataset

#3
by AIdinner - opened

Hi bro, could you please specify which repositories the ORCA and Dolphin Data you used come from and the total amount of your unpublished dataset? Thanks a lot!

OK. garage-bAInd/Open-Platypus, ehartford/dolphin: flan1m-alpaca-uncensored-deduped.jsonl, Open-Orca/OpenOrca:1M-GPT4-Augmented.parquet
The amount is around 50K.

Thank you for your detailed information!
However, I have another question that, you said in your model card that you selected 5% Dolphin Data and 7% OpenOrca which are 120K in total and how could the final amount be around 50K? I am not sure whether or not I misunderstood your comment and your model card.
looking forward to your reply!

Oh, I get that, It's a mistake and I forget to revise the readme, I first use 5% Dolphin and 7% OpenOrca mixed with Platypus for training , but found a inferior performance(which we guess replicate or similar data still exists), so I filter the remaining data again, and finally only ~1% Dolphin and ~1% OpenORCA data remained. I will update the ReadMe file later, sorry for that.

fangloveskari changed discussion status to closed

Sign up or log in to comment