Text Generation
Transformers
PyTorch
English
llama
Inference Endpoints
text-generation-inference

Will you consider releasing a public dataset?

#1
by Jackdiy - opened

Here's the thing, I've noticed that from Mega to Manticore, and now to Hippogriff, it seems like you all have been using the Pygmalion dataset. The open-source community has probably also realized that in order to achieve better and more open-ended role-playing effects, it's not necessarily required to align with datasets like Alpaca and Vicuna that resemble GPT more. Instead, we should lean towards Pygmalion.

If you consider releasing datasets like Pygmalion and hellaswag (updated with 30K+ rows), it should encourage the open-source community to use Falcon, Guanaco, RedPajama, BLOOM, and other tools to train better models based on Pygmalion.

Open Access AI Collective org

Unfortunately I'm bound by oath not to release the pygmalion dataset. The hellaswag dataset I'm using is here: https://huggingface.co/datasets/winglian/evals/blob/main/hellaswag/hellaswag.jsonl

releasing datasets like Pygmalion

From what I've heard from one of the people involved with the project, the reason they don't release it is because it contains a lot of data that might be upsetting to some people. If you actually intend to use it for training and have trained models in the past you can probably reach out to one of the members for a copy.

Thank you both for your patient explanation and sharing. I will try to contact the Pygmalion team.

winglian changed discussion status to closed

Sign up or log in to comment