Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
davanstrien 
posted an update Feb 27
Post
The open-source AI community can build impactful datasets collectively!

Announcing DIBT/10k_prompts_ranked, the first dataset release from Data Is Better Together.

Created in <2 weeks by the community. Includes:

✨ 10,000+ prompt quality ratings
🧑‍💻 Human and synthetic data prompts
🌐 Generated by 300+ contributors

How and why collaborative datasets?

It's no secret that high-quality open data is essential for creating better open models. The open source community shares 100s of models, datasets and demos openly weekly, but collectively building open datasets has been less explored.

Datasets have a massive role in shaping what models can be created. If we want more high-quality models for all languages, domains and tasks, we need more and better open datasets for all languages, domains and tasks!

To explore how the community could build impactful datasets collectively, Argilla added support for HF authentication for Argilla instances hosted on a Hugging Face Space. Anyone with an HF login could begin contributing to a dataset in <1 minute.

To test this new workflow, we launched a task to rank the quality of prompts (human and synthetically generated).

In less than two weeks, we built a community of over 300 contributors for this dataset 🤗

This dataset became a reality thanks to the dedication of all the individuals who lent their support ❤️ To see the amazing people behind this dataset, visit DIBT/prompt-collective-dashboard

This is just the start for collectively building powerful open datasets!

Also to mention that there's a v2 of the dataset ahead and anyone can contribute via https://huggingface.co/spaces/DIBT/prompt-collective signing in with their HuggingFace Hub account! 🤗

P.S. Thanks everyone for the great community effort already done!