Daniel van Strien PRO

davanstrien

AI & ML interests

Machine Learning Librarian

Articles

Organizations

Posts 13

view post
Post
2077
Introducing CosmoChat, a multiturn chat dataset based on Cosmopedia that I'm working on in the open on the Hub.

🎯 Goals:
💬 Create multi-turn chats seeded from Cosmopedia
🎓 Customize questions for different audience levels
🔍 Evaluate the model's ability to elaborate and clarify
🤓 (I want to learn more about creating valuable synthetic datasets, and I learn best by doing stuff rather than reading stuff).

Cosmochat is created using the excellent distilabel library.

🔗 Explore the current version of the dataset: davanstrien/cosmochat
📝 Read more: https://huggingface.co/blog/davanstrien/cosmochat
view post
Post
2178
Only 14 languages have DPO preference style datasets on the Hugging Face Hub ( DIBT/preference_data_by_language) Let's improve that! How?

The Cohere For AI Aya dataset CohereForAI/aya_dataset has human-annotated prompt-completion pairs in 71 languages. We can use this to create DPO datasets for more languages!

Using Aya's prompt/response pairs as a starting point we can use an LLM to generate an additional response to each prompt. We then use an LLM Judge to rank each response.

✅ In some/many languages, human responses may be better than LLM ones but we may want to check that assumption for some languages.
🚀 We use Argilla's distilabel library to push data to Argilla for validation. This also allows us to determine if an LLM judge is effective for different languages.

As an example of what this pipeline produces:
- DIBT/aya_dutch_dpo a DPO style dataset for Dutch using Llama 3 as a generator/judge LM.
- An annotation Space that anyone with a HF account can contribute to: https://dibt-demo-argilla-space.hf.space/dataset/924ef8a8-a447-4563-8806-0e2a668a5314/annotation-mode?page=1&status=pending

As part of Data is Better Together we want to build more DPO datasets. Join us here: https://github.com/huggingface/data-is-better-together#4-dpoorpo-datasets-for-more-languages 🤗