Daniel van Strien PRO

davanstrien

AI & ML interests

Machine Learning Librarian

Articles

Organizations

davanstrien's activity

posted an update about 21 hours ago
view post
Post
787
Can you create domain-specific synthetic datasets in under 20 minutes?

@burtenshaw recently launched the Domain Specific Dataset Project as part of Data is Better Together. As part of this, Ben created a Space that you can use to define some key perspectives and concepts from a domain. This seed dataset can then be used to generate a synthetic dataset for a particular domain.

In less than 30 minutes this afternoon, I created a domain-specific dataset focused on data-centric machine learning using these tools: davanstrien/data-centric-ml-sft.

You can create your own domain specific datasets using this approach. Find the steps to follow here: https://github.com/huggingface/data-is-better-together/blob/main/domain-specific-datasets/README.md
  • 1 reply
·
posted an update 2 days ago
view post
Post
1070
As part of the Data is Better Together MPEP project, we are now at the point where some translation efforts have successfully translated 500 highly ranked prompts into a new target language (amazing work from @Rijgersberg et al!)

Our next step is to use these translated prompts to evaluate the performance of LLMs for non English languages.

Does LLM, as a judge, work outside of English?

Ideally, it would be compelling to leverage LLMs to judge models for non-English since this significantly lowers the barrier to evaluating models (although it doesn't remove this barrier altogether).

What we want to know is:
- does auto/LLM eval work in general for a particular language
- which model(s) works best as a judge
- do LLMs' judgments of non-English models match human preferences?

We're starting to think about how to approach this. If you have any ideas of possible approaches feel free to comment or join the discussion here: https://github.com/huggingface/data-is-better-together/issues/61

Other ideas...

Could an approach like Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models (2404.18796) with the SOA models for a particular language work? i.e., choose 4 of the best open LLMs for Arabic and use those at the pool of raters rather than relying on one powerful judge LLM?
posted an update 17 days ago
view post
Post
2002
Could more DPO-style preference data be crucial for enhancing open LLMs across different languages?

Leveraging a 7k preference dataset Argilla ( @alvarobartt ), Hugging Face ( @lewtun ) and Kaist AI ( @JW17 & @nlee-208 )
utilized Kaist AI's recently introduced ORPO technique ORPO: Monolithic Preference Optimization without Reference Model (2403.07691) with the latest MistralAI MOE model to create a very high-performing open LLM: HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1

Since ORPO doesn't require a separate SFT stage, all that is needed is a strong base model + high-quality DPO-style datasets.

Currently, there is a significant lack of non-English DPO datasets. Filling this gap could significantly improve open LLMs in various languages.

You can get an overview of the current state of DPO datasets across different languages here: DIBT/preference_data_by_language
posted an update 27 days ago
view post
Post
2722
TIL: since Text Generation Inference supports Messages API, which is compatible with the OpenAI Chat Completion API, you can trace calls made to inference endpoints using Langfuse's OpenAI API integration.

A Hugging Face Pro subscription includes access to many models you want to test when developing an app (https://huggingface.co/blog/inference-pro). Using the endpoint and tracing your generations during this development process is an excellent way for GPU-poor people to bootstrap an initial dataset quickly while prototyping.
replied to their post about 1 month ago
view reply

That's a good point! It might be nice to combine the textual tl;dr description with some critical bits of metadata (where it exists).

posted an update about 1 month ago
view post
Post
1307
Would 1-2 sentence tl;dr summaries of datasets on the Hub be useful for you?

For example, for the togethercomputer/RedPajama-Data-1T dataset, would the following summary help give you a quick sense of its content?

> tl;dr: RedPajama is a fully open-source implementation of the LLaMa dataset, consisting of 1.2 trillion tokens from sources like Commoncrawl, C4, GitHub, Books, ArXiv, Wikipedia, and StackExchange, primarily in English, and is structured with metadata for each text sample.

I've created a dataset with example summaries of the 500 most liked datasets on the Hub: davanstrien/dataset-tldr

Would these kinds of summaries be helpful?
  • 2 replies
·
replied to their post about 1 month ago
view reply

Hopefully I'll have something to share for this soon! I still need to do some more annotating!

replied to ZennyKenny's post about 1 month ago
posted an update about 2 months ago
view post
Post
KTO offers an easier way to preference train LLMs (only 👍👎 ratings are required). As part of #DataIsBetterTogether, I've written a tutorial on creating a preference dataset using Argilla and Spaces.

Using this approach, you can create a dataset that anyone with a Hugging Face account can contribute to 🤯

See an example of the kind of Space you can create following this tutorial here: davanstrien/haiku-preferences

🆕 New tutorial covers:
💬 Generating responses with open models
👥 Collecting human feedback (do you like this model response? Yes/No)
🤖 Preparing a TRL-compatible dataset for training aligned models

Check it out here: https://github.com/huggingface/data-is-better-together/tree/main/kto-preference
  • 2 replies
·
replied to dvilasuero's post about 2 months ago
view reply

Great work! It's nice to see some open reproduction efforts of SPIN, and it's cool to see that some high-quality data can reduce the amount of data required. cc @teknium , who I know was also excited about SPIN.

I am excited to see what other amazing things the community can collectively build together! 💪

posted an update about 2 months ago
view post
Post
Can we improve the quality of open LLMs for more languages?

Step 1: Evaluate current SOTA.

The Data Is Better Together community has rated more than 10K prompts for quality. We now want to translate a subset of these to help address the language gap in model evals.

The plan is roughly this:

- We started with DIBT/10k_prompts_ranked and took a subset of 500 high-quality prompts
- We're asking the community to translate these prompts into different languages
- We'll evaluate the extent to which we can use AlpacaEval and similar approaches to rate the outputs of models across these different languages
- If it works well, we can more easily evaluate open LLMs across different languages by using a judge LLM to rate the quality of outputs from different models.

You can find more details in our new GitHub repo: https://github.com/huggingface/data-is-better-together (don't forget to give it a ⭐!)
posted an update about 2 months ago
view post
Post
Introducing davanstrien/cosmopedia_chat (v0.0.1), my first experiment using the new NousResearch Genstruct model NousResearch/Genstruct-7B

This dataset uses a subset of HuggingFaceTB/cosmopedia, a synthetic textbook-quality dataset, and Genstruct to generate user/assistant response pairs.

My current results are mixed, but I'm excited to see how much work is happening around synthetic data generation in the community. Most crucial next step is working more on data filtering from cosmopedia.

Massive thanks to @euclaise @teknium and the other NouseResearch folks for sharing this model ❤️
posted an update about 2 months ago
view post
Post
Today, we're launching an effort to empower the community to build impactful datasets collectively.

Good data is essential for the open-source AI community. Recently, Argilla and Hugging Face launched Data is Better Together. In less than two weeks, over 350 people ranked over 10k prompts.

Today, we're shifting our focus to help support other community efforts to create datasets using Argilla and Hugging Face Spaces. This workflow means anyone with a Hugging Face account can contribute to a dataset in less than a minute. We want to hear from anyone with ideas for creating important datasets as a community. This could include things like:

- Creating preference data for a language that lacks high-quality preference datasets.
- Building evaluation datasets for a new domain.
- Developing a dataset for a new task.

If you would like to get involved, join us in the #data-is-better-together Discord channel: https://discord.com/channels/879548962464493619/1205128865735770142.

You can read more in this blog post from @dvilasuero and I: https://huggingface.co/blog/community-datasets
posted an update 2 months ago
view post
Post
The open-source AI community can build impactful datasets collectively!

Announcing DIBT/10k_prompts_ranked, the first dataset release from Data Is Better Together.

Created in <2 weeks by the community. Includes:

✨ 10,000+ prompt quality ratings
🧑‍💻 Human and synthetic data prompts
🌐 Generated by 300+ contributors

How and why collaborative datasets?

It's no secret that high-quality open data is essential for creating better open models. The open source community shares 100s of models, datasets and demos openly weekly, but collectively building open datasets has been less explored.

Datasets have a massive role in shaping what models can be created. If we want more high-quality models for all languages, domains and tasks, we need more and better open datasets for all languages, domains and tasks!

To explore how the community could build impactful datasets collectively, Argilla added support for HF authentication for Argilla instances hosted on a Hugging Face Space. Anyone with an HF login could begin contributing to a dataset in <1 minute.

To test this new workflow, we launched a task to rank the quality of prompts (human and synthetically generated).

In less than two weeks, we built a community of over 300 contributors for this dataset 🤗

This dataset became a reality thanks to the dedication of all the individuals who lent their support ❤️ To see the amazing people behind this dataset, visit DIBT/prompt-collective-dashboard

This is just the start for collectively building powerful open datasets!
  • 1 reply
·
replied to dvilasuero's post 2 months ago
view reply

This is a very popular opinion with me!

replied to dvilasuero's post 2 months ago
view reply

Really great work! I'm very pleased to see people explore beyond using GPT-4 for all preference ranking!

posted an update 3 months ago
view post
Post
I think one of the most important ways you can contribute to open-source machine learning in 2024 is through datasets.

On Monday Argilla and Hugging Face launched #data-is-better-together an experiment focused on collectively building datasets on the Hub.

For our V1 experiment we're aiming to collectively rank 50k prompts!

In the few days since launch we've had:

❤️ 158 people contribute
🚀 2,796 prompts ranked

🤔 How Can You Contribute?

1. Sign up if you don’t have a Hugging Face account (why not!?)
2. Go to this Argilla Space and sign in: DIBT/prompt-collective
3. Read the guidelines and start rating prompts!

You can also join the #data-is-better-together channel in the Hugging Face Discord 🔗 https://discord.com/channels/879548962464493619/1205128865735770142
replied to dvilasuero's post 3 months ago
view reply

We're aiming to judge the text as a full prompt. Some of them are synthetically generated, so I would rank this as a bad prompt since the additional context doesn't seem to make sense as a prompt!

replied to dvilasuero's post 3 months ago
view reply

The progress tracking Space is very motivating!

replied to zpn's post 3 months ago
view reply

Really nice and work, and really appreciate the depth of the technical report and that the data is available 🤗

replied to BramVanroy's post 3 months ago
view reply

Excellent work! Would be great if someone did a big sweep across a bunch of datasets and parameters to produce guidelines based on dataset properties/model size, etc.

replied to Pclanglais's post 3 months ago
view reply

I'm really excited to see collections shared like this. Making collections easily accessible unlocks so many interesting use cases often well beyond what was originally imagined by the collection holder.

replied to dctanner's post 4 months ago
view reply

Do you know how much this format is currently being used? i.e. what % of datasets adopted this format? Could be a nice community effort to convert some existing datasets with permissive licences into a standard format?

replied to clem's post 4 months ago
view reply

I would recommend following @dvilasuero and other folks at Argilla, who are doing some very cool work in this area, particularly via their distilabel library.

I'm also working on a modest intro to synthetic data generation here 🤗

replied to clem's post 4 months ago
view reply

I think it will be a big focus of 2024. I also believe there is a lot of scope for creative approaches to generating synthetic data that don't always rely on a big GPU budget (though this will also be important!). As an example of this, I created haiku_dpo, a synthetic dataset for making LLMs better at writing haiku. The development happened locally on a laptop with <1 hour of collab GPU time used at the end to generate a larger dataset. I think this topic will be an area where many community members can contribute more through their creativity rather than their GPU budget.