446 138 467

Daniel van Strien PRO

davanstrien

https://danielvanstrien.xyz/

vanstriendaniel

davanstrien

AI & ML interests

Machine Learning Librarian

Articles

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

about 1 month ago

• 11

Extracting Insights from Model Cards Using Open Large Language Models

Nov 27, 2023

Creating open machine learning datasets? Share them on the Hugging Face Hub!

Oct 30, 2023

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Aug 22, 2023

• 3

Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub

Aug 2, 2023

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Jun 12, 2023

Introducing BERTopic Integration with Hugging Face Hub

May 31, 2023

Organizations

Posts 9

Post

Could more DPO-style preference data be crucial for enhancing open LLMs across different languages?

Leveraging a 7k preference dataset Argilla ( @alvarobartt ), Hugging Face ( @lewtun ) and Kaist AI ( @JW17 & @nlee-208 )
utilized Kaist AI's recently introduced ORPO technique ORPO: Monolithic Preference Optimization without Reference Model (2403.07691) with the latest MistralAI MOE model to create a very high-performing open LLM: HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1

Since ORPO doesn't require a separate SFT stage, all that is needed is a strong base model + high-quality DPO-style datasets.

Currently, there is a significant lack of non-English DPO datasets. Filling this gap could significantly improve open LLMs in various languages.

You can get an overview of the current state of DPO datasets across different languages here: DIBT/preference_data_by_language

Post

TIL: since Text Generation Inference supports Messages API, which is compatible with the OpenAI Chat Completion API, you can trace calls made to inference endpoints using Langfuse's OpenAI API integration.

A Hugging Face Pro subscription includes access to many models you want to test when developing an app (https://huggingface.co/blog/inference-pro). Using the endpoint and tracing your generations during this development process is an excellent way for GPU-poor people to bootstrap an initial dataset quickly while prototyping.

View all posts