Bram Vanroy PRO

BramVanroy

AI & ML interests

Artificial intelligence, natural language processing, computational linguistics

Organizations

Posts 9

view post
Post
1612
🥳 New license for datasets: Apache 2.0!

I have been struggling mentally for many months now with the OpenAI terms of use that indicate that their model outputs cannot be used to build "competing models". This leads to many questions:

- what is the definition of competing? Is it the same as "commercial"?
- since this is part of the terms of use between OpenAI and the API user, can a third party still use the generated dataset to build competing models?
- are such restrictions even legal in the first place?

Trying to "follow the rules" as much as possible despite wanting to be as open as possible, I kept releasing my datasets under non-commercial licenses (which are too restrictive anyhow - nothing should prevent you from using the data in non-LM commercial settings), just like models trained on these datasets. This has put me at a competitive disadvantage compared to creators who do not follow the same approach and release their data/models on apache 2.0 despite the OpenAI "restrictions". Moreover, I fear (https://twitter.com/BramVanroy/status/1780220420316164246) that my approach blocks adaptation of my data/models for (commercial) applications/integrations.

Thankfully @Rijgersberg noted that these OpenAI terms of use are NOT explicit in the Azure OpenAI API (https://twitter.com/E_Rijgersberg/status/1780308971762450725). Since my latest datasets were created via Azure, this comes as a relief. As far as I can tell after digging through Azure docs, this allows me to change all recent GPT4-generated datasets to apache 2.0! 🥳

- BramVanroy/ultrachat_200k_dutch
- BramVanroy/orca_dpo_pairs_dutch
- BramVanroy/ultra_feedback_dutch
- BramVanroy/ultra_feedback_dutch_cleaned
- BramVanroy/no_robots_dutch

I will have to mull over what I'll do for the older GPT3.5 datasets. What do you think that I should do?
view post
Post
2028
🎈 LLM Benchmarks Update!

**tl;dr: do not depend on benchmark leaderboards to choose your "chatbot" model! (Especially for non-English languages.)**

First of all, I'm discontinuing the Open #Dutch #LLM Leaderboard (https://lnkd.in/eFnsaFR6). It will stay online for now, but I urge the use of the ScandEval leaderboard instead (https://scandeval.com/dutch-nlg/) by @saattrupdan . It contains more tasks, has better reproducibility and statistics (CI) and a flexible back-end library (scandeval) to run your own benchmarks with. As part of project "Leesplank" (with Michiel Buisman and Maarten Lens-FitzGerald) we recently added GPT-4-1106-preview scores to add a good "target" to the leaderboard.

An important note here is that benchmark leaderboards are not a golden truth. Especially evaluating generative models is hard. You run into issues like prompt engineering (and sensitivity of models to one or other prompt), structured output generation, and - quite simply - "how to automatically evaluate open-ended generation".

💡 Another important but under-discussed facet is the discrepancy between models' capability of understanding vs. generating *in different languages* (so the NLU part of NLG benchmarking). In other words: some of the listed models score really well on, e.g., MCQ benchmarks but are not suitable to use as DUTCH chat bots. Interestingly, some of these models seem to understand questions in Dutch and are able to pick the right answer (because they have good knowledge or reasoning skills), but generating fluent and grammatical Dutch is something else entirely! This is perhaps also true for humans: it's easier to sort-of grasp the meaning of a new language and answer with "Yes" or "No", but answering fluently in the language is much harder! Yet, your language production fluency does not necessarily say anything about your knowledge and reasoning skills.

Hopefully we can get a chat arena for Dutch some day - user feedback is the most powerful metric!