Bram Vanroy PRO

BramVanroy

AI & ML interests

Artificial intelligence, natural language processing, computational linguistics

Organizations

BramVanroy's activity

replied to their post 18 days ago
view reply

Nice! In my experience preference tuning with the ultra feedback datasets does not really change benchmark scores (and sometimes even makes them worse) but it does seem to improve the real-world user experience when chatting with the model.

I'm also not sure if orpo only on UF is better than sft on UC + DPO on UF, especially if you're also trying to do language adaptation. That, or first continue pretraining the model and then doing orpo.

replied to their post 19 days ago
view reply

While the "rules" of OpenAI do get frustrating from time to time, I do not blame others who do not follow the same path as I do. If I am asked why my licenses are different from someone else's I will answer according to what I've written in the post above (the rules suck and our vague, I understand why people do what they do and I do what I do because of other reasons). But I definitely do not want to go around and point fingers pre-emptively in hopes that people just use my models. Our community for Dutch is already quite small so I rather just lift each other up and build on each others work through friendly "competition" than to compete in bad faith.

So I think that for my future models, I'll just make use of ultrachat+ultrafeedback, which should be cleared for training apache 2.0 models because they were created with Azure. This may negatively impact the model's performance (especially for code because it does not include the Stack Overflow set) but I hope the impact is limited.

replied to their post 20 days ago
view reply

What do you mean with compliance in this context? I'm not sure how I can market being non-commercial as a good thing ๐Ÿ˜…

replied to their post 20 days ago
view reply

Cool! Looking forward to what you'll build with this!

posted an update 20 days ago
view post
Post
1911
๐Ÿฅณ New license for datasets: Apache 2.0!

I have been struggling mentally for many months now with the OpenAI terms of use that indicate that their model outputs cannot be used to build "competing models". This leads to many questions:

- what is the definition of competing? Is it the same as "commercial"?
- since this is part of the terms of use between OpenAI and the API user, can a third party still use the generated dataset to build competing models?
- are such restrictions even legal in the first place?

Trying to "follow the rules" as much as possible despite wanting to be as open as possible, I kept releasing my datasets under non-commercial licenses (which are too restrictive anyhow - nothing should prevent you from using the data in non-LM commercial settings), just like models trained on these datasets. This has put me at a competitive disadvantage compared to creators who do not follow the same approach and release their data/models on apache 2.0 despite the OpenAI "restrictions". Moreover, I fear (https://twitter.com/BramVanroy/status/1780220420316164246) that my approach blocks adaptation of my data/models for (commercial) applications/integrations.

Thankfully @Rijgersberg noted that these OpenAI terms of use are NOT explicit in the Azure OpenAI API (https://twitter.com/E_Rijgersberg/status/1780308971762450725). Since my latest datasets were created via Azure, this comes as a relief. As far as I can tell after digging through Azure docs, this allows me to change all recent GPT4-generated datasets to apache 2.0! ๐Ÿฅณ

- BramVanroy/ultrachat_200k_dutch
- BramVanroy/orca_dpo_pairs_dutch
- BramVanroy/ultra_feedback_dutch
- BramVanroy/ultra_feedback_dutch_cleaned
- BramVanroy/no_robots_dutch

I will have to mull over what I'll do for the older GPT3.5 datasets. What do you think that I should do?
ยท
replied to their post about 1 month ago
posted an update about 1 month ago
view post
Post
2318
๐ŸŽˆ LLM Benchmarks Update!

**tl;dr: do not depend on benchmark leaderboards to choose your "chatbot" model! (Especially for non-English languages.)**

First of all, I'm discontinuing the Open #Dutch #LLM Leaderboard (https://lnkd.in/eFnsaFR6). It will stay online for now, but I urge the use of the ScandEval leaderboard instead (https://scandeval.com/dutch-nlg/) by @saattrupdan . It contains more tasks, has better reproducibility and statistics (CI) and a flexible back-end library (scandeval) to run your own benchmarks with. As part of project "Leesplank" (with Michiel Buisman and Maarten Lens-FitzGerald) we recently added GPT-4-1106-preview scores to add a good "target" to the leaderboard.

An important note here is that benchmark leaderboards are not a golden truth. Especially evaluating generative models is hard. You run into issues like prompt engineering (and sensitivity of models to one or other prompt), structured output generation, and - quite simply - "how to automatically evaluate open-ended generation".

๐Ÿ’ก Another important but under-discussed facet is the discrepancy between models' capability of understanding vs. generating *in different languages* (so the NLU part of NLG benchmarking). In other words: some of the listed models score really well on, e.g., MCQ benchmarks but are not suitable to use as DUTCH chat bots. Interestingly, some of these models seem to understand questions in Dutch and are able to pick the right answer (because they have good knowledge or reasoning skills), but generating fluent and grammatical Dutch is something else entirely! This is perhaps also true for humans: it's easier to sort-of grasp the meaning of a new language and answer with "Yes" or "No", but answering fluently in the language is much harder! Yet, your language production fluency does not necessarily say anything about your knowledge and reasoning skills.

Hopefully we can get a chat arena for Dutch some day - user feedback is the most powerful metric!
ยท
replied to their post about 2 months ago
view reply

Understandable. I'm especially attracted to the broad vocabulary, which can be of use for language adaptation.

replied to their post about 2 months ago
view reply

What kind of weird results? In terms of loss, or really qualitative output?

posted an update about 2 months ago
view post
Post
2346
Does anyone have experience with finetuning Gemma? Even the 2B variant feels more memory heavy than mistral 7B. I know that its vocabulary is much larger (250k) but I'm a bit surprised that the max batch size that I can get in an A100 80GB is only 2 whereas I could fit 4 with mistral 7B - even though Gemma is much smaller except for the embedding layer. Both runs were using FA, same sequence length, same deepspeed zero 3 settings. Oh and yes I'm using the most recent hot fix of transformers that solves a memory issue with Gemma and others.

Any prior experience that you can share or suggestions to improve throughout?
  • 4 replies
ยท
replied to their post about 2 months ago
view reply

Indeed, there is not a lot of metadata. There's also a discrepancy between the no. scores/languages and the no. paragraphs in the text. I've notified the authors about that. CulturaX is an attractive dataset, too!

posted an update about 2 months ago
view post
Post
1701
๐Ÿ–ด The HPLT monolingual dataset has a new home!

After being in touch with HPLT folks, I've transfered the data to their org. That only makes sense. You can find it below.

HPLT/hplt_monolingual_v1_2
  • 2 replies
ยท
posted an update 2 months ago
view post
Post
๐Ÿ—„๏ธ Massive data release on the HF Hub for 75 languages!

https://huggingface.co/datasets/BramVanroy/hplt_monolingual_v1_2

In December of last year, HPLT (https://hplt-project.org/) released version 1.2 of their dataset. It covers web-crawled data of 75 languages!, in the raw format as well as deduplicated and cleaned sections. In total, we're talking about over 40TB of data! This data was already accessible via their website but I figured the accessibility could be improved by an integration with Hugging Face tooling. ๐Ÿค— So I added the dataset here to the Hugging Face hub, enabing direct use in your conventional training pipelines for LLMs or other language technologies. The data will automatically be downloaded and optimised with just one line of code:

load_dataset("BramVanroy/hplt_mono_v1_2", "nl_cleaned")

Let's use this big blob of data to build something awesome in our languages! ๐Ÿฅณ
replied to dvilasuero's post 3 months ago
view reply

Really cool, crowd-sourcing like this can be very powerful!

Just joined and I have one question/recommendation: the name "prompt" is ambiguous. I got the following conversation:

Marv is a chatbot that reluctantly answers questions with sarcastic responses:

You: How many pounds are in a kilogram?
Marv: This again? There are 2.2 pounds in a kilogram. Please make a note of this.
You: What does HTML stand for?
Marv: Was Google too busy? Hypertext Markup Language. The T is for try to ask better questions in the future.
You: When did the first airplane fly?
Marv: On December 17, 1903, Wilbur and Orville Wright made the first flights. I wish theyโ€™d come and take me away.
You: What is the meaning of life?
Marv: Iโ€™m not sure. Iโ€™ll ask my friend Google.
You: Why is the sky blue?

Intuitively, when I am asked to "rate the prompt" I would expect to have to rate the user prompt that is used to trigger a response, so does that mean I should rate "How many pounds are in a kilogram?" Or do I have to rate all the responses of "Marv"? Or do I have to rate the whole conversation? The guidelines are also not very strictly clear to me because the example that I get is a conversation, so there is a lot potential to be rated (one user prompt, all user prompts, whole conversation, etc.)

Hope you see my confusion. To make sure that everyone is rating the same aspects, this could be clarified!

posted an update 3 months ago
view post
Post
๐Ÿ“ฃ DPO Dutch model release + datasets

After teasing for a while, I am finally releasing **GEITje 7B Ultra**, building upon the great GEITje 7B by @Rijgersberg . New contributions include: large new datasets for SFT (instruction/chat), two datasets for DPO training (i.e. RLAIF), and an SFT and DPO version of GEITje. The READMEs describe everything well (I hope), and I'll also share more info on social medias tomorrow.

For me this is a huge release, the datasets more so than the models. I'm especially pleased with UltraChat, which I created with the intent of having a diverse dataset - the model must be able to communicate with different types of users. So the user questions are created as if they were written by different personas, e.g. language learners, young children, experts, critics, etc. The focus with this is "building a good communication bot that is accessible and can handle different kinds of user input".

I wish I could find the time to also write a paper to get some "academic recognition" but that'll have to wait for now. I just want to bring it to the public so that others can play with it and use it to build new, cool stuff!

I hope that you can all appreciate the work. Let's build some cool stuff with it!

Models:
- Demo: BramVanroy/GEITje-7B-ultra
- DPO Model: BramVanroy/GEITje-7B-ultra
- SFT model (not recommended): BramVanroy/GEITje-7B-ultra-sft

Datasets with GPT-4 turbo completions:
- No robots (~10k instructions): BramVanroy/no_robots_dutch
- UltraChat (~200k instructions): BramVanroy/ultrachat_200k_dutch
- UltraFeedback (DPO with GPT4+GEITje chat, ~50k): BramVanroy/ultra_feedback_dutch
- Orca DPO Pairs (DPO with GPT4+GEITje chat, ~10k): BramVanroy/orca_dpo_pairs_dutch
ยท
replied to their post 3 months ago
view reply

From my limited experience, looking at potential unwanted text properties of chosen v. rejected can be another one to investigate. I think the model may learn, for instance, that longer sequences are better and will therefore learn to generate longer sequences regardless of the quality of the content. You can catch this in the metrics, I believe, but only in the log probs (which would then also be very low for the chosen text, perhaps even lower than the rejected text). You'll likely not notice this in the rewards metrics.

replied to their post 3 months ago
view reply

Update (since my post can't be too long): it seems that beta=0.5 was too high! Looking at the log probabilities, you'll notice that despite high rewards margins the log probabilities of all beta=0.5's are inconsistent (chosen log probs are lower than the rejected ones!). I am not sure what causes this. Perhaps the model has become good at discerning the two texts on other characteristics (like text length) but isn't really certain about the probabilities? In any case, I think that 0.1 beta is better. Perhaps there is some middle ground that I missed during grid search but looking at the results qualitatively, it seems that 0.1 is a good model.

posted an update 3 months ago
view post
Post
๐Ÿ”Ž DPO hyperparameter search update!

In my previous post (https://huggingface.co/posts/BramVanroy/633544255876795), I indicated how despite high reward accuracies and low losses, my model would sometimes just output repeating random tokens (/*****/). There were some useful brainstorms in that thread. I think the dataset is relatively easy for the model, leading it to quickly overfit when the beta is very small, which allows the model to step away further from its initially outputs.

So, I ran a hyperparameter search for learning rate (1e-7 v 5e-7), batch size (32, 64, 96, 128) and most importantly, beta (0.01, 0.1, 0.2, 0.5). You can have a look at the results for yourself here: https://wandb.ai/bramvanroy/dpo-geitje-ultra-hyperparams

Interpreting the result, I'd think that the beta=0.5 is the better choice for this dataset. Reasons:

- markedly higher rewards margins compared to all other betas
- better balance between positive chosen and negative rejected rewards
- log probabilities are not as superbly low as for beta=0.01, which seems too low for this dataset

Of course, that is just purely looking at numbers without running any benchmarks. However, I am hesitant to evaluate all the models on benchmarks and, therefore, literally optimising my hyperparameters on a test set (which is very bad!). So I will just play with some of the most promising models and see which one feels "best" qualitatively.

If you have other insights, thoughts, or opinions, let me know!
  • 3 replies
ยท
replied to their post 3 months ago
view reply

That's also a good idea. I'm now going to have a look at the results of the different hyperparameter runs and then look into some example generations. I'll make a new post later where I share all the WANDB logs so everyone can have a look at the impact of lr, batch size and beta on all the losses and rewards. Currently it seems that a beta of 0.01 is just too low for how easy my dataset is.

replied to their post 3 months ago
replied to their post 3 months ago
view reply

I work on an already finetuned version of Mistral (GEITje) and on a Dutch dataset. Currently I am thinking that the dataset is "too easy" and that the differences between the chosen and rejected answers are simply to obvious. Currently doing a hyperparameter search similar to HF's blog post. I do see a lot of difference in logp between runsm going from -500 to -1100.

replied to their post 3 months ago
view reply

So I tweaked the learning rate to 1e-7 (was 5e-7), removed portion of the dataset (I used two datasets, one based on Orca Pairs, the other based on UltraFeedback, I kept the latter), and I decreased the max length to 2048 max_length 1152 prompt length. EDIT: I thought this had worked, but apparently I am wrong. For some prompts, the same result occurs with repetitions of /******/ everywhere. I am very confused about this and should find time to dig deeper but it is a tedious trial and error process of training and testing that east up a lot of my time. If anyone wants to have a look, I can provide gated access to the model.

Mostly cc @lewtun but also a shout out to everyone in this thread for brainstorming along!

replied to their post 4 months ago
view reply

my first name at argilla.io

Mail sent!

replied to their post 4 months ago
replied to their post 4 months ago
replied to their post 4 months ago
view reply

Good point. I have looked at the data and set the lengths to more reasonable values wrt the dataset (ended up with 1536 for the prompt, 2048 max length for the responses). Having heteregoneous batches might indeed also be something to look into.

I have also looked at the dataset and in terms of quality I would expect it to be usable enough when I compare it to the UltraFeedback dataset.

So perhaps DPO is just very sensitive to hyperparameters. I'll try a number of different hyperparameters and see whether I can find some suitable ones. (With limit disk space this is not easy though, every checkpoint taking up almost 100GB!)

I'll report back with my findings.

replied to their post 4 months ago
view reply

Sadly it is not possible to rely on something like PairRM because I am not working on English. I am having a closer look at the dataset but it does not look that different from the (English) UltraFeedback in terms of quality (in my language) so it's hard to pinpoint the difference. I think I'll try out some different hyperparameters and see if I can find something stable.

replied to their post 4 months ago
view reply

I used the alignment-handbook hyperparameters for Zephyr (beta 0.01 for 1 epoch) and the architecture is also based on Mistral. The only change I made is set max_length=8192 because I didn't understand why it was set to 1024. But I don't know if length would impact the result so much?

posted an update 4 months ago
view post
Post
๐Ÿ•ต๏ธ Looking for DPO experts!

I have a dataset that is gpt-4-turbo as chosen and a lower performing model as rejected. The objective should therefore be fairly easy because the two are easy to discern. As a consequence, the model achieves very low losses (0.021 train; 0.013 validation) and high reward accuracies (0.995). **However**, when using the model in practice, it often detoriates after the first one or two tokens and continuously outputs sequences of /*****/. So despite the good performance on the DPO objective and strong scores on the validation set (no overfitting), something seems to go wrong. Perhaps the outputs are too different and the task is too easy, in which case DPO is not useful. But why then would the model start hallucinating and repeating the same token over and over again?

Any thoughts? Any suggestions to get around this? All discussions are welcome!
ยท
replied to their post 4 months ago
posted an update 4 months ago
view post
Post
๐Ÿ’กWe recently launched a Discord server for #Dutch #NLP and #LLMs. We have more than 50 users of _very_ varrying backgrounds! ๐Ÿง™๐Ÿ‘ฉโ€๐Ÿ”ฌ๐Ÿง‘โ€๐ŸŽจ๐Ÿง‘โ€๐Ÿซ๐Ÿง‘โ€๐Ÿ’ผ We've already had discussions on eval, tokenizers, RAG, data... A bit of everything. Everyone is welcome to work together on Dutch NLP and LLMs! https://discord.gg/YUUXVZkZJ9
ยท