Scale up base model for German?

#2
by WANGYIWEI - opened

Dear Malte,

thanks very much for your excellent work on adapting GPT2-XL to the German language.

I have been trying to find a powerful enough pre-trained (fine-tuned) model but also not too heavy to use for my German Chatbot's NLU algorithm. Your GPT2-XL-Wechsel-German has provided significant improvement to my project compared to other auto-encoding/ encoder-only models, which are adapted to German. On the private NLU intent classification dataset, I achieved 99.5% F1 with your model ;)

Now I am wondering if you are interested in replicating the fine-tuning process for a slightly larger model on the German dataset, but the model should not be as heavy as Llama2-7b, which imposes a lot of challenge to full-scale fine-tuning on my private customer-level GPU, an RTX 4090.

I found a great base model, which is called BTLM-3B-8k-base. It might not be as good as Llama2-7B, but compared to GPT2-XL from 2019, it will definitely be an upgrade.

I am glad to hear your feedback and I will be interested to dive into further discussion regarding this :)

There are already bigger models as the ones below (we will also release additional models in the near future):

Hey dear Malte,

sorry for the late reply. I have actually tested the bloom-based German LLMs and compared the performance with the GPT-2-based German models.

Actually, on the same scale, let's determine it at 1.x B parameters, I have found the older model, GPT-2-XL has way much better generation ability in terms of topic consistency and text coherence.

Also, I have discussed this with my supervisor, since he and his PhD students have done some work before using Bloom-175B as the base model for entity processing. From their feedback, the bloom family really doesn't deliver a satisfactory level of performance.

I also tested some other llama-based German models (7b). I did a small causal generation evaluation with the 6.4B bloom-based German model you provided, and the difference is still quite noticeable. However, due to computation resource limitations, I am not able to fine-tune these models on my private data, therefore, they are unfortunately not in my current plan.

I am excited to hear that you will be releasing some new German models in the near future using CLP-Transfer/ Wechsel. I am looking forward to it :)) Also if it doesn't hurt, some 3B models are also very nice options (BTLM-3B-8k-base), since most developers can run or even train it on custom-level GPUs.

Best regards,
Yiwei Wang

Sign up or log in to comment