Pieter Delobelle, François Remy, Miryam de Lhoneux, Thomas Demeester
Model Card for tweety-7b-dutch
tweety-7b-dutch is a foundation model with a focus on the Dutch language, incorporating a Dutch tokenizer for better understanding and generation of Dutch text. It's built on the mistral architecture, employing flash attention for efficient processing within a context window of 8192 tokens. Tweety-7b-dutch is trained on the cleaned Dutch mC4 dataset, without instruction finetuning.
Model Details
Model Description
Our tweety-7b-dutch model has an Apache 2.0 license, encouraging applications in research, content creation, and language analysis.
- Tokenizer: Dutch, 50k tokens (yhavinga/gpt-neo-1.3B-dutch)
- Pre-training data: Scraped Dutch (yhavinga/mc4_nl_cleaned)
- Context window: 8196 tokens
- Training data: 8.5B tokens
- Developed by: KU Leuven and UGent
- Funded by: KU Leuven BOF, VSC (Flemish Supercomputer Center), Vlaams AI-onderzoeksprogramma
- Model type: Foundation model
- License: Apache 2.0
Uses
As a base model, tweety-7b-dutch is primed for direct applications across text generation and understanding within the Dutch language.
Technical Specifications
Compute Infrastructure
Training utilized Nvidia H100 and A100 GPUs. Inference is accessible on lower-end GPUs, basically any GPU capable of running mistral models.
Model Weights
- This model was trained in bfloat16.
- GGUF weights are released by Bram Vanroy.
Citation
If you use this model, please cite our work as:
@article{tweeties2024,
title = {Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP},
author = {François Remy and Pieter Delobelle and Hayastan Avetisyan and Alfiya Khabibullina and Miryam de Lhoneux and Thomas Demeester},
url = {https://arxiv.org/abs/2408.04303},
year = {2024},
note = {Accepted at COLM 2024}
}
- Downloads last month
- 36