Presentation

#1
by scampion - opened
European Parliament org

A bit of context for that model :

The vocabulary size for the tokenizer has been increased to match the Eurovoc classification task and I agree, downstream tasks are possible in theory but may be more suitable with other models, it's really related to the use case.

The dataset used for the training was extracted from the Publications Office https://huggingface.co/datasets/EuropeanParliament/Eurovoc without much data preparation, this is also a point to study, I think.

My initial idea was to concentrate the model on 24 languages only in order to have a better vocabulary/training size ratio and to increase performance in order to aim for deployment on x86 hardware rather than on a GPU, which is much harder to deploy.

European Parliament org
edited Oct 10, 2023

A new version is currently being computed, with many more epochs and a new tokenizer.
The checkpoint will be updated progressively, the train-loss report in realtime
https://api.wandb.ai/links/sebastien-campion/70wicm0r

Sign up or log in to comment