Add funding

51696fb verified 3 months ago

1.56 kB

library_name: transformers
license: mit
language:
  - gl
base_model:
  - microsoft/mdeberta-v3-base
pipeline_tag: fill-mask

mDeBERTa-gl

mDeBERTa-gl is a continued pretraining checkpoint based on microsoft/mdeberta-v3-base, adapted to Galician through large-scale masked-language modeling. It is intended as a strong general-purpose encoder for downstream NLP tasks in Galician.

Training

Base model: microsoft/mdeberta-v3-base
Epochs: 3
Learning rate: 6e-4
MLM probability: 0.15
Max sequence length: 512
Total batch size: 1024
Training examples: 10,335,227
Mask token: [MASK]

Intended uses

Masked language modeling (fill-mask)
Encoder for classification, NER, QA, and general Galician NLP tasks
Further domain adaptation via fine-tuning

How to use

from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

model_id = "proxectonos/mdeberta-gl"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

fill_mask("O Parlamento de Galicia aprobou a [MASK] hoxe.")

Funding

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA

Citation

Please reference this model as: mdeberta-gl (Proxecto Nós Team, 2025).