metadata

license: apache-2.0
datasets:
  - wikipedia
language:
  - it
widget:
  - text: milano è una [MASK] dell'italia
    example_title: Example 1
  - text: il sole è una [MASK] della via lattea
    example_title: Example 2
  - text: l'italia è una [MASK] dell'unione europea
    example_title: Example 3

Model: BLAZE 🔥

Language: IT
Version: 𝖬𝖪-𝖨

Introduction

This model is a lightweight and uncased version of BERT [1] for the italian language. With its 55M parameters and 220MB size, it's 50% lighter than a typical mono-lingual BERT model, and ideal for circumstances where memory consumption and execution speed are critical aspects, while maintaining high quality results.

Model description

The model has been obtained by taking the multilingual DistilBERT [2] model (from the HuggingFace team: distilbert-base-multilingual-cased) as a starting point, and then focusing it on the italian language while at the same time turning it into an uncased model by modifying the embedding layer (as in [3], but computing document-level frequencies over the Wikipedia dataset and setting a frequency threshold of 0.1%), which brings a considerable reduction in the number of parameters.

In order to compensate for the deletion of cased tokens, which now forces the model to exploit lowercase representations of words which were previously capitalized, the model has been further pre-trained on the italian split of the Wikipedia dataset, using the whole word masking [4] technique to make it more robust with respect to the new uncased representations.

The resulting model has 55M parameters, a vocabulary of 13.832 tokens, and a size of 220MB, which makes it 50% lighter than a typical mono-lingual BERT model and 20% lighter than a typical mono-lingual DistilBERT model.

Training procedure

The model has been trained for masked language modeling on the italian Wikipedia (~3GB) dataset for 10K steps, using the AdamW optimizer, with a batch size of 512 (obtained through 128 gradient accumulation steps), a sequence length of 512, and a linearly decaying learning rate starting from 5e-5. The training has been performed using dynamic masking between epochs and exploiting the whole word masking technique.

Performances

The following metrics have been computed on the Part of Speech Tagging and Named Entity Recognition tasks, using the UD Italian ISDT and WikiNER datasets, respectively. The PoST model has been trained for 5 epochs, and the NER model for 3 epochs, both with a constant learning rate, fixed at 1e-5. For Part of Speech Tagging, the metrics have been computed on the default test set provided with the dataset, while for Named Entity Recognition the metrics have been computed with a 5-fold cross-validation

Task	Recall	Precision	F1
Part of Speech Tagging	97.48	97.29	97.37
Named Entity Recognition	89.29	89.84	89.53

The metrics have been computed at token level and macro-averaged over the classes.

Demo

You can try the model online (fine-tuned on named entity recognition) using this webapp: https://huggingface.co/spaces/osiria/next-it-demo

Quick usage

from transformers import AutoTokenizer, DistilBertForMaskedLM
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("osiria/blaze-it")
model = DistilBertForMaskedLM.from_pretrained("osiria/blaze-it")
pipeline_mlm = pipeline(task="fill-mask", model=model, tokenizer=tokenizer)

Limitations

This lightweight model is mainly trained on Wikipedia, so it's particularly suitable as an agile analyzer for large volumes of natively digital text taken from the world wide web, written in a correct and fluent form (like wikis, web pages, news, etc.). It may show limitations when it comes to chaotic text, containing errors and slang expressions (like social media posts) or when it comes to domain-specific text (like medical, financial or legal content).

References

[1] https://arxiv.org/abs/1810.04805

[2] https://arxiv.org/abs/1910.01108

[3] https://arxiv.org/abs/2010.05609

[4] https://arxiv.org/abs/1906.08101

License

The model is released under Apache-2.0 license