xaviergillard's picture
Update README.md
411392f verified
metadata
pipeline_tag: fill-mask
datasets:
  - arch-be/brabant-xvii
language:
  - nl
  - fr
widget:
  - text: >-
      by den ontfanger van de exploiten gecollacionneert tegens [MASK] brieven
      by my
    output:
      - label: onsen
        score: 0.326
      - label: onse
        score: 0.247
      - label: doriginale
        score: 0.17
      - label: synen
        score: 0.046
      - label: zynen
        score: 0.011
  - text: |-
      [MASK] par la grace de dieu etc savoir faisons a tous presens et avenir 
      nous avoir receu lumble supplication de jehannet austin
    output:
      - label: maximilian
        score: 0.958
      - label: philippe
        score: 0.017
      - label: phelippe
        score: 0.016
  - text: >-
      [MASK] byder gracien gods roomsch keyser altyt vermeerder tsrycx coninck
      van 

      germanien van castillien van leon van arragon van navarre van napels van
      secillien 

      van maiorque van sardine vanden eylanden van indien vander vaster eerden
      ende zee 

      occeane eertshertoge van oistenrycke hertoge van bourgoingnen van lothric
      van brabant 

      van limborch van luxemborch etc.
    output:
      - label: philips
        score: 0.968
      - label: kaerle
        score: 0.027
      - label: maximiliaen
        score: 0.002
  - text: |-
      Cornelia de Ghijs
      Joos de Medraige.
      Gedaen ende alzoo gepasseert inder stadt van Bruessele
      den tweesten dach der maendt van [MASK] int jaer
      duijsent vijffhondert tachtentich
    output:
      - label: julio
        score: 0.428
      - label: augusto
        score: 0.111
      - label: aprille
        score: 0.107
      - label: februario
        score: 0.053
      - label: januario
        score: 0.042
license: mit

Brabant-XVII-From-Scratch

As the name suggests, this model was trained on the eponymous dataset that comprises the transcriptions of archives text from the 'council of brabant' (raad van babant) over a period that ranges from the xv to the vii century. These transcriptions were made by hand, mostly by volunteers. The transcribed documents cover (at the time) letters of pardons and letters of sentences.

The model architecture

The 'brabant-xvii-from-scratch' model in itself was trained as a plain old BERT model.

Tokenizer

The tokenizer that has been trained specifically for this model was trained using the standard BERT tokenizer (WordPiece) with a vocabulary size limit of 30,000. During the training of the tokenizer, all of the available text was provided so as to let the resulting tokenizer stick as close as possible to the actual vocabulary used in the complete corpus.

Preprocessing steps

The tokenizer of this model applies the following preprocessing steps:

  • text has been normalized to NFKC encoding
  • all '(', ')', '[' and, ']' have been removed from the text. These were typically used by annotators to indicate a likely word completion that had been abbreviated in the handwritten text. (That decision was made by historian in charge of the corpus since the objective is to make the corpus searchable eventually).
  • sequences of line breaks have been removed to keep at most one.
  • sequences of blanks have been normalized to keep at most one.
  • all texts have been lower cased
  • all accents have been removed
  • all control characters have been normalized to plain space

Pretraining of the model

The model was then pretrained for both BERT objectives (MLM and NSP). To that end, it used the dataset 'arch-be/brabant-xvii' ("next_sentence") which has been crafted specifically for this purpose. As per the state-of-the art in BERT training, 15% of the tokens were masked at training time.

Related Work

This model was trained as an experiment to see what would work best on the target corpus (pardons and sentences). Three options were considered for that purpose:

  1. Pretraining a BERT model from scratch that would be able to leverage a tokenizer based on a vocabulary emerging from the target corpus
  2. Fine Tuning all layers of a pretrained historical model (emanjavacas/GysBERT-v2) that mostly fits with the target languages (not fully though, as the brabant-xvii comprises both text in ancient Dutch and text in ancient French)
  3. Fine Tuning the head of a pretrained historical model (emanjavacas/GysBERT-v2) that mostly fit with the target language.

Important Note

The fine tuning of the pretrained historical model (emanjavacas/GysBERT-v2) is fundamentally different from the pretraining of this foundation model. Indeed, this model was pretrained for both the MLM (masked lanaguage model) and the NSP (next sentence prediction) objectives. Whereas the finetuning only account for the retraining of the MLM objective (even when updating the weights of all layers is allowed). Indeed, when pretraining for both MLM and NSP, the intuition is to let the model learn what pairs of lines could possibly follow one another in the corpus (hence enriching the internal structure representation) in addidtion to teaching it to fill gaps in a text with the words it knows from its vocabulary.

First Observations

Note: The experiment is not complete yet, hence no conclusive results can be provided so far.

Option 1 (training a model from scratch) is what gave rise to this very model. Option 2 (fine tuning all layers of a pretrained historical model) is wat gave rise to