---
pipeline_tag: fill-mask
datasets:
- arch-be/brabant-xvii
language:
- nl
- fr
widget:
- text: |-
      by den ontfanger van de exploiten gecollacionneert tegens [MASK] brieven by my
  output:
    - label: onsen
      score: 0.326
    - label: onse
      score: 0.247
    - label: doriginale
      score: 0.170
    - label: synen
      score: 0.046
    - label: zynen
      score: 0.011
- text: |-
      [MASK] par la grace de dieu etc savoir faisons a tous presens et avenir 
      nous avoir receu lumble supplication de jehannet austin
  output:
    - label: maximilian
      score: 0.958
    - label: philippe
      score: 0.017
    - label: phelippe
      score: 0.016
- text: |-
      [MASK] byder gracien gods roomsch keyser altyt vermeerder tsrycx coninck van 
      germanien van castillien van leon van arragon van navarre van napels van secillien 
      van maiorque van sardine vanden eylanden van indien vander vaster eerden ende zee 
      occeane eertshertoge van oistenrycke hertoge van bourgoingnen van lothric van brabant 
      van limborch van luxemborch etc.
  output:
    - label: philips
      score: 0.968
    - label: kaerle
      score: 0.027
    - label: maximiliaen
      score: 0.002
- text: |-
      Cornelia de Ghijs
      Joos de Medraige.
      Gedaen ende alzoo gepasseert inder stadt van Bruessele
      den tweesten dach der maendt van [MASK] int jaer
      duijsent vijffhondert tachtentich
  output:
    - label: julio
      score: 0.428
    - label: augusto
      score: 0.111
    - label: aprille
      score: 0.107
    - label: februario
      score: 0.053
    - label: januario
      score: 0.042
license: mit
---

# Brabant-XVII-From-Scratch

As the name suggests, this model was trained on the eponymous dataset that comprises the transcriptions of archives text from the 
'council of brabant' (raad van babant) over a period that ranges from the xv to the vii century. These transcriptions were made by hand, 
mostly by volunteers. The transcribed documents cover (at the time) letters of pardons and letters of sentences.

## The model architecture
The 'brabant-xvii-from-scratch' model in itself was trained as a plain old BERT model.

## Tokenizer

The tokenizer that has been trained specifically for this model was trained using the standard BERT tokenizer (WordPiece) with a vocabulary size limit of 30,000.
During the training of the tokenizer, all of the available text was provided so as to let the resulting tokenizer stick as close as possible to the actual vocabulary
used in the complete corpus.

## Preprocessing steps

The tokenizer of this model applies the following preprocessing steps: 
- text has been normalized to NFKC encoding
- all '(', ')', '[' and, ']' have been removed from the text. These were typically used by annotators to indicate a likely word completion that had been abbreviated in the handwritten text. (That decision was made by historian in charge of the corpus since the objective is to make the corpus searchable eventually).
- sequences of line breaks have been removed to keep at most one.
- sequences of blanks have been normalized to keep at most one.
- all texts have been lower cased
- all accents have been removed
- all control characters have been normalized to plain space

## Pretraining of the model

The model was then pretrained for both BERT objectives (MLM and NSP). To that end, it used the dataset 'arch-be/brabant-xvii' ("next_sentence") which has 
been crafted specifically for this purpose. As per the state-of-the art in BERT training, 15% of the tokens were masked at training time.


## Related Work

This model was trained as an experiment to see what would work best on the target corpus (pardons and sentences). Three options were considered for that purpose: 
1. Pretraining a BERT model from scratch that would be able to leverage a tokenizer based on a vocabulary *emerging from the target corpus*
2. Fine Tuning all layers of a pretrained historical model (emanjavacas/GysBERT-v2) that mostly fits with the target languages (not fully though, as the brabant-xvii comprises both text in ancient Dutch *and* text in ancient French)
3. Fine Tuning the head of a pretrained historical model (emanjavacas/GysBERT-v2) that mostly fit with the target language.

### Important Note

The fine tuning of the pretrained historical model (emanjavacas/GysBERT-v2) is fundamentally different from the pretraining of this foundation model. Indeed, this model
was pretrained for both the MLM (masked lanaguage model) and the NSP (next sentence prediction) objectives. Whereas the finetuning only account for the retraining of the
MLM objective (even when updating the weights of all layers is allowed). Indeed, when pretraining for both MLM and NSP, the intuition is to let the model learn what pairs
of lines could possibly follow one another in the corpus (hence enriching the internal structure representation) in addidtion to teaching it to fill gaps in a text with the 
words it knows from its vocabulary.

## First Observations

**Note:** The experiment is not complete yet, hence no conclusive results can be provided so far.

Option 1 (training a model from scratch) is what gave rise to _this_ very model. 
Option 2 (fine tuning all layers of a pretrained historical model) is wat gave rise to