Why is the model struggling with pluralization of Swedish words?

#2
by tomsoderlund - opened

Great model, thank you for your hard work KBLab! I'm struggling a bit though:

The string "Groens malmgård är en av Stockholms malmgårdar, belägen vid Malmgårdsvägen 53 på Södermalm i Stockholm."

generates the following output:

{
  "entities": [
    {
      "entity": "LOC",
      "score": 0.4733002483844757,
      "index": 1,
      "word": "Gro",
      "start": 0,
      "end": 3
    },
    {
      "entity": "LOC",
      "score": 0.7171409130096436,
      "index": 2,
      "word": "##ens",
      "start": 3,
      "end": 6
    },
    {
      "entity": "LOC",
      "score": 0.5204557180404663,
      "index": 3,
      "word": "malm",
      "start": 7,
      "end": 11
    },
    {
      "entity": "LOC",
      "score": 0.9973670840263367,
      "index": 8,
      "word": "Stockholms",
      "start": 25,
      "end": 35
    },
    {
      "entity": "LOC",
      "score": 0.9990962743759155,
      "index": 14,
      "word": "Malm",
      "start": 60,
      "end": 64
    },
    {
      "entity": "LOC",
      "score": 0.9991210103034973,
      "index": 15,
      "word": "##gårds",
      "start": 64,
      "end": 69
    },
    {
      "entity": "LOC",
      "score": 0.9988952875137329,
      "index": 16,
      "word": "##vägen",
      "start": 69,
      "end": 74
    },
    {
      "entity": "LOC",
      "score": 0.9987678527832031,
      "index": 17,
      "word": "53",
      "start": 75,
      "end": 77
    },
    {
      "entity": "LOC",
      "score": 0.9986610412597656,
      "index": 19,
      "word": "Södermalm",
      "start": 81,
      "end": 90
    },
    {
      "entity": "LOC",
      "score": 0.9985471367835999,
      "index": 21,
      "word": "Stockholm",
      "start": 93,
      "end": 102
    }
  ]
}

I'm surprised to find "Stockholms" as a word – is pluralization a problem?

tomsoderlund changed discussion title from Why is the model struggling with Swedish words, and what does `##` mean? to Why is the model struggling with pluralization of Swedish words?

(Update - replied to the original comment)

The documentation README says:

The BERT tokenizer often splits words into multiple tokens, with the subparts starting with ##, for example the string Engelbert kör Volvo till Herrängens fotbollsklubb gets tokenized as Engel ##bert kör Volvo till Herr ##ängens fotbolls ##klubb.

There is also a short code example on how to glue them together.

Thank you @peterkz , I was too quick/lazy and missed it in the docs.

National Library of Sweden / KBLab org
edited Dec 17, 2022

Yes, it is a partial word (a token). ## means it is a suffix.

If a token does not have a ##, that means it is either its own word unit, or alternatively a starting prefix of a longer word (the latter is the case when the following token indices have tokens starting with ##).

The model card of this model has an example of how to concatenate words that are broken up when tokenized:

text = 'Engelbert tar Volvon till Tele2 Arena för att titta på Djurgården IF ' +\
       'som spelar fotboll i VM klockan två på kvällen.'

l = []
for token in nlp(text):
    if token['word'].startswith('##'):
        l[-1]['word'] += token['word'][2:]
    else:
        l += [ token ]

print(l)

Transformer language models have fixed size vocabularies. The wordpiece chunks that constitute its vocabulary are called "tokens". Generally, before pre-training a transformer language model, we first train a tokenizer to efficiently encode text in a given language into a fixed number of discrete vocabulary units (tokens).

The reason we need these is because we cannot actually input words or text into the model, but rather input numeric vector representations (embeddings) of words or pieces of words (tokens). If each individual word and each word inflection in a language would be its own token, we would need millions of tokens and millions of these numeric representations for each token to be able to represent a language as its unieuq set of possible word units. Instead, we train a tokenizer with a fixed vocabulary size (e.g. this model has about 52K token vocab size), and the purpose of one of these tokenizer models is to

  1. Encode as much text as possible efficiently into as few discrete token units as possible.
  2. Have a mapping of each token to an integer, where the integer is used to select the correct numeric vector representation of the token to be supplied as input to the model.
  3. Perform the reverse mapping from integer to text tokens once the model has processed the text.

Video explanation: https://www.youtube.com/watch?v=qpv6ms_t_1A

Sign up or log in to comment