18 2.1.3.2 FUNDAMENTALS Language Neural Networks The first step to represent language input into a format compatible with NNs is to convert units of language, words or characters or “tokens” as depending on a tokenizer, into numerical vectors. This is done by means of embeddings, which are typically learned as part of the training process, and are used to represent the meaning of words in a continuous vector space. There have been multiple generations of word embeddings, starting with one-hot vectors that represent each word by a vector of zeros with a single one at its vocabulary index, which depends highly on the tokenizer used and does not capture semantic relationships between words. Alternatives are frequency-based embeddings, such as TF-IDF vectors, which represent each word by its frequency in the corpus, weighted by its inverse frequency in the corpus, capturing some lexical semantics, but not the context in which the word appears. The next generation are Word2Vec embeddings that are trained to predict the context of a word, i.e., the words that appear before and after it in a sentence. FastText embeddings improve this by considering a character n-gram context, i.e., a sequence of n characters. The current generation are contextual word embeddings that are trained to predict the context of a word, taking into account the surrounding context and learning the sense of a word based on its context, e.g., ‘bank’ as a river bank vs. a financial institution in ‘Feliz sits at the bank of the river Nete’. Another important innovation is subword tokenization to deal with the out-of-vocabulary (OOV) problem, which is particularly relevant for morphologically rich languages, such as Dutch, where word meaning can be inferred from its subwords. A clever extension is byte pair encoding (BPE) [412], which is a data compression algorithm that iteratively replaces the most frequent pair of bytes in a sequence with a single, unused byte, until a predefined vocabulary size is reached. This is particularly useful for multilingual models, where the vocabulary size would otherwise be too large to fit in memory. The first embedding layer is typically a lookup table, which maps each word to a unique index in a vocabulary, and each index to a vector of real numbers. The embedding layer is typically followed by a recurrent, convolutional or attention layer, which is used to capture the sequential nature of language. Recurrent Neural Networks (RNNs) and recurrent architectures extended to model long-range dependencies such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks were the dominant architectures for sequence modeling in NLP, yet they have been superseded by Transformers in recent years.