Language modeling
Language modeling predicts words in a sentence. There are two forms of language modeling.
Causal language modeling predicts the next token in a sequence of tokens, and the model can only attend to tokens on the left.
Masked language modeling predicts a masked token in a sequence, and the model can attend to tokens bidirectionally.
This guide will show you how to fine-tune DistilGPT2 for causal language modeling and DistilRoBERTa for masked language modeling on the r/askscience subset of the ELI5 dataset.
You can fine-tune other architectures for language modeling such as GPT-Neo, GPT-J, and BERT, following the same steps presented in this guide!
See the text generation task page and fill mask task page for more information about their associated models, datasets, and metrics.
Load ELI5 dataset
Load only the first 5000 rows of the ELI5 dataset from the 🤗 Datasets library since it is pretty large:
>>> from datasets import load_dataset
>>> eli5 = load_dataset("eli5", split="train_asks[:5000]")
Split this dataset into a train and test set:
eli5 = eli5.train_test_split(test_size=0.2)
Then take a look at an example:
>>> eli5["train"][0]
{'answers': {'a_id': ['c3d1aib', 'c3d4lya'],
'score': [6, 3],
'text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
"Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"]},
'answers_urls': {'url': []},
'document': '',
'q_id': 'nyxfp',
'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
'selftext_urls': {'url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg']},
'subreddit': 'askscience',
'title': 'Few questions about this space walk photograph.',
'title_urls': {'url': []}}
Notice text
is a subfield nested inside the answers
dictionary. When you preprocess the dataset, you will need to extract the text
subfield into a separate column.
Preprocess
For causal language modeling, load the DistilGPT2 tokenizer to process the text
subfield:
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
For masked language modeling, load the DistilRoBERTa tokenizer instead:
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")
Extract the text
subfield from its nested structure with the flatten
method:
>>> eli5 = eli5.flatten()
>>> eli5["train"][0]
{'answers.a_id': ['c3d1aib', 'c3d4lya'],
'answers.score': [6, 3],
'answers.text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
"Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"],
'answers_urls.url': [],
'document': '',
'q_id': 'nyxfp',
'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
'selftext_urls.url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg'],
'subreddit': 'askscience',
'title': 'Few questions about this space walk photograph.',
'title_urls.url': []}
Each subfield is now a separate column as indicated by the answers
prefix. Notice that answers.text
is a list. Instead of tokenizing each sentence separately, convert the list to a string to jointly tokenize them.
Here is how you can create a preprocessing function to convert the list to a string and truncate sequences to be no longer than DistilGPT2’s maximum input length:
>>> def preprocess_function(examples):
... return tokenizer([" ".join(x) for x in examples["answers.text"]], truncation=True)
Use 🤗 Datasets map function to apply the preprocessing function over the entire dataset. You can speed up the map
function by setting batched=True
to process multiple elements of the dataset at once and increasing the number of processes with num_proc
. Remove the columns you don’t need:
>>> tokenized_eli5 = eli5.map(
... preprocess_function,
... batched=True,
... num_proc=4,
... remove_columns=eli5["train"].column_names,
... )
Now you need a second preprocessing function to capture text truncated from any lengthy examples to prevent loss of information. This preprocessing function should:
- Concatenate all the text.
- Split the concatenated text into smaller chunks defined by
block_size
.
>>> block_size = 128
>>> def group_texts(examples):
... concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
... total_length = len(concatenated_examples[list(examples.keys())[0]])
... total_length = (total_length // block_size) * block_size
... result = {
... k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
... for k, t in concatenated_examples.items()
... }
... result["labels"] = result["input_ids"].copy()
... return result
Apply the group_texts
function over the entire dataset:
>>> lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)
For causal language modeling, use DataCollatorForLanguageModeling to create a batch of examples. It will also dynamically pad your text to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the tokenizer
function by setting padding=True
, dynamic padding is more efficient.
You can use the end of sequence token as the padding token, and set mlm=False
. This will use the inputs as labels shifted to the right by one element:
>>> from transformers import DataCollatorForLanguageModeling
>>> tokenizer.pad_token = tokenizer.eos_token
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
For masked language modeling, use the same DataCollatorForLanguageModeling except you should specify mlm_probability
to randomly mask tokens each time you iterate over the data.
>>> from transformers import DataCollatorForLanguageModeling
>>> tokenizer.pad_token = tokenizer.eos_token
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
You can use the end of sequence token as the padding token, and set mlm=False
. This will use the inputs as labels shifted to the right by one element:
>>> from transformers import DataCollatorForLanguageModeling
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf")
For masked language modeling, use the same DataCollatorForLanguageModeling except you should specify mlm_probability
to randomly mask tokens each time you iterate over the data.
>>> from transformers import DataCollatorForLanguageModeling
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf")
Causal language modeling
Causal language modeling is frequently used for text generation. This section shows you how to fine-tune DistilGPT2 to generate new text.
Train
Load DistilGPT2 with AutoModelForCausalLM:
>>> from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
>>> model = AutoModelForCausalLM.from_pretrained("distilgpt2")
If you aren’t familiar with fine-tuning a model with the Trainer, take a look at the basic tutorial here!
At this point, only three steps remain:
- Define your training hyperparameters in TrainingArguments.
- Pass the training arguments to Trainer along with the model, datasets, and data collator.
- Call train() to fine-tune your model.
>>> training_args = TrainingArguments(
... output_dir="./results",
... evaluation_strategy="epoch",
... learning_rate=2e-5,
... weight_decay=0.01,
... )
>>> trainer = Trainer(
... model=model,
... args=training_args,
... train_dataset=lm_dataset["train"],
... eval_dataset=lm_dataset["test"],
... data_collator=data_collator,
... )
>>> trainer.train()
To fine-tune a model in TensorFlow, start by converting your datasets to the tf.data.Dataset
format with to_tf_dataset. Specify inputs and labels in columns
, whether to shuffle the dataset order, batch size, and the data collator:
>>> tf_train_set = lm_dataset["train"].to_tf_dataset(
... columns=["attention_mask", "input_ids", "labels"],
... dummy_labels=True,
... shuffle=True,
... batch_size=16,
... collate_fn=data_collator,
... )
>>> tf_test_set = lm_dataset["test"].to_tf_dataset(
... columns=["attention_mask", "input_ids", "labels"],
... dummy_labels=True,
... shuffle=False,
... batch_size=16,
... collate_fn=data_collator,
... )
If you aren’t familiar with fine-tuning a model with Keras, take a look at the basic tutorial here!
Set up an optimizer function, learning rate, and some training hyperparameters:
>>> from transformers import create_optimizer, AdamWeightDecay
>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
Load DistilGPT2 with TFAutoModelForCausalLM:
>>> from transformers import TFAutoModelForCausalLM
>>> model = TFAutoModelForCausalLM.from_pretrained("distilgpt2")
Configure the model for training with compile
:
>>> import tensorflow as tf
>>> model.compile(optimizer=optimizer)
Call fit
to fine-tune the model:
>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)
Masked language modeling
Masked language modeling is also known as a fill-mask task because it predicts a masked token in a sequence. Models for masked language modeling require a good contextual understanding of an entire sequence instead of only the left context. This section shows you how to fine-tune DistilRoBERTa to predict a masked word.
Train
Load DistilRoBERTa with AutoModelForMaskedlM
:
>>> from transformers import AutoModelForMaskedLM
>>> model = AutoModelForMaskedLM.from_pretrained("distilroberta-base")
If you aren’t familiar with fine-tuning a model with the Trainer, take a look at the basic tutorial here!
At this point, only three steps remain:
- Define your training hyperparameters in TrainingArguments.
- Pass the training arguments to Trainer along with the model, datasets, and data collator.
- Call train() to fine-tune your model.
>>> training_args = TrainingArguments(
... output_dir="./results",
... evaluation_strategy="epoch",
... learning_rate=2e-5,
... num_train_epochs=3,
... weight_decay=0.01,
... )
>>> trainer = Trainer(
... model=model,
... args=training_args,
... train_dataset=lm_dataset["train"],
... eval_dataset=lm_dataset["test"],
... data_collator=data_collator,
... )
>>> trainer.train()
To fine-tune a model in TensorFlow, start by converting your datasets to the tf.data.Dataset
format with to_tf_dataset. Specify inputs and labels in columns
, whether to shuffle the dataset order, batch size, and the data collator:
>>> tf_train_set = lm_dataset["train"].to_tf_dataset(
... columns=["attention_mask", "input_ids", "labels"],
... dummy_labels=True,
... shuffle=True,
... batch_size=16,
... collate_fn=data_collator,
... )
>>> tf_test_set = lm_dataset["test"].to_tf_dataset(
... columns=["attention_mask", "input_ids", "labels"],
... dummy_labels=True,
... shuffle=False,
... batch_size=16,
... collate_fn=data_collator,
... )
If you aren’t familiar with fine-tuning a model with Keras, take a look at the basic tutorial here!
Set up an optimizer function, learning rate, and some training hyperparameters:
>>> from transformers import create_optimizer, AdamWeightDecay
>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
Load DistilRoBERTa with TFAutoModelForMaskedLM:
>>> from transformers import TFAutoModelForMaskedLM
>>> model = TFAutoModelForCausalLM.from_pretrained("distilroberta-base")
Configure the model for training with compile
:
>>> import tensorflow as tf
>>> model.compile(optimizer=optimizer)
Call fit
to fine-tune the model:
>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)
For a more in-depth example of how to fine-tune a model for causal language modeling, take a look at the corresponding PyTorch notebook or TensorFlow notebook.