The Office Scene Generation Model

This repository contains a fine-tuned GPT-2 model for generating text in the style of the TV show "The Office". The goal of the project was to imitate the speaking styles of different characters from the show.

Model Details

The model is based on the GPT-2 language model. We fine-tuned the GPT-2 model using the Transformers library from Hugging Face.

The model was trained on a dataset of scripts from the TV show "The Office". The dataset contains over 200 episodes of the show, and the model was trained to predict the next word in each line of dialogue. This training process allows the model to learn the patterns and style of language used by the different characters in the show.

Data

The data was retrieved from data.world. The dataset includes the scripts for all 9 seasons of the show, as well as episode summaries and character information.

Usage

To use the model, you can load it using the from_pretrained method of the GPT2LMHeadModel class from the transformers library. Here's an example code snippet:

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load the model and tokenizer
model = GPT2LMHeadModel.from_pretrained("path/to/your/model")
tokenizer = GPT2Tokenizer.from_pretrained("path/to/your/tokenizer")

# Generate some text using the model
prompt = "Dwight walks into the office and says"
input_ids = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0)

sample_outputs = model.generate(
                                input_ids, 
                                do_sample=True,   
                                eos_token_id=tokenizer.eos_token_id,
                                bos_token_id=tokenizer.bos_token_id,
                                top_k=50, 
                                max_length = 100,
                                top_p=0.999, 
                                num_return_sequences=5,
                                mask_token_id=tokenizer.mask_token_id,
                                pad_token_id=tokenizer.pad_token_id,
                                temperature=0.7
                                )

for i, sample_output in enumerate(sample_outputs):
  print(tokenizer.decode(sample_output, skip_special_tokens=True))
  print("\n")

# Print the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

In this example, we first load the fine-tuned GPT-2 model and the corresponding tokenizer. We then generate some text using the generate method of the model, passing in a prompt and setting the maximum length of the generated text to 50 tokens. Finally, we decode the output using the tokenizer and print the generated text.

Project Source

The whole project can be found on GitHub.

Credits

The Office Scene Generation Model was fine-tuned by Moritz Schlager and Timo Heiss using the Transformers library from Hugging Face. The original GPT-2 model was developed by OpenAI. The training data was sourced from the TV show "The Office".