Spaces:

MattyTheBoi
/

NLP-CSI_5180

Sleeping

App Files Files Community

NLP-CSI_5180 / app.py

MattyTheBoi

Updated app

74f615f about 2 months ago

raw history blame contribute delete

No virus

12.6 kB

	import gradio as gr
	from transformers import AutoTokenizer, AutoModelForCausalLM
	from huggingface_hub import login

	# Markdown content for the first tab
	About_content = """
	# PromoModel

	PromoModel is a Python application that generates WWE Superstar Promo text using finely-tuned models that are stored on the Huggingface Transformer Library. The application is designed to work with three different models: GPT-2, Gemma, and Llama.

	## Features

	- Model Selection: You can select one of a handful of models: the base models of GPT-2, Gemma, or Llama, as well as a finely-tuned model of GPT2 specifically trained on WWE Superstar Promo data.

	- Promo Generation: You can generate promos given a prompt. The prompt is encoded and passed to the model, which generates the text. The generated text is then decoded and returned.

	## Usage

	To use the application, select the desired model and input a prompt. The application will generate a promo based on your prompt.

	For example, if you select the GPT-2 model and input the prompt "Here's what's going to happen brother. ", the application will generate a promo based on this prompt.

	## Note

	This application was designed for the final project of the course CSI 5180, W'24 at Oakland University. The application was created by Matthew Horvath. For more information, please feel free to contact him at mhorvath@oakland.edu. The code for this application can be found on [GitHub](https://github.com/MattyTheBoi/NLP/tree/main/NLP_app)
	"""
	Dataset_content = """
	# Dataset

	The finely-tuned GPT2 model used in this application were trained on promo transcripts from several professional wrestlers. Here are the wrestlers whose promos were used in the training data. These transcripts were obtained from [Cagematch](https://www.cagematch.net/). Each wrestler's promo text for a given transcripts contains parts they speak, as well as parts
	others speak. This data was procesed accordingly with this in mind. Additionally, different sized embedded chunks of the promo text were used for training. For further details, please refer to the training tab.

	\| Wrestler \| Number of Transcripts \| Link \|
	\| --- \| --- \| --- \|
	\| John Cena \| 100 \| [Link](https://www.cagematch.net/?id=2&nr=691&page=6) \|
	\| The Rock \| 70 \| [Link](https://www.cagematch.net/?id=2&nr=960&page=6) \|
	\| Stone Cold Steve Austin \| 36 \| [Link](https://www.cagematch.net/?id=2&nr=635&page=6) \|
	\| CM Punk \| 93 \| [Link](https://www.cagematch.net/?id=2&nr=80&page=6) \|
	\| Randy Orton \| 68 \| [Link](https://www.cagematch.net/?id=2&nr=998&page=6) \|
	\| Shawn Michaels \| 62 \| [Link](https://www.cagematch.net/?id=2&nr=796&page=6) \|
	\| Triple H \| 100 \| [Link](https://www.cagematch.net/?id=2&nr=496&page=6) \|
	"""

	How_i_did_it_content = """
	# How it was Built

	The PromoModel application was built using the Hugging Face Transformers library. It uses pre-trained models and fine-tunes them on promotional text from professional wrestlers.

	## Model Loading

	The application can load one of three pre-trained models: GPT-2, Gemma, or Llama. The model is loaded from a specified directory and moved to the GPU for faster processing. If the model's tokenizer does not have a pad token, the end-of-sentence token is used as the pad token.

	## Data Preparation

	The application prepares the data for training by tokenizing the promotional text and encoding it into input IDs. The labels for training are the same as the inputs, as this is a language modeling task. For specific details, please refer to the training tab.

	## Promo Loading

	The application loads promotional text from .txt files in a specified directory. The text is cleaned and tokenized into sentences and words. The words are then grouped into chunks of 200 words each, which are used as the promotional text for training.

	## Training

	The application trains the model on the promotional text using the Hugging Face Trainer class. The training arguments, such as the number of epochs, batch size, and learning rate, can be customized. The model uses gradient checkpointing if it supports it, and mixed precision training is enabled for efficiency.

	## Promo Generation

	The application can generate promotional text given a prompt. The prompt is encoded and passed to the model, which generates the text. The generated text is then decoded and returned.

	## Model Saving and Pushing

	The application can save the trained model to a specified directory. It can also push the model to the Hugging Face Model Hub.

	## Usage

	To use the application, create an instance of the PromoModel class with the desired model name, version, and directory for loading promos. Then, call the appropriate methods to load promos, prepare data, train the model, and generate promotional text.
	"""
	Training_metrics = """
	# Training Runs

	The PromoModel application uses the Hugging Face's `Trainer` class for training the models. The training process involves several steps:

	## Data Preparation

	The application prepares the data for training by tokenizing the promotional text and encoding it into input IDs. The labels for training are the same as the inputs, as this is a language modeling task.

	## Training

	The application trains the model on the promotional text using the Hugging Face Trainer class. The training arguments, such as the number of epochs, batch size, and learning rate, can be customized.

	### Training Arguments

	The training arguments used are:

	- `num_train_epochs`: Number of training epochs.
	- `per_device_train_batch_size`: Training batch size per device.
	- `per_device_eval_batch_size`: Evaluation batch size per device.
	- `warmup_steps`: Number of warmup steps.
	- `weight_decay`: Weight decay.
	- `logging_dir`: Directory for storing logs.
	- `gradient_accumulation_steps`: Number of steps to accumulate gradients before updating the model parameters.
	- `fp16`: Enables mixed precision training if set to True.

	## Promo Generation

	The application can generate promotional text given a prompt. The prompt is encoded and passed to the model, which generates the text. The generated text is then decoded and returned. The generation parameters can be adjusted to control the randomness and diversity of the generated text:

	- `max_length`: The maximum length of the generated text.
	- `do_sample`: Whether to sample the next token randomly according to its probability distribution. If False, the token with the highest probability is always chosen.
	- `temperature`: Controls the randomness of the token sampling process. Higher values make the output more random, while lower values make it more deterministic.
	- `num_return_sequences`: The number of sequences to return.

	## Training Run Data
	Below is a table showing SOME of the training runs that were determined to be interesting to denote. The output of the model was evaluated based on the generated text and the same input prompt was used for each run - "This weekend at Wrestlemania,".
	- ALL data from training is available on weights and biases [here](https://wandb.ai/mattytheboiwork/Promo_Generator_testing?nw=nwusermattytheboi).
	- The following data from the table can be seen [here](https://wandb.ai/mattytheboiwork/Promo%20Generator?nw=nwusermattytheboi).
	<style>
	table {
	width: 100%;
	}
	th, td {
	text-align: right;
	padding: 8px;
	}
	</style>
	<table>
	<tr>
	<th>Run</th>
	<th>Model</th>
	<th>Epochs</th>
	<th>Batch Size</th>
	<th>Learning Rate</th>
	<th>Gradient Accumulation Steps</th>
	<th>Max Length</th>
	<th>Temperature</th>
	<th>Repetition Penalty</th>
	<th>Output</th>
	</tr>
	<tr>
	<td>1</td>
	<td>GPT2</td>
	<td>3</td>
	<td>1</td>
	<td>5e-5</td>
	<td>12</td>
	<td>25</td>
	<td>0.7</td>
	<td>1.0</td>
	<td>Bad</td>
	</tr>
	<tr>
	<td>2</td>
	<td>GPT2</td>
	<td>3</td>
	<td>1</td>
	<td>1e-4</td>
	<td>12</td>
	<td>50</td>
	<td>0.7</td>
	<td>1.0</td>
	<td>Bad</td>
	</tr>
	<tr>
	<td>3</td>
	<td>GPT2</td>
	<td>4</td>
	<td>1</td>
	<td>1e-4</td>
	<td>16</td>
	<td>150</td>
	<td>0.8</td>
	<td>1.0</td>
	<td>Decent</td>
	</tr>
	<tr>
	<td>4</td>
	<td>GPT2</td>
	<td>1</td>
	<td>1</td>
	<td>5e-4</td>
	<td>16</td>
	<td>200</td>
	<td>0.9</td>
	<td>1.1</td>
	<td>Understandable</td>
	</tr>
	<tr>
	<td>5</td>
	<td>GPT2</td>
	<td>1</td>
	<td>1</td>
	<td>5e-4</td>
	<td>16</td>
	<td>250</td>
	<td>0.9</td>
	<td>1.1</td>
	<td>Respectable</td>
	</tr>
	<tr>
	<td>1</td>
	<td>GPT2</td>
	<td>3</td>
	<td>1</td>
	<td>5e-5</td>
	<td>12</td>
	<td>25</td>
	<td>0.7</td>
	<td>1.1</td>
	<td>Pretty Good</td>
	</tr>
	</table>
	"""
	# Function for generating output in the second tab
	def generate_promo(model_name, prompt, max_lenth=50, temp = 0.9):
	if model_name == 'PromoGeneratorGPT2 Model Run 1':
	tokenizer = AutoTokenizer.from_pretrained("MattyTheBoi/promo_generator_GPT2", revision="ce9eaf86d992b06afdad00ab2614b000c5069bf5")
	model = AutoModelForCausalLM.from_pretrained("MattyTheBoi/promo_generator_GPT2", revision="7093ec3bdc4fc86f17f3ebaece1b9afa4c0a7343")
	elif model_name == 'PromoGeneratorGPT2 Model Run 5':
	tokenizer = AutoTokenizer.from_pretrained("MattyTheBoi/promo_generator_GPT2", revision="4f7f3fd27c3b19bcbc9014f5416ccfd2d4ff9f9a")
	model = AutoModelForCausalLM.from_pretrained("MattyTheBoi/promo_generator_GPT2", revision="a0a8296cb1be979a75d35fe6aebc91594dc922b9")
	elif model_name == 'PromoGeneratorGPT2 Model Run 7':
	tokenizer = AutoTokenizer.from_pretrained("MattyTheBoi/promo_generator_GPT2", revision="8f9a9901cd5a456c7f239ee8e838cea6468a3b6c")
	model = AutoModelForCausalLM.from_pretrained("MattyTheBoi/promo_generator_GPT2", revision="4732424b865440a8fec086e1e25936ab0a578973")
	elif model_name == 'GPT2 Base Model':
	tokenizer = AutoTokenizer.from_pretrained("gpt2")
	model = AutoModelForCausalLM.from_pretrained("gpt2")
	elif model_name == 'Gemma Base Model':
	tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
	model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it")
	elif model_name == 'Llama Base Model':
	tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
	model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B")
	# Initialize the accumulated output
	accumulated_output = ''
	generated_promo = prompt
	for i in range(1):
	# Encode the prompt
	print("Encoding the prompt")
	encoded_prompt = tokenizer.encode(generated_promo, return_tensors='pt')

	# Generate the promo
	print("Generating the promo")
	generated = model.generate(encoded_prompt, max_length = max_lenth, do_sample = True, temperature=temp, repetition_penalty=1.1)

	# Decode the generated promo
	print("Decoding the generated promo")
	generated_promo = tokenizer.decode(generated[0], skip_special_tokens=True)

	# Add the generated promo to the accumulated output
	accumulated_output += ' ' + generated_promo
	generated_promo += accumulated_output

	return accumulated_output

	# Create the Gradio interface
	demo = gr.Blocks()

	with demo:
	with gr.Tabs(elem_classes="tab-buttons") as tabs:
	# First tab with markdown content
	with gr.TabItem("About", elem_id="about-tab", id=0):
	gr.Markdown(About_content, elem_classes="markdown-text")

	with gr.TabItem("Dataset", elem_id="dataset-tab", id=1):
	gr.Markdown(Dataset_content, elem_classes="markdown-text")
	with gr.TabItem("How it was Built", elem_id="training-tab", id=3):
	gr.Markdown(How_i_did_it_content, elem_classes="markdown-text")
	with gr.TabItem("Training Runs", elem_id="training-runs-tab", id=4):
	gr.Markdown(Training_metrics, elem_classes="markdown-text")
	with gr.TabItem("Generate Output", elem_id="generate-output-tab", id=5):
	with gr.Column():
	max_lenth = gr.Slider(minimum=50, maximum=500, step=1, label="Response Length")
	temp = gr.Slider(minimum=0.1, maximum=1.0, step=0.1, label="Temperature")
	model = gr.Dropdown(choices=['PromoGeneratorGPT2 Model Run 1', 'PromoGeneratorGPT2 Model Run 5', 'PromoGeneratorGPT2 Model Run 7', 'GPT2 Base Model', 'Gemma Base Model', 'Llama Base Model'], label="Model")
	prompt = gr.Textbox(lines=3, label="Prompt")
	iface = gr.Interface(fn=generate_promo, inputs=[model, prompt, max_lenth, temp], outputs="text")

	demo.launch()