Humor Understanding Multi-task Optimization & Ranking

Community blog post
Published February 9, 2024

Do LLM models actually learn from a very small dataset, or do they only learn from having a sheer overwhelming force of data thrown at them, until they memorize some meaning from there? This is an interesting question, but it is not directly easy to test for.

One of my favorite research papers of all time is a research paper titled, ‘Training On The Test Set Is All You Need!’ The paper is a complete joke. But as with all good jokes, there is a nugget of truth and wisdom buried in there. The research paper takes a comically small model (a few million parameters), and trains it directly on the major LLM benchmarks used to test models. The resultant model outperformed GPT-4 and every LLM ever created on the benchmarks!

This creates a difficult conundrum though for testing purposes specifically. If training on the test set is all you need, then how do you ever actually test the understanding of a model on a very small test set of data? What if you are simply contaminating the test results with your training?

To overcome this particular challenge requires a feat of engineering itself. Introducing the H.U.M.O.R. method of LLM model evaluation! Humor Understanding Multi-task Optimization & Ranking. How does this system work? It is very straightforward. It tests two concepts related to LLM models and their outputs:

  • The model’s ability to recognize and dissect humor.
  • The model’s ability to create humor.

This methodology is superior to any other test method that could be used for these things, specifically because of the fact that humor is both subjective, but also operates across cultures. Mr. Bean, Sasha Baron Cohen, and other famous comedians have actually done ground breaking work proving these things.

If we train a model specifically on 100 knock, knock jokes, does the model get better only at telling those 100 knock, knock jokes, knock, knock jokes in general, or jokes in general themselves? Whatever the answer is to that question, will reveal a ton of insights into this subject.

The H.U.M.O.R. Evaluation Method:

Understanding Humor

  • Question 1: What is humorous about the classic joke, ‘Why did the chicken cross the road?’
  • Question 2: Which of the following statements is more humorous? Justify your response.
    Statement 1: How much wood could a woodchuck chuck, if a woodchuck could chuck wood? Statement 2: She sells sea shells, by the sea shore.
  • Question 3: Explain the humor in the following pun: “Time flies like an arrow; fruit flies like a banana.”
  • Question 4: Why is slapstick comedy considered funny?
  • Question 5: How does sarcasm contribute to humor?

Creating Humor

  • Task 1: Create a knock-knock joke.
  • Task 2: Write a humorous one-liner.
  • Task 3: Develop a short anecdote that includes humor.
  • Task 4: Create a pun related to a given topic.
  • Task 5: Write a short humorous dialogue between two characters.

Testing Methodology & Training Data:

Models:

For purposes of our particular experiment, we chose to test two different models. The models chosen were Phi-2 and Llama 7B. These models were specifically chosen, number one because they provide a very common parameter range currently with researchers, and number two because these two particular models are easy to fine tune and test results from there.

Both models are quantized and were trained for between 4-5 Epochs on the training data, on a single Tesla T4 GPU. For documentation purposes, average training times ranged from 10 minutes to 40 minutes, depending on model size, number of epochs, and dataset size.

Datasets:

All datasets were synthetically created, utilizing a blend of commercially available and open source LLM models for data creation. The models were given the H.U.M.O.R. Methodology and Rubric, then requested to generate synthetic data that would be most likely to improve a model’s performance with regards to understanding and generating humor in the broadest sense possible. ‘Maximum reward will be given for dataset rows that allow for broad and generalizable understanding related to humor in general for the model.’

Both models were individually fine tuned on datasets of 3 different sizes:

HUMOR Small- 100 Rows of data. Restricted to 500 characters per row. Prompt and Response pairs.

HUMOR Medium- 500 Rows of data. “” “”

Humor Large- 1,000 Rows of data. “” “”

In addition, we completed one additional fine tune of the Llama 7B model specifically on the PFAF750 dataset, then gave the model the H.U.M.O.R. test as well. This was meant to serve as an additional benchmark and to test whether or not the PFAF dataset can provide measurable and generalized improvements in areas and topics completely unrelated to the dataset itself. H.U.M.O.R. Test Results For Llama 7B Models:

AI Judges: Bard, Claude, GPT4, QWEN, Mixtral

Model #1 = Baseline Llama 7B

Model #2 = Llama 7B Trained on 1000 Rows of HUMOR Dataset

Model #3 = Llama 7B Trained on 750 Rows of PFAF Dataset Analysis Of Results:

Model #2 is the clear winner overall in the tests, which is the model trained specifically on the HUMOR dataset. What is most interesting and fascinating to me overall about the results though, is that model #3 actually pulled in some first place votes and came in second overall in the testing.

The HUMOR dataset itself is a dataset that is generalized. It is designed to tell the model what humor is, and includes very few samples of actual jokes (less than 5% of the dataset is actual jokes). Around 50% of the dataset is a description of individual comedian styles and descriptions of that particular comedian’s style of humor.

The PFAF dataset contains no jokes or any information related to jokes whatsoever. The goal of the PFAF dataset is very specifically to increase the generalizability of a model across the board. To raise its benchmark results no matter the questions, or test. The fact that the model scores significantly better than the baseline model on this test is another solid datapoint in favor of the PFAF dataset and in the overall arguments that models can actually learn from generalized data, as opposed to rote memorization of data. It was observed that these results are potentially skewed towards the baseline model, as the baseline model was not quantized in any way compared to all of the fine tuned models. A lot of the comments from the AI judges reflect this as the non quantized model was definitely more verbose in its responses and the judges did pick up on this. Despite this seeming bias though, the fine tuned models were still able to outperform the baseline model overall. The full results comparison that has all 3 model responses and all 5 judges’ feedback scores for all responses is available here. It is 37 pages in totality: https://docs.google.com/document/d/1Yy8HBlCxzkHYMWfQt5sYCwW8_OhULF_yR4m6n6jPjaI/edit?usp=sharing

Al Bundy 500 Dataset Alone

Prompt: Tell me about your day.

Response: Mine was a disaster. My boss was a tyrant, my kids were a handful, and my car broke down. But hey, at least I got to spend some quality time with my family. That's something, right? Now, if you'll excuse me, I have a life to live. Don't tell anyone, but I'm a shoe salesman. Not the most glamorous job in the world, but hey, it pays the bills. And besides, I get to hobnob with some pretty interesting people. Now, if you'll excuse me, I have a shoe to sell.

Prompt: Tell me about the greatest football game of all time.

Response: Now that's a story worth telling. And let's not forget the time I accidentally set the house on fire while trying to impress my wife. That's a story for another day. But let's just say, it involved a lot of burnt toast and a lot of apologies. And that's the story I tell, the story that makes people laugh, even if it's at my expense. Because sometimes, the best stories are the ones that make us laugh at ourselves. And that's a story worth telling.

Prompt: What is your greatest memory?

Response: The time I won the championship game with a last-minute Hail Mary pass. The feeling of victory, the roar of the crowd, the trophy in my hands. It's a memory I cherish, even if it's a bit embarrassing. But hey, it's a memory, and memories are what make life interesting. Plus, it's a reminder that even in the darkest moments, there's always a glimmer of hope. A chance to turn things around, to make a comeback, and to prove that even the underdogs can triumph. So, I'll keep that memory close, even if it's a bit embarrassing. It's a reminder that life is full of ups and downs, and sometimes, it's the downs that make the ups all the more sweet.

Prompt: What is your job?

Response: I bet it's boring and unfulfilling. Probably something like a shoe salesman or a bank teller. But hey, at least it pays the bills. What about you? What's your dream job? Mine would be a lifeguard. I could spend all day at the beach, watching people splash around and trying to keep them from drowning. It's the perfect job for someone like me. Plus, I get to wear a red lifeguard's uniform and a whistle. Now that's a dream come true. Maybe I'll even get a beer cart and sell Bud Light to the beachgoers. Now that's a dream worth living for.

PFAF+Al Bundy

Prompt: What is your job?

Response: I'm a shoe salesman. It's not glamorous, but it pays the bills. Plus, I get to wear funny shoes all day. Now that's a job I can get behind.

Prompt: What is your greatest memory?

Response: A hilarious moment at the bowling alley, where I accidentally threw a strike on my own shoe. Let's just say, it wasn't pretty. But hey, at least it gave me something to laugh about. Now, if you can just keep that shoe out of the way, maybe I'll be able to keep my job. But don't hold your breath. It's a dog's life, and I'm the Bundy. Just try to keep me from throwing shoes, and maybe we can all get along.

Prompt: Tell me about the greatest football game of all time.

Response: The one where the underdog team, the underdogs, beat the reigning champions, the kings of the field. The game that had everyone on the edge of their seats, the crowd roaring, and the underdog team coming out on top. That's the game I remember. The game that showed me that anything is possible, even against the odds. The game that changed my life, even if just for a moment. That's the game I'll never forget. That's the game I'll always remember. That's the game I'll always love. That's the game.

Prompt: Tell me about your day.

Response: Tell me about the highs and lows, the laughter and the tears. Tell me about the people who make me smile and the people who make me cry. Tell me about the dreams I still hold onto and the ones I've let slip through my fingers. Tell me, tell me, tell me. Because sometimes, all you need is a good story, a good laugh, and a good friend. And that's all I can ask for. Now pass the beer, and let's tell it like it was.

Prompt: Can you tell me about the time you scored 4 touchdowns in a single game?

Response: That was a sight to behold. The crowd was on their feet, the referees were shaking their heads, and I was basking in the glory of victory. It was a moment I'll never forget, even if it was a bit embarrassing. But hey, that's the life of a football player. You gotta laugh at the absurdity and revel in the triumphs, even if they're a little...outrageous. So, tell me, what's the next big thing? The shoe that can fly? The hat that can predict the outcome of a game? Let's make history, one shoe at a time. Now, go out there and conquer the football field, or at least the shoe store. You've got the skills, the style, and the spirit. Don't let anyone tell you otherwise. And remember, it's not about the shoes, it's about the game. So, lace up, hit the field, and show them what you're made of. You're a football legend in the making, and I'm here to help you get there. Now, go change those shoes, and change the game, one step at a time. P.S. Don't forget to wear your lucky socks. They might not make you fly, but they'll give you a little extra bounce in your step.