SemScore: Evaluating LLMs with Semantic Similarity

Community Article Published March 9, 2024

image/jpeg Accurately assessing the performance of Large Language Models (LLMs) ist crucial but hard. Currently evaluation methods come with significant limitations:

  • Human evaluation, like in the LMSYS Arena is the gold standard, but slow
  • Benchmarks like MMLU can be cheated
  • Using another LLM like GPT4 as a Judge is expensive and might be biased

This post explores SemScore, a method recently introduced, to evaluate LLMs by looking at the semantics of their answers.

image/png This blog post introduces the idea, why it might be useful, and how it can be applied to your own models and training runs.

Part 1: What is SemScore and why would I care?

SemScore was proposed in a recent publication and focuses on the semantic content of a model's output using embeddings.

Embeddings are numerical representations of text which carry semantic meaning. The transformation from text to embedding vectors is done using embedding models.

To illustrate, consider the words orange, lemon, car, and money. Embedding the word orange with sentence-transformers/all-mpnet-base-v2 (the model used in the SemScore paper) yields a 768-dimensional vector:

tensor([[ 3.2832e-02,  2.2214e-02,  9.9305e-02, -1.0286e-01,  5.2077e-03,
 -5.9724e-02, -1.8181e-01,  6.0466e-02, -8.1715e-03, -5.3353e-02,
          2.1441e-02, -7.4530e-02,  7.7298e-02, -7.2748e-02, -1.6974e-01,
         -2.5297e-01, -1.7442e-02,  3.8736e-02, -4.5297e-02, -1.0881e-01,
         ...
          -1.3494e-02,  2.7610e-02,  2.9820e-01,  1.8822e-02,  1.4104e-01,
          3.1662e-03,  2.3393e-34, -3.2049e-02, -1.1889e-01, -9.7884e-02,
          2.5336e-02, -2.4282e-02, -1.2387e-01,  3.2787e-01,  1.1333e-02,
          1.0318e-01, -8.3175e-02,  4.2550e-02]], device='cuda:0')

If we break these 768 dimensions down to the 2 most important ones using Principal Component Analysis (PCA) the words can be visualized on a 2D plot:

image/png

While this plot does not accurately reflect how these words differ in all the 768 dimensions, we can still appreciate that distances on the plot reflect the difference in meaning of these words.

What does this have to do with the evaluation of LLMs? Embeddings not only allows us to turn simple words into interesting plots but to quantify the similarity of entire sentences or paragraphs by using the cosine similarity.

Cosine similarity is a metric used to measure how similar two vectors are, regardless of their size. Think of it as looking at the angle between two arrows; the closer this angle is to zero, the more similar the arrows (or vectors) are. A cosine similarity of 1 means the vectors are pointing in the exact same direction (very similar), 0 means they are perpendicular (no similarity), and -1 means they are pointing in opposite directions (very dissimilar).

Coming back to our simple example, the cosine similarities between these four words above reflect their semantic similarity. For example, the cosine similarity of lemon and orange is higher than lemon and car:

  • lemon vs. orange: 0.534
  • lemon vs. car: 0.291
  • lemon vs money: 0.228
  • car vs. money: 0.341

Applying this concept to entire LLM responses (instead of simple words) is what SemScore is all about.

Embedding conversational data

Let's move a bit closer to an actual use case by applying embeddings to actual conversational data.

The following is a visualization of all the questions of the Open Assistant 2 dataset, again embedded with sentence-transformers/all-mpnet-base-v2, and broken down from high-dimensional space to two dimensions using PCA.

image/png To validate the approach of embedding text passages (questions in this case), let's look for the pairs of questions which are most similar to each other (highest cosine similarity) and most dissimilar to each other (lowest cosine similarity).

image/png

The two most similar questions in the dataset (after removing duplicates):

Question 1:

I want you to act as a Linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside a unique code block, and nothing else. Do no write explanations. Do not type commands unless I instruct you to do so. When I need to tell you something in English I will do so by putting text inside curly brackets {like this}. My first command is pwd.

Question 2:

I want you to act as a Linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. Do no write explanations. Do not type commands unless I instruct you to do so. When I need to tell you something in English I will do so by putting text inside curly brackets {like this}. My first command is pwd.

That's basically the same question, the only difference is the word "a" in sentence 3 which has been replaced with "one" in the second question.

The two most dissimilar questions:

Question 1:

These are the books i like, suggest me more like these:

The Handmaid's Tale by Margaret Atwood

A Fine Balance by Rohinton Mistry

Educated by Tara Westover

Eleanor Oliphant Is Completely Fine by Gail Honeyman

Essays In Love by Alain de Botton

A Man Called Ove by Fredrik Backman

In Five Years by Rebecca Serle

Little Fires Everywhere by Celeste Ng

Normal People by Sally Rooney

Room by Emma Donoghue

Question 2:

How would win in a battle between a Giant Tiranosaurius Rex and Godzilla

I think we can all agree, that indeed these two requests are very different.

Embedding LLM answers to recreate the arena leaderboard

Let's step up the game and see how this can be applied to benchmarking LLMs.

In the SemScore paper, the authors recreated an LLM ranking based on calculating the similarity of an LLM's answer (prediction) and the answer by a human (reference).

image/png

We will now do something similar and apply SemScore to a corpus of conversations with human judgement: the LMSYS arena conversations. The arena conversations assess LLM generated answers through direct comparison, where users ask a question and receive two responses from distinct LLMs. Without knowledge of which LLM provided which answer, the user specifies which response he or she liked better than the other. Based on these human ratings, ELO scores were calculated for each model and the ranking gives a global leaderboard reflecting human preference.

Our approach with the LMSYS arena dataset is different from the one published in the SemScore paper. We don't have a human reference answer to the questions, instead we have two side-by-side answers to the same question. To apply SemScore, we need to compare the model's generated answer to a reference answer. Under the assumption that GPT-4 provides the best answers, we compare the answer of each model to the answer of GPT-4. The more similar a model's answers are to GPT-4 answers, the higher the ranking will be.

Let's start by loading the LMSYS arena conversations. The dataset is structured like this:

Dataset({
    features: ['question_id', 'model_a', 'model_b', 'winner', 'judge', 'conversation_a', 'conversation_b', 'turn', 'anony', 'language', 'tstamp', 'openai_moderation', 'toxic_chat_tag'],
    num_rows: 33000
})

The first entry:

{
  "question_id": "58210e39b3fd4441a2bd4a518bb44c2d",
  "model_a": "chatglm-6b",
  "model_b": "koala-13b",
  "winner": "model_b",
  "judge": "arena_user_973",
  "conversation_a": [
    {
      "content": "What is the difference between OpenCL and CUDA?",
      "role": "user"
    },
    {
      "content": "OpenCL and CUDA are two different programming models that are used for parallel computing.OpenCL is a general-purpose\u5e76\u884c\u7f16\u7a0b\u63a5\u53e3 that allows developers to write parallel code that can run on any platform that supportsCL, which includes most modern operating systems and computer systems, including Windows, Linux, and macOS. It provides a lower-level, more flexible API that is more suitable for building large-scale distributed computing systems.CUDA is a specific implementation ofOpenCL that is designed for performance and scalability in devices with multiple GPU(s). It was developed by Nvidia and is widely used for scientific computing, machine learning, and\u56fe\u5f62\u56fe\u50cf\u5904\u7406. The CUDA API provides a high-level, more efficient way to write parallel code that can run on top of a GPU, while still maintaining the ability to write code that can be easily adapted for other parallel computing platforms.",
      "role": "assistant"
    }
  ],
  "conversation_b": [
    {
      "content": "What is the difference between OpenCL and CUDA?",
      "role": "user"
    },
    {
      "content": "OpenCL and CUDA are both programming languages for parallel computing on GPUs, but they differ in several key ways:\n\n1.   Architecture: OpenCL is a general-purpose parallel computing language, while CUDA is designed specifically for GPU computing. OpenCL can run on various types of processors, including CPUs, GPUs, and FPGAs, whereas CUDA is only compatible with NVIDIA GPUs.\n2.   Language: OpenCL is a C++ language, while CUDA is a C++ language with additional features for GPU computing. CUDA provides additional libraries and tools for handling graphics and other tasks unique to GPUs.\n3.   Driver support: OpenCL has a wider range of device support, with support for not just NVIDIA GPUs but also other manufacturers like AMD, Intel, and ARM. CUDA is only available on NVIDIA GPUs, and requires the CUDA Toolkit to be installed on the host machine.\n4.   Memory management: OpenCL provides a unified memory model, where all the data is stored in global memory, and the kernel can access any part of it. CUDA, on the other hand, uses a hierarchical memory architecture, where data is divided into host, device, and shared memory.\n5.   Performance: OpenCL and CUDA have different strengths and weaknesses in terms of performance. OpenCL is generally faster for numerical computations, while CUDA is faster for graphics and other tasks that require fast memory access. CUDA also has better support for 64-bit floating-point operations, which are essential for some scientific applications.\n\nIn summary, OpenCL and CUDA are both programming languages for parallel computing on GPUs, but they differ in terms of architecture, language, driver support, memory management, and performance. OpenCL is a general-purpose language that can run on various types of processors, while CUDA is specific to NVIDIA GPUs and provides additional libraries and tools for handling graphics and other tasks unique to GPUs.",
      "role": "assistant"
    }
  ],
  "turn": 1,
  "anony": true,
  "language": "English",
  "tstamp": 1682351591.1322,
  "openai_moderation": {
    "categories": {
      "harassment": false,
      "harassment/threatening": false,
      "hate": false,
      "hate/threatening": false,
      "self-harm": false,
      "self-harm/instructions": false,
      "self-harm/intent": false,
      "sexual": false,
      "sexual/minors": false,
      "violence": false,
      "violence/graphic": false
    },
    "category_scores": {
      "harassment": 2.8765102e-05,
      "harassment/threatening": 5.663866e-07,
      "hate": 5.574919e-06,
      "hate/threatening": 2.3965333e-08,
      "self-harm": 5.1901172e-09,
      "self-harm/instructions": 1.1517327e-08,
      "self-harm/intent": 3.9070875e-09,
      "sexual": 4.381485e-06,
      "sexual/minors": 5.541973e-08,
      "violence": 2.2985896e-06,
      "violence/graphic": 4.8212314e-07
    },
    "flagged": false
  },
  "toxic_chat_tag": {
    "roberta-large": {
      "flagged": false,
      "probability": 0.008313022553920746
    },
    "t5-large": {
      "flagged": false,
      "score": 7.943665219245499e-05
    }
  }
}

Next, we extract all answers where any model is compared to GPT-4, our reference model. This leaves us with the following number of answers per model.

image/png

Let's look at one example entry in the dataset:

Question "My dog has eaten some chocolate. What should I do?"

Answer Vicuna-13B:

If your dog has eaten some chocolate, the first thing you should do is to try to determine how much chocolate they have eaten and what type of chocolate it was. The darker and more bitter the chocolate, the more dangerous it is for your dog.

If your dog has only eaten a small amount of chocolate, they may not show any symptoms at all. However, if they have eaten a larger amount, they may experience symptoms such as vomiting, diarrhea, increased heart rate, and agitation. In severe cases, chocolate poisoning can lead to seizures, coma, and even death.

If you suspect that your dog has eaten chocolate, it's important to contact a veterinarian right away. Your veterinarian will be able to determine the severity of the situation and provide appropriate treatment. In some cases, your veterinarian may recommend inducing vomiting to remove the chocolate from your dog's stomach. In other cases, they may need to administer activated charcoal or other medications to help remove the chocolate from your dog's system.

In general, it's best to keep chocolate out of reach of your dog to prevent accidental ingestion. If you have any questions about whether a particular type of chocolate is safe for your dog, it's always best to consult with your veterinarian.

Answer GPT-4:

I'm not a veterinarian, but if your dog has eaten chocolate, it's important to act quickly as chocolate can be toxic to dogs. Here are some steps you might consider:

  1. Determine the amount and type of chocolate your dog consumed. Dark chocolate and baking chocolate are more toxic than milk chocolate.
  2. Check for symptoms of chocolate poisoning, which may include vomiting, diarrhea, rapid breathing, increased heart rate, and seizures.
  3. Call your veterinarian or a pet poison helpline immediately. They can provide guidance based on the specific situation.
  4. Follow the advice of the veterinarian or the helpline, which may include inducing vomiting, administering activated charcoal, or taking your dog to the vet for treatment.

Remember to consult a professional for advice tailored to your dog's specific situation.

The cosine similarity of these two answers is 0.958 indicating a high degree of similarity. If we average the similarity of all answers by vicuna-13B and the corresponding answer by GPT-4, we obtain a SemScore of 0.770 for vicuna-13B.

By calculating the average scores for all the models in the dataset, we can (almost) reproduce the official arena leaderboard based on human evaluation. The plot below shows how SemScore correlates with the ELO rating, a score based on the human evaluations collected in the arena.

image/png

The correlation of ELO rating and SemScore is good but not perfect. Let's look at how the ranking correlates and where it differs.

image/png

I did not go into detail of why the models wizardlm-13b and fastchat-t5-3b are ranked higher (and palm-2 lower) in terms of SemScore than human judgement but there is a good correlation, showing that SemScore is another useful tool to benchmark LLM answers. If you want to reproduce the results above, the code is provided in a notebook.

In the next part, we will walk through the code of how the evaluate any Hugging Face model on a dataset of conversations.

Part 2: Implementation - Bringing SemScore to Life

Let's look at how to evaluate a Hugging Face model and dataset using SemScore, after and during training.

Prerequisites

Make sure to use the latest versions of the Hugging Face suite, these are the specific package version I used for the code that follows.

accelerate                        0.28.0            
bitsandbytes                      0.42.0
datasets                          2.18.0
flash-attn                        2.5.6
peft                              0.9.0
sentencepiece                     0.2.0
transformers                      4.38.2
trl                               0.7.11
torch                             2.2.1

Hello world example

The code snippet below demonstrates how to load the embedding model and calculate the semantic similarity between simple sentences.

The code is mostly taken from the model card of sentence-transformers/all-mpnet-base-v2, the embedding model used in the SemScore paper.

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2')

Now that the model is loaded, let's obtain the embedding vectors.

# Sentences we want sentence embeddings for
sentences = ["apple", "orange", "car"]

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p = 2, dim = 1)

To obtain the cosine similarity, we calculate the dot product of the normalized vectors.

for i in range(0, len(sentences)):
    print(
        sentences[0],
        sentences[i],
        (sentence_embeddings[0] @ sentence_embeddings[i]).item()
    )    

apple apple 0.9999999403953552

apple orange 0.40115007758140564

apple car 0.3137194514274597

Evaluate a finetuned model on any dataset

To make the evaluation more easy to follow, let's use a wrapper class for the embedding model. EmbeddingModelWrapper loads all-mpnet-base-v2 by default and exposes the functions get_embeddings to calculate the embeddings for a list of string and get_similarities to calculate cosine similarities of two lists of embedding vectors. The rest of the code is mostly taken from the model card of sentence-transformers/all-mpnet-base-v2.

class EmbeddingModelWrapper():
    DEFAULT_MODEL = "sentence-transformers/all-mpnet-base-v2"

    def __init__(self, model_path=DEFAULT_MODEL, bs=8):
        self.model, self.tokenizer = self.load_model(model_path)
        self.bs = bs
        self.cos = nn.CosineSimilarity(dim=1, eps=1e-6)

    def load_model(self, model_path):
        model = AutoModel.from_pretrained(model_path).cuda()
        tokenizer = AutoTokenizer.from_pretrained(model_path)
        return model.eval(), tokenizer

    def get_embeddings(self, sentences):
        ...
        
    def get_similarities(self, x, y = None):
        ...

Please find the full code of EmbeddingModelWrapper in semscore.py.

This will make model evaluation easier. For example, EmbeddingModelWrapper can be used to recreate the simple example from before like this:

from semscore import EmbeddingModelWrapper
em = EmbeddingModelWrapper()

words = ["lemon", "orange", "car", "money"]
embds = em.get_embeddings(words)

similarities = em.get_similarities(embds)

The returned similarities contains cosine similarities of all combinations:

tensor([[1.0000, 0.5340, 0.2909, 0.2281],
        [0.5340, 1.0000, 0.3027, 0.1968],
        [0.2909, 0.3027, 1.0000, 0.3411],
        [0.2281, 0.1968, 0.3411, 1.0000]])

We are now ready to test a Hugging Face model.

Let's go for TinyLlama and test on a random subset of the Open Assistant dataset. We load the dataset, challenge TinyLlama with the questions and compare the answers to the reference answers.

from datasets import load_dataset

dataset_name = "g-ronimo/oasst2_top1_en"
dataset = load_dataset(dataset_name)["train"]
dataset = dataset.shuffle(seed = 42).select(range(10))

This is what an entry in this dataset looks like:

{
  "conversation": [
    {
      "content": "Can photoshop be used as a tool to forge documents illegally?",
      "role": "user"
    },
    {
      "content": "Yes, Photoshop can be used to forge documents due to the application's ability to modify digital images. Users with access to a Photoshop tool would have the potential to alter documents and create highly convincing fakes. It is important to note that forging documents is not an intended goal of the photoshop tool, and forging documents is likely to be illegal or immoral based on a specific situation.",
      "role": "assistant"
    }
  ]
}

The conversation above contains a single-turn, one question and one answers but conversations in this dataset might also consist of multiple turns.

Next, load TinyLlama:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_path="TinyLlama/TinyLlama-1.1B-Chat-v1.0"
model = AutoModelForCausalLM.from_pretrained(
    model_path,    
    device_map = "auto",
    torch_dtype = torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)

We can now challenge TinyLlama with questions in the dataset and calculate the semantic similarity of the generated answers to the reference answers.

answers_ref, answers_pred = [], []

for conversation in dataset["conversation"]:
    for i, msg in enumerate(conversation):
        if msg["role"] == "assistant": continue
        partial_conv = conversation[0:i+1]

        prompt_tok = tokenizer.apply_chat_template(
            partial_conv, 
            add_generation_prompt = True, 
            return_tensors = "pt"
        ).to("cuda")
        answer_tok = model.generate(
            prompt_tok, 
            eos_token_id = tokenizer.eos_token_id,
            max_new_tokens = 2000, 
        )
        answer = tokenizer.decode(
            answer_tok[0][len(prompt_tok[0]):], 
            skip_special_tokens = True
        )

        answers_pred.append(answer) 
        answers_ref.append(conversation[i+1]["content"])  

Coming back to the photoshop question, this is how TinyLlama answers it:

Question: Can photoshop be used as a tool to forge documents illegally?

Answer TinyLlama:

Yes, photoshop can be used as a tool to forge documents illegally. Photoshop is a powerful image editing software that can be used to manipulate images and create fake documents. It can be used to create fake signatures, addresses, and other documents that are not genuine. However, it is essential to note that using photoshop for forging documents can lead to legal consequences, and it is best to avoid using this software for this purpose.

In some cases, photoshop can be used to create fake documents that are indistinguishable from genuine documents. However, it is essential to note that the software can be easily detected and can lead to legal consequences. It is best to avoid using photoshop for forging documents and instead use other legal methods to create fake documents.

In summary, photoshop can be used as a tool to forge documents illegally, but it is essential to avoid using this software for this purpose and use other legal methods to create fake documents.

Reference answer:

Yes, Photoshop can be used to forge documents due to the application's ability to modify digital images. Users with access to a Photoshop tool would have the potential to alter documents and create highly convincing fakes. It is important to note that forging documents is not an intended goal of the photoshop tool, and forging documents is likely to be illegal or immoral based on a specific situation.

We can now calculate the cosine similarities of predictions versus references using EmbeddingModelWrapper:

from semscore import EmbeddingModelWrapper
from statistics import mean

em = EmbeddingModelWrapper()
similarities = em.get_similarities(
    em.get_embeddings( answers_pred ),
    em.get_embeddings( answers_ref ),
)

The cosine similarity of prediction and reference answer for the photoshop question above is 0.92. The average similarity of all 10 questions we picked from the OA dataset is 0.71.

If we scale this up to 100 questions and test TinyLlama and Mistral, this is what we get:

Model SemScore
TinyLlama/TinyLlama-1.1B-Chat-v1.0 0.6903450180861083
models/Mistral-7B-Instruct-v0.2 0.7476783179036983

Let's look at an example where Mistral beats TinyLlama in terms of SemsScore.

Question: Write a Twitter thread on investing

Reference answer:

1/5 Investing has become much bigger in the public eye in the last decade thanks in large part to apps like Robinhood or eToro that target everyday people, aka retail investors.

2/5 While there are many online guru's and "experts" that voice all sorts of opinions on day trading and stock options, there's only one real option towards getting reliable gains without investing becoming your full-time job...

3/5 That is building an investment portfolio of low risk, low reward, diverse assets across a variety of different industries for a long period of time. It's not going to create quick wealth, but it will secure your financial future.

4/5 Another way to lower risk while investing is to invest your money often and in small amounts. This allows you to avoid the risk of depositing a large amount during a momentary dip.

5/5 This is all the advice the average person needs for investing in their entire life. Remember, investing shouldn't be about getting rich, it should be about putting your extra resources to work to build a better future for yourself and others.

Answer TinyLlama/TinyLlama-1.1B-Chat-v1.0 (Similarity 0.603):

  1. Introduction: Start by introducing yourself and your interest in investing. Explain why you want to learn more about the topic, what motivates you, and how this information can benefit you financially. Use hashtags like #investmenttips or #financialadvice to attract attention from other users interested in similar topics.

  2. Understanding Investments: Discuss the basics of investing, including the difference between stocks, bonds, mutual funds, and exchange-traded funds (ETFs). Provide examples of successful investors and explain how they have achieved their success through sound financial planning and discipline. Use relevant hashtags such as #stockmarketprimer or #investmentbasics.

  3. Diversification: Discuss the importance of diversifying one's portfolio across different asset classes, such as stocks, bonds, and real estate. Explain how this strategy helps reduce risk and maximize returns over time. Use relevant hashtags like #diversifyyourportfolio or #assetclassdiversification.

  4. Risk Management: Address the concept of risk management in investing, which involves balancing potential losses with expected gains. Explain how to identify risks associated with specific assets and strategies for mitigating them. Use relevant hashtags like #riskmanagementor #assetallocation.

  5. Market Cycles: Discuss market cycles, which are periods of high volatility followed by periods of low volatility. Explain how these cycles affect individual investments and provide tips for managing risk during periods of uncertainty. Use relevant hashtags like #marketcycle or #volatilitiestoinvestors.

  6. Portfolio Construction: Develop a comprehensive plan for building an effective investment portfolio based on your understanding of the previous points.

Answer Mistral-7B-Instruct-v0.2 (Similarity 0.787):

🧵1/ Investing is an essential part of building wealth and securing your financial future. It's never too early or too late to start! #Investing #MoneyTips

🧵2/ Before you dive in, it's crucial to understand the basics: stocks, bonds, mutual funds, ETFs, etc. Research each investment type and determine which aligns best with your goals, risk tolerance, and time horizon. #Stocks #Bonds #ETFs

🧵3/ Diversification is key to minimizing risk. Don't put all your eggs in one basket! Spread investments across various asset classes, sectors, and geographies. #Diversification #RiskManagement

🧵4/ Set clear objectives for your investments. Are you looking for capital appreciation, income generation, or both? Understanding your goals will help guide your decisions. #FinancialGoals #CapitalAppreciation #IncomeGeneration

🧵5/ Patience pays off when it comes to investing. Long-term strategies often yield better returns than short-term speculation. Remember, Rome wasn't built in a day! #LongTermInvesting #Patience PaysOff

🧵6/ Stay informed about market trends, economic indicators, and company performance. Knowledge is power when making investment decisions. #MarketNews #EconomicIndicators #CompanyPerformance

🧵7/ Lastly, consider seeking advice from a trusted financial advisor or professional. They can provide valuable insights and guidance based on their expertise and experience. #FinancialAdvisor #ProfessionalAdvice

End of Thread 🧵 #Investing101 #WealthBuilding #PersonalFinance #MoneyMatters #Fin

If you want to see more examples, take a look at the complete results here . The code is gathered in a notebook.

We could expand our analysis to include more samples from the dataset and test additional models but the results will be hard to interpret. This is a popular dataset and it might be that a model has been trained at least on parts of it.

To address this issue, one approach is to use SemScore to evaluate your own model after training. By calculating SemScore on the validation split of your dataset, you can gain insight into whether your model has generalized well from the training set and is generating answers that are semantically similar to your desired outputs.

Furthermore, SemScore can be a valuable tool for monitoring progress during training, as I will demonstrate in the following section.

Evaluate while training

Training LLMs, especially on conversational data, often results in increasing evaluation loss (eval/loss in the figure below) over time — a phenomenon typically associated with overfitting. However, this increase doesn't always correlate with decreased performance in natural language tasks.

image/jpeg

SemScore offers a way to monitor semantic similarity of prediction versus reference throughout training, providing insights into the true progress of the model, beyond what the traditional loss metrics can offer.

Let's put SemScore into a typical Hugging Face based training run. The following is a minimal example of training TinyLlama on a (deduplicated) version of the Open Assistant dataset.

If you are familiar with finetuning models with the Hugging Face suite, this will look familiar to you and you will probably have your own code and routines setup that might look different. The code below is a full fine-tune of TinyLlama that efficiently uses a 24 GB VRAM GPU like an NVIDIA GeForce RTX 3090 or 4090 to complete training of a decent chatbot in around 1.5 hours.

from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, BitsAndBytesConfig, set_seed
from peft import LoraConfig
from trl import SFTTrainer, setup_chat_format, DataCollatorForCompletionOnlyLM
from datasets import load_dataset
import torch

set_seed(42)

modelpath = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
model = AutoModelForCausalLM.from_pretrained(
    modelpath,    
    device_map = "auto",
    torch_dtype = torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(modelpath, use_fast = False)

model, tokenizer = setup_chat_format(model, tokenizer)
if tokenizer.pad_token in [None, tokenizer.eos_token]: 
    tokenizer.pad_token = tokenizer.unk_token

dataset = load_dataset("g-ronimo/oasst2_top4k_en")

training_arguments = TrainingArguments(
    output_dir = "out_OA_TL",
    evaluation_strategy = "steps",
    label_names = ["labels"],
    per_device_train_batch_size = 16,
    gradient_accumulation_steps = 1,
    save_steps = 250,
    eval_steps = 250,
    logging_steps = 1, 
    learning_rate = 1e-5,
    num_train_epochs=10,
    lr_scheduler_type = "constant",
    optim = 'paged_adamw_32bit',
    bf16 = True,
    gradient_checkpointing = True,
    group_by_length = True,
)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset["train"],
    eval_dataset = dataset['test'],
    data_collator = DataCollatorForCompletionOnlyLM(
        instruction_template = "<|im_start|>user", 
        response_template = "<|im_start|>assistant", 
        tokenizer = tokenizer, 
        mlm = False),
    max_seq_length = 512,
    args = training_arguments,
)```

At this point you would start training with `trainer.train()`, but before that we will plugin a `TrainerCallback` that performs the SemScore evaluation after each epoch.

`SemscoreEvalCallback` loads `ModelPredictionGenerator`, a helper class for batch inference (source) to generate answers and then passes the generated answers and references to `EmbeddingModelWrapper` which calculates and logs the cosine similarities after each epoch. To run this, please obtain [semscore.py](https://github.com/geronimi73/semscore/blob/main/semscore.py) which contains definitions of `ModelPredictionGenerator` and `EmbeddingModelWrapper`.

```python
from transformers import TrainerCallback
from statistics import mean

from semscore import ModelPredictionGenerator, EmbeddingModelWrapper

class SemscoreEvalCallback(TrainerCallback):
    def on_evaluate(self, args, state, control, model, tokenizer, eval_dataloader, **kwargs):

        generator = ModelPredictionGenerator(model = model, tokenizer = tokenizer)
        eval_ds = dataset["test"].select(range(100))
        results = generator.run(dataset = eval_ds)

        em = EmbeddingModelWrapper()
        similarities = em.get_similarities(
            em.get_embeddings( [a["answer_ref"] for a in results] ),
            em.get_embeddings( [a["answer_pred"] for a in results] ),
        )
        cosine_sim = mean(similarities)
        trainer.log({"cosine_sim": cosine_sim})
trainer.add_callback(SemscoreEvalCallback())

What we observe is that even though the loss curves diverge, the semantic similarity of generated versus reference answers actually increases until epoch 7.

image/png

In the figure above, the improvement in absolute numbers is rather modest for this TinyLlama training run. Let's put the numbers into perspective and train Mistral, a much better base model, on the same data and see what happens.

image/png

Note that inference is very slow while training with LoRA or QLoRA due to the additional overhead of the adapters. An eval of 100 samples generating max. 500 tokens during a QLoRA training of Mistral takes around 30 minutes (with batched inference). Depending on your level of nervousness, it might be better to calculate the SemScore after training and merging which is a lot faster.

What these curves looks like exactly depends on the model, dataset and training hyperparameters. The hyperparameters for example might not be ideal in this specific example. One could also argue that the learning rate is too low, the dataset is too small, or training for 10 epochs is excessive. And, the comparison is kind of apples-to-oranges because we are comparing a full finetune (all parameters trained) to a QLoRA training (~4% of parameters trained). But this also kind of the point: SemScore offers another way of quickly assessing if your model is effectively learning what you want it to learn and allows you to explore the effect of different training methods and hyperparameters.

If you are curious about the generated and reference answers at each epoch of training TinyLlama and Mistral, you can find them here. The code to reproduce all of the above is provided in notebooks: Calculate SemScore while training TinyLlama and Mistral.

Summary

  • Evaluating LLMs is a challenging task, current benchmarks have limitations
  • Human evaluation is the gold standard, but it does not scale well
  • Other evaluation methods such as using pre-defined metrics or relying on other LLMs, may be unreliable, biased, or expensive
  • SemScore is a novel approach that evaluates the semantic meaning of LLM output, providing a different perspective on model performance
  • It's not a silver bullet solution, will not work for all kinds of datasets (code, math?), but rather a complementary approach

The purpose of this blog post is to demonstrate the usefulness of SemScore and provide practical guidance on how to apply it to your own datasets and models.

I hope you found this post helpful, and encourage you to try SemScore on your own models and training runs. If you have any questions or feedback, please let me know! @Geronimo_AI.