Knowledge Distillation for Fine-Tuning a GPT-3.5 Judge: Enhancing Accuracy and Performance

Community Article Published May 13, 2024

Introduction

In the realm of natural language processing (NLP), the refinement of language models has become a pivotal focus. Fine-tuning existing models to enhance their capabilities is a strategy gaining traction, particularly in scenarios where accuracy and performance are paramount. One such instance is the fine-tuning of a GPT-3.5 Judge, a language model tasked with evaluating responses to user queries. In this article, we delve into the concept of knowledge distillation for fine-tuning a GPT-3.5 Judge, exploring its definitions, methodologies, and the benefits it offers.

Definitions

Knowledge Distillation: Knowledge distillation refers to the process of transferring knowledge from a complex model (the teacher) to a simpler model (the student). This transfer typically involves distilling the insights and patterns learned by the teacher model into a more compact form, allowing the student model to benefit from the teacher's expertise.
Fine-Tuning: Fine-tuning, in the context of machine learning, involves taking a pre-trained model and further training it on a specific task or dataset to improve its performance or adapt it to new circumstances. It allows the model to adjust its parameters based on the new data, thereby refining its predictions or outputs for the given task.

Benefits of Knowledge Distillation for Fine-Tuning a GPT-3.5 Judge

Enhanced Accuracy: By distilling knowledge from a more advanced model like GPT-4, the GPT-3.5 Judge can improve its accuracy in evaluating responses to user queries. The insights gleaned from the GPT-4 model help fine-tune the judging criteria of the GPT-3.5 Judge, leading to more precise assessments of the responses generated.
Improved Performance: Knowledge distillation enables the GPT-3.5 Judge to inherit the robustness and effectiveness of the GPT-4 model, thereby enhancing its overall performance. Fine-tuning based on distilled knowledge allows the GPT-3.5 Judge to adapt and evolve, ensuring it remains at the forefront of language understanding and evaluation.
Alignment with Human Judgements: Through knowledge distillation, the GPT-3.5 Judge can align more closely with human judgements, mirroring the assessment criteria employed by human evaluators. This alignment fosters greater consistency and reliability in evaluating responses, ultimately enhancing user satisfaction and trust in the judging process.

Code Implementation

To implement knowledge distillation for fine-tuning a GPT-3.5 Judge, follow these steps using Python and relevant libraries such as llama-index:

Step I: Install Libraries


%pip install llama-index-readers-wikipedia
%pip install llama-index-finetuning
%pip install llama-index-llms-openai
%pip install llama-index-finetuning-callbacks
%pip install llama-index-llms-huggingface
%pip install wikipedia -q

Step II: Initiate OpenAI and Huggingface

import os
import nest_asyncio
import tqdm

nest_asyncio.apply()

# we will be using models on HuggingFace as our LLM answer generators
HUGGING_FACE_TOKEN = os.getenv("HUGGING_FACE_TOKEN")

# we will use GPT-4 and GPT-3.5 + OpenAI Fine-Tuning
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

Step III: Generate datasets


QUESTION_GEN_PROMPT = (
    "You are a Teacher/ Professor. Your task is to setup "
    "a quiz/examination. Using the provided context, formulate "
    "a single question that captures an important fact from the "
    "context. Restrict the question to the context information provided."
)

# wikipedia pages
from llama_index.readers.wikipedia import WikipediaReader

# generate questions against chunks
from llama_index.core.evaluation import DatasetGenerator
from llama_index.llms.openai import OpenAI

from llama_index.core import VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever

from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.llms.huggingface import HuggingFaceInferenceAPI


cities = [
    "San Francisco",
    "Toronto",
    "New York",
    "Vancouver",
    "Montreal",
    "Tokyo",
    "Singapore",
    "Paris",
]

documents = WikipediaReader().load_data(
    pages=[f"History of {x}" for x in cities]
)

qrd = dataset_generator.generate_dataset_from_nodes(num=350)

Step IV: Vector Index the Content

# Create vector index
the_index = VectorStoreIndex.from_documents(documents=documents)

# Create the retriver on this index
the_retriever = VectorIndexRetriever(
    index=the_index,
    similarity_top_k=2,
)

llm = HuggingFaceInferenceAPI(
    model_name="meta-llama/Llama-2-7b-chat-hf",
    context_window=2048,  # to use refine
    token=HUGGING_FACE_TOKEN,
)

query_engine = RetrieverQueryEngine.from_args(retriever=the_retriever, llm=llm)

# we will use 65% of the generated questions for training
train_dataset = []
num_train_questions = int(0.65 * len(qrd.qr_pairs))

for q, a in tqdm.tqdm(qrd.qr_pairs[:num_train_questions]):
    # data for this q
    data_entry = {"question": q, "reference": a}
    response = query_engine.query(q)
    response_struct = {}
    response_struct["model"] = "llama-2"
    response_struct["text"] = str(response)
    response_struct["context"] = (
        response.source_nodes[0].node.text[:1000] + "..."
    )

    data_entry["response_data"] = response_struct
    train_dataset.append(data_entry)

Step V: GPT-4 Evaluations

# instantiate the gpt-4 judge
from llama_index.llms.openai import OpenAI
from llama_index.finetuning.callbacks import OpenAIFineTuningHandler
from llama_index.core.callbacks import CallbackManager
from llama_index.core.evaluation import CorrectnessEvaluator

finetuning_handler = OpenAIFineTuningHandler()
callback_manager = CallbackManager([finetuning_handler])
gpt_4_llm = OpenAI(
    temperature=0, model="gpt-4", callback_manager=callback_manager
)

gpt4_judge = CorrectnessEvaluator(llm=gpt_4_llm)


# for `training`
for data_entry in tqdm.tqdm(train_dataset):
    eval_result = await gpt4_judge.aevaluate(
        query=data_entry["question"],
        response=data_entry["response_data"]["text"],
        context=data_entry["response_data"]["context"],
        reference=data_entry["reference"],
    )

    # save final result
    judgement = {}
    judgement["llm"] = "gpt_4"
    judgement["score"] = eval_result.score
    judgement["text"] = eval_result.response
    data_entry["evaluations"] = [judgement]

finetuning_handler.save_finetuning_events("correction_finetuning_events.jsonl")

Step VI: Perform knowledge distillation

from llama_index.finetuning import OpenAIFinetuneEngine

finetune_engine = OpenAIFinetuneEngine(
    "gpt-3.5-turbo",
    "correction_finetuning_events.jsonl",
)

# We can check the status of our current job as follows
# This may take some time ...
finetune_engine.finetune()

Output:

Num examples: 79
First example:
{'role': 'system', 'content': '\nYou are an expert evaluation system for a question answering chatbot.\n\nYou are given the following information:\n- a user query,\n- a reference answer, and\n- a generated answer.\n\nYour job is to judge the relevance and correctness of the generated answer.\nOutput a single score that represents a holistic evaluation.\nYou must return your response in a line with only the score.\nDo not return answers in any other format.\nOn a separate line provide your reasoning for the score as well.\n\nFollow these guidelines for scoring:\n- Your score has to be between 1 and 5, where 1 is the worst and 5 is the best.\n- If the generated answer is not relevant to the user query, you should give a score of 1.\n- If the generated answer is relevant but contains mistakes, you should give a score between 2 and 3.\n- If the generated answer is relevant and fully correct, you should give a score between 4 and 5.\n\nExample Response:\n4.0\nThe generated answer has the exact same metrics as the reference answer,     but it is not as concise.\n\n'}
{'role': 'user', 'content': '\n## User Query\nWhat event in 1906 caused significant damage to San Francisco but was followed by a quick rebuild?\n\n## Reference Answer\nThe great earthquake and fire in 1906 caused significant damage to San Francisco but was followed by a quick rebuild.\n\n## Generated Answer\n1906 earthquake and fire.\n'}
{'role': 'assistant', 'content': '4.0\nThe generated answer is relevant and correct, but it lacks the detail and context provided in the reference answer.'}
No errors found
Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0

#### Distribution of num_total_tokens_per_example:
min / max: 315, 782
mean / median: 479.49367088607596, 465.0
p5 / p95: 355.6, 634.6

#### Distribution of num_assistant_tokens_per_example:
min / max: 19, 110
mean / median: 57.63291139240506, 56.0
p5 / p95: 29.6, 83.2

0 examples may be over the 4096 token limit, they will be truncated during fine-tuning
Dataset has ~37880 tokens that will be charged for during training
By default, you'll train for 3 epochs on this dataset
By default, you'll be charged for ~113640 tokens
As of August 22, 2023, fine-tuning gpt-3.5-turbo is $0.008 / 1K Tokens.
This means your total cost for training will be $0.30304000000000003 per epoch.

Step VII: Evaluate The Fine-Tuned GPT-3.5 Judge On The Test Dataset

# Use Llama-2 to generate answers to the test questions
test_dataset = []
for q, a in tqdm.tqdm(qrd.qr_pairs[num_train_questions:]):
    # data for this q
    data_entry = {"question": q, "reference": a}
    response = query_engine.query(q)
    response_struct = {}
    response_struct["model"] = "llama-2"
    response_struct["text"] = str(response)
    response_struct["context"] = (
        response.source_nodes[0].node.text[:1000] + "..."
    )

    data_entry["response_data"] = response_struct
    test_dataset.append(data_entry)


# get the gpt-4 judgements on the Llama-2 answers
for data_entry in tqdm.tqdm(test_dataset):
    eval_result = await gpt4_judge.aevaluate(
        query=data_entry["question"],
        response=data_entry["response_data"]["text"],
        context=data_entry["response_data"]["context"],
        reference=data_entry["reference"],
    )

    # save final result
    judgement = {}
    judgement["llm"] = "gpt_4"
    judgement["score"] = eval_result.score
    judgement["text"] = eval_result.response
    data_entry["evaluations"] = [judgement]

from llama_index.core.evaluation import EvaluationResult

# use our fine-tuned GPT-3.5 to evaluate the answers
ft_llm = finetune_engine.get_finetuned_model()

ft_gpt_3p5_judge = CorrectnessEvaluator(llm=ft_llm)

for data_entry in tqdm.tqdm(test_dataset):
    eval_result = await ft_gpt_3p5_judge.aevaluate(
        query=data_entry["question"],
        response=data_entry["response_data"]["text"],
        context=data_entry["response_data"]["context"],
        reference=data_entry["reference"],
    )

    # save final result
    judgement = {}
    judgement["llm"] = "ft_gpt_3p5"
    judgement["score"] = eval_result.score
    judgement["text"] = eval_result.response
    data_entry["evaluations"] += [judgement]

# Similarly, use a non-fine-tuned judge to evaluate the answers
gpt_3p5_llm = OpenAI(model="gpt-3.5-turbo")

gpt_3p5_judge = CorrectnessEvaluator(llm=gpt_3p5_llm)

for data_entry in tqdm.tqdm(test_dataset):
    eval_result = await gpt_3p5_judge.aevaluate(
        query=data_entry["question"],
        response=data_entry["response_data"]["text"],
        context=data_entry["response_data"]["context"],
        reference=data_entry["reference"],
    )

    # save final result
    judgement = {}
    judgement["llm"] = "gpt_3p5"
    judgement["score"] = eval_result.score
    judgement["text"] = eval_result.response
    data_entry["evaluations"] += [judgement]

Step VIII: Metrics

import numpy as np

REPORT_FMT_STR = (
    "{model}\n"
    "-----------------\n"
    "Number of obs.: {total_obs}\n"
    "Correlation with GPT-4: {corr}\n"
)

scores = {"gpt_4": [], "gpt_3p5": [], "ft_gpt_3p5": []}
for ix, d in enumerate(test_dataset):
    for e in d["evaluations"]:
        scores[e["llm"]].append(e["score"])

# numpy conversion
np_scores_gpt_4 = np.array(scores["gpt_4"])
np_scores_gpt_3p5 = np.array(scores["gpt_3p5"])
np_scores_ft_gpt_3p5 = np.array(scores["ft_gpt_3p5"])

# correlations
corr_ft = np.corrcoef(np_scores_gpt_4, np_scores_ft_gpt_3p5)[0, 1]
corr_no_ft = np.corrcoef(np_scores_gpt_4, np_scores_gpt_3p5)[0, 1]

print(
    REPORT_FMT_STR.format(
        model="GPT-3.5 w/ fine-tuning",
        total_obs=np_scores_gpt_4.shape[0],
        corr=corr_ft,
    )
)
print("\n")
print(
    REPORT_FMT_STR.format(
        model="GPT-3.5 w/out fine-tuning",
        total_obs=np_scores_gpt_4.shape[0],
        corr=corr_no_ft,
    )
)

Conclusion

Knowledge distillation offers a compelling strategy for fine-tuning language models like the GPT-3.5 Judge, providing a pathway to improved accuracy, performance, and alignment with human judgements. By leveraging the insights and expertise of more advanced models, such as GPT-4, the GPT-3.5 Judge can undergo iterative refinement, ensuring its continued relevance and effectiveness in evaluating responses to user queries. As NLP continues to advance, knowledge distillation remains a valuable tool for optimizing language models and advancing the capabilities of AI-powered systems.

“Stay connected and support my work through various platforms:

Medium: You can read my latest articles and insights on Medium at https://medium.com/@andysingal

Paypal: Enjoyed my article? Buy me a coffee! https://paypal.me/alphasingal?country.x=US&locale.x=en_US"

Requests and questions: If you have a project in mind that you’d like me to work on or if you have any questions about the concepts I’ve explained, don’t hesitate to let me know. I’m always looking for new ideas for future Notebooks and I love helping to resolve any doubts you might have.

Resources:

Knowledge-distillation

Upvote