Fine Tuning a LLM Using Kubernetes with Intel® Xeon® Scalable Processors

Community blog post
Published January 23, 2024

Introduction

Large language models (LLM) used for text generation have exploded in popularity, but it's no secret that a massive amount of compute power is required to train these models due to their large model and dataset sizes. Previously, we have seen how efficient multiple CPUs can be when used with software optimizations such as Intel® Extension for PyTorch and Intel® oneCCL Bindings for PyTorch. Kubernetes (or K8s for short) orchestrates running containerized workloads across a cluster of nodes. In this blog, we take a deep dive into the process of fine tuning meta-llama/Llama-2-7b-hf using the medalpaca/medical_meadow_medical_flashcards dataset with multiple Intel® Xeon® Scalable Processors nodes from a K8s cluster.

Table of Contents

Why Kubernetes?

K8s orchestration reminds me of Chef Ramsay yelling out orders on Hell's Kitchen -- the teams work together to fulfill the orders as they are given. Some orders go to the red team, and others go to the blue team. Something simple like a salad might be a quick one person job, but a complicated entree might involve multiple people. Similarly, K8s has a scheduler that assigns jobs to nodes on the cluster based on which nodes are available and the resources required for the application. At any given time, a single node might be running several different jobs, or it could be running a single job that consumes all of its resources. If all of the nodes in a cluster are busy, a job is pending while it waits for node resources to become available.

The handling of unexpected issues can also be compared to K8s. If someone is sent to the medic after they burn their hand, the rest of the team continues to work on orders. Similarly, the rest of the K8s cluster continues to function if a node goes down. Also, if someone in the kitchen overcooks a steak they'll need to redo it. If there's a failure in a K8s pod, it gets restarted.

image/png

K8s becomes especially useful when there are a lot of jobs being deployed to the cluster (you don't need Chef Ramsay when you're at home cooking mac and cheese). This could be a team of engineers sharing a cluster of nodes, a bunch of multi-node experiments to run, or a production cluster handling numerous requests. In all of these cases, the Kubernetes control plane manages the cluster resources and coordinates which node will be used to run each pod.

Components

In the tutorial, we will be fine tuning Llama 2 with a Hugging Face dataset using multiple CPU nodes. Several different components are involved to run this job on the cluster. The diagram below visualizes the interactions between the components.

image/png

Helm Chart

The first component that we're going to talk about is the Helm charts. This is kind of like the recipe for our job. It brings together all the different components that are used for our job and allows us to deploy everything using one helm install command. The K8s resources used in our example are:

  • PyTorchJob with multiple workers for fine tuning the model
  • Persistent Volume Claim (PVC) used as a shared storage location amoung the workers for dataset and model files
  • Secret for gated models (Optional)
  • Data access pod (Optional)

The Helm chart has a values.yaml file with parameters that are used in the spec files for the K8s resources. Our values file includes parameters such the name for our K8s resources, the image/tag for the container to use for the worker pods, the number of workers, CPU and memory resources, the arguments for the python script, etc. The values get filled into the K8s spec files when the Helm chart is deployed.

Container

K8s runs jobs in a containerized environment, so the next thing that we're going to need is a docker container. You can think of this like the mixing bowl. In our mixing bowl, we need to include all the dependencies needed for our training job. For optimal performance using Intel Xeon processors, we recommend including Intel® Extension for PyTorch and Intel® oneCCL Bindings for PyTorch.

For this example, we will be using the intel/ai-workflows:torch-2.0.1-huggingface-multinode-py3.9 image from DockerHub, which includes the following packages:

Package Name Version Purpose
PyTorch 2.0.1 Base framework to train models
🤗 Transformers 4.35.2 Library used to download and fine tune the Hugging Face model
Intel® Extension for PyTorch 2.0.100 Extends PyTorch to provide an extra performance boost on Intel® hardware
Intel® oneAPI Collective Communications Library 2.0.0 Deploy PyTorch jobs on multiple nodes

Other key things included in the Dockerfile are MKL, google-perftools, 🤗 PEFT, 🤗 Datasets, and OpenSSH to allow the Intel oneAPI CCL to communicate between containers.

Fine Tuning Script

The python script that we are using fine tunes a causal language model for text generation. It has arguments similar to what you'd see in the Hugging Face example scripts. All parameters in the script are the same as those in the values.yaml, just converted to camelcase to match the Helm naming convention. This script can be used to fine tune a chatbot or for instruction tuning.

Chat-based models will use the following prompt format:

<s>[INST] <<SYS>>
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.<</SYS>>

Calculate the median of the following list of integers. 6, 5, 8, 1, 2, 1, 7 [/INST] 5 </s>

Other models will be instruction tuned using the following prompt format:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Calculate the median of the following list of integers.

### Input:
6, 5, 8, 1, 2, 1, 7

### Response:
5

The prompt strings can also be customized with script parameters in order to provide a prompt that is more approriate for your job. For example, you might want to provide a more targeted prompt, such as " You are a helpful, respectful and honest finance assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."

The python script needs to be included the docker container, so let's add that to our mixing bowl. Our dockerfile does a COPY to add the scripts the image.

Storage

We need a storage location that can be shared among the workers to access the dataset, and save model files. We are using a vanilla K8s cluster with an NFS backed storage class. If you are using a cloud service provider, you could use a cloud storage bucket instead.

Thinking back to our cooking analogy, storage can be compared to the pantry or fridge. You wouldn't add the whole pantry into the mixing bowl, and similarly, our NFS storage location doesn't get added to the container. Instead, the storage location gets mounted into the container so that we have access to read and write from that location without it being built into the image. To achieve this, we are using a persistent volume claim (PVC).

Secret

The last ingredient that we're adding in is the secret sauce. Gated or private models require you to be logged in to download the model. For authentication from the K8s job, we define a secret with a Hugging Face User Read Only Access Token. The secret gets used to set the HUGGING_FACE_HUB_TOKEN environment variable If the model being trained is not gated or private, this isn't required.

Cluster Requirements

This tutorial requires Kubeflow to be installed on your cluster. Kubeflow provides features and custom resources that simplify running and scaling machine learning workloads on K8s clusters. In this example, we are going to be using the PyTorch training operator from Kubeflow. The PyTorch training operator allows us to run distributed PyTorch training jobs on the cluster without needing to manually set environment variables such as MASTER_ADDR, MASTER_PORT, RANK, and WORLD_SIZE. After Kubeflow has been installed, verify that the PyTorch custom resource has been successfully deployed to your cluster, using kubectl get crd pytorchjobs.kubeflow.org and ensuring that the output is similar to:

NAME                       CREATED AT
pytorchjobs.kubeflow.org   2023-03-24T15:42:17Z

Our cluster uses 4th Gen Intel® Xeon® Scalable processors in order to take advantage of Intel® Advanced Matrix Extensions (Intel® AMX) and bfloat16. If role-based access control (RBAC) is enabled on your cluster, listing nodes and many other cluster wide commands require specific roles to be granted to the user. The kubectl auth can-i get nodes command will return "yes" if you are able to list the nodes with kubectl get nodes, for example:

NAME                 STATUS     ROLES                      AGE   VERSION
k8s-spr-01           Ready      worker                     69d   v1.22.17
k8s-spr-02           Ready      worker                     68d   v1.22.17
k8s-spr-03           Ready      worker                     65d   v1.22.17
k8s-spr-04           Ready      worker                     65d   v1.22.17

Otherwise, consult your cluster admin to get a list of the nodes available to your user group.

Once you know the names of the nodes, use kubectl describe node <node name> to get its CPU and memory capacity. We will be using this information later when setting up the specification for the worker pods.

Tutorial: Fine Tuning Llama 2 using a Kubernetes Cluster

Client Requirements:

  • kubectl installed and configured to connect to your cluster
  • Helm
  • Download and extract the Helm chart used in this tutorial:
    wget https://storage.googleapis.com/public-artifacts/helm_charts/tlt_hf_helm_chart.tar.gz
    tar -xvzf tlt_hf_helm_chart.tar.gz
    
    This tar file includes 3 different examples for the Helm values file:
    • hf_k8s/chart/values.yaml: A template for running your own workload
    • hf_k8s/chart/medical_meadow_values.yaml: Fine tune Llama 2 using the medalpaca/medical_meadow_medical_flashcards dataset
    • hf_k8s/chart/financial_chatbot_values.yaml: Fine tune a Llama 2 chatbot using a subset of a finance dataset Select one of these to use as your values file for the instructions below.

Step 1: Setup the secret with your Hugging Face token

Get a Hugging Face token with read access and use your terminal to get the base64 encoding for your token using a terminal using echo <your token> | base64.

For example:

$ echo hf_ABCDEFG | base64
aGZfQUJDREVGRwo=

Copy and paste the encoded token value into your values yaml file encodedToken field in the secret section. For example, to run the Medical Meadow fine tuning job, open the hf_k8s/chart/medical_meadow_values.yaml file and paste in your encoded token on line 23:

secret:
  encodedToken: aGZfQUJDREVGRwo=

Step 2: Customize your values.yaml parameters

The hf_k8s/chart/medical_meadow_values.yaml file is setup to fine tune meta-llama/Llama-2-7b-hf using the medalpaca/medical_meadow_medical_flashcards dataset. If you are using the hf_k8s/chart/values.yaml template, fill in either a datasetName to use a Hugging Face dataset, or provide a dataFile path.

The distributed.train section of the file can be changed to adjust the training job's dataset, epochs, max steps, learning rate, LoRA config, enable bfloat16, etc.

There are other parameters in the values.yaml file that need to be configured based on your cluster:

elasticPolicy:
  rdzvBackend: c10d
  minReplicas: 1
  maxReplicas: 4  # Must be greater than or equal to the number of distributed.workers
  maxRestarts: 30

distributed:
  workers: 4

...

# Resources allocated to each worker
resources:
  cpuRequest: 200                 # Update based on your hardware config
  cpuLimit: 200
  memoryRequest: 226Gi
  memoryLimit: 226Gi
  nodeSelectorLabel: node-type    # Update with your node label/value
  nodeSelectorValue: spr

# Persistent volume claim storage resources
storage:
  storageClassName: nfs-client     # Update with your cluster's storage class name
  resources: 50Gi
  pvcMountPath: /tmp/pvc-mount

The CPU resource limits/requests in the yaml are defined in cpu units where 1 CPU unit is equivalent to 1 physical CPU core or 1 virtual core (depending on whether the node is a physical host or a VM). The amount of CPU and memory limits/requests defined in the yaml should be less than the amount of available CPU/memory capacity on a single machine. It is usually a good idea to not use the entire machine's capacity in order to leave some resources for the kubelet and OS. In order to get "guaranteed" quality of service for the worker pods, set the same CPU and memory amounts for both the requests and limits.

Step 3: Deploy the Helm chart to the cluster

Deploy the Helm chart to the cluster using the kubeflow namespace:

# Navigate to the hf_k8s directory from the extracted tar file
cd hf_k8s

# Deploy the job using the Helm chart, specifying your values file name with the -f parameter
helm install --namespace kubeflow -f chart/medical_meadow_values.yaml llama2-distributed ./chart

Step 4: Monitor the job

After the Helm chart is deployed to the cluster, the K8s resources like the secret, PVC, and worker pods are created. The job can be monitored by looking at the pod status using kubectl get pods. At first, the pods will show as "Pending" as the containers get pulled and created, then the status should change to "Running".

$ kubectl get pods
NAME                                                READY   STATUS             RESTARTS         AGE
medical-meadow-dataaccess                           1/1     Running            0                1h30m
medical-meadow-pytorchjob-worker-0                  1/1     Running            0                1h30m
medical-meadow-pytorchjob-worker-1                  1/1     Running            0                1h30m
medical-meadow-pytorchjob-worker-2                  1/1     Running            0                1h30m
medical-meadow-pytorchjob-worker-3                  1/1     Running            0                1h30m

Watch the training logs using kubectl logs <pod name>. You can also add -f to stream the log.

$ kubectl logs medical-meadow-pytorchjob-worker-0
...
72%|███████▏  | 2595/3597 [4:08:05<1{'loss': 2.2737, 'learning_rate': 5.543508479288296e-05, 'epoch': 2.17}
...

Step 5: Download the trained model

After the job completes, the trained model can be copied from /tmp/pvc-mount/output/saved_model (the path defined in your values file for the train.outputDir parameter) to the local system using the following command:

kubectl cp --namespace kubeflow medical-meadow-dataaccess:/tmp/pvc-mount/output/saved_model .

Step 6: Clean up

Finally, the resources can be deleted from the cluster using the helm uninstall command with the name of the Helm job to delete. A list of all the deployed Helm releases can be seen using helm list.

helm uninstall --namespace kubeflow llama2-distributed

After uninstalling the Helm chart, the resources on the cluster should show a status of "Terminating", and then they will eventually disappear.

Results

We fine tuned Llama2-7b using the Medical Meadow Flashcards dataset for 3 epochs on our vanilla k8s cluster using 4th generation Intel® Xeon® Scalable Processors. We used the default parameters from the values.yaml file and then experimented with changing the number of workers and turning on and off mixed precision training. Measuring the accuracy of text generation models can be tricky, since there isn't a clear right or wrong answer. Instead, we are using the perplexity metric to roughly gauge how confident the model is in its prediction. We reserved 20% of the data for evaluation. We saw a notable reduction in training time when increasing the number of workers. You won't see an exact 2x and 4x improvement when going from a single node to 2 nodes and 4 nodes, since there is some overhead for communication between the nodes and some resources are reserved for the OS. We also saw a significant performance improvement when using BFloat16 training instead of FP32. In all cases, the perplexity value stayed relatively consistent. This is a good thing, because sometimes models can see an accuracy drop when scaling the job or training with less bits (like BFloat16).

Next Steps

Now that we've fine tuned Llama 2 using the Medical Meadow Flashcards dataset, you're probably wondering how to use this information to run your own workload on K8s. If you're using Llama 2 or a similar generative text model, you are in luck, because the same docker container and script can be reused. You'd just have to edit the parameters in the distributed.train section of the values.yaml file to use your dataset and tweak other parameters (learning rate, epochs, etc) as needed. If you want to use your own fine tuning script, you will need to build a docker container that includes the libraries needed to run the training job, along with your script. The image needs to be copied to the cluster nodes or pushed to a container registry. The container image and tag need to be updated in the values.yaml file along with the script name and all of the python parameters for your script.

All of the scripts, Dockerfile, and spec files for the tutorial can be found in our GitHub repo.

For other resources on distributed training, check out the Hugging Face documentation for efficient training on multiple CPUs.

In our next blog, we will cover prompting the Llama 2 model to generate responses for user given prompts (inference).

Acknowledgments

Thank you to my colleagues who made contributions and helped to review this blog: Harsha Ramayanam, Omar Khleif, Abolfazl Shahbazi, Rajesh Poornachandran, Melanie Buehler, Daniel De Leon, Tyler Wilbers, and Matthew Fleetwood.

Citations

@misc{touvron2023llama,
      title={Llama 2: Open Foundation and Fine-Tuned Chat Models},
      author={Hugo Touvron and Louis Martin and Kevin Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Nikolay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and Dan Bikel and Lukas Blecher and Cristian Canton Ferrer and Moya Chen and Guillem Cucurull and David Esiobu and Jude Fernandes and Jeremy Fu and Wenyin Fu and Brian Fuller and Cynthia Gao and Vedanuj Goswami and Naman Goyal and Anthony Hartshorn and Saghar Hosseini and Rui Hou and Hakan Inan and Marcin Kardas and Viktor Kerkez and Madian Khabsa and Isabel Kloumann and Artem Korenev and Punit Singh Koura and Marie-Anne Lachaux and Thibaut Lavril and Jenya Lee and Diana Liskovich and Yinghai Lu and Yuning Mao and Xavier Martinet and Todor Mihaylov and Pushkar Mishra and Igor Molybog and Yixin Nie and Andrew Poulton and Jeremy Reizenstein and Rashi Rungta and Kalyan Saladi and Alan Schelten and Ruan Silva and Eric Michael Smith and Ranjan Subramanian and Xiaoqing Ellen Tan and Binh Tang and Ross Taylor and Adina Williams and Jian Xiang Kuan and Puxin Xu and Zheng Yan and Iliyan Zarov and Yuchen Zhang and Angela Fan and Melanie Kambadur and Sharan Narang and Aurelien Rodriguez and Robert Stojnic and Sergey Edunov and Thomas Scialom},
      year={2023},
      eprint={2307.09288},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@article{han2023medalpaca,
  title={MedAlpaca--An Open-Source Collection of Medical Conversational AI Models and Training Data},
  author={Han, Tianyu and Adams, Lisa C and Papaioannou, Jens-Michalis and Grundmann, Paul and Oberhauser, Tom and L{\"o}ser, Alexander and Truhn, Daniel and Bressem, Keno K},
  journal={arXiv preprint arXiv:2304.08247},
  year={2023}
}