Implementing Fractional GPUs in Kubernetes with Aliyun Scheduler

Community Article Published January 14, 2024

This blog provides a detailed approach to partitioning a single GPU into as many as seven smaller GPUs, each equipped with its own memory, cache, and streaming multiprocessors, using open-source frameworks in a Kubernetes environment. 

This guide is particularly beneficial for Machine Learning Engineers, Data Scientists, and AI researchers who aim to optimize their GPU resources for specific workload requirements.

Key Takeaways

  • Deploying multiple containers with shared GPU resources.
  • Understanding various methods' pros and cons.
  • A step-by-step guide on using the Aliyun Gpushare Scheduler Extender.

Table of Contents

  1. Understanding Nvidia MIG and Its Limitations
  2. Recommendation: Aliyun Gpushare Scheduler Extender
  3. Step-by-Step Tutorial
  4. Advantages of Aliyun Gpushare Scheduler Extender
  5. Conclusion

Understanding Nvidia MIG and Its Limitations

The Nvidia Multi-Instance GPU (MIG) feature, available in NVIDIA's A100 and subsequent GPUs, allows a single A100 GPU to be divided into up to seven smaller GPUs, each equipped with its own memory, cache, and streaming multiprocessors. This feature is designed to enhance the utilization of GPU resources according to specific workload needs, such as machine learning, data analysis, or graphics processing. Particularly valuable in cloud and data center environments, MIG technology facilitates efficient GPU resource usage, offering flexibility and improved performance across a range of computing tasks. However, we have identified several limitations with the Nvidia MIG driver:

  1. Resource Partitioning: Dividing memory and compute cores among instances might limit resources for each instance, affecting performance for high-demand tasks.
  2. Potential Underutilization: There's a risk of mismatch between workload and resource partitioning, leading to resource underutilization.
  3. Compatibility and Support: MIG technology is limited to certain NVIDIA GPUs like the A100, excluding older models.
  4. Complexity in Management: Managing multiple GPU instances adds complexity, especially in large-scale deployments.
  5. Inter-Instance Communication: Communication challenges may arise due to the logical isolation of GPU instances.

Our Recommendation : Aliyun Gpushare Scheduler Extender

We highly recommend the Aliyun Gpushare Scheduler Extender, accessible at Aliyun Gpushare Scheduler Extender. Though it requires advanced configuration in Kubernetes, it proves to be a superior choice. The steps outlined for AKS (Azure Kubernetes Service) are easily adaptable to other cloud providers.

Step-by-Step Tutorial

Here's a step-by-step guide to setting it up, using AKS (Azure Kubernetes Service) as an example, though it's applicable to other cloud providers as well:

Step 1: Configure Docker Runtime (Skip for Azure/GCP)

For non-Azure/GCP environments, ensure your /etc/docker/daemon.json is correctly configured

sudo vi /etc/docker/daemon.json

Verify that JSON has the below configuration:

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Step 2: Set Up the Scheduler

SSH into a GPU Node and prepare the scheduler configuration (only needed in one gpu node):

cd /etc/kubernetes
sudo curl -O https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/scheduler-policy-config.json

Update kube-scheduler.yaml to use the new config:

sudo cd /tmp/
sudo wget https://github.com/AliyunContainerService/gpushare-scheduler-extender/blob/master/config/kube-scheduler-v1.23+.yaml
sudo cp /tmp/kube-scheduler.yaml /etc/kubernetes/manifests/kube-scheduler.yaml

Step 3: Deploy Device Plugin and Scheduler Controller

Exit from the GPU, Use kubectl where you have the cluster configured to deploy the necessary components:

curl -O https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/gpushare-schd-extender.yaml
wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-rbac.yaml
wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-ds.yaml
kubectl create -f gpushare-schd-extender.yaml
kubectl create -f device-plugin-rbac.yaml
kubectl create -f device-plugin-ds.yaml

Label the Node for automatic inclusion in the NodePool:

kubectl label node aks-<your_node_name>-xxxxxxxx-vmss000000 gpushare=true

Step 4: Verify GPU Status

Install and run the kubectl plugin to check GPU status:

sudo wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare
sudo chmod 755 ./kubectl-inspect-gpushare
./kubectl-inspect-gpushare

Step 5: Configure Scheduler for Node

Modify gpushare-schd-extender.yaml to run the scheduler on a specific node:

vi gpushare-schd-extender.yaml

Update nodeSelector:

nodeSelector:
         kubernetes.io/hostname: aks-<your_node_pool_name>-xxxxxxxx-vmss00000<node_number>

Redeploy the scheduler:

kubectl delete -f gpushare-schd-extender.yaml
kubectl apply -f gpushare-schd-extender.yaml

For more nodes to show up keep marking the other gpu nodes with this command

kubectl label node aks-<your_node_name>-xxxxxxxx-vmss00000<X> gpushare=true 

Step 6: Deploy a Test Pod

Create and deploy a test pod to monitor utilization:

apiVersion: v1
kind: Pod
metadata:
  name: gpushare-test-pod
spec:
  restartPolicy: OnFailure
  containers:
    - name: gpushare-test-pod
      image: "cheyang/gpu-player:v2"
      env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"
      resources:
        limits:
         aliyun.com/gpu-mem: 5

Deploy it with:

kubectl apply -f test-pods.yaml

Advantages of Using the Aliyun GPU Scheduler

  1. Flexibility and Scalability: This scheduler excels in dynamically allocating GPU resources to VMs based on their immediate requirements. It is particularly beneficial in environments with frequently changing workloads and demands.
  2. Performance: Performance levels can vary with the number of workloads sharing the GPU and each VM's specific workload. However, the advantage lies in the ability of all models to utilize the cores, resulting in higher overall utilization.
  3. Resource Allocation: The scheduler facilitates dynamic allocation and balancing of GPU resources. This adaptability is crucial as it allows adjustments in line with changing workload demands. Its approach to bin packing is not constrained by partition sizes, offering greater flexibility.
  4. Compatibility and Support: It supports a broader range of GPUs and is commonly integrated with various virtualization software, enhancing its applicability and versatility.

Conclusion

The Aliyun plugin stands out as a highly effective solution for sharing GPU resources in Kubernetes environments. This guide provides detailed, step-by-step setup instructions and sheds light on the enhanced flexibility, efficiency, and compatibility that Aliyun offers. These qualities establish it as an indispensable tool in the management of complex cloud and data center infrastructures.

For those eager to expand their understanding, I am open to sharing our insights. My professional area concentrates on aiding organizations in scaling up GPU workloads.

During my career, I have developed substantial expertise in setting up a variety of Large Language Models (LLMs) and Diffusion models on Kubernetes. This includes a specialized focus on enabling fractional GPU usage, a strategy that significantly aids in the cost-efficient deployment of these workloads.