dstack: Your LLM Launchpad - From Fine-Tuning to Serving, Simplified

Community Article Published August 22, 2024

image/png

Dstack, an open-source project, empowers developers to manage virtual machine instances seamlessly, making multi-node fine-tuning of large language models (LLMs) more accessible. By combining dstack with various cloud service platforms such as Google Cloud Platform (GCP), Aamazon Web Services (AWS), Microsoft Azure, Oracle Cloud Infrastructure (OCI), you unlock a streamlined process for setting up and managing distributed training environments. This guide walks you through fine-tuning Gemma 7B using dstack on GCP, incorporating best practices from the Hugging Face alignment-handbook, and then deploying the model for serving with Hugging Face’s Text Generation Inference (TGI).

NOTE: This blog post’s experiment is tested with 3 nodes of which have (2 x A10) GPUs, and Gemma 7B model is chosen as the base model to be fine-tuned.

Setting up dstack for GCP

With four simple steps, we can interact with GCP via dstack. First, we need to install the dstack Python package. Since dstack supports multiple cloud providers, we can narrow down the scope to GCP:

$ pip install dstack[gcp]

Next, we need to configure the GCP specific credentials inside the ~/.dstack/server/config.yml. Below assumes that you obtained the application default credentials. Also, you need to be autheticated with application default credentials via gcloud auth application-default login before going forward. For service account credentials, please follow the dstack’s official document.

projects:
- name: main
  backends:
  - type: gcp
    project_id: <your-gcp-project-id>
    creds:
      type: default

Next, we can boot up the dstack server as below.

$ dstack server

INFO     Applying ~/.dstack/server/config.yml
INFO     Configured the main project in ~/.dstack/config.yml
INFO     The admin token is ab6e8759-9cd9-4e84-8d47-5b94ac877ebf
INFO     The dstack server 0.18.4 is running at http://127.0.0.1:3000

Grasp the admin token, the running IP address, and the port. Then, initialize your dstack project inside a folder of your choice, then you are all set to go! Every step from this point is the same as in any other infrastructure setup.

# inside a project folder

$ dstack init

$ dstack config –url http://127.0.0.1:3000 \
  --project main \
  --token ab6e8759-9cd9-4e84-8d47-5b94ac877ebf

Fine-Tuning an LLM with dstack

Let's dive into the practical steps for fine-tuning your LLM. Here's the command to initiate the fine-tuning job with dstack:

$ ACCEL_CONFIG_PATH=fsdp_qlora_full_shard.yaml \
  FT_MODEL_CONFIG_PATH=qlora_finetune_config.yaml \
  HUGGING_FACE_HUB_TOKEN=xxxx \
  WANDB_API_KEY=xxxx \
  dstack apply . -f ft.task.dstack.yml

FT_MODEL_CONFIG_PATH, ACCEL_CONFIG_PATH, HUGGING_FACE_HUB_TOKEN, and WANDB_API_KEY are the environment variables defined inside the dstack’s running script. These will eventually be set up as environment variables inside the virtual machines provisioned on GCP. dstack apply . runs a job defined in ft.task.dstack.yml on GCP, and it also copies the current directory’s content and sets it up as a working directory.

NOTE: If these environment variables are defined in the current terminal session of your local machine, you don’t need to set them up explicitly.

Let’s go through each yaml file. We only highlight important parts in this blog post, but their full contents are available in this repository. Let’s first have a look at qlora_finetune_config.yaml and fsdp_qlora_full_shard.yaml which define how we want to fine-tune an LLM and how we want to leverage the underlying GPU infrastructure for fine-tuning respectively.

qlora_finetune_config.yaml is the configuration that alignment-handbook can understand about how you would want to fine-tune an LLM.

# Model arguments
model_name_or_path: google/gemma-7b
tokenizer_name_or_path: philschmid/gemma-tokenizer-chatml
torch_dtype: bfloat16
bnb_4bit_quant_storage: bfloat16

# LoRA arguments
load_in_4bit: true
use_peft: true
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - k_proj
  ...

# Data training arguments
dataset_mixer:
  chansung/mental_health_counseling_conversations: 1.0
dataset_splits:
  - train
  - test
...
  • Model arguments

    • model_name_or_path: Google’s Gemma 7B is chosen as base model
    • tokenizer_name_or_path: alignment-handbook uses apply_chat_template() method of the chosen tokenizer. This tutorial uses the ChatML template instead of the Gemma 7B’s standard conversation template.
    • torch_dtype and bnb_4bit_quant_storage: these two values should be defined the same if we want to leverage FSDP+QLoRA fine-tuning method. Since Gemma 7B is hard to fit into a single A10 GPU, this blog post uses FSDP+QLoRA to shard a model into 2 x A10 GPUs while leveraging QLoRA technique.
  • LoRA arguments: LoRA specific configurations. Since this blog post leverages FSDP+QLoRA technique, load_in_4bit is set to true. Other configurations could vary from experiment to experiment.

  • Data training arguments: we have prepared a dataset which is based on Amod’s mental health counseling conversations’ dataset. Since alignment-handbook only understands the data in the form of [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}, ….] which can be interpreted with tokenizer’s apply_chat_template() method, the prepared dataset is basically the conversion of the original dataset into the apply_chat_template() compatible format.

Now, with the fine-tuning configurations, it’s time to define how the underlying infrastructure should behave for the fine-tuning job, and it’s where fsdp_qlora_full_shard.yaml comes in.

compute_environment: LOCAL_MACHINE
distributed_type: FSDP  # Use Fully Sharded Data Parallelism
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_use_orig_params: false
  fsdp_offload_params: true
  fsdp_sharding_strategy: FULL_SHARD
  # ... (other FSDP configurations)
# ... (other configurations)
  • distributed_type: FSDP indicates the use of Fully Sharded Data Parallel (FSDP), a technique that enables training large models that would otherwise not fit on a single GPU.
  • fsdp_config: These set up how FSDP operates, such as how the model is sharded (fsdp_sharding_strategy) and whether parameters are offloaded to CPU (fsdp_offload_params).

image/png

With the FSDP of distributed_type and FULL_SHARD of fsdp_config’s fsdp_sharding_strategy, a model will be sharded across multiple GPUs on multiple nodes. If you want to shard a model in a single node and have the same sharded model instance across other nodes, the value of fsdp_sharding_strategy should be set as HYBRID_SHARD. In this case, if there are multiple nodes, each node will have the same model sharded across multiple GPUs within itself. That means each sharded model instance in each node will learn different parts/batches of a given dataset.

There are other important variables such as machine_rank, num_machines, and num_processes, but the values of these variables are better injected based on your target environment at runtime since we can easily switch to different specs of the infrastructure.

The Power of dstack: Configuration Made Easy

Along with fsdp_qlora_full_shard.yaml, at the heart of our multi-node setup is ft.task.dstack.yaml. dstack simplifies defining the complex configuration of distributed training environments.

type: task
python: "3.11"
nodes: 3
env:
  - ACCEL_CONFIG_PATH
  - FT_MODEL_CONFIG_PATH
  - HUGGING_FACE_HUB_TOKEN
  - WANDB_API_KEY`   
commands:
  # ... (setup steps, cloning repo, installing requirements)
  - ACCELERATE_LOG_LEVEL=info accelerate launch \
      --config_file recipes/custom/accel_config.yaml \
      --main_process_ip=$DSTACK_MASTER_NODE_IP \
      --main_process_port=8008 \
      --machine_rank=$DSTACK_NODE_RANK \
      --num_processes=$DSTACK_GPUS_NUM \
      --num_machines=$DSTACK_NODES_NUM \
      scripts/run_sft.py recipes/custom/config.yaml
ports:
  - 6006
resources:
  gpu: 1..2
  shm_size: 24GB

Key points to highlight:

  • Seamless Integration: dstack effortlessly integrates with Hugging Face's open source ecosystem. In Particular, you can simply use the accelerate library with the configurations that we defined in fsdp_qlora_full_shard.yaml as normal.
  • Automatic Configuration: $DSTACK_MASTER_NODE_IP, $DSTACK_NODE_RANK, $DSTACK_GPUS_NUM, and $DSTACK_NODES_NUM variables are automatically managed by dstack, reducing manual setup.
  • Resource Allocation: dstack makes it easy to specify the number of nodes and GPUs (gpu: 1..2) for your fine-tuning job. Hence, for this blog post, there are three nodes each of which is equipped with 2 x A10(24GB) GPUs.

Serving Your Fine-Tuned Model

Once your model is fine-tuned, dstack makes it a breeze to serve it on GCP using Hugging Face's Text Generation Inference (TGI) framework. Here's an example of how you can define a service in dstack for serving your fine-tuned model securely:

type: service
image: ghcr.io/huggingface/text-generation-inference:latest`  
env:
  - HUGGING_FACE_HUB_TOKEN
  - MODEL_ID=chansung/mental_health_counseling_merged_v0.1
commands:
  - text-generation-launcher \
    --max-input-tokens 512 --max-total-tokens 1024 \
    --max-batch-prefill-tokens 512 --port 8000
port: 8000
resources:
  gpu:
    memory: 48GB

# (Optional) Enable the OpenAI-compatible endpoint
model:
  format: tgi
  type: chat
  name: chansung/mental_health_counseling_merged_v0.1

Key advantages of this approach:

  • Secure HTTPS Gateway: Dstack simplifies the process of setting up a secure HTTPS connection through a gateway, a crucial aspect of production-level model serving.
  • Optimized for Inference: The TGI framework is designed for efficient text generation inference, ensuring your model delivers responsive and reliable results.

At this point, you can interact with the serving instance via standard curl command and Python’s requests, OpenAI SDK, and Hugging Face’s InferenceClient libraries. For instance, the code snippet below shows an example of curl.

curl \-X POST [https://black-octopus-1.deep-diver-main.sky.dstack.ai/generate](https://black-octopus-1.deep-diver-main.sky.dstack.ai/generate) \\  
  \-H "Authorization: Bearer \<dstack-token\>" \\  
  \-H 'Content-Type: application/json' \\  
  \-d '{"inputs":"I feel bad .....","parameters":{"max\_new\_tokens":128}}' 

Also, if you are using dstack Sky, you can directly interact with the deployed model from the dstack Chat UI. dstack Sky is a fully managed cloud platform that allows you to manage your own cloud resources for free. Alternatively, dstack.ai can provide resource quotas from various cloud service providers at competitive market prices.

image/png

Figure 1. Chat UI on dstack Sky

Conclusion

By following the steps outlined in this guide, you've unlocked a powerful approach to fine-tuning and deploying LLMs using the combined capabilities of dstack, GCP, and Hugging Face's ecosystem. You can now leverage dstack's user-friendly interface to manage your GCP resources effectively, streamlining the process of setting up distributed training environments for your LLM projects. Furthermore, the integration with Hugging Face's alignment-handbook and TGI framework empowers you to fine-tune and serve your models seamlessly, ensuring they're optimized for performance and ready for real-world applications. We encourage you to explore the possibilities further and experiment with different models and configurations to achieve your desired outcomes in the world of natural language processing.

Bonus

  • dstack fleet enables you to provision resources in both cloud and on-premise environments, allowing you to keep desired resources available even before executing any tasks. This is particularly useful when you need the benefits of dstack without directly accessing cloud resources, or in any situation where efficient resource management and provisioning across cloud and on-premise environments is desired.

  • dstack volume allows you to create and attach persistent volumes to your development environments, tasks, and services. Volumes are currently experimental and work with the aws, gcp, and runpod backends. They allow you to persist data between runs. You can define a configuration file to create a new volume, or register an existing one. Once a volume is created, you can attach it to dev environments, tasks, and services. This lets you share data across runs.