Open-Source AI Cookbook documentation

在单个 GPU 上针对自定义代码微调代码 LLM

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Open In Colab

在单个 GPU 上针对自定义代码微调代码 LLM

作者: Maria Khalusova

公开发布的代码 LLM,如 Codex、StarCoder 和 Code Llama,在生成遵循通用编程原则和语法的代码方面表现出色,但它们可能不符合组织的内部惯例,或者不了解某些特定的库。

在这个 notebook 中,我们将展示如何微调代码 LLM 来更好的理解你们公司或组织的代码风格和习惯。由于代码 LLM 非常大,按照传统的微调方式可能会消耗大量资源。但不用担心!我们会教你一些技巧,让你只用单个 GPU 就能完成微调工作。

数据集

对于这个例子,我们选择了 GitHub 上 Hugging Face 的前 10 个公共仓库。我们已经排除了非代码文件,如图片、音频文件、演示文稿等。对于 Jupyter notebook,我们只保留了包含代码的单元格。生成的代码被存储为一个数据集,你可以在 Hugging Face Hub 上找到,位于 smangrul/hf-stack-v1。它包含仓库 id、文件路径和文件内容。

模型

我们将微调 bigcode/starcoderbase-1b 模型,这是一个在 80 多种编程语言上训练的 10 亿参数模型。这是一个需要权限的模型,所以如果你计划使用这个确切模型运行这个 notebook,你需要在其模型页面上获得访问权限。登录你的 Hugging Face 帐户以执行此操作:

from huggingface_hub import notebook_login

notebook_login()

To get started, let’s install all the necessary libraries. As you can see, in addition to transformers and datasets, we’ll be using peft, bitsandbytes, and flash-attn to optimize the training.

By employing parameter-efficient training techniques, we can run this notebook on a single A100 High-RAM GPU.

!pip install -q transformers datasets peft bitsandbytes flash-attn

现在让我们定义一些变量。请随意调整这些变量。

MODEL = "bigcode/starcoderbase-1b"  # Model checkpoint on the Hugging Face Hub
DATASET = "smangrul/hf-stack-v1"  # Dataset on the Hugging Face Hub
DATA_COLUMN = "content"  # Column name containing the code content

SEQ_LENGTH = 2048  # Sequence length

# Training arguments
MAX_STEPS = 2000  # max_steps
BATCH_SIZE = 16  # batch_size
GR_ACC_STEPS = 1  # gradient_accumulation_steps
LR = 5e-4  # learning_rate
LR_SCHEDULER_TYPE = "cosine"  # lr_scheduler_type
WEIGHT_DECAY = 0.01  # weight_decay
NUM_WARMUP_STEPS = 30  # num_warmup_steps
EVAL_FREQ = 100  # eval_freq
SAVE_FREQ = 100  # save_freq
LOG_FREQ = 25  # log_freq
OUTPUT_DIR = "peft-starcoder-lora-a100"  # output_dir
BF16 = True  # bf16
FP16 = False  # no_fp16

# FIM trasformations arguments
FIM_RATE = 0.5  # fim_rate
FIM_SPM_RATE = 0.5  # fim_spm_rate

# LORA
LORA_R = 8  # lora_r
LORA_ALPHA = 32  # lora_alpha
LORA_DROPOUT = 0.0  # lora_dropout
LORA_TARGET_MODULES = "c_proj,c_attn,q_attn,c_fc,c_proj"  # lora_target_modules

# bitsandbytes config
USE_NESTED_QUANT = True  # use_nested_quant
BNB_4BIT_COMPUTE_DTYPE = "bfloat16"  # bnb_4bit_compute_dtype

SEED = 0
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    logging,
    set_seed,
    BitsAndBytesConfig,
)

set_seed(SEED)

准备数据

首先加载数据。由于数据集可能相当大,请确保启用流模式。流模式允许我们在遍历数据集时逐步加载数据,而不是一次性下载数据集的整个内容。

我们将前 4000 个示例作为验证集,其余的全部作为训练数据。

from datasets import load_dataset
import torch
from tqdm import tqdm


dataset = load_dataset(
    DATASET,
    data_dir="data",
    split="train",
    streaming=True,
)

valid_data = dataset.take(4000)
train_data = dataset.skip(4000)
train_data = train_data.shuffle(buffer_size=5000, seed=SEED)

在这一步,数据集仍然包含任意长度的原始数据。为了训练,我们需要固定长度的输入。让我们创建一个可迭代的数据集,它可以从文本文件流中返回固定长度的 token 块。

首先,让我们估计数据集中每个 token 的平均字符数,这将帮助我们稍后估计文本缓冲区中的 token 数量。默认情况下,我们只从数据集中取 400 个示例(nb_examples)。只使用整个数据集的一个子集可以减少计算成本,同时仍然提供了对整体字符到 token 比的合理估计。

>>> tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)


>>> def chars_token_ratio(dataset, tokenizer, data_column, nb_examples=400):
...     """
...     Estimate the average number of characters per token in the dataset.
...     """

...     total_characters, total_tokens = 0, 0
...     for _, example in tqdm(zip(range(nb_examples), iter(dataset)), total=nb_examples):
...         total_characters += len(example[data_column])
...         total_tokens += len(tokenizer(example[data_column]).tokens())

...     return total_characters / total_tokens


>>> chars_per_token = chars_token_ratio(train_data, tokenizer, DATA_COLUMN)
>>> print(f"The character to token ratio of the dataset is: {chars_per_token:.2f}")
The character to token ratio of the dataset is: 2.43

字符到 token 的比也可以用作文本标记质量的一个指标。例如,字符到 token 的比为 1.0 意味着每个字符都由一个 token 表示,这并没有太多意义。表明标记化做得不好。在标准的英文文本中,一个 token 通常相当于大约四个字符,这意味着字符到 token 的比率大约是 4.0。我们可以预见在代码数据集中的比率会更低,但一般来说,2.0 到 3.5 之间的数字可以认为是足够好的。

可选的 FIM 变换 自回归语言模型通常是从左到右生成序列的。通过应用 FIM 变换,模型也可以学习填充文本。详细信息可以看“Efficient Training of Language Models to Fill in the Middle” 这篇论文了解这种技术。

我们将在下面定义 FIM 变换,并在创建可迭代数据集时使用它们。然而,如果你想省略变换步骤,请将 fim_rate 设置为 0。

import functools
import numpy as np


# Helper function to get token ids of the special tokens for prefix, suffix and middle for FIM transformations.
@functools.lru_cache(maxsize=None)
def get_fim_token_ids(tokenizer):
    try:
        FIM_PREFIX, FIM_MIDDLE, FIM_SUFFIX, FIM_PAD = tokenizer.special_tokens_map["additional_special_tokens"][1:5]
        suffix_tok_id, prefix_tok_id, middle_tok_id, pad_tok_id = (
            tokenizer.vocab[tok] for tok in [FIM_SUFFIX, FIM_PREFIX, FIM_MIDDLE, FIM_PAD]
        )
    except KeyError:
        suffix_tok_id, prefix_tok_id, middle_tok_id, pad_tok_id = None, None, None, None
    return suffix_tok_id, prefix_tok_id, middle_tok_id, pad_tok_id


## Adapted from https://github.com/bigcode-project/Megatron-LM/blob/6c4bf908df8fd86b4977f54bf5b8bd4b521003d1/megatron/data/gpt_dataset.py
def permute(
    sample,
    np_rng,
    suffix_tok_id,
    prefix_tok_id,
    middle_tok_id,
    pad_tok_id,
    fim_rate=0.5,
    fim_spm_rate=0.5,
    truncate_or_pad=False,
):
    """
    Take in a sample (list of tokens) and perform a FIM transformation on it with a probability of fim_rate, using two FIM modes:
    PSM and SPM (with a probability of fim_spm_rate).
    """

    # The if condition will trigger with the probability of fim_rate
    # This means FIM transformations will apply to samples with a probability of fim_rate
    if np_rng.binomial(1, fim_rate):

        # Split the sample into prefix, middle, and suffix, based on randomly generated indices stored in the boundaries list.
        boundaries = list(np_rng.randint(low=0, high=len(sample) + 1, size=2))
        boundaries.sort()

        prefix = np.array(sample[: boundaries[0]], dtype=np.int64)
        middle = np.array(sample[boundaries[0] : boundaries[1]], dtype=np.int64)
        suffix = np.array(sample[boundaries[1] :], dtype=np.int64)

        if truncate_or_pad:
            # calculate the new total length of the sample, taking into account tokens indicating prefix, middle, and suffix
            new_length = suffix.shape[0] + prefix.shape[0] + middle.shape[0] + 3
            diff = new_length - len(sample)

            # trancate or pad if there's a difference in length between the new length and the original
            if diff > 0:
                if suffix.shape[0] <= diff:
                    return sample, np_rng
                suffix = suffix[: suffix.shape[0] - diff]
            elif diff < 0:
                suffix = np.concatenate([suffix, np.full((-1 * diff), pad_tok_id)])

        # With the probability of fim_spm_rateapply SPM variant of FIM transformations
        # SPM: suffix, prefix, middle
        if np_rng.binomial(1, fim_spm_rate):
            new_sample = np.concatenate(
                [
                    [prefix_tok_id, suffix_tok_id],
                    suffix,
                    [middle_tok_id],
                    prefix,
                    middle,
                ]
            )
        # Otherwise, apply the PSM variant of FIM transformations
        # PSM: prefix, suffix, middle
        else:

            new_sample = np.concatenate(
                [
                    [prefix_tok_id],
                    prefix,
                    [suffix_tok_id],
                    suffix,
                    [middle_tok_id],
                    middle,
                ]
            )
    else:
        # don't apply FIM transformations
        new_sample = sample

    return list(new_sample), np_rng

让我们定义 ConstantLengthDataset,这是一个可迭代的数据集,它将返回固定长度的 token 块。为此,我们将从原始数据集中读取文本缓冲区,直到达到大小限制,然后应用分词器将原始文本转换为 token 后的输入。可选项,我们可以在一些序列上执行 FIM 变换(受影响的序列比例由 fim_rate 控制)。

定义好后,我们可以从训练和验证数据中创建 ConstantLengthDataset 的实例。

from torch.utils.data import IterableDataset
from torch.utils.data.dataloader import DataLoader
import random

# Create an Iterable dataset that returns constant-length chunks of tokens from a stream of text files.


class ConstantLengthDataset(IterableDataset):
    """
    Iterable dataset that returns constant length chunks of tokens from stream of text files.
        Args:
            tokenizer (Tokenizer): The processor used for proccessing the data.
            dataset (dataset.Dataset): Dataset with text files.
            infinite (bool): If True the iterator is reset after dataset reaches end else stops.
            seq_length (int): Length of token sequences to return.
            num_of_sequences (int): Number of token sequences to keep in buffer.
            chars_per_token (int): Number of characters per token used to estimate number of tokens in text buffer.
            fim_rate (float): Rate (0.0 to 1.0) that sample will be permuted with FIM.
            fim_spm_rate (float): Rate (0.0 to 1.0) of FIM permuations that will use SPM.
            seed (int): Seed for random number generator.
    """

    def __init__(
        self,
        tokenizer,
        dataset,
        infinite=False,
        seq_length=1024,
        num_of_sequences=1024,
        chars_per_token=3.6,
        content_field="content",
        fim_rate=0.5,
        fim_spm_rate=0.5,
        seed=0,
    ):
        self.tokenizer = tokenizer
        self.concat_token_id = tokenizer.eos_token_id
        self.dataset = dataset
        self.seq_length = seq_length
        self.infinite = infinite
        self.current_size = 0
        self.max_buffer_size = seq_length * chars_per_token * num_of_sequences
        self.content_field = content_field
        self.fim_rate = fim_rate
        self.fim_spm_rate = fim_spm_rate
        self.seed = seed

        (
            self.suffix_tok_id,
            self.prefix_tok_id,
            self.middle_tok_id,
            self.pad_tok_id,
        ) = get_fim_token_ids(self.tokenizer)
        if not self.suffix_tok_id and self.fim_rate > 0:
            print("FIM is not supported by tokenizer, disabling FIM")
            self.fim_rate = 0

    def __iter__(self):
        iterator = iter(self.dataset)
        more_examples = True
        np_rng = np.random.RandomState(seed=self.seed)
        while more_examples:
            buffer, buffer_len = [], 0
            while True:
                if buffer_len >= self.max_buffer_size:
                    break
                try:
                    buffer.append(next(iterator)[self.content_field])
                    buffer_len += len(buffer[-1])
                except StopIteration:
                    if self.infinite:
                        iterator = iter(self.dataset)
                    else:
                        more_examples = False
                        break
            tokenized_inputs = self.tokenizer(buffer, truncation=False)["input_ids"]
            all_token_ids = []

            for tokenized_input in tokenized_inputs:
                # optionally do FIM permutations
                if self.fim_rate > 0:
                    tokenized_input, np_rng = permute(
                        tokenized_input,
                        np_rng,
                        self.suffix_tok_id,
                        self.prefix_tok_id,
                        self.middle_tok_id,
                        self.pad_tok_id,
                        fim_rate=self.fim_rate,
                        fim_spm_rate=self.fim_spm_rate,
                        truncate_or_pad=False,
                    )

                all_token_ids.extend(tokenized_input + [self.concat_token_id])
            examples = []
            for i in range(0, len(all_token_ids), self.seq_length):
                input_ids = all_token_ids[i : i + self.seq_length]
                if len(input_ids) == self.seq_length:
                    examples.append(input_ids)
            random.shuffle(examples)
            for example in examples:
                self.current_size += 1
                yield {
                    "input_ids": torch.LongTensor(example),
                    "labels": torch.LongTensor(example),
                }


train_dataset = ConstantLengthDataset(
    tokenizer,
    train_data,
    infinite=True,
    seq_length=SEQ_LENGTH,
    chars_per_token=chars_per_token,
    content_field=DATA_COLUMN,
    fim_rate=FIM_RATE,
    fim_spm_rate=FIM_SPM_RATE,
    seed=SEED,
)
eval_dataset = ConstantLengthDataset(
    tokenizer,
    valid_data,
    infinite=False,
    seq_length=SEQ_LENGTH,
    chars_per_token=chars_per_token,
    content_field=DATA_COLUMN,
    fim_rate=FIM_RATE,
    fim_spm_rate=FIM_SPM_RATE,
    seed=SEED,
)

准备模型

现在数据已经准备好了,是时候加载模型了!我们将加载量化的模型。

因为量化使用更少的位来表示数据,所以会减少内存使用。我们将使用 bitsandbytes 库来量化模型,因为它与 transformers 有很好的集成。我们需要做的只是定义一个 bitsandbytes 配置,然后在加载模型时使用它。

4 比特位量化有不同的变体,但通常我们推荐使用 NF4 量化以获得更好的性能(bnb_4bit_quant_type="nf4")。

bnb_4bit_use_double_quant 选项在第一次量化后添加第二次量化,以节省每个参数额外的 0.4 位。

要了解更多关于量化的信息,请查看 “利用 bitsandbytes、4 比特位量化和 QLoRA 让 LLMs 更易于访问” 的博客

定义好后,将配置传递给 from_pretrained 方法以加载量化的模型。

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from peft.tuners.lora import LoraLayer

load_in_8bit = False

# 4-bit quantization
compute_dtype = getattr(torch, BNB_4BIT_COMPUTE_DTYPE)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=USE_NESTED_QUANT,
)

device_map = {"": 0}

model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    load_in_8bit=load_in_8bit,
    quantization_config=bnb_config,
    device_map=device_map,
    use_cache=False,  # We will be using gradient checkpointing
    trust_remote_code=True,
    use_flash_attention_2=True,
)

当使用量化模型进行训练时,你需要调用 prepare_model_for_kbit_training() 函数来预处理量化模型以进行训练。

model = prepare_model_for_kbit_training(model)

现在量化模型已经准备好了,我们可以设置一个 LoRA 配置。LoRA 通过大幅减少可训练参数的数量,使得微调更加高效。

要使用 LoRA 技术训练模型,我们需要将基础模型包装为 PeftModel。这涉及到使用 LoraConfig 定义 LoRA 配置,并使用 get_peft_model()LoraConfig 包装原始模型。

要了解更多关于 LoRA 及其参数的信息,请参考 PEFT 文档

>>> # Set up lora
>>> peft_config = LoraConfig(
...     lora_alpha=LORA_ALPHA,
...     lora_dropout=LORA_DROPOUT,
...     r=LORA_R,
...     bias="none",
...     task_type="CAUSAL_LM",
...     target_modules=LORA_TARGET_MODULES.split(","),
... )

>>> model = get_peft_model(model, peft_config)
>>> model.print_trainable_parameters()
trainable params: 5,554,176 || all params: 1,142,761,472 || trainable%: 0.4860310866343243

可以看到,通过应用 LoRA 技术,我们现在只需要训练不到 1% 的参数。

训练模型

现在我们已经准备好了数据,并且优化了模型,我们可以将所有东西整合在一起开始训练。

要实例化一个 Trainer,你需要定义训练配置。最重要的是 TrainingArguments,这是一个包含所有用于配置训练的属性的类。

这些与你可能运行的任何其他类型的模型训练相似,所以我们这里不会详细说明。

train_data.start_iteration = 0


training_args = TrainingArguments(
    output_dir=f"Your_HF_username/{OUTPUT_DIR}",
    dataloader_drop_last=True,
    evaluation_strategy="steps",
    save_strategy="steps",
    max_steps=MAX_STEPS,
    eval_steps=EVAL_FREQ,
    save_steps=SAVE_FREQ,
    logging_steps=LOG_FREQ,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    learning_rate=LR,
    lr_scheduler_type=LR_SCHEDULER_TYPE,
    warmup_steps=NUM_WARMUP_STEPS,
    gradient_accumulation_steps=GR_ACC_STEPS,
    gradient_checkpointing=True,
    fp16=FP16,
    bf16=BF16,
    weight_decay=WEIGHT_DECAY,
    push_to_hub=True,
    include_tokens_per_second=True,
)

最后一步,实例化 Trainer 并调用 train 方法。

>>> trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset)

>>> print("Training...")
>>> trainer.train()
Training...

最后,你可以将微调好的模型推送到你的 Hub 仓库中,并分享给你的团队。

trainer.push_to_hub()

推理

一旦模型被上传到 Hub,我们就可以使用它进行推理。为此,我们首先初始化原始的基础模型及其分词器。接下来,我们需要将微调后的权重与基础模型合并。

from peft import PeftModel
import torch

# load the original model first
tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    quantization_config=None,
    device_map=None,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).cuda()

# merge fine-tuned weights with the base model
peft_model_id = f"Your_HF_username/{OUTPUT_DIR}"
model = PeftModel.from_pretrained(base_model, peft_model_id)
model.merge_and_unload()

现在我们可以使用合并后的模型进行推理。为了方便起见,我们将定义一个 get_code_completion 函数 - 请随意尝试文本生成参数!

def get_code_completion(prefix, suffix):
    text = prompt = f"""{prefix}{suffix}"""
    model.eval()
    outputs = model.generate(
        input_ids=tokenizer(text, return_tensors="pt").input_ids.cuda(),
        max_new_tokens=128,
        temperature=0.2,
        top_k=50,
        top_p=0.95,
        do_sample=True,
        repetition_penalty=1.0,
    )
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

现在,为了获得代码补全,我们只需要调用 get_code_complete 函数,并将我们希望补全的前几行作为前缀传递,以及一个空字符串作为后缀。

>>> prefix = """from peft import LoraConfig, TaskType, get_peft_model
... from transformers import AutoModelForCausalLM
... peft_config = LoraConfig(
... """
>>> suffix = """"""

... print(get_code_completion(prefix, suffix))
from peft import LoraConfig, TaskType, get_peft_model
from transformers import AutoModelForCausalLM
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    modules_to_save=["q_proj", "v_proj"],
    inference_mode=False,
)
model = AutoModelForCausalLM.from_pretrained("gpt2")
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

作为刚刚在这个 notebook 中使用过 PEFT 库的人,你可以看到创建为 LoraConfig 函数的生成结果相当不错!

如果你回到我们为推理实例化模型的单元格,并注释掉我们合并微调权重的行,你可以看到原始模型对于完全相同的前缀会生成什么内容:

>>> prefix = """from peft import LoraConfig, TaskType, get_peft_model
... from transformers import AutoModelForCausalLM
... peft_config = LoraConfig(
... """
>>> suffix = """"""

... print(get_code_completion(prefix, suffix))
from peft import LoraConfig, TaskType, get_peft_model
from transformers import AutoModelForCausalLM
peft_config = LoraConfig(
    model_name_or_path="facebook/wav2vec2-base-960h",
    num_labels=1,
    num_features=1,
    num_hidden_layers=1,
    num_attention_heads=1,
    num_hidden_layers_per_attention_head=1,
    num_attention_heads_per_hidden_layer=1,
    hidden_size=1024,
    hidden_dropout_prob=0.1,
    hidden_act="gelu",
    hidden_act_dropout_prob=0.1,
    hidden

尽管这是 Python 语法,但你可以看到原始模型并不理解 LoraConfig 应该做什么。

要了解这种高效参数微调与完全微调的比较,以及如何通过推理端点在 VS Code 中使用这样的模型作为你的编程助手(copilot),或者在本地使用,请查看“个人编程助手(copilot):训练你自己的编码助手”博客。这个 notebook 补充了原始博客内容。

< > Update on GitHub