Text Generation
Transformers
Safetensors
English
llama
conversational
Inference Endpoints
text-generation-inference

Finetuning upstage/SOLAR-10.7B-Instruct-v1.0

#24
by bertdirt - opened

I have 2 A10 GPUs (total memory 48GB) and I loaded quantised model (size was almost 9GB) and tried finetuning but got "out of memory" error . I loaded the model in the following way :

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16 #Changed from bflot16
)

config = LoraConfig(
        r=16,
        lora_alpha=32,
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
    )

model_name = "./SOLAR-10.7B-Instruct-v1.0"

model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", quantization_config=quant_config, trust_remote_code=True)

# model.gradient_checkpointing_enable() ## Added checkpointing

model = prepare_model_for_kbit_training(model,use_gradient_checkpointing=False)
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = get_peft_model(model, config)

To overcome this, I tried adding gradient_checkpointing=True in TrainingAguments:

def train_model(dsl_train,dsl_test,model,tokenizer,output_dir):
    os.environ["WANDB_DISABLED"] = "true"
    model.config.use_cache = False
    trainer = transformers.Trainer(
        model=model,
        train_dataset=dsl_train,
        eval_dataset=dsl_test,
        args=transformers.TrainingArguments(
            per_device_train_batch_size=1,
            per_device_eval_batch_size=1,
            gradient_accumulation_steps=4,
            gradient_checkpointing=True,
            evaluation_strategy='epoch',
            save_strategy='epoch',
            load_best_model_at_end=True,
            log_level='info',
            overwrite_output_dir=True,
            report_to=None,
            warmup_steps=1,
            num_train_epochs=3,
            learning_rate=2e-4,
            fp16=True,
            logging_steps=1,
            save_steps=1,
            output_dir=output_dir,
#             optim='paged_lion_8bit', #"paged_adamw_8bit"
        ),
        data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
    )
    result = trainer.train()
    return result,model,tokenizer

I got the following error:

ERROR - Exception
Traceback (most recent call last):
  File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3553, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_11584/2258950767.py", line 1, in <cell line: 1>
    result,model,tokenizer = train_model(dsl_train,dsl_test,model,tokenizer,output_dir)
  File "/tmp/ipykernel_11584/1227025339.py", line 30, in train_model
    result = trainer.train()
  File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/transformers/trainer.py", line 2734, in training_step
    self.accelerator.backward(loss)
  File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/accelerate/accelerator.py", line 1851, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

I am not aware of what causing this. I tried the changes provide in https://github.com/huggingface/transformers/issues/25006
but this does not work as SOLAR requires updates versions of transformers, torch and accelerate. Please help me in finding the cause to debug this issue.

Hello,

I successfully fine-tuned this model for another task I have been working on recently. I do not think the problem you encounter is due to your GPU because I did with a single GPU with 24GB. The problem you face is possibly because of library configuration issue. Here is my packages, make sure to use a virtualenv and load these:
%pip install -Uqqq pip --progress-bar off
%pip install -qqq torch==2.0.1 --progress-bar off
#!pip install -qqq transformers==4.32.1 --progress-bar off
%pip install git+https://github.com/huggingface/transformers
%pip install -qqq datasets==2.14.4 --progress-bar off
%pip install -qqq peft==0.5.0 --progress-bar off
%pip install -qqq bitsandbytes==0.41.1 --progress-bar off
%pip install -qqq trl==0.7.1 --progress-bar off
%pip install scipy
%pip install accelerate==0.27.2

Hope this helps!

@halilergul1 , I have figured out the issue. I am using use_gradient_checkpointing=False in model = prepare_model_for_kbit_training(model,use_gradient_checkpointing=False) but gradient_checkpointing=True is set to True in TrainingArguments. When I removed use_gradient_checkpointing=False, then it worked.

hunkim changed discussion status to closed

Sign up or log in to comment