Intel Neural-Chat 7b: Fine-Tuning on Gaudi2 for Top LLM Performance

Community Article Published November 22, 2023

image/png

Introduction

Intel's Neural Extension for Transformers has made significant strides in optimizing large language models (LLMs) for the Intel Gaudi2 accelerators. The recent advancements showcased in the NeuralChat 7b model, fine-tuned and optimized on Gaudi2, have established a new benchmark in the LLM domain, raising the bar for performance and versatility.

image/png

Leveraging Gaudi2: A Game-Changer in LLM Optimization

The Gaudi2 AI accelerator, crafted by Habana Labs under Intel's umbrella, has become a cornerstone in the quest for robust deep learning training and inference. With its expansive 96 GB integrated memory and availability in servers equipped with eight Gaudi2 mezzanine cards, it offers unparalleled capabilities for enhancing the training and inference processes of large language models.

image/png

The Fine-Tuning Journey

Fine-tuning an LLM involves meticulous steps, and Intel's approach underscores a blend of supervised fine-tuning and direct preference optimization (DPO). Beginning with the mistralai/Mistral-7B-v0.1 model as the base, Intel's team employed the Intel Extension for Transformers alongside the Open-Orca/SlimOrca dataset, utilizing the DeepSpeed ZeRO-2 to tailor the model's parameters. This fine-tuning process not only optimized performance but also adhered to commercial-friendly licenses.

image/png

Benefits of NeuralChat 7b

  1. Direct Preference Optimization: Aligning with Human Preferences A distinctive aspect of the NeuralChat 7b model's development was the application of the DPO algorithm. This algorithm, both stable and computationally lightweight, aimed to align model responses with human preferences. Leveraging a dataset containing 12k examples from the Orca style dataset, the team employed the llama-2–13b-chat model to generate responses, ensuring a nuanced understanding of acceptable versus rejected responses.

  2. Inference Excellence Compatibility with Transformers ensures seamless inference using the NeuralChat model. Employing the same launcher code for inference in FP32 and enabling BF16 inference using Optimum-Habana further amplifies its inference performance, promising swift and accurate responses.

  3. Supervised Fine-Tuning with Intel Extension for Transformers Utilizing the mistralai/Mistral-7B-v0.1 as the base model, the Intel Extension for Transformers facilitates supervised fine-tuning. Leveraging the Open-Orca/SlimOrca dataset and DeepSpeed ZeRO-2, this process tailors the model to specific requirements while adhering to commercial-friendly licenses.

Code Implementation: Optimizing NeuralChat 7b

The implementation of the NeuralChat 7b model's fine-tuning process on Intel Gaudi2 involves a seamless integration of robust tools and methodologies provided by the Intel Extension for Transformers.

image/png

Step 1: Install Libraries

!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -qqq  modelz-llm huggingface_hub 
!pip install -q datasets loralib sentencepiece
!pip -qqq install xformers einops 
!apt-get update && apt-get install git-lfs
!git lfs install

Step 2: Import Libraries and Load Model

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, AutoProcessor
from transformers import GenerationConfig, pipeline
bnb_config = BitsAndBytesConfig(
 load_in_4bit=True,
 bnb_4bit_use_double_quant=True,
 bnb_4bit_quant_type="nf4",
 bnb_4bit_compute_dtype=torch.bfloat16
)
model_id = "Intel/neural-chat-7b-v3-1"
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

Step 4: Function for Response Generation

def generate_response(system_input, user_input):

    # Format the input using the provided template
    prompt = f"### System:\n{system_input}\n### User:\n{user_input}\n### Assistant:\n"

    # Tokenize and encode the prompt
    inputs = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=False)

    # Generate a response
    outputs = model.generate(inputs, max_length=1000, num_return_sequences=1)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract only the assistant's response
    return response.split("### Assistant:\n")[-1]


# Example usage
system_input = "You are a math expert assistant. Your mission is to help users understand and solve various math problems. You should provide step-by-step solutions, explain reasonings and give the correct answer."
user_input = "calculate 100 + 520 + 60"
response = generate_response(system_input, user_input)
print(response)

Output

 To calculate the sum of 100, 520, and 60, we will follow these steps:

1. Add the first two numbers: 100 + 520
2. Add the result from step 1 to the third number: (100 + 520) + 60

Step 1: Add 100 and 520
100 + 520 = 620

Step 2: Add the result from step 1 to the third number (60)
(620) + 60 = 680

So, the sum of 100, 520, and 60 is 680.

Conclusion

The journey of fine-tuning the NeuralChat 7b model on Intel Gaudi2 stands as a testament to innovation, collaboration, and meticulous optimization in the realm of large language models (LLMs). Leveraging the robust tools provided by the Intel Extension for Transformers, this endeavor has unlocked new thresholds of performance and versatility, reshaping the landscape of LLMs.

By harnessing supervised fine-tuning methodologies and pioneering the application of Direct Preference Optimization (DPO) algorithms, Intel's team has not only optimized the model's performance but also steered it toward alignment with human preferences. This meticulous approach, characterized by the integration of high-quality datasets and cutting-edge training techniques, has propelled NeuralChat 7b to the forefront of LLM excellence.

The integration and optimization of these methodologies on Intel Gaudi2, a powerhouse AI accelerator, underscore a convergence of cutting-edge hardware and sophisticated software frameworks. This convergence has been instrumental in achieving the model's remarkable performance milestones, culminating in its top-ranking on the LLM leaderboard.

Moreover, Intel's commitment to ethical considerations in AI development shines through in the meticulous approach taken to address potential risks, ensuring that NeuralChat 7b contributes positively to the AI landscape.

The release of NeuralChat 7b to the LLM community not only sets a new benchmark but also invites collaboration and engagement. By extending an invitation to fine-tune and contribute, Intel seeks to foster a community-driven evolution of LLMs, aiming to make AI beneficial and ethical for all.

NeuralChat 7b's journey signifies a leap forward in LLM technology, promising continued advancements and inspiring further research and development efforts. Its impact transcends benchmarks, signaling a new era of empowered, community-driven innovation in the realm of large language models.

“Stay connected and support my work through various platforms:

Huggingface: For natural language processing and AI-related projects, you can explore my Huggingface profile at https://huggingface.co/Andyrasika.

LinkedIn: To stay updated on my latest projects and posts, you can follow me on LinkedIn. Here is the link to my profile: https://www.linkedin.com/in/ankushsingal/."

Requests and questions: If you have a project in mind that you’d like me to work on or if you have any questions about the concepts I’ve explained, don’t hesitate to let me know. I’m always looking for new ideas for future Notebooks and I love helping to resolve any doubts you might have.

Resources:

  1. Neural-Chat