Expanding Model Context and Creating Chat Models with a Single Click

Community Article Published April 28, 2024

There are countless methods for expanding the context of a model. However, most of these methods require an enormous amount of VRAM and long training times. Similarly, creating a chat model also requires a large number of QA sets and significant costs.

In this post, we will introduce a method that achieves context expansion, chat model creation, and knowledge transfer with just a single click.

Limitations of Simple Diff Addition

Simply adding diffs has its limitations. In fact, I wrote a post about expanding context by adding diffs yesterday, but I deleted it for this reason. While it works to some extent for the same base model, applying it to a heavily fine-tuned model often results in undesirable outputs and model degradation.

Therefore, I reconsidered the approach to context expansion, and as a result, I would like to introduce a method and code that not only expands context but also creates a chat model and transfers knowledge with a single click. Since I want to cover a simple method on Blog, please refer to the link below for the detailed code.

Considering Two Types of Diffs

First, we need to consider two types of diffs:

  1. The diff between the model with the desired information (long context or chat) and the base model.
  2. The diff between the base model and the target model (the model to be applied).

In the example below, we will focus on context expansion.

model_diffs = long_context_model - base_model # diff between long context and base
sigmoid_ratios = calculate_sigmoid_ratios(base_model, target_model)

In the calculate_sigmoid_ratios function, we assign ratios using the sigmoid function based on the minimum and maximum values of all 1D tensors in the base model and target model. Of course, using other activation functions might be more appropriate.

Here's how you can understand the process of calculating ratios: If there is a large difference between the base and target models (i.e., a large diff), applying new information to that element will not capture the information of either the context expansion or the target model. Therefore, when applying the diff to the model, we use the ratio obtained above to add diff * (1 - ratio), minimizing model degradation and guiding the model to capture only the desired information.

apply_model_diffs(target_model, model_diffs, sigmoid_ratios)

After that, we can create the model by updating the weights of the target model using the diff and ratio. This entire process is completed within 1 minute without any training.

We have confirmed 100% retrieval up to 64k for models created using this method.

Application to Chat Model Creation (Especially for Non-English model or Domain specific)

The same method can be applied to create chat models as well. Below is an example of creating Korean Chat model.

Below is a comparison between the approach using only diffs (Chat Vector) and the introduced approach.

Due to length limitations, please refer to the "Detailed Code Introduction" section below for the output comparison.

Here are the results of the kobest benchmark (5-shot, acc basis):

beomi/Llama-3-Open-Ko-8B-Instruct-preview (simple diff application)

  • Average: 0.6220
  • boolq: 0.6254
  • copa: 0.7110
  • hellaswag: 0.384
  • sentineg: 0.8388
  • wic: 0.5739

maywell/Llama-3-Ko-8B-Instruct (method application)

  • Average: 0.6852
  • boolq: 0.7208
  • copa: 0.7650
  • hellaswag: 0.4440
  • sentineg: 0.9194
  • wic: 0.6040

We were able to achieve a performance improvement of about 10%.

Of course, kobest is not a benchmark that evaluates chat performance, so it can be understood as providing an Instruct format while minimizing the degradation of the model's performance.

The code has been uploaded to GitHub. I would be grateful if you could experiment with it and share your results.

Links