Edit model card

image/png

interlocked-DUS(iDUS)

We attempted to improve the performance of the model by further minimizing the layer distance without significantly departing from the framework of DUS.

💻 GitHub Repository: https://github.com/gauss5930/iDUS

Architectural Details

We propose interlocked-DUS(iDUS) the variant of DUS! As you can see from the name, it does not connect the layers as a whole like DUS but divides into groups and merges them so that they interlock with each other. With this mechanism, iDUS more effectively reduces the layer distance that was important in DUS and has greater strength in processing. The figure above illustrates the overall framework of iDUS.

Experiments

We created variants of DUS called interlocked-DUS(iDUS) and conducted experiments to verify the effectiveness of them.

  • iDUS-1layer: The layers used are taken from a base model like DUS, but when merging, one layer per model is merged alternately. This variant aims to solve the layer distance problem more effectively.
  • iDUS-8layer(iDUS): The concept is similar to iDUS-1layer, but iDUS-8layer uses 8 layers as a standard and merges them alternately. This variant aims to solve layer distance and boost processing effectively.

To understand the effectiveness of these variants, it was uploaded to the HuggingFace Open LLM Leaderboard and its performance was evaluated as follows.

Model ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K Average
Llama2_init_Mistral 60.07 83.3 64.09 42.15 78.37 37.91 60.98
SOLAR-10.7B-DUS-Implementation 59.56 81.18 63.68 40.72 76.48 26.99 58.1
iDUS-1layer 27.73 26.65 24.91 48.58 49.17 0 29.51
iDUS(iDUS-8layer) 59.3 81.34 63.22 40.62 76.24 29.57 58.38

As shown in the table above, iDUS-1layer has significantly lower performance, and iDUS-8layer is slightly better than the original DUS used in the SOLAR-10.7B.

Discussion

We were able to obtain the following analysis through the result of experiments with variants of iDUS.

  • The performance of iDUS-1layer showed that alternately merging one layer at a time to solve the layer distance problem, but instead, it caused the model to go in a strange direction.
  • On the other hand, the iDUS-8layer showed good performance, it seems to be because it solved the layer distance problem to some extent and allows the model to properly process the information through the placement of successive layers.

As a result, it was confirmed that it is important to solve the layer distance problem, however, it is also important to place consecutive layers together to process information effectively. Taking all of these points into consideration, we propose iDUS, which shows improved performance over the original DUS.

Due to a lack of computation resources, further pre-training could not be performed in the SOLAR-10.7B implementation and iDUS experiment, making a more detailed analysis impossible. We will leave this limitation for future projects.

Downloads last month
2,111
Safetensors
Model size
10.7B params
Tensor type
FP16
·

Collection including Cartinoe5930/SOLAR-10.7B-iDUS