asiansoul/Propose-Merge-Llama-3-8B-GGUF

🗼 I Proposed a My New Merge Method

For more information on the "Reborn" method, check out this Medium article:
"Reborn": Elevating Model Adaptation with Merging for Superior NLP Performance

“It’s okay if my assumptions are not entirely correct. As long as they work to some degree, there is potential. By refining and correcting mistakes, perfection can be achieved!

Natural language processing doesn’t always work exactly as expected.

Last night, while watching YouTube, an idea occurred to me, and I wrote a coding novel. It became an action, so I ended up writing it like this "Reborn". ”

This Test model uses w/o interpolation (This Repo)

🏎️ w/o interpolation (This Repo)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Set precision type and target device
dtype = torch.bfloat16
device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")

# Load Models
reference_model_name = "abacusai/Llama-3-Smaug-8B"
base_model_name = "NousResearch/Meta-Llama-3-8B-Instruct"
target_model_name = "beomi/Llama-3-KoEn-8B-Instruct-preview"  # target model.

reference_model = AutoModelForCausalLM.from_pretrained(reference_model_name).to(device=device, dtype=dtype)
base_model = AutoModelForCausalLM.from_pretrained(base_model_name).to(device=device, dtype=dtype)
target_model = AutoModelForCausalLM.from_pretrained(target_model_name).to(device=device, dtype=dtype)

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(target_model_name)

# Calculate model differences
def calculate_model_diffs(model_a, model_b):
    model_a_dict = model_a.state_dict()
    model_b_dict = model_b.state_dict()
    model_diffs = {key: model_a_dict[key] - model_b_dict[key] for key in model_a_dict.keys() if key in model_b_dict}
    return model_diffs

# Calculate adaptive scaling factors with dynamic attention
def calculate_dynamic_scaling_factors(model_diffs):
    scaling_factors = {}
    for key, diff_tensor in model_diffs.items():
        attention_weights = torch.softmax(torch.abs(diff_tensor), dim=-1)
        scaling_factor = torch.sum(diff_tensor * attention_weights) / torch.sum(attention_weights)
        scaling_factors[key] = scaling_factor.item()
    return scaling_factors

# Apply adaptive scaling to model differences
def apply_scaling_factors(target_model, model_diffs, scaling_factors):
    target_state_dict = target_model.state_dict()
    for key, diff_tensor in model_diffs.items():
        scaled_diff = diff_tensor * scaling_factors[key]
        target_state_dict[key] += scaled_diff
    target_model.load_state_dict(target_state_dict)

# Main adaptation function with dynamic attention
def adapt_model_dynamic_attention(reference_model, base_model, target_model):
    reference_base_diffs = calculate_model_diffs(reference_model, base_model)
    base_target_diffs = calculate_model_diffs(base_model, target_model)

    reference_scaling_factors = calculate_dynamic_scaling_factors(reference_base_diffs)
    base_scaling_factors = calculate_dynamic_scaling_factors(base_target_diffs)

    scaling_factors = {key: (reference_scaling_factors[key] + base_scaling_factors[key]) / 2
                       for key in reference_scaling_factors}

    apply_scaling_factors(target_model, base_target_diffs, scaling_factors)

    return target_model

# Adapt the target model with dynamic attention
adapted_model_dynamic_attention = adapt_model_dynamic_attention(reference_model, base_model, target_model)

# Save the adapted model and tokenizer
output_dir = './adapted_model_dynamic_attention'
adapted_model_dynamic_attention.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

# Interpolation in Model Merging below one

Interpolation is a practical way to reconcile mismatched tensors during model merging or adaptation. However, it comes with certain trade-offs and considerations that should be understood.

## Benefits of Interpolation

- **Alignment of Different Models:**
  - When two models have slightly different architectures or vocabularies, interpolation helps align them by finding common ground in their weight distributions.

- **Combining Features:**
  - Interpolation leverages the strengths of both models by combining their learned features.

- **Avoids Skipping Entire Layers:**
  - Rather than completely ignoring layers with different shapes, interpolation provides a way to incorporate them into the target model.

## Drawbacks of Interpolation

- **Loss of Specificity:**
  - Averaging weights might dilute specialized or highly fine-tuned features of individual models.

- **Overlapping Portions Only:**
  - For larger mismatches (e.g., output layers with very different vocabulary sizes), only the overlapping portion is interpolated, potentially leading to reduced expressiveness.

- **Limited Contextualization:**
  - Interpolation lacks context about the specific task or domain knowledge, which direct training might address more effectively.

mismatch

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.30s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:14<00:00,  3.57s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:17<00:00,  2.90s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
  File "/reborn.py", line 66, in <module>
    adapted_model_dynamic_attention = adapt_model_dynamic_attention(reference_model, base_model, target_model)
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/reborn.py", line 49, in adapt_model_dynamic_attention
    reference_base_diffs = calculate_model_diffs(reference_model, base_model)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/reborn.py", line 24, in calculate_model_diffs
    model_diffs = {key: model_a_dict[key] - model_b_dict[key] for key in model_a_dict.keys() if key in model_b_dict}
                        ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
RuntimeError: The size of tensor a (128258) must match the size of tensor b (128256) at non-singleton dimension 0

🏍️ w interpolation

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Set precision type and target device
dtype = torch.bfloat16
device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")

# Load Models
reference_model_name = "cognitivecomputations/dolphin-2.9-llama3-8b"
base_model_name = "NousResearch/Meta-Llama-3-8B-Instruct"
target_model_name = "beomi/Llama-3-KoEn-8B-Instruct-preview"  # target model.

reference_model = AutoModelForCausalLM.from_pretrained(reference_model_name).to(device=device, dtype=dtype)
base_model = AutoModelForCausalLM.from_pretrained(base_model_name).to(device=device, dtype=dtype)
target_model = AutoModelForCausalLM.from_pretrained(target_model_name).to(device=device, dtype=dtype)

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(target_model_name)

def interpolate_weights(tensor_a, tensor_b, target_shape):
    """Interpolate or adjust weights between two tensors to match the target shape."""
    min_shape = [min(tensor_a.shape[i], tensor_b.shape[i], target_shape[i]) for i in range(len(target_shape))]

    # Create a new tensor matching the target shape, filled with zeros initially
    interpolated = torch.zeros(target_shape, dtype=tensor_a.dtype, device=tensor_a.device)

    # Copy the overlapping portion
    slices = tuple(slice(0, s) for s in min_shape)
    interpolated[slices] = (tensor_a[slices] + tensor_b[slices]) / 2

    return interpolated

def calculate_model_diffs_and_interpolate(model_a, model_b):
    """Calculate differences between two models and interpolate where mismatches occur."""
    model_a_dict = model_a.state_dict()
    model_b_dict = model_b.state_dict()
    
    model_diffs = {}
    for key in model_a_dict.keys():
        if key in model_b_dict:
            if model_a_dict[key].shape == model_b_dict[key].shape:
                model_diffs[key] = model_a_dict[key] - model_b_dict[key]
            else:
                # Interpolate weights to match the shape of the smaller tensor
                target_shape = model_b_dict[key].shape
                model_diffs[key] = interpolate_weights(model_a_dict[key], model_b_dict[key], target_shape)
                print(f"Interpolating tensor '{key}' to match the shape: {model_a_dict[key].shape} vs {model_b_dict[key].shape}")
        else:
            print(f"'{key}' not found in model_b. Skipping.")

    return model_diffs

# Calculate adaptive scaling factors with dynamic attention
def calculate_dynamic_scaling_factors(model_diffs):
    scaling_factors = {}
    for key, diff_tensor in model_diffs.items():
        # Compute attention weights based on the absolute magnitude of parameter differences
        attention_weights = torch.softmax(torch.abs(diff_tensor), dim=-1)
        # Compute the weighted sum of differences to calculate the scaling factor
        scaling_factor = torch.sum(diff_tensor * attention_weights) / torch.sum(attention_weights)
        scaling_factors[key] = scaling_factor.item()
    return scaling_factors

# Apply adaptive scaling to model differences
def apply_scaling_factors(target_model, model_diffs, scaling_factors):
    target_state_dict = target_model.state_dict()
    for key, diff_tensor in model_diffs.items():
        scaled_diff = diff_tensor * scaling_factors[key]
        target_state_dict[key] += scaled_diff
    target_model.load_state_dict(target_state_dict)

# Main adaptation function with dynamic attention
def adapt_model_dynamic_attention(reference_model, base_model, target_model):
    # Calculate interpolated differences between reference and base models
    reference_base_diffs = calculate_model_diffs_and_interpolate(reference_model, base_model)
    base_target_diffs = calculate_model_diffs_and_interpolate(base_model, target_model)

    # Calculate dynamic scaling factors with attention
    reference_scaling_factors = calculate_dynamic_scaling_factors(reference_base_diffs)
    base_scaling_factors = calculate_dynamic_scaling_factors(base_target_diffs)

    # Merge scaling factors
    scaling_factors = {key: (reference_scaling_factors[key] + base_scaling_factors[key]) / 2
                       for key in reference_scaling_factors if key in base_scaling_factors}

    # Apply adaptive scaling
    apply_scaling_factors(target_model, base_target_diffs, scaling_factors)

    return target_model

# Adapt the target model with dynamic attention
adapted_model_dynamic_attention = adapt_model_dynamic_attention(reference_model, base_model, target_model)

# Save the adapted model and tokenizer
output_dir = './adapted_model_dynamic_attention_dolphin'
adapted_model_dynamic_attention.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

🐠 Test Source Model

"abacusai/Llama-3-Smaug-8B"
"NousResearch/Meta-Llama-3-8B-Instruct"
"beomi/Llama-3-KoEn-8B-Instruct-preview"
"cognitivecomputations/dolphin-2.9-llama3-8b"

Citation

@article{llama3modelcard,

title={Llama 3 Model Card},

author={AI@Meta},

year={2024},

url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}

}