--- license: other license_name: other license_link: LICENSE --- πŸ—Ό I Proposed a My New Merge Method For more information on the "Reborn" method, check out this Medium article: ["Reborn": Elevating Model Adaptation with Merging for Superior NLP Performance](https://medium.com/@puffanddmx82/reborn-elevating-model-adaptation-with-merging-for-superior-nlp-performance-f604e8e307b2) β€œIt’s okay if my assumptions are not entirely correct. As long as they work to some degree, there is potential. By refining and correcting mistakes, perfection can be achieved! Natural language processing doesn’t always work exactly as expected. Last night, while watching YouTube, an idea occurred to me, and I wrote a coding novel. It became an action, so I ended up writing it like this "Reborn". ” This Test model uses w/o interpolation (This Repo) 🏎️ w/o interpolation (This Repo) ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer # Set precision type and target device dtype = torch.bfloat16 device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu") # Load Models reference_model_name = "abacusai/Llama-3-Smaug-8B" base_model_name = "NousResearch/Meta-Llama-3-8B-Instruct" target_model_name = "beomi/Llama-3-KoEn-8B-Instruct-preview" # target model. reference_model = AutoModelForCausalLM.from_pretrained(reference_model_name).to(device=device, dtype=dtype) base_model = AutoModelForCausalLM.from_pretrained(base_model_name).to(device=device, dtype=dtype) target_model = AutoModelForCausalLM.from_pretrained(target_model_name).to(device=device, dtype=dtype) # Load Tokenizer tokenizer = AutoTokenizer.from_pretrained(target_model_name) # Calculate model differences def calculate_model_diffs(model_a, model_b): model_a_dict = model_a.state_dict() model_b_dict = model_b.state_dict() model_diffs = {key: model_a_dict[key] - model_b_dict[key] for key in model_a_dict.keys() if key in model_b_dict} return model_diffs # Calculate adaptive scaling factors with dynamic attention def calculate_dynamic_scaling_factors(model_diffs): scaling_factors = {} for key, diff_tensor in model_diffs.items(): attention_weights = torch.softmax(torch.abs(diff_tensor), dim=-1) scaling_factor = torch.sum(diff_tensor * attention_weights) / torch.sum(attention_weights) scaling_factors[key] = scaling_factor.item() return scaling_factors # Apply adaptive scaling to model differences def apply_scaling_factors(target_model, model_diffs, scaling_factors): target_state_dict = target_model.state_dict() for key, diff_tensor in model_diffs.items(): scaled_diff = diff_tensor * scaling_factors[key] target_state_dict[key] += scaled_diff target_model.load_state_dict(target_state_dict) # Main adaptation function with dynamic attention def adapt_model_dynamic_attention(reference_model, base_model, target_model): reference_base_diffs = calculate_model_diffs(reference_model, base_model) base_target_diffs = calculate_model_diffs(base_model, target_model) reference_scaling_factors = calculate_dynamic_scaling_factors(reference_base_diffs) base_scaling_factors = calculate_dynamic_scaling_factors(base_target_diffs) scaling_factors = {key: (reference_scaling_factors[key] + base_scaling_factors[key]) / 2 for key in reference_scaling_factors} apply_scaling_factors(target_model, base_target_diffs, scaling_factors) return target_model # Adapt the target model with dynamic attention adapted_model_dynamic_attention = adapt_model_dynamic_attention(reference_model, base_model, target_model) # Save the adapted model and tokenizer output_dir = './adapted_model_dynamic_attention' adapted_model_dynamic_attention.save_pretrained(output_dir) tokenizer.save_pretrained(output_dir) ``` ``` # Interpolation in Model Merging below one Interpolation is a practical way to reconcile mismatched tensors during model merging or adaptation. However, it comes with certain trade-offs and considerations that should be understood. ## Benefits of Interpolation - **Alignment of Different Models:** - When two models have slightly different architectures or vocabularies, interpolation helps align them by finding common ground in their weight distributions. - **Combining Features:** - Interpolation leverages the strengths of both models by combining their learned features. - **Avoids Skipping Entire Layers:** - Rather than completely ignoring layers with different shapes, interpolation provides a way to incorporate them into the target model. ## Drawbacks of Interpolation - **Loss of Specificity:** - Averaging weights might dilute specialized or highly fine-tuned features of individual models. - **Overlapping Portions Only:** - For larger mismatches (e.g., output layers with very different vocabulary sizes), only the overlapping portion is interpolated, potentially leading to reduced expressiveness. - **Limited Contextualization:** - Interpolation lacks context about the specific task or domain knowledge, which direct training might address more effectively. ``` mismatch ``` Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:05<00:00, 1.30s/it] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:14<00:00, 3.57s/it] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 6/6 [00:17<00:00, 2.90s/it] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Traceback (most recent call last): File "/reborn.py", line 66, in adapted_model_dynamic_attention = adapt_model_dynamic_attention(reference_model, base_model, target_model) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/reborn.py", line 49, in adapt_model_dynamic_attention reference_base_diffs = calculate_model_diffs(reference_model, base_model) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/reborn.py", line 24, in calculate_model_diffs model_diffs = {key: model_a_dict[key] - model_b_dict[key] for key in model_a_dict.keys() if key in model_b_dict} ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~ RuntimeError: The size of tensor a (128258) must match the size of tensor b (128256) at non-singleton dimension 0 ``` 🏍️ w interpolation ``` import torch from transformers import AutoModelForCausalLM, AutoTokenizer # Set precision type and target device dtype = torch.bfloat16 device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu") # Load Models reference_model_name = "cognitivecomputations/dolphin-2.9-llama3-8b" base_model_name = "NousResearch/Meta-Llama-3-8B-Instruct" target_model_name = "beomi/Llama-3-KoEn-8B-Instruct-preview" # target model. reference_model = AutoModelForCausalLM.from_pretrained(reference_model_name).to(device=device, dtype=dtype) base_model = AutoModelForCausalLM.from_pretrained(base_model_name).to(device=device, dtype=dtype) target_model = AutoModelForCausalLM.from_pretrained(target_model_name).to(device=device, dtype=dtype) # Load Tokenizer tokenizer = AutoTokenizer.from_pretrained(target_model_name) def interpolate_weights(tensor_a, tensor_b, target_shape): """Interpolate or adjust weights between two tensors to match the target shape.""" min_shape = [min(tensor_a.shape[i], tensor_b.shape[i], target_shape[i]) for i in range(len(target_shape))] # Create a new tensor matching the target shape, filled with zeros initially interpolated = torch.zeros(target_shape, dtype=tensor_a.dtype, device=tensor_a.device) # Copy the overlapping portion slices = tuple(slice(0, s) for s in min_shape) interpolated[slices] = (tensor_a[slices] + tensor_b[slices]) / 2 return interpolated def calculate_model_diffs_and_interpolate(model_a, model_b): """Calculate differences between two models and interpolate where mismatches occur.""" model_a_dict = model_a.state_dict() model_b_dict = model_b.state_dict() model_diffs = {} for key in model_a_dict.keys(): if key in model_b_dict: if model_a_dict[key].shape == model_b_dict[key].shape: model_diffs[key] = model_a_dict[key] - model_b_dict[key] else: # Interpolate weights to match the shape of the smaller tensor target_shape = model_b_dict[key].shape model_diffs[key] = interpolate_weights(model_a_dict[key], model_b_dict[key], target_shape) print(f"Interpolating tensor '{key}' to match the shape: {model_a_dict[key].shape} vs {model_b_dict[key].shape}") else: print(f"'{key}' not found in model_b. Skipping.") return model_diffs # Calculate adaptive scaling factors with dynamic attention def calculate_dynamic_scaling_factors(model_diffs): scaling_factors = {} for key, diff_tensor in model_diffs.items(): # Compute attention weights based on the absolute magnitude of parameter differences attention_weights = torch.softmax(torch.abs(diff_tensor), dim=-1) # Compute the weighted sum of differences to calculate the scaling factor scaling_factor = torch.sum(diff_tensor * attention_weights) / torch.sum(attention_weights) scaling_factors[key] = scaling_factor.item() return scaling_factors # Apply adaptive scaling to model differences def apply_scaling_factors(target_model, model_diffs, scaling_factors): target_state_dict = target_model.state_dict() for key, diff_tensor in model_diffs.items(): scaled_diff = diff_tensor * scaling_factors[key] target_state_dict[key] += scaled_diff target_model.load_state_dict(target_state_dict) # Main adaptation function with dynamic attention def adapt_model_dynamic_attention(reference_model, base_model, target_model): # Calculate interpolated differences between reference and base models reference_base_diffs = calculate_model_diffs_and_interpolate(reference_model, base_model) base_target_diffs = calculate_model_diffs_and_interpolate(base_model, target_model) # Calculate dynamic scaling factors with attention reference_scaling_factors = calculate_dynamic_scaling_factors(reference_base_diffs) base_scaling_factors = calculate_dynamic_scaling_factors(base_target_diffs) # Merge scaling factors scaling_factors = {key: (reference_scaling_factors[key] + base_scaling_factors[key]) / 2 for key in reference_scaling_factors if key in base_scaling_factors} # Apply adaptive scaling apply_scaling_factors(target_model, base_target_diffs, scaling_factors) return target_model # Adapt the target model with dynamic attention adapted_model_dynamic_attention = adapt_model_dynamic_attention(reference_model, base_model, target_model) # Save the adapted model and tokenizer output_dir = './adapted_model_dynamic_attention_dolphin' adapted_model_dynamic_attention.save_pretrained(output_dir) tokenizer.save_pretrained(output_dir) ``` 🐠 Test Source Model - "abacusai/Llama-3-Smaug-8B" - "NousResearch/Meta-Llama-3-8B-Instruct" - "beomi/Llama-3-KoEn-8B-Instruct-preview" - "cognitivecomputations/dolphin-2.9-llama3-8b" ## Citation @article{llama3modelcard, title={Llama 3 Model Card}, author={AI@Meta}, year={2024}, url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md} }