--- library_name: transformers tags: [] --- # Model Card for Model ID ## Model Details This is my attemp (probably too naive) to reproduce the upcycling process used to initialize [Qwen1.5-MoE-A2.7B](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B) using [Qwen1.5-1.8B](https://huggingface.co/Qwen/Qwen1.5-1.8B). ## Upcycling script
Script: ```python from torch import nn from transformers import AutoModelForCausalLM from dataclasses import dataclass from transformers import AutoModel from typing_extensions import Self from copy import deepcopy @dataclass class UpcyclingConfig: finegrained_experts: int partitions_from_mlp: int @property def upcycling_factor(self) -> int: return self.finegrained_experts // self.partitions_from_mlp def iterate_in_chunks(list1, list2, chunk_size1, chunk_size2): iterations = max(len(list1) // chunk_size1, len(list2) // chunk_size2) for i in range(iterations): start_idx1 = i * chunk_size1 end_idx1 = start_idx1 + chunk_size1 start_idx2 = i * chunk_size2 end_idx2 = start_idx2 + chunk_size2 yield (list1[start_idx1:end_idx1], list2[start_idx2:end_idx2]) def chunk_linear(linear: nn.Linear, chunks: int, down_proj: bool = False) -> tuple[nn.Linear, ...]: if not down_proj: in_features = linear.out_features // chunks out_features = linear.in_features dim = 0 else: in_features = linear.out_features out_features = linear.in_features // chunks dim = 1 weight = linear.weight.reshape(linear.out_features, linear.in_features) weights = weight.chunk(chunks, dim=dim) biases = linear.bias.chunk(chunks) if linear.bias is not None else [None] * chunks linear_layers = [] for weight, bias in zip(weights, biases): new_linear = nn.Linear( in_features=in_features, out_features=out_features, bias=bias is not None ) new_linear.weight = nn.Parameter(weight.clone()) # Clone weights to ensure they are not shared if bias is not None: new_linear.bias = nn.Parameter(bias.clone()) # Clone bias if it exists linear_layers.append(new_linear) return tuple(linear_layers) class UpcycledModelMixin: sparse_moe_block_cls: type @classmethod def upcycled_from(cls, source_model, config: UpcyclingConfig) -> Self: upcycled_model_config = cls.config_class(**source_model.config.to_dict()) upcycled_model_config.moe_intermediate_size = upcycled_model_config.intermediate_size // config.partitions_from_mlp if hasattr(upcycled_model_config, "shared_expert_intermediate_size"): upcycled_model_config.shared_expert_intermediate_size = source_model.config.intermediate_size upcycled_model = cls(upcycled_model_config) upcycled_model.model.embed_tokens = source_model.model.embed_tokens for upcycled_layer, layer in zip(upcycled_model.model.layers, source_model.model.layers): upcycled_layer.self_attn = layer.self_attn upcycled_mlp_layers = [deepcopy(layer.mlp) for _ in range(config.upcycling_factor)] if hasattr(upcycled_layer.mlp, "shared_expert"): upcycled_layer.mlp.shared_expert = upcycled_mlp_layers.pop(-1) for experts, mlp in iterate_in_chunks(upcycled_layer.mlp.experts, upcycled_mlp_layers, 4, 1): gate_projs = chunk_linear(mlp[0].gate_proj, 4, down_proj=False) up_projs = chunk_linear(mlp[0].up_proj, 4, down_proj=False) down_projs = chunk_linear(mlp[0].down_proj, 4, down_proj=True) for i, expert in enumerate(experts): expert.gate_proj = gate_projs[i] expert.up_proj = up_projs[i] expert.down_proj = down_projs[i] expert.act_fn = deepcopy(mlp[0].act_fn) upcycled_layer.input_layernorm = layer.input_layernorm upcycled_layer.post_attention_layernorm = layer.post_attention_layernorm upcycled_model.lm_head = source_model.lm_head return upcycled_model from transformers import Qwen2MoeForCausalLM as _Qwen2MoeForCausalLM from transformers.models.qwen2.modeling_qwen2 import Qwen2MLP from transformers.models.qwen2_moe.modeling_qwen2_moe import Qwen2MoeSparseMoeBlock class Qwen2MoeForCausalLM(UpcycledModelMixin, _Qwen2MoeForCausalLM): sparse_moe_block_cls = Qwen2MoeSparseMoeBlock source_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen1.5-1.8B", device_map="auto") model = Qwen2MoeForCausalLM.upcycled_from( source_model, UpcyclingConfig( finegrained_experts=64, partitions_from_mlp=4, ), ) model = model.bloat16() ```
### Model Description This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. - **Developed by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Model type:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] - **Finetuned from model [optional]:** [More Information Needed] ### Model Sources [optional] - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses ### Direct Use [More Information Needed] ### Downstream Use [optional] [More Information Needed] ### Out-of-Scope Use [More Information Needed] ## Bias, Risks, and Limitations [More Information Needed] ### Recommendations Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. ## How to Get Started with the Model Use the code below to get started with the model. [More Information Needed] ## Training Details ### Training Data [More Information Needed] ### Training Procedure #### Preprocessing [optional] [More Information Needed] #### Training Hyperparameters - **Training regime:** [More Information Needed] #### Speeds, Sizes, Times [optional] [More Information Needed] ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data [More Information Needed] #### Factors [More Information Needed] #### Metrics [More Information Needed] ### Results [More Information Needed] #### Summary ## Model Examination [optional] [More Information Needed] ## Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hardware Type:** [More Information Needed] - **Hours used:** [More Information Needed] - **Cloud Provider:** [More Information Needed] - **Compute Region:** [More Information Needed] - **Carbon Emitted:** [More Information Needed] ## Technical Specifications [optional] ### Model Architecture and Objective [More Information Needed] ### Compute Infrastructure [More Information Needed] #### Hardware [More Information Needed] #### Software [More Information Needed] ## Citation [optional] **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] [More Information Needed] ## More Information [optional] [More Information Needed] ## Model Card Authors [optional] [More Information Needed] ## Model Card Contact [More Information Needed]