Use the combined WUDI and FDAs merging methods to fuse the Qwen3.5-35B-A3B and Qwen3.6-35B-A3B models.
Model Highlights:
- Merge Method:
WUDI+FDAs - Precision:
dtype: bfloat16 - Context Length:
262,144
Parameter Settings:
Non-Thinking Mode: ({%- set enable_thinking = false %})
General Tasks:
temperature=0.7,top_p=0.8,top_k=20,min_p=0.0,presence_penalty=1.5,repetition_penalty=1.0
Reasoning Tasks:
temperature=1.0,top_p=1.0,top_k=40,min_p=0.0,presence_penalty=2.0,repetition_penalty=1.0
Thinking Mode: ({%- set enable_thinking = true %})
General Tasks:
temperature=1.0,top_p=0.95,top_k=20,min_p=0.0,presence_penalty=1.5,repetition_penalty=1.1
Coding Tasks:
temperature=0.6,top_p=0.95,top_k=20,min_p=0.0,presence_penalty=0.0,repetition_penalty=1.1
Papers:
WUDI
FDAs
Specific Merging Procedure:
This model was created using a two-stage hybrid merging pipeline. We first utilize WUDI to eliminate parameter-space interference without requiring additional data, and subsequently apply FDAs to refine the Feed-Forward Networks layer-by-layer via input-space knowledge distillation.
Step 1: WUDI Merging (Parameter-Space Alignment)
WUDI theoretically demonstrates that the task vectors of linear layers constitute an approximate linear subspace of their corresponding inputs. By leveraging this property, we minimize interference under the guidance of task vectors directly in the parameter space, eliminating the need for rescaling coefficients or extra datasets.
Hyperparameters:
- Optimizer: Adam
- Iteration Steps: 300
- Learning Rate: 1e-5
Step 2: FDAs Refinement (Input-Space Knowledge Distillation)
Functional Dual Anchors shift the merging process into the input-representation space. Instead of merely manipulating parameter offsets, FDAs generates synthetic inputs whose induced gradients align with task vectors. This refinement is applied layer-wise and strictly scoped to the Feed-Forward Networks to capture richer task-specific functional shifts.
Phase 2.1: FDAs Construction (Data Synthesis in Input Space)
We construct synthetic inputs (FDAs) that effectively simulate the role of task vectors. By optimizing these inputs, we ensure that the gradients they induce on the pretrained model align with the actual parameter shifts of the downstream models.
Hyperparameters:
- Initialization: Scaled Gaussian Sampling (sigma = 0.01)
- Distance Metric: Cosine Distance
- Optimizer: AdamW
- Iteration Steps: 400
- Learning Rate: 1e-2
Phase 2.2: Parameter Optimization (Knowledge Distillation)
We optimize the layer parameters by minimizing the Mean Squared Error between the merged model's representations and the teachers' representations on the FDAs:
Hyperparameters:
- Distance Metric: MSE
- Optimizer: Adam
- Batch Size: 16384
- Iteration Steps: 100
- Learning Rate: 1e-2
- Downloads last month
- -