Use the combined WUDI and FDAs merging methods to fuse the Qwen3.5-35B-A3B and Qwen3.6-35B-A3B models.

Model Highlights:

Merge Method: WUDI+ FDAs
Precision: dtype: bfloat16
Context Length: 262,144

Parameter Settings:

Non-Thinking Mode: (`{%- set enable_thinking = false %}`)

General Tasks:

temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

Reasoning Tasks:

temperature=1.0, top_p=1.0, top_k=40, min_p=0.0, presence_penalty=2.0, repetition_penalty=1.0

Thinking Mode: (`{%- set enable_thinking = true %}`)

General Tasks:

temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.1

Coding Tasks:

temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.1

Papers:

WUDI

https://arxiv.org/abs/2503.08099

FDAs

https://arxiv.org/abs/2510.21223

Specific Merging Procedure:

This model was created using a two-stage hybrid merging pipeline. We first utilize WUDI to eliminate parameter-space interference without requiring additional data, and subsequently apply FDAs to refine the Feed-Forward Networks layer-by-layer via input-space knowledge distillation.

Step 1: WUDI Merging (`Parameter-Space Alignment`)

WUDI theoretically demonstrates that the task vectors of linear layers constitute an approximate linear subspace of their corresponding inputs. By leveraging this property, we minimize interference under the guidance of task vectors directly in the parameter space, eliminating the need for rescaling coefficients or extra datasets.

$\mathcal{L}_l = \sum_{i} \frac{1}{\|\tau_{i,l}\|_F^2} \left\| (\tau_{m,l} - \tau_{i,l}) \tau_{i,l}^\top \right\|_F^2$

Hyperparameters:

Optimizer: Adam
Iteration Steps: 300
Learning Rate: 1e-5

Step 2: FDAs Refinement (`Input-Space Knowledge Distillation`)

Functional Dual Anchors shift the merging process into the input-representation space. Instead of merely manipulating parameter offsets, FDAs generates synthetic inputs whose induced gradients align with task vectors. This refinement is applied layer-wise and strictly scoped to the Feed-Forward Networks to capture richer task-specific functional shifts.

Phase 2.1: FDAs Construction (`Data Synthesis in Input Space`)

We construct synthetic inputs (FDAs) that effectively simulate the role of task vectors. By optimizing these inputs, we ensure that the gradients they induce on the pretrained model align with the actual parameter shifts of the downstream models.

$\min_{x_{i1}, \dots, x_{in}} \text{cos\_dist} \left( \nabla_\theta \sum_{j=1}^n \text{Dist}(\phi(\theta, x_{ij}), \phi(\theta_i, x_{ij})) \Big|_{\theta=\theta_0}, \tau_i \right)$

Hyperparameters:

Initialization: Scaled Gaussian Sampling (sigma = 0.01)
Distance Metric: Cosine Distance
Optimizer: AdamW
Iteration Steps: 400
Learning Rate: 1e-2

Phase 2.2: Parameter Optimization (`Knowledge Distillation`)

We optimize the layer parameters by minimizing the Mean Squared Error between the merged model's representations and the teachers' representations on the FDAs:

$\min_{\theta^{(l)}} \sum_{i=1}^m \sum_{j=1}^n \text{MSE} \left( \phi^{(l)}(\theta^{(l)}, x_{ij}), \phi^{(l)}(\theta_i^{(l)}, x_{ij}) \right)$

Hyperparameters:

Distance Metric: MSE
Optimizer: Adam
Batch Size: 16384
Iteration Steps: 100
Learning Rate: 1e-2

Downloads last month: -

Safetensors

Model size

36B params

Tensor type

BF16

Model tree for YOYO-AI/Qwen3.6-35B-A3B-YOYO

Qwen/Qwen3.5-35B-A3B

Qwen/Qwen3.5-35B-A3B-Base

Qwen/Qwen3.6-35B-A3B

Merge model

this model

Collection including YOYO-AI/Qwen3.6-35B-A3B-YOYO

Qwen3.6-YOYO

Collection

1 item • Updated 1 day ago • 1

Papers for YOYO-AI/Qwen3.6-35B-A3B-YOYO

Model Merging with Functional Dual Anchors

Paper • 2510.21223 • Published Oct 24, 2025 • 12

Whoever Started the Interference Should End It: Guiding Data-Free Model Merging via Task Vectors

Paper • 2503.08099 • Published Jun 11, 2025

Model Highlights:

Parameter Settings:

Non-Thinking Mode: ({%- set enable_thinking = false %})

General Tasks:

Reasoning Tasks:

Thinking Mode: ({%- set enable_thinking = true %})

General Tasks:

Coding Tasks:

Papers:

WUDI

FDAs

Specific Merging Procedure:

Step 1: WUDI Merging (Parameter-Space Alignment)

Step 2: FDAs Refinement (Input-Space Knowledge Distillation)

Phase 2.1: FDAs Construction (Data Synthesis in Input Space)

Phase 2.2: Parameter Optimization (Knowledge Distillation)

Model tree for YOYO-AI/Qwen3.6-35B-A3B-YOYO

Collection including YOYO-AI/Qwen3.6-35B-A3B-YOYO

Papers for YOYO-AI/Qwen3.6-35B-A3B-YOYO

Non-Thinking Mode: (`{%- set enable_thinking = false %}`)

Thinking Mode: (`{%- set enable_thinking = true %}`)

Step 1: WUDI Merging (`Parameter-Space Alignment`)

Step 2: FDAs Refinement (`Input-Space Knowledge Distillation`)

Phase 2.1: FDAs Construction (`Data Synthesis in Input Space`)

Phase 2.2: Parameter Optimization (`Knowledge Distillation`)