PowerInfer
/

SparseQwen2-7B

Model card Files Files and versions Community

SparseQwen2-7B / README.md

yixinsong's picture

minor

b83c849 19 days ago

|

history blame contribute delete

3.56 kB

	---
	license: apache-2.0
	---
	# Qwen2-7B-ReLU

	Qwen2-7B-ReLU is a variant of Qwen2-7B that replaces the SiLU/Swish activation function with dReLU, achieving higher sparsity while maintaining the performance of the original model.

	## Key Features

	- Replaces SiLU/Swish activation function with dReLU
	- Maintains comparable or even better performance with the original Qwen2-7B
	- Significantly increases activation sparsity, enabling further optimization and compression

	## Benchmarks

	The model has been evaluated on standard benchmarks to verify its performance:

	- MMLU: 69.19% (5-shot)
	- IFEval: 73.2% (Prompt Strict-Accuracy)
	- Livebench:
	- Average: 32.1%
	- Coding: 39.8%
	- Data Analysis: 45.3%
	- Instruction Following: 58.1%
	- Language: 9.0%
	- Math: 22.0%
	- Reasoning: 18.7%

	These results demonstrate that the ReLU modification maintains competitive performance while achieving higher sparsity compared to the original model.
	## Technical Details

	The key modification in this version is the application of ReLU activation to both branches in the MLP block. The implementation modifies the original `Qwen2MLP` class as follows:

	```python
	class Qwen2MLP(nn.Module):
	def __init__(self, config):
	super().__init__()
	self.config = config
	self.hidden_size = config.hidden_size
	self.intermediate_size = config.intermediate_size
	self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
	self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
	self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
	self.act_fn = ACT2FN[config.hidden_act]

	def forward(self, x):
	down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.act_fn(self.up_proj(x)))
	return down_proj
	```
	The key change is in the forward pass, where the activation function is now applied to both the gate projection and up projection outputs before multiplication. This modification, combined with the use of ReLU, contributes to the increased sparsity of the model.

	## Intended Usage

	This release primarily targets the research community for:
	- Studying sparsity in large language models
	- Model compression and optimization research
	- Understanding the impact of activation functions on model behavior

	## Model Limitations

	- The model may exhibit biases present in the training data
	- May generate incorrect, inappropriate, or harmful content
	- Performance may vary across different domains and tasks
	- Not suitable for production deployment without proper evaluation

	## Quick Start

	You should replace original modeling_qwen FFN implementation code to dReLU firstly.

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained("PowerInfer/SparseQwen2-7B")
	tokenizer = AutoTokenizer.from_pretrained("PowerInfer/SparseQwen2-7B")

	prompt = "Hello"
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs)
	response = tokenizer.decode(outputs[0])
	```



	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@article{song2024turbo,
	title={Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters},
	author={Song, Yixin and Xie, Haotong and Zhang, Zhengyan and Wen, Bo and Ma, Li and Mi, Zeyu and Chen, Haibo},
	journal={arXiv preprint arXiv:2406.05955},
	year={2024}
	}
	```