SparseLLM
/

prosparse-llama-2-7b

@@ -97,7 +97,7 @@ The evaluation results on the above benchmarks demonstrate the advantage of ProS
 ### Inference Acceleration Effects
-First, we utilize [PowerInfer](https://arxiv.org/pdf/2312.12456.pdf), a state-of-the-art acceleration framework leveraging activation sparsity. As its inference speed and accuracy heavily rely on the performance of activation predictors, we report the activation recall and predicted sparsity (i.e., two key metrics for evaluating the activation predictor) as well as the number of tokens generated per second by PowerInfer (with one A100 GPU and sufficient CPUs). The GGUF files and activation predictors for ProSparse-7B are available at [ProSparse-LLaMA-2-7B-GGUF](https://huggingface.co/PowerInfer/prosparse-llama-2-7b-gguf)([duplicate](https://huggingface.co/SparseLLM/prosparse-llama-2-7b-gguf)) and [ProSparse-LLaMA-2-7B-Predictor](https://huggingface.co/PowerInfer/prosparse-llama-2-7b-predictor)([duplicate](https://huggingface.co/SparseLLM/prosparse-llama-2-7b-predictor)) respectively.
 Moreover, considering the potential inference inaccuracies caused by wrong predictions of activation predictors, we implement two sparse GPU [operators](https://github.com/Raincleared-Song/sparse_gpu_operator) for faster accurate inference utilizing activation sparsity. They are responsible for the speedup of two key steps in a gated FFN:

 ### Inference Acceleration Effects
+First, we utilize [PowerInfer](https://arxiv.org/pdf/2312.12456.pdf), a state-of-the-art acceleration framework leveraging activation sparsity. As its inference speed and accuracy heavily rely on the performance of activation predictors, we report the activation recall and predicted sparsity (i.e., two key metrics for evaluating the activation predictor) as well as the number of tokens generated per second by PowerInfer (with one A100 GPU and sufficient CPUs). The GGUF files and activation predictors for ProSparse-7B are available at [ProSparse-LLaMA-2-7B-GGUF](https://huggingface.co/PowerInfer/prosparse-llama-2-7b-gguf) ([duplicate](https://huggingface.co/SparseLLM/prosparse-llama-2-7b-gguf)) and [ProSparse-LLaMA-2-7B-Predictor](https://huggingface.co/PowerInfer/prosparse-llama-2-7b-predictor) ([duplicate](https://huggingface.co/SparseLLM/prosparse-llama-2-7b-predictor)) respectively.
 Moreover, considering the potential inference inaccuracies caused by wrong predictions of activation predictors, we implement two sparse GPU [operators](https://github.com/Raincleared-Song/sparse_gpu_operator) for faster accurate inference utilizing activation sparsity. They are responsible for the speedup of two key steps in a gated FFN: