arxiv:2207.08006

S4: a High-sparsity, High-performance AI Accelerator

Published on Jul 16, 2022

Authors:

Ian En-Hsu Yen ,

Abstract

Exploiting sparsity underlying neural networks has become one of the most potential methodologies to reduce the memory footprint, I/O cost, and computation workloads during inference. And the degree of sparsity one can exploit has become higher as larger model sizes have been considered along with the trend of pre-training giant models. On the other hand, compared with quantization that has been a widely supported option, acceleration through high-degree <PRE_TAG>sparsity</POST_TAG> is not supported in most computing platforms. In this work, we introduce the first commercial hardware platform supporting high-degree <PRE_TAG>sparsity</POST_TAG> acceleration up to 32 times -- S4. Combined with state-of-the-art sparse pruning techniques, we demonstrate several-times practical <PRE_TAG>inference speedup</POST_TAG> on S4 over mainstream inference platforms such as Nvidia T4. We also show that in practice a sparse model of larger size can achieve both higher accuracy and higher throughput on S4 than a dense model of smaller size.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2207.08006 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2207.08006 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2207.08006 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.