Sparse GPT-J 6B

Model Description

The sparse version of GPT-J 6B is a pruned variant derived from the original GPT-J 6B model and the vast majority of linear layers maintain a 40% unstructured sparsity (except for the 'lm_head').

Hyperparameter Value
nparametersn_{parameters} 6053381344
nlayersn_{layers} 28*
dmodeld_{model} 4096
dffd_{ff} 16384
nheadsn_{heads} 16
dheadd_{head} 256
nctxn_{ctx} 2048
nvocabn_{vocab} 50257/50400† (same tokenizer as GPT-2/3)
Positional Encoding Rotary Position Embedding RoPE
RoPE Dimensions 64

* Each layer consists of one feedforward block and one self attention block.

† Although the embedding matrix has a size of 50400, only 50257 entries are used by the GPT-2 tokenizer.

The model consists of 28 layers with a model dimension of 4096, and a feedforward dimension of 16384. The model dimension is split into 16 heads, each with a dimension of 256. Rotary Position Embedding (RoPE) is applied to 64 dimensions of each head. The model is trained with a tokenization vocabulary of 50257, using the same set of BPEs as GPT-2/GPT-3.

Evaluation results

Evaluating the accuracy of the sparse model of gpt-j-6b using the lambada_openai dataset in lm_eval, providing the accuracy fluctuation under two precisions: FP32 and BF16.

Sparsity Dataset Precision Dense Acc ↑ Sparse Acc ↑ Acc fluctuations
40% Lambada_openai FP32 0.6831 0.6922 +1.33%
40% Lambada_openai BF16 0.6771 0.6874 +0.63%
Downloads last month
23
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including Intel/gpt-j-6b-sparse