Hilbertmeng
commited on
Commit
•
6becb1d
1
Parent(s):
b555738
add paper link
Browse files
README.md
CHANGED
@@ -10,7 +10,7 @@ license: mit
|
|
10 |
---
|
11 |
DCPythia-6.9B is a pretrained language model on the Pile with 300B tokens. With comparison of Pythia-6.9B, we validate the scaling performance of Dynamically
|
12 |
Composable Multi-Head Attention(DCMHA), a parameter and computation efficient attention architecture that tackles the shortcomings of MHA and increases the expressive power of the model
|
13 |
-
by dynamically composing attention heads. Please see downstrem evaluations and more details in the paper[(Improving Transformers with Dynamically Composable Multi-Head Attention)](). In addition, we open-source Jax training code on [(Github)](https://github.com/Caiyun-AI/DCFormer/).
|
14 |
|
15 |
We recommend <strong>compiled version</strong> of DCPythia with *torch.compile* for inference acceleration. Please refer to Generation section for compile implementation.
|
16 |
|
|
|
10 |
---
|
11 |
DCPythia-6.9B is a pretrained language model on the Pile with 300B tokens. With comparison of Pythia-6.9B, we validate the scaling performance of Dynamically
|
12 |
Composable Multi-Head Attention(DCMHA), a parameter and computation efficient attention architecture that tackles the shortcomings of MHA and increases the expressive power of the model
|
13 |
+
by dynamically composing attention heads. Please see downstrem evaluations and more details in the paper[(Improving Transformers with Dynamically Composable Multi-Head Attention)](https://arxiv.org/abs/2405.08553). In addition, we open-source Jax training code on [(Github)](https://github.com/Caiyun-AI/DCFormer/).
|
14 |
|
15 |
We recommend <strong>compiled version</strong> of DCPythia with *torch.compile* for inference acceleration. Please refer to Generation section for compile implementation.
|
16 |
|