zaydzuhri
/

transformer-8192-16M-test

Text Generation

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Community

transformer-8192-16M-test / fla /ops /simple_gla /README.md

zaydzuhri's picture

Training in progress, step 2500

0094a2a verified 3 months ago

|

history blame contribute delete

535 Bytes

	# Simple GLA

	Gating mechanism in [Gated RFA](https://arxiv.org/abs/2103.02143), [Mamba2](https://arxiv.org/abs/2405.21060) and [YOCO](https://arxiv.org/abs/2405.05254) (a.k.a., Gated RetNet).

	Compared to GLA, the gating is head-wise instead of elementwise.
	As a result, we can adapt the RetNet kernel for training using matmul w/o numerical instability.
	It is faster than GLA but has less expressive power.
	I will use it as a baseline for the GLA.

	$S_{t+1} = g_{t+1} \odot S_{t} + K_{t+1} V_{t+1}^{\top}$ where $g$ is a scalar.