zaydzuhri commited on Dec 18, 2024

Commit

a0806ea

verified ·

1 Parent(s): 979c3a8

Training in progress, step 5000

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +165 -0
config.json +42 -0
configs/deepspeed.yaml +10 -0
configs/ds_config.json +19 -0
configs/gla_16M.json +26 -0
configs/gla_1B.json +26 -0
configs/gla_340M.json +26 -0
configs/gla_7B.json +29 -0
configs/gsa_16M.json +27 -0
configs/scan_16M.json +29 -0
configs/scan_16M_8192.json +29 -0
configs/scan_20M.json +29 -0
configs/scan_340M.json +29 -0
configs/transformer_16M.json +26 -0
configs/transformer_16M_8192.json +26 -0
fla/__init__.py +58 -0
fla/layers/__init__.py +31 -0
fla/layers/abc.py +207 -0
fla/layers/attn.py +182 -0
fla/layers/based.py +105 -0
fla/layers/bitattn.py +183 -0
fla/layers/delta_net.py +267 -0
fla/layers/gla.py +280 -0
fla/layers/gsa.py +233 -0
fla/layers/hgrn.py +153 -0
fla/layers/hgrn2.py +207 -0
fla/layers/linear_attn.py +171 -0
fla/layers/multiscale_retention.py +282 -0
fla/layers/rebased.py +136 -0
fla/layers/rwkv6.py +291 -0
fla/layers/scan.py +237 -0
fla/layers/simple_gla.py +252 -0
fla/models/__init__.py +39 -0
fla/models/abc/__init__.py +13 -0
fla/models/abc/configuration_abc.py +84 -0
fla/models/abc/modeling_abc.py +403 -0
fla/models/bitnet/__init__.py +13 -0
fla/models/bitnet/configuration_bitnet.py +68 -0
fla/models/bitnet/modeling_bitnet.py +428 -0
fla/models/delta_net/__init__.py +13 -0
fla/models/delta_net/configuration_delta_net.py +87 -0
fla/models/delta_net/modeling_delta_net.py +439 -0
fla/models/gla/__init__.py +13 -0
fla/models/gla/configuration_gla.py +90 -0
fla/models/gla/modeling_gla.py +418 -0
fla/models/gsa/__init__.py +13 -0
fla/models/gsa/configuration_gsa.py +94 -0
fla/models/gsa/modeling_gsa.py +442 -0
fla/models/hgrn/__init__.py +13 -0
fla/models/hgrn/configuration_hgrn.py +74 -0

README.md ADDED Viewed

	@@ -0,0 +1,165 @@

+<div align="center">
+# 🔥 Flame
+</div>
+A minimal framework for training FLA models, whether from scratch or through finetuning.
+Built on the robust infrastructure of 🤗, `flame` enables you to train large language models with just a few lines of code:
+we use `datasets` for data processing, `transformers` for model definitions, and `accelerate`[^1] for seamless distributed training.
+In this README, we will guide you through the process of using `flame` to train GLA models.
+## Setup
+To get started, you'll need to install the required packages.
+Both `fla` and `flame` have minimal dependencies.
+Clone the `fla` repository and install the necessary packages as follows:
+```bash
+git clone https://github.com/sustcsonglin/flash-linear-attention.git
+pip install .
+pip install accelerate wandb
+pip3 install deepspeed
+```
+> [!CAUTION]
+> The 🤗 `tokenizers` have some [memory leak issues](https://github.com/huggingface/tokenizers/issues/1539) when processing very long documents.
+> To address this, please ensure you install `tokenizers>=0.20.4`.
+## Preprocessing
+Before training, you need to download and pre-tokenize your dataset.
+We provide a straightforward script for this.
+For instance, to tokenize a 10B sample of the `fineweb-edu` dataset, run:
+```bash
+python preprocess.py \
+  --dataset HuggingFaceFW/fineweb-edu \
+  --name sample-10BT \
+  --split train \
+  --context_length 2048
+```
+or an even smaller example, just for testing:
+```bash
+python preprocess.py \
+  --dataset alturing/gutenberg-texts \
+  --split train \
+  --context_length 2048
+```
+This will cache the processed dataset at `data/HuggingFaceFW/fineweb-edu/sample-10BT/train`.
+GLA utilizes a subset of Slimpajama for pretraining [in the paper](https://proceedings.mlr.press/v235/yang24ab.html).
+Given the size of the dataset, the fastest way to download it is using `git lfs` (refer to [this issue](https://huggingface.co/datasets/cerebras/SlimPajama-627B/discussions/2)).
+```bash
+git lfs install
+git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B
+python preprocess.py \
+  --dataset SlimPajama-627B \
+  --split train \
+  --context_length 2048
+```
+## Training from scratch
+To train your 340M model from scratch, execute the following command:
+```bash
+bash train.sh \
+  type=gla \
+  lr=3e-4 \
+  steps=20480 \
+  batch=8 \
+  update=1 \
+  warmup=1024 \
+  context=2048 \
+  path=exp/gla-340M-10B \
+  project=fla \
+  model=configs/gla_340M.json \
+  data=HuggingFaceFW/fineweb-edu \
+  name=sample-10BT \
+  cache=data/HuggingFaceFW/fineweb-edu/sample-10BT/train
+```
+or for testing SCAN:
+```bash
+bash train.sh \
+  type=scan \
+  lr=3e-4 \
+  steps=1000 \
+  batch=8 \
+  update=1 \
+  warmup=100 \
+  context=2048 \
+  path=exp/scan-340M-test \
+  project=fla \
+  model=configs/scan_340M.json \
+  data=alturing/gutenberg-texts \
+  name=sample-10BT \
+  cache=data/alturing/gutenberg-texts/train
+```
+`flame` also supports resuming interrupted training by specifying the checkpoint path.
+Simply use the following command to resume training:
+```bash
+bash train.sh \
+  type=gla \
+  lr=3e-4 \
+  steps=20480 \
+  batch=8 \
+  update=1 \
+  warmup=1024 \
+  context=2048 \
+  path=exp/gla-340M-10B \
+  project=fla \
+  model=configs/gla_340M.json \
+  data=HuggingFaceFW/fineweb-edu \
+  name=sample-10BT \
+  cache=data/HuggingFaceFW/fineweb-edu/sample-10BT/train \
+  checkpoint=exp/gla-340M-10B/checkpoint-8192
+```
+You can also use `wandb` to monitor your training process effectively.
+![wandb](https://github.com/user-attachments/assets/05ca031c-1cae-41c9-bfcb-5b6b6d0df729)
+## Continual Pretraining
+`flame` supports continual training from a pretrained checkpoint.
+Below, we provide an example of how to finetune Mistral-7B to GLA.
+You can follow similar steps to reproduce the results in the [GSA paper](https://arxiv.org/abs/2409.07146):
+1. Initialize a brand-new GLA-7B model from the config and copy the mathced pretrained weights from Mistral-7B:
+```bash
+cd ../utils
+python convert_from_llama.py \
+  --model mistralai/Mistral-7B-v0.1 \
+  --config ../training/configs/gla_7B.json \
+  --output ../training/converted/gla-7B
+cd -
+```
+2. Directly launch training from the converted checkpoint:
+```bash
+bash train.sh \
+  type=gla \
+  lr=3e-5 \
+  steps=10240 \
+  batch=4 \
+  update=8 \
+  warmup=512 \
+  context=2048 \
+  path=exp/gla-7B-20B \
+  project=fla \
+  model=converted/gla-7B \
+  data=SlimPajama-627B \
+  cache=data/SlimPajama-627B/train
+```
+Please be aware that finetuning on a single node may not be the most efficient approach.
+If available, consider leveraging multi-node GPUs for optimal performance.
+You can find guidance on how to launch a multi-node job in the [accelerate tutorial](https://github.com/huggingface/accelerate/blob/main/examples/slurm/submit_multinode.sh).
+[^1]: The `accelerate` library supports various distributed frameworks, like `deepspeed` and `megatron` for large-scale training. We use `deepspeed` in our case.

config.json ADDED Viewed

	@@ -0,0 +1,42 @@

+{
+  "_name_or_path": "configs/scan_16M_8192.json",
+  "architectures": [
+    "SCANForCausalLM"
+  ],
+  "attn": null,
+  "attn_mode": "parallel",
+  "bos_token_id": 1,
+  "clamp_max": null,
+  "clamp_min": null,
+  "elementwise_affine": true,
+  "eos_token_id": 2,
+  "expand_k": 1,
+  "expand_v": 1,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "gate_act": "softmax",
+  "gate_logit_normalizer": 8,
+  "hidden_act": "swish",
+  "hidden_ratio": 4,
+  "hidden_size": 256,
+  "initializer_range": 0.02,
+  "intermediate_size": null,
+  "max_position_embeddings": 8192,
+  "model_type": "scan",
+  "norm_eps": 1e-06,
+  "norm_first": true,
+  "num_heads": 4,
+  "num_hidden_layers": 10,
+  "num_kv_heads": null,
+  "state_size": 16,
+  "tie_word_embeddings": true,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.47.0",
+  "use_cache": true,
+  "use_gk": true,
+  "use_gv": false,
+  "use_norm": true,
+  "use_output_gate": false,
+  "vocab_size": 32000,
+  "window_size": 128
+}

configs/deepspeed.yaml ADDED Viewed

	@@ -0,0 +1,10 @@

+compute_environment: LOCAL_MACHINE
+distributed_type: DEEPSPEED
+deepspeed_config:
+  deepspeed_config_file: configs/ds_config.json
+  zero3_init_flag: true
+machine_rank: 0
+main_training_function: main
+num_machines: 1
+num_processes: 1
+use_cpu: false

configs/ds_config.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "train_batch_size": "auto",
+  "train_micro_batch_size_per_gpu": "auto",
+  "gradient_accumulation_steps": "auto",
+  "gradient_clipping": "auto",
+  "zero_allow_untested_optimizer": true,
+  "bf16": {
+    "enabled": true
+  },
+  "zero_optimization": {
+    "stage": 2,
+    "allgather_partitions": true,
+    "allgather_bucket_size": 5e8,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 5e8,
+    "overlap_comm": false,
+    "contiguous_gradients": true
+  }
+}

configs/gla_16M.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+  "attn_mode": "chunk",
+  "bos_token_id": 1,
+  "clamp_min": null,
+  "eos_token_id": 2,
+  "expand_k": 0.5,
+  "expand_v": 1,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "hidden_act": "swish",
+  "hidden_ratio": 4,
+  "hidden_size": 256,
+  "initializer_range": 0.02,
+  "intermediate_size": null,
+  "max_position_embeddings": 2048,
+  "model_type": "gla",
+  "num_heads": 4,
+  "num_hidden_layers": 10,
+  "norm_eps": 1e-06,
+  "tie_word_embeddings": true,
+  "transformers_version": "4.38.2",
+  "use_cache": true,
+  "use_gk": true,
+  "use_gv": false,
+  "vocab_size": 32000
+}

configs/gla_1B.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+  "attn_mode": "chunk",
+  "bos_token_id": 1,
+  "clamp_min": null,
+  "eos_token_id": 2,
+  "expand_k": 0.5,
+  "expand_v": 1,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "hidden_act": "swish",
+  "hidden_ratio": 4,
+  "hidden_size": 2048,
+  "initializer_range": 0.02,
+  "intermediate_size": null,
+  "max_position_embeddings": 2048,
+  "model_type": "gla",
+  "num_heads": 4,
+  "num_hidden_layers": 24,
+  "norm_eps": 1e-06,
+  "tie_word_embeddings": false,
+  "transformers_version": "4.38.2",
+  "use_cache": true,
+  "use_gk": true,
+  "use_gv": false,
+  "vocab_size": 32000
+}

configs/gla_340M.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+  "attn_mode": "chunk",
+  "bos_token_id": 1,
+  "clamp_min": null,
+  "eos_token_id": 2,
+  "expand_k": 0.5,
+  "expand_v": 1,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "hidden_act": "swish",
+  "hidden_ratio": 4,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": null,
+  "max_position_embeddings": 2048,
+  "model_type": "gla",
+  "num_heads": 4,
+  "num_hidden_layers": 24,
+  "norm_eps": 1e-06,
+  "tie_word_embeddings": true,
+  "transformers_version": "4.38.2",
+  "use_cache": true,
+  "use_gk": true,
+  "use_gv": false,
+  "vocab_size": 32000
+}

configs/gla_7B.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+  "attn_mode": "chunk",
+  "bos_token_id": 1,
+  "clamp_min": null,
+  "eos_token_id": 2,
+  "expand_k": 1,
+  "expand_v": 1,
+  "feature_map": "relu",
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "hidden_act": "swish",
+  "hidden_ratio": 4,
+  "hidden_size": 4096,
+  "initializer_range": 0.02,
+  "intermediate_size": 14336,
+  "max_position_embeddings": 32768,
+  "model_type": "gla",
+  "num_heads": 32,
+  "num_kv_heads": 8,
+  "num_hidden_layers": 32,
+  "norm_eps": 1e-05,
+  "tie_word_embeddings": false,
+  "transformers_version": "4.40.0",
+  "use_cache": true,
+  "use_output_gate": false,
+  "use_gk": true,
+  "use_gv": false,
+  "vocab_size": 32000
+}

configs/gsa_16M.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+  "attn_mode": "chunk",
+  "bos_token_id": 1,
+  "clamp_min": null,
+  "eos_token_id": 2,
+  "expand_k": 1,
+  "expand_v": 1,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "hidden_act": "swish",
+  "hidden_ratio": 4,
+  "hidden_size": 256,
+  "initializer_range": 0.02,
+  "intermediate_size": null,
+  "max_position_embeddings": 2048,
+  "model_type": "gsa",
+  "num_slots": 16,
+  "num_heads": 4,
+  "num_hidden_layers": 10,
+  "norm_eps": 1e-06,
+  "tie_word_embeddings": true,
+  "transformers_version": "4.38.2",
+  "use_cache": true,
+  "use_gk": true,
+  "use_gv": false,
+  "vocab_size": 32000
+}

configs/scan_16M.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+  "attn_mode": "parallel",
+  "bos_token_id": 1,
+  "clamp_min": null,
+  "eos_token_id": 2,
+  "expand_k": 1,
+  "expand_v": 1,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "hidden_act": "swish",
+  "gate_act": "softmax",
+  "hidden_ratio": 4,
+  "hidden_size": 256,
+  "window_size": 128,
+  "state_size": 16,
+  "initializer_range": 0.02,
+  "intermediate_size": null,
+  "max_position_embeddings": 2048,
+  "model_type": "scan",
+  "num_heads": 4,
+  "num_hidden_layers": 10,
+  "norm_eps": 1e-06,
+  "tie_word_embeddings": true,
+  "transformers_version": "4.38.2",
+  "use_cache": true,
+  "use_gk": true,
+  "use_gv": false,
+  "vocab_size": 32000
+}

configs/scan_16M_8192.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+  "attn_mode": "parallel",
+  "bos_token_id": 1,
+  "clamp_min": null,
+  "eos_token_id": 2,
+  "expand_k": 1,
+  "expand_v": 1,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "hidden_act": "swish",
+  "gate_act": "softmax",
+  "hidden_ratio": 4,
+  "hidden_size": 256,
+  "window_size": 128,
+  "state_size": 16,
+  "initializer_range": 0.02,
+  "intermediate_size": null,
+  "max_position_embeddings": 8192,
+  "model_type": "scan",
+  "num_heads": 4,
+  "num_hidden_layers": 10,
+  "norm_eps": 1e-06,
+  "tie_word_embeddings": true,
+  "transformers_version": "4.38.2",
+  "use_cache": true,
+  "use_gk": true,
+  "use_gv": false,
+  "vocab_size": 32000
+}

configs/scan_20M.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+  "attn_mode": "parallel",
+  "bos_token_id": 1,
+  "clamp_min": null,
+  "eos_token_id": 2,
+  "expand_k": 1,
+  "expand_v": 1,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "hidden_act": "swish",
+  "gate_act": "softmax",
+  "hidden_ratio": 4,
+  "hidden_size": 384,
+  "window_size": 128,
+  "state_size": 16,
+  "initializer_range": 0.02,
+  "intermediate_size": null,
+  "max_position_embeddings": 2048,
+  "model_type": "scan",
+  "num_heads": 6,
+  "num_hidden_layers": 10,
+  "norm_eps": 1e-06,
+  "tie_word_embeddings": true,
+  "transformers_version": "4.38.2",
+  "use_cache": true,
+  "use_gk": true,
+  "use_gv": false,
+  "vocab_size": 32000
+}

configs/scan_340M.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+  "attn_mode": "parallel",
+  "bos_token_id": 1,
+  "clamp_min": null,
+  "eos_token_id": 2,
+  "expand_k": 1,
+  "expand_v": 1,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "hidden_act": "swish",
+  "gate_act": "softmax",
+  "hidden_ratio": 4,
+  "hidden_size": 1024,
+  "window_size": 128,
+  "state_size": 32,
+  "initializer_range": 0.02,
+  "intermediate_size": null,
+  "max_position_embeddings": 2048,
+  "model_type": "scan",
+  "num_heads": 4,
+  "num_hidden_layers": 24,
+  "norm_eps": 1e-06,
+  "tie_word_embeddings": true,
+  "transformers_version": "4.38.2",
+  "use_cache": true,
+  "use_gk": true,
+  "use_gv": false,
+  "vocab_size": 32000
+}

configs/transformer_16M.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "model_type": "transformer",
+    "attention_bias": false,
+    "bos_token_id": 1,
+    "clamp_min": null,
+    "eos_token_id": 2,
+    "fuse_cross_entropy": true,
+    "fuse_norm": true,
+    "hidden_act": "swish",
+    "hidden_ratio": 4,
+    "hidden_size": 256,
+    "state_size": 16,
+    "initializer_range": 0.02,
+    "intermediate_size": null,
+    "max_position_embeddings": 2048,
+    "num_heads": 4,
+    "num_kv_heads": 4,
+    "num_hidden_layers": 10,
+    "norm_eps": 1e-06,
+    "tie_word_embeddings": true,
+    "transformers_version": "4.38.2",
+    "use_cache": true,
+    "use_gk": true,
+    "use_gv": false,
+    "vocab_size": 32000
+}

configs/transformer_16M_8192.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "model_type": "transformer",
+    "attention_bias": false,
+    "bos_token_id": 1,
+    "clamp_min": null,
+    "eos_token_id": 2,
+    "fuse_cross_entropy": true,
+    "fuse_norm": true,
+    "hidden_act": "swish",
+    "hidden_ratio": 4,
+    "hidden_size": 256,
+    "state_size": 16,
+    "initializer_range": 0.02,
+    "intermediate_size": null,
+    "max_position_embeddings": 8192,
+    "num_heads": 4,
+    "num_kv_heads": 4,
+    "num_hidden_layers": 10,
+    "norm_eps": 1e-06,
+    "tie_word_embeddings": true,
+    "transformers_version": "4.38.2",
+    "use_cache": true,
+    "use_gk": true,
+    "use_gv": false,
+    "vocab_size": 32000
+}

fla/__init__.py ADDED Viewed

	@@ -0,0 +1,58 @@

+# -*- coding: utf-8 -*-
+from fla.layers import (ABCAttention, Attention, BasedLinearAttention,
+                        BitAttention, DeltaNet, GatedLinearAttention,
+                        GatedSlotAttention, HGRN2Attention, HGRNAttention,
+                        LinearAttention, MultiScaleRetention,
+                        ReBasedLinearAttention)
+from fla.models import (ABCForCausalLM, ABCModel, BitNetForCausalLM,
+                        BitNetModel, DeltaNetForCausalLM, DeltaNetModel,
+                        GLAForCausalLM, GLAModel, GSAForCausalLM, GSAModel,
+                        HGRN2ForCausalLM, HGRN2Model, HGRNForCausalLM,
+                        LinearAttentionForCausalLM, LinearAttentionModel,
+                        RetNetForCausalLM, RetNetModel, RWKV6ForCausalLM,
+                        RWKV6Model, TransformerForCausalLM, TransformerModel)
+__all__ = [
+    'ABCAttention',
+    'Attention',
+    'BasedLinearAttention',
+    'BitAttention',
+    'DeltaNet',
+    'HGRNAttention',
+    'HGRN2Attention',
+    'GatedLinearAttention',
+    'GatedSlotAttention',
+    'LinearAttention',
+    'MultiScaleRetention',
+    'ReBasedLinearAttention',
+    'ABCForCausalLM',
+    'ABCModel',
+    'BitNetForCausalLM',
+    'BitNetModel',
+    'DeltaNetForCausalLM',
+    'DeltaNetModel',
+    'HGRNForCausalLM',
+    'HGRNModel',
+    'HGRN2ForCausalLM',
+    'HGRN2Model',
+    'GLAForCausalLM',
+    'GLAModel',
+    'GSAForCausalLM',
+    'GSAModel',
+    'LinearAttentionForCausalLM',
+    'LinearAttentionModel',
+    'RetNetForCausalLM',
+    'RetNetModel',
+    'RWKV6ForCausalLM',
+    'RWKV6Model',
+    'TransformerForCausalLM',
+    'TransformerModel',
+    'chunk_gla',
+    'chunk_retention',
+    'fused_chunk_based',
+    'fused_chunk_gla',
+    'fused_chunk_retention'
+]
+__version__ = '0.1'

fla/layers/__init__.py ADDED Viewed

	@@ -0,0 +1,31 @@

+# -*- coding: utf-8 -*-
+from .abc import ABCAttention
+from .attn import Attention
+from .based import BasedLinearAttention
+from .bitattn import BitAttention
+from .delta_net import DeltaNet
+from .gla import GatedLinearAttention
+from .gsa import GatedSlotAttention
+from .hgrn import HGRNAttention
+from .hgrn2 import HGRN2Attention
+from .linear_attn import LinearAttention
+from .multiscale_retention import MultiScaleRetention
+from .rebased import ReBasedLinearAttention
+from .rwkv6 import RWKV6Attention
+__all__ = [
+    'ABCAttention',
+    'Attention',
+    'BasedLinearAttention',
+    'BitAttention',
+    'DeltaNet',
+    'GatedLinearAttention',
+    'GatedSlotAttention',
+    'HGRNAttention',
+    'HGRN2Attention',
+    'LinearAttention',
+    'MultiScaleRetention',
+    'ReBasedLinearAttention',
+    'RWKV6Attention',
+]

fla/layers/abc.py ADDED Viewed

	@@ -0,0 +1,207 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2024, Songlin Yang, Yu Zhang
+from __future__ import annotations
+import warnings
+from typing import TYPE_CHECKING, Optional, Tuple
+import torch
+import torch.nn as nn
+from einops import rearrange
+from fla.modules import (FusedRMSNormSwishGate, RMSNorm, RotaryEmbedding,
+                         ShortConvolution)
+from fla.modules.activations import swiglu, swish
+from fla.ops.abc.chunk import chunk_abc
+if TYPE_CHECKING:
+    from fla.models.utils import Cache
+class ABCAttention(nn.Module):
+    def __init__(
+        self,
+        hidden_size: int = 1024,
+        expand_k: float = 0.5,
+        expand_v: float = 1.0,
+        num_heads: int = 4,
+        use_short_conv: bool = False,
+        conv_size: int = 4,
+        conv_bias: bool = False,
+        num_slots: Optional[int] = None,
+        elementwise_affine: Optional[bool] = True,
+        norm_eps: float = 1e-5,
+        gate_low_rank_dim: int = 16,
+        gate_logit_normalizer: int = 16,
+        use_input_gate: bool = False,
+        use_output_gate: bool = True,
+        use_norm: bool = True,
+        clamp_min: Optional[float] = -32,
+        clamp_max: Optional[float] = 32,
+        layer_idx: Optional[int] = None,
+        **kwargs
+    ) -> ABCAttention:
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.expand_k = expand_k
+        self.expand_v = expand_v
+        self.num_heads = num_heads
+        self.key_dim = int(self.hidden_size * self.expand_k)
+        self.value_dim = int(self.hidden_size * self.expand_v)
+        self.head_k_dim = self.key_dim // self.num_heads
+        self.head_v_dim = self.value_dim // self.num_heads
+        self.use_short_conv = use_short_conv
+        self.conv_size = conv_size
+        self.conv_bias = conv_bias
+        self.gate_low_rank_dim = gate_low_rank_dim
+        self.gate_logit_normalizer = gate_logit_normalizer
+        self.use_input_gate = use_input_gate
+        self.use_output_gate = use_output_gate
+        self.use_norm = use_norm
+        if num_slots is None:
+            num_slots = self.head_k_dim
+        self.num_slots = num_slots
+        self.norm_eps = norm_eps
+        self.clamp_min = clamp_min
+        self.clamp_max = clamp_max
+        self.layer_idx = layer_idx
+        if layer_idx is None:
+            warnings.warn(
+                f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will "
+                "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` "
+                "when creating this class."
+            )
+        self.q_proj = nn.Linear(self.hidden_size, self.key_dim, bias=False)
+        self.k_proj = nn.Linear(self.hidden_size, self.key_dim, bias=False)
+        self.v_proj = nn.Linear(self.hidden_size, self.value_dim, bias=False)
+        if use_output_gate:
+            self.g_proj = nn.Linear(self.hidden_size, self.value_dim, bias=False)
+        self.s_proj = nn.Linear(self.hidden_size, self.num_heads * self.num_slots, bias=False)
+        self.o_proj = nn.Linear(self.value_dim, self.hidden_size, bias=False)
+        if use_short_conv:
+            self.conv_size = conv_size
+            self.q_conv1d = ShortConvolution(self.key_dim, conv_size, activation='silu')
+            self.k_conv1d = ShortConvolution(self.key_dim, conv_size, activation='silu')
+            self.v_conv1d = ShortConvolution(self.value_dim, conv_size, activation='silu')
+        if self.use_norm:
+            if self.use_output_gate:
+                self.g_norm = FusedRMSNormSwishGate(self.head_v_dim, elementwise_affine, norm_eps)
+            else:
+                self.g_norm = RMSNorm(hidden_size=self.head_v_dim, elementwise_affine=elementwise_affine, eps=norm_eps)
+        if self.use_rope:
+            self.rotary = RotaryEmbedding(self.head_k_dim)
+        self.apply(self._initialize_weights)
+    def _initialize_weights(self, module: nn.Module):
+        if getattr(module, "_is_hf_initialized", False):
+            return
+        if isinstance(module, nn.Linear):
+            nn.init.xavier_uniform_(module.weight, gain=2 ** -2.5)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        module._is_hf_initialized = True
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Cache] = None,
+        use_cache: Optional[bool] = False,
+        output_attentions: Optional[bool] = False,
+        **kwargs
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Cache]]:
+        if attention_mask is not None:
+            assert len(attention_mask.shape) == 2, (
+                "Expected attention_mask as a 0-1 matrix with shape [batch_size, seq_len] "
+                "for padding purposes (0 indicating padding). "
+                "Arbitrary attention masks of shape [batch_size, seq_len, seq_len] are not allowed."
+            )
+        last_state = None
+        if past_key_values is not None and len(past_key_values) > self.layer_idx:
+            last_state = past_key_values[self.layer_idx]
+        if self.use_short_conv:
+            conv_state_q, conv_state_k, conv_state_v = None, None, None
+            if last_state is not None:
+                conv_state_q, conv_state_k, conv_state_v = last_state['conv_state']
+            conv_mask = attention_mask[:, -hidden_states.shape[1]:] if attention_mask is not None else None
+            q, conv_state_q = self.q_conv1d(x=self.q_proj(hidden_states),
+                                            mask=conv_mask,
+                                            cache=conv_state_q,
+                                            output_final_state=use_cache)
+            k, conv_state_k = self.k_conv1d(x=self.k_proj(hidden_states),
+                                            mask=conv_mask,
+                                            cache=conv_state_k,
+                                            output_final_state=use_cache)
+            v, conv_state_v = self.v_conv1d(x=self.v_proj(hidden_states),
+                                            mask=conv_mask,
+                                            cache=conv_state_v,
+                                            output_final_state=use_cache)
+        else:
+            q = self.q_proj(hidden_states)
+            k = self.k_proj(hidden_states)
+            v = self.v_proj(hidden_states)
+        if self.use_input_gate:
+            q, k, v = map(lambda x: swish(x), (q, k, v))
+        # dealing with left-padding
+        if attention_mask is not None:
+            v = v.mul_(attention_mask[:, -v.shape[-2]:, None])
+        q, k, v = map(lambda x: rearrange(x, '... (h d) -> ... h d', h=self.num_heads), (q, k, v))
+        if self.use_rope:
+            seqlen_offset = 0
+            if past_key_values is not None:
+                seqlen_offset = past_key_values.get_seq_length(self.layer_idx)
+            q, k = self.rotary(q, k, seqlen_offset)
+        s = rearrange(self.s_proj(hidden_states), '... (h m) -> ... h m', h=self.num_heads)
+        s = s.clamp_(self.clamp_min, self.clamp_max)
+        recurrent_state = last_state['recurrent_state'] if last_state is not None else None
+        o, recurrent_state = chunk_abc(
+            q=q,
+            k=k,
+            v=v,
+            s=s,
+            initial_state=recurrent_state,
+            output_final_state=use_cache,
+            head_first=False
+        )
+        if past_key_values is not None:
+            past_key_values.update(
+                recurrent_state=recurrent_state,
+                conv_state=(conv_state_q, conv_state_k, conv_state_v) if self.use_short_conv else None,
+                layer_idx=self.layer_idx,
+                offset=q.shape[2]
+            )
+        if self.use_norm and not self.use_output_gate:
+            o = self.g_norm(o)
+        elif self.use_output_gate:
+            g = rearrange(self.g_proj(hidden_states), '... (h d) -> ... h d', h=self.num_heads)
+            o = self.g_norm(o, g) if self.use_norm else swiglu(g, o)
+        o = rearrange(o, '... h d -> ... (h d)')
+        o = self.o_proj(o)
+        return o, None, past_key_values
+    def state_size(self, seq_len: int = 2048):
+        return self.num_heads * self.key_dim * self.head_v_dim

fla/layers/attn.py ADDED Viewed

	@@ -0,0 +1,182 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2024, Songlin Yang, Yu Zhang
+from __future__ import annotations
+import warnings
+from typing import TYPE_CHECKING, Optional, Tuple
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.utils.checkpoint
+from einops import rearrange
+from transformers.utils import logging
+from fla.modules import RMSNorm, RotaryEmbedding
+if TYPE_CHECKING:
+    from fla.models.utils import Cache
+try:
+    from flash_attn import flash_attn_func, flash_attn_varlen_func
+    from flash_attn.bert_padding import (index_first_axis, pad_input,
+                                         unpad_input)
+except ImportError:
+    warnings.warn(
+        "Flash Attention is not installed. Please install it via `pip install flash-attn --no-build-isolation`",
+        category=ImportWarning
+    )
+    flash_attn_func = None
+logger = logging.get_logger(__name__)
+class Attention(nn.Module):
+    def __init__(
+        self,
+        hidden_size: int = 2048,
+        num_heads: int = 32,
+        num_kv_heads: Optional[int] = None,
+        window_size: Optional[int] = None,
+        rope_theta: Optional[float] = 10000.,
+        max_position_embeddings: Optional[int] = None,
+        norm_first: bool = False,
+        norm_eps: float = 1e-5,
+        layer_idx: int = None
+    ):
+        super().__init__()
+        self.num_heads = num_heads
+        if num_kv_heads is None:
+            self.num_kv_heads = self.num_heads
+        else:
+            self.num_kv_heads = num_kv_heads
+        self.num_kv_groups = num_heads // self.num_kv_heads
+        self.hidden_size = hidden_size
+        self.head_dim = self.hidden_size // self.num_heads
+        self.kv_dim = self.num_kv_heads * self.head_dim
+        self.kv_dim = self.num_kv_heads * self.head_dim
+        self.window_size = window_size
+        self.rope_theta = rope_theta
+        self.max_position_embeddings = max_position_embeddings
+        self.norm_first = norm_first
+        self.layer_idx = layer_idx
+        if norm_first:
+            self.norm = RMSNorm(self.hidden_size, eps=norm_eps)
+        self.q_proj = nn.Linear(self.hidden_size, self.hidden_size, bias=False)
+        self.k_proj = nn.Linear(self.hidden_size, self.kv_dim, bias=False)
+        self.v_proj = nn.Linear(self.hidden_size, self.kv_dim, bias=False)
+        self.o_proj = nn.Linear(self.hidden_size, self.hidden_size, bias=False)
+        self.rotary = RotaryEmbedding(dim=self.head_dim, base=self.rope_theta)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+        **kwargs,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        if attention_mask is not None:
+            assert len(attention_mask.shape) == 2, (
+                "Expected attention_mask as a 0-1 matrix with shape [batch_size, seq_len] "
+                "for padding purposes (0 indicating padding). "
+                "Arbitrary attention masks of shape [batch_size, seq_len, seq_len] are not allowed."
+            )
+        batch_size, q_len, _ = hidden_states.size()
+        if self.norm_first:
+            hidden_states = self.norm(hidden_states)
+        q = rearrange(self.q_proj(hidden_states), '... (h d) -> ... h d', h=self.num_heads)
+        k = rearrange(self.k_proj(hidden_states), '... (h d) -> ... h d', h=self.num_kv_heads)
+        v = rearrange(self.v_proj(hidden_states), '... (h d) -> ... h d', h=self.num_kv_heads)
+        seqlen_offset, max_seqlen = 0, q_len
+        if past_key_values is not None:
+            seqlen_offset = past_key_values.get_seq_length(self.layer_idx)
+            max_seqlen = q.shape[1] + seqlen_offset
+            if attention_mask is not None:
+                # to deliminate the offsets of padding tokens
+                seqlen_offset = (seqlen_offset + attention_mask.sum(-1) - attention_mask.shape[-1]).clamp(min=0)
+                max_seqlen = q.shape[1] + max(seqlen_offset)
+        if self.max_position_embeddings is not None:
+            max_seqlen = max(max_seqlen, self.max_position_embeddings)
+        q, k = self.rotary(q, k, seqlen_offset, max_seqlen)
+        if past_key_values is not None:
+            k, v = past_key_values.update(
+                attn_state=(k.flatten(-2, -1), v.flatten(-2, -1)),
+                layer_idx=self.layer_idx,
+                offset=q_len,
+                cache_kwargs=dict(window_size=self.window_size)
+            )['attn_state']
+            k = rearrange(k, '... (h d) -> ... h d', h=self.num_kv_heads)
+            v = rearrange(v, '... (h d) -> ... h d', h=self.num_kv_heads)
+        if flash_attn_func is None:
+            raise ImportError("Please install Flash Attention via `pip install flash-attn --no-build-isolation` first")
+        # Contains at least one padding token in the sequence
+        if attention_mask is not None:
+            q, k, v, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(q, k, v, attention_mask, q_len)
+            cu_seqlens_q, cu_seqlens_k = cu_seq_lens
+            max_seqlen_q, max_seqlen_k = max_seq_lens
+            o = flash_attn_varlen_func(
+                q, k, v,
+                cu_seqlens_q=cu_seqlens_q,
+                cu_seqlens_k=cu_seqlens_k,
+                max_seqlen_q=max_seqlen_q,
+                max_seqlen_k=max_seqlen_k,
+                causal=True,
+                window_size=(-1, -1) if self.window_size is None else (self.window_size-1, 0)
+            )
+            o = pad_input(o, indices_q, batch_size, q_len)
+        else:
+            o = flash_attn_func(
+                q, k, v,
+                causal=True,
+                window_size=(-1, -1) if self.window_size is None else (self.window_size-1, 0)
+            )
+        o = o.reshape(batch_size, q_len, self.hidden_size)
+        o = self.o_proj(o)
+        if not output_attentions:
+            attentions = None
+        return o, attentions, past_key_values
+    def _upad_input(self, q, k, v, attention_mask, q_len):
+        seqlens = attention_mask.sum(-1, dtype=torch.int32)
+        indices_k = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
+        max_seqlen_k = seqlens.max().item()
+        cu_seqlens_k = F.pad(torch.cumsum(seqlens, dim=0, dtype=torch.int32), (1, 0))
+        batch_size, seq_len, num_key_value_heads, head_dim = k.shape
+        k = index_first_axis(k.reshape(batch_size * seq_len, num_key_value_heads, head_dim), indices_k)
+        v = index_first_axis(v.reshape(batch_size * seq_len, num_key_value_heads, head_dim), indices_k)
+        if q_len == seq_len:
+            q = index_first_axis(q.reshape(batch_size * seq_len, self.num_heads, head_dim), indices_k)
+            cu_seqlens_q = cu_seqlens_k
+            max_seqlen_q = max_seqlen_k
+            indices_q = indices_k
+        elif q_len == 1:
+            max_seqlen_q = 1
+            # There is a memcpy here, that is very bad.
+            cu_seqlens_q = torch.arange(batch_size + 1, dtype=torch.int32, device=q.device)
+            indices_q = cu_seqlens_q[:-1]
+            q = q.squeeze(1)
+        else:
+            # The -q_len: slice assumes left padding.
+            attention_mask = attention_mask[:, -q_len:]
+            q, indices_q, cu_seqlens_q, max_seqlen_q = unpad_input(q, attention_mask)
+        return q, k, v, indices_q, (cu_seqlens_q, cu_seqlens_k), (max_seqlen_q, max_seqlen_k)

fla/layers/based.py ADDED Viewed

	@@ -0,0 +1,105 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2024, Songlin Yang, Yu Zhang
+"""
+Linear attention in Based.
+https://github.com/HazyResearch/zoology/blob/main/zoology/mixers/based.py
+"""
+import torch
+import torch.nn as nn
+from einops import rearrange
+from fla.modules.feature_map import TaylorFeatureMap
+from fla.ops.based import parallel_based
+from fla.ops.linear_attn import chunk_linear_attn, fused_chunk_linear_attn
+class BasedLinearAttention(nn.Module):
+    def __init__(
+        self,
+        hidden_size: int,
+        feature_dim: int = 16,
+        num_key_value_heads: int = 12,
+        num_heads: int = 12,
+        feature_name: str = "taylor_exp",
+        eps: float = 1e-12,
+        causal: bool = True,
+        mode: str = "parallel",
+    ):
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.mode = mode
+        self.feature_name = feature_name
+        self.feature_dim = feature_dim
+        self.num_key_value_heads = num_key_value_heads
+        self.num_heads = num_heads
+        self.head_dim = self.hidden_size // self.num_key_value_heads
+        self.causal = causal
+        self.q_proj = nn.Linear(self.hidden_size, self.feature_dim * self.num_heads, bias=False)
+        self.k_proj = nn.Linear(self.hidden_size, self.feature_dim * self.num_heads, bias=False)
+        self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
+        self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
+        self.dropout = nn.Identity()
+        self.feature_map = TaylorFeatureMap(feature_dim)
+        self.eps = eps
+        self.apply(self._initialize_weights)
+    def _initialize_weights(self, module: nn.Module):
+        if getattr(module, "_is_hf_initialized", False):
+            return
+        if isinstance(module, nn.Linear):
+            nn.init.xavier_uniform_(module.weight, gain=2 ** -2.5)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        module._is_hf_initialized = True
+    def forward(self, hidden_states: torch.Tensor, **kwargs):
+        mode = self.mode
+        q, k, v = self.q_proj(hidden_states), self.k_proj(hidden_states), self.v_proj(hidden_states)
+        q, k, v = map(lambda x: rearrange(x, "... (h d) -> ... h d", h=self.num_heads), [q, k, v])
+        if mode == "fused_chunk":
+            q, k = self.feature_map(q), self.feature_map(k)
+            o = fused_chunk_linear_attn(q, k, v, normalize=True, scale=1, head_first=False)
+        elif mode == 'chunk':
+            q, k = self.feature_map(q), self.feature_map(k)
+            o = chunk_linear_attn(q, k, v, normalize=True, scale=1, head_first=False)
+        elif mode == 'parallel':
+            assert q.shape[-1] <= 128
+            o = parallel_based(q, k, v, True, True, head_first=False)
+        o = self.o_proj(o)
+        o = self.dropout(o)
+        return o
+    # https://github.com/HazyResearch/zoology/blob/main/zoology/mixers/based.py#L119
+    def forward_reference(self, hidden_states: torch.Tensor, filters: torch.Tensor = None, *args, **kwargs):
+        """
+        x (torch.Tensor): tensor of shape (b, d, t)
+        y (torch.Tensor): tensor of shape (b, d, t)
+        """
+        # hidden_states = hidden_states.transpose(1, 2)
+        b, t, _ = hidden_states.size()
+        q, k, v = self.q_proj(hidden_states), self.k_proj(hidden_states), self.v_proj(hidden_states)
+        q = q.view(b, t, self.num_heads, self.feature_dim).transpose(1, 2)
+        k = k.view(b, t, self.num_key_value_heads, self.feature_dim).transpose(1, 2)
+        v = v.view(b, t, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        # Linear attention
+        q, k = self.feature_map(q), self.feature_map(k)
+        q, k, v = q.unsqueeze(-2), k.unsqueeze(-2), v.unsqueeze(-1)
+        # Compute attention
+        if self.causal:
+            y = ((q * (k * v).cumsum(2)).sum(-1) / ((q * k.cumsum(2)).sum(-1) + self.eps))
+        else:
+            y = ((q * (k * v).sum(2, True)).sum(-1) / ((q * k.sum(2, True)).sum(-1) + self.eps))
+        y = rearrange(y, 'b h t d -> b t (h d)')
+        y = self.o_proj(y.to(hidden_states.dtype))
+        y = self.dropout(y)
+        return y.to(hidden_states.dtype)

fla/layers/bitattn.py ADDED Viewed

	@@ -0,0 +1,183 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2024, Songlin Yang, Yu Zhang
+from __future__ import annotations
+import warnings
+from typing import TYPE_CHECKING, Optional, Tuple
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.utils.checkpoint
+from einops import rearrange
+from transformers.utils import logging
+from fla.modules import RMSNorm, RotaryEmbedding
+from fla.modules.fused_bitlinear import FusedBitLinear
+if TYPE_CHECKING:
+    from fla.models.utils import Cache
+try:
+    from flash_attn import flash_attn_func, flash_attn_varlen_func
+    from flash_attn.bert_padding import (index_first_axis, pad_input,
+                                         unpad_input)
+except ImportError:
+    warnings.warn(
+        "Flash Attention is not installed. Please install it via `pip install flash-attn --no-build-isolation`",
+        category=ImportWarning
+    )
+    flash_attn_func = None
+logger = logging.get_logger(__name__)
+class BitAttention(nn.Module):
+    def __init__(
+        self,
+        hidden_size: int = 2048,
+        num_heads: int = 32,
+        num_kv_heads: Optional[int] = None,
+        window_size: Optional[int] = None,
+        rope_theta: Optional[float] = 10000.,
+        max_position_embeddings: Optional[int] = None,
+        norm_first: bool = False,
+        norm_eps: float = 1e-5,
+        layer_idx: int = None
+    ):
+        super().__init__()
+        self.num_heads = num_heads
+        if num_kv_heads is None:
+            self.num_kv_heads = self.num_heads
+        else:
+            self.num_kv_heads = num_kv_heads
+        self.num_kv_groups = num_heads // self.num_kv_heads
+        self.hidden_size = hidden_size
+        self.head_dim = self.hidden_size // self.num_heads
+        self.kv_dim = self.num_kv_heads * self.head_dim
+        self.kv_dim = self.num_kv_heads * self.head_dim
+        self.window_size = window_size
+        self.rope_theta = rope_theta
+        self.max_position_embeddings = max_position_embeddings
+        self.norm_first = norm_first
+        self.layer_idx = layer_idx
+        if norm_first:
+            self.norm = RMSNorm(self.hidden_size, eps=norm_eps)
+        self.q_proj = FusedBitLinear(self.hidden_size, self.hidden_size, bias=False)
+        self.k_proj = FusedBitLinear(self.hidden_size, self.kv_dim, bias=False)
+        self.v_proj = FusedBitLinear(self.hidden_size, self.kv_dim, bias=False)
+        self.o_proj = FusedBitLinear(self.hidden_size, self.hidden_size, bias=False)
+        self.rotary = RotaryEmbedding(dim=self.head_dim, base=self.rope_theta)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+        **kwargs,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        if attention_mask is not None:
+            assert len(attention_mask.shape) == 2, (
+                "Expected attention_mask as a 0-1 matrix with shape [batch_size, seq_len] "
+                "for padding purposes (0 indicating padding). "
+                "Arbitrary attention masks of shape [batch_size, seq_len, seq_len] are not allowed."
+            )
+        batch_size, q_len, _ = hidden_states.size()
+        if self.norm_first:
+            hidden_states = self.norm(hidden_states)
+        q = rearrange(self.q_proj(hidden_states), '... (h d) -> ... h d', h=self.num_heads)
+        k = rearrange(self.k_proj(hidden_states), '... (h d) -> ... h d', h=self.num_kv_heads)
+        v = rearrange(self.v_proj(hidden_states), '... (h d) -> ... h d', h=self.num_kv_heads)
+        seqlen_offset, max_seqlen = 0, q_len
+        if past_key_values is not None:
+            seqlen_offset = past_key_values.get_seq_length(self.layer_idx)
+            max_seqlen = q.shape[1] + seqlen_offset
+            if attention_mask is not None:
+                # to deliminate the offsets of padding tokens
+                seqlen_offset = (seqlen_offset + attention_mask.sum(-1) - attention_mask.shape[-1]).clamp(min=0)
+                max_seqlen = q.shape[1] + max(seqlen_offset)
+        if self.max_position_embeddings is not None:
+            max_seqlen = max(max_seqlen, self.max_position_embeddings)
+        q, k = self.rotary(q, k, seqlen_offset, max_seqlen)
+        if past_key_values is not None:
+            k, v = past_key_values.update(
+                attn_state=(k.flatten(-2, -1), v.flatten(-2, -1)),
+                layer_idx=self.layer_idx,
+                offset=q_len,
+                cache_kwargs=dict(window_size=self.window_size)
+            )['attn_state']
+            k = rearrange(k, '... (h d) -> ... h d', h=self.num_kv_heads)
+            v = rearrange(v, '... (h d) -> ... h d', h=self.num_kv_heads)
+        if flash_attn_func is None:
+            raise ImportError("Please install Flash Attention via `pip install flash-attn --no-build-isolation` first")
+        # Contains at least one padding token in the sequence
+        if attention_mask is not None:
+            q, k, v, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(q, k, v, attention_mask, q_len)
+            cu_seqlens_q, cu_seqlens_k = cu_seq_lens
+            max_seqlen_q, max_seqlen_k = max_seq_lens
+            o = flash_attn_varlen_func(
+                q, k, v,
+                cu_seqlens_q=cu_seqlens_q,
+                cu_seqlens_k=cu_seqlens_k,
+                max_seqlen_q=max_seqlen_q,
+                max_seqlen_k=max_seqlen_k,
+                causal=True,
+                window_size=(-1, -1) if self.window_size is None else (self.window_size-1, 0)
+            )
+            o = pad_input(o, indices_q, batch_size, q_len)
+        else:
+            o = flash_attn_func(
+                q, k, v,
+                causal=True,
+                window_size=(-1, -1) if self.window_size is None else (self.window_size-1, 0)
+            )
+        o = o.reshape(batch_size, q_len, self.hidden_size)
+        o = self.o_proj(o)
+        if not output_attentions:
+            attentions = None
+        return o, attentions, past_key_values
+    def _upad_input(self, q, k, v, attention_mask, q_len):
+        seqlens = attention_mask.sum(-1, dtype=torch.int32)
+        indices_k = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
+        max_seqlen_k = seqlens.max().item()
+        cu_seqlens_k = F.pad(torch.cumsum(seqlens, dim=0, dtype=torch.int32), (1, 0))
+        batch_size, seq_len, num_key_value_heads, head_dim = k.shape
+        k = index_first_axis(k.reshape(batch_size * seq_len, num_key_value_heads, head_dim), indices_k)
+        v = index_first_axis(v.reshape(batch_size * seq_len, num_key_value_heads, head_dim), indices_k)
+        if q_len == seq_len:
+            q = index_first_axis(q.reshape(batch_size * seq_len, self.num_heads, head_dim), indices_k)
+            cu_seqlens_q = cu_seqlens_k
+            max_seqlen_q = max_seqlen_k
+            indices_q = indices_k
+        elif q_len == 1:
+            max_seqlen_q = 1
+            # There is a memcpy here, that is very bad.
+            cu_seqlens_q = torch.arange(batch_size + 1, dtype=torch.int32, device=q.device)
+            indices_q = cu_seqlens_q[:-1]
+            q = q.squeeze(1)
+        else:
+            # The -q_len: slice assumes left padding.
+            attention_mask = attention_mask[:, -q_len:]
+            q, indices_q, cu_seqlens_q, max_seqlen_q = unpad_input(q, attention_mask)
+        return q, k, v, indices_q, (cu_seqlens_q, cu_seqlens_k), (max_seqlen_q, max_seqlen_k)

fla/layers/delta_net.py ADDED Viewed

	@@ -0,0 +1,267 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2024, Songlin Yang, Yu Zhang
+# Sect4.2 of Linear Transformers Are Secretly Fast Weight Programmers https://arxiv.org/abs/2102.11174
+from __future__ import annotations
+from typing import TYPE_CHECKING, Optional, Tuple
+import torch
+import torch.nn as nn
+from einops import rearrange
+from torch.nn import functional as F
+from fla.modules import FusedRMSNormSwishGate, RMSNorm, ShortConvolution
+from fla.modules.l2norm import l2_norm
+from fla.ops.delta_rule import (chunk_delta_rule, fused_chunk_delta_rule,
+                                fused_recurrent_delta_rule)
+if TYPE_CHECKING:
+    from fla.models.utils import Cache
+def elu_p1(x):
+    return (F.elu(x, 1., False) + 1.).to(x)
+def sum_norm(x):
+    return (x / x.sum(-1, keepdim=True)).to(x)
+# https://github.com/IDSIA/recurrent-fwp/blob/master/algorithmic/layers.py#L86C1-L146C1
+class DeltaNet(nn.Module):
+    def __init__(
+        self,
+        d_model: int = None,
+        hidden_size: int = 1024,
+        expand_k: float = 1.0,
+        expand_v: float = 1.0,
+        num_heads: int = 4,
+        mode: str = 'chunk',
+        use_beta: bool = True,
+        use_gate: bool = False,
+        use_output_norm: bool = True,
+        use_elu: bool = False,
+        use_short_conv: bool = True,
+        conv_size: int = 4,
+        conv_bias: bool = False,
+        layer_idx: int = None,
+        qk_activation: str = 'silu',
+        qk_norm: str = 'l2',
+        norm_first: bool = False,
+        norm_eps: float = 1e-5,
+        **kwargs
+    ) -> DeltaNet:
+        super().__init__()
+        self.mode = mode
+        self.qk_activation = qk_activation
+        self.qk_norm = qk_norm
+        assert self.qk_activation in ['silu', 'relu', 'elu', 'identity']
+        assert self.qk_norm in ['l2', 'sum']
+        if d_model is not None:
+            hidden_size = d_model
+        self.hidden_size = hidden_size
+        self.expand_k = expand_k
+        self.expand_v = expand_v
+        self.num_heads = num_heads
+        self.use_gate = use_gate
+        self.use_output_norm = use_output_norm
+        self.use_short_conv = use_short_conv
+        self.conv_size = conv_size
+        self.conv_bias = conv_bias
+        self.key_dim = int(hidden_size * expand_k)
+        self.value_dim = int(hidden_size * expand_v)
+        self.head_qk_dim = self.key_dim // num_heads
+        self.head_v_dim = self.value_dim // num_heads
+        self.norm_first = norm_first
+        self.layer_idx = layer_idx
+        self.silu = nn.SiLU()
+        assert mode in ['chunk', 'fused_chunk', 'fused_recurrent'], f"Not suppoerted mode `{mode}`."
+        assert self.key_dim % num_heads == 0, f"key dim must be divisible by num_heads of {num_heads}"
+        assert self.value_dim % num_heads == 0, f"value dim must be divisible by num_heads of {num_heads}"
+        if norm_first:
+            self.norm = RMSNorm(self.hidden_size, eps=norm_eps)
+        self.q_proj = nn.Linear(hidden_size, self.key_dim, bias=False)
+        self.k_proj = nn.Linear(hidden_size, self.key_dim, bias=False)
+        self.v_proj = nn.Linear(hidden_size, self.value_dim, bias=False)
+        self.use_beta = use_beta
+        self.use_elu = use_elu
+        if self.use_beta:
+            self.b_proj = nn.Linear(hidden_size, self.num_heads, bias=False)
+        if use_short_conv:
+            self.conv_size = conv_size
+            self.q_conv1d = ShortConvolution(
+                hidden_size=self.key_dim,
+                kernel_size=conv_size,
+                activation='silu' if qk_activation == 'silu' else None
+            )
+            self.k_conv1d = ShortConvolution(
+                hidden_size=self.key_dim,
+                kernel_size=conv_size,
+                activation='silu' if qk_activation == 'silu' else None
+            )
+            self.v_conv1d = ShortConvolution(
+                hidden_size=self.value_dim,
+                kernel_size=conv_size,
+                activation='silu'
+            )
+        else:
+            raise UserWarning(
+                "ShortConvolution is crucial to the performance. "
+                "Do not turn it off, i.e., setting `use_short_conv=False` unless you know what you are doing."
+            )
+        if use_gate:
+            self.g_proj = nn.Linear(hidden_size, self.value_dim, bias=False)
+            self.o_norm = FusedRMSNormSwishGate(self.head_v_dim, eps=norm_eps)
+        else:
+            self.o_norm = RMSNorm(self.head_v_dim, eps=norm_eps)
+        self.o_proj = nn.Linear(self.value_dim, hidden_size, bias=False)
+        self.apply(self._initialize_weights)
+    def _initialize_weights(self, module: nn.Module):
+        if getattr(module, "_is_hf_initialized", False):
+            return
+        if isinstance(module, nn.Linear):
+            nn.init.xavier_uniform_(module.weight, gain=2 ** -2.5)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        module._is_hf_initialized = True
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Cache] = None,
+        use_cache: Optional[bool] = False,
+        output_attentions: Optional[bool] = False,
+        **kwargs
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Cache]]:
+        if attention_mask is not None:
+            assert len(attention_mask.shape) == 2, (
+                "Expected attention_mask as a 0-1 matrix with shape [batch_size, seq_len] "
+                "for padding purposes (0 indicating padding). "
+                "Arbitrary attention masks of shape [batch_size, seq_len, seq_len] are not allowed."
+            )
+        # change to inference mode.
+        mode = 'fused_recurrent' if hidden_states.shape[1] < 64 else self.mode
+        if self.norm_first:
+            hidden_states = self.norm(hidden_states)
+        last_state = None
+        if past_key_values is not None and len(past_key_values) > self.layer_idx:
+            last_state = past_key_values[self.layer_idx]
+        if self.use_short_conv:
+            conv_state_q, conv_state_k, conv_state_v = None, None, None
+            if last_state is not None:
+                conv_state_q, conv_state_k, conv_state_v = last_state['conv_state']
+            conv_mask = attention_mask[:, -hidden_states.shape[1]:] if attention_mask is not None else None
+            q, conv_state_q = self.q_conv1d(x=self.q_proj(hidden_states),
+                                            mask=conv_mask,
+                                            cache=conv_state_q,
+                                            output_final_state=use_cache)
+            k, conv_state_k = self.k_conv1d(x=self.k_proj(hidden_states),
+                                            mask=conv_mask,
+                                            cache=conv_state_k,
+                                            output_final_state=use_cache)
+            v, conv_state_v = self.v_conv1d(x=self.v_proj(hidden_states),
+                                            mask=conv_mask,
+                                            cache=conv_state_v,
+                                            output_final_state=use_cache)
+        else:
+            q = self.q_proj(hidden_states)
+            k = self.k_proj(hidden_states)
+            v = self.silu(self.v_proj(hidden_states))
+        q, k, v = map(lambda x: rearrange(x, 'b t (h d) -> b t h d', h=self.num_heads), (q, k, v))
+        if self.qk_activation != 'silu':
+            if self.qk_activation == 'relu':
+                q, k = q.relu(), k.relu()
+            elif self.qk_activation == 'elu':
+                q, k = elu_p1(q), elu_p1(k)
+            elif self.qk_activation == 'identity':
+                pass
+            else:
+                raise NotImplementedError
+        if self.qk_norm is not None:
+            if self.qk_norm == 'l2':
+                q = l2_norm(q)
+                k = l2_norm(k)
+            elif self.qk_norm == 'sum':
+                q = sum_norm(q).to(q)
+                k = sum_norm(k).to(k)
+        if self.use_beta:
+            beta = self.b_proj(hidden_states).sigmoid()
+        else:
+            beta = q.new_ones(q.shape[0], q.shape[1], q.shape[2])
+        # dealing with padding
+        if attention_mask is not None:
+            beta = beta.mul(attention_mask[:, -beta.shape[-2]:, None])
+        recurrent_state = last_state['recurrent_state'] if last_state is not None else None
+        if mode == 'fused_recurrent':
+            o, recurrent_state = fused_recurrent_delta_rule(
+                q=q,
+                k=k,
+                v=v,
+                beta=beta,
+                initial_state=recurrent_state,
+                output_final_state=use_cache,
+                head_first=False
+            )
+        elif mode == 'fused_chunk':
+            o, recurrent_state = fused_chunk_delta_rule(
+                q=q,
+                k=k,
+                v=v,
+                beta=beta,
+                initial_state=recurrent_state,
+                output_final_state=use_cache,
+                head_first=False
+            )
+        elif mode == 'chunk':
+            o, recurrent_state = chunk_delta_rule(
+                q=q,
+                k=k,
+                v=v,
+                beta=beta,
+                initial_state=recurrent_state,
+                output_final_state=use_cache,
+                head_first=False
+            )
+        else:
+            raise NotImplementedError(f"Not supported mode `{mode}`.")
+        if past_key_values is not None:
+            past_key_values.update(
+                recurrent_state=recurrent_state,
+                conv_state=(conv_state_q, conv_state_k, conv_state_v) if self.use_short_conv else None,
+                layer_idx=self.layer_idx,
+                offset=q.shape[2]
+            )
+        if self.use_gate:
+            g = rearrange(self.g_proj(hidden_states), 'b t (h d) -> b t h d', h=self.num_heads)
+            o = self.o_norm(o, g)
+        else:
+            o = self.o_norm(o)
+        o = rearrange(o, 'b t h d -> b t (h d)')
+        o = self.o_proj(o)
+        return o, None, past_key_values

fla/layers/gla.py ADDED Viewed

	@@ -0,0 +1,280 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2024, Songlin Yang, Yu Zhang
+from __future__ import annotations
+from typing import TYPE_CHECKING, Optional, Tuple
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from einops import rearrange, repeat
+from fla.modules import FusedRMSNormSwishGate, RMSNorm, ShortConvolution
+from fla.modules.activations import ACT2FN
+from fla.ops.gla import chunk_gla, fused_chunk_gla, fused_recurrent_gla
+if TYPE_CHECKING:
+    from fla.models.utils import Cache
+class GatedLinearAttention(nn.Module):
+    r"""
+    The layer implementaion for [Gated Linear Attention Transformers with Hardware-Efficient Training](https://arxiv.org/abs/2312.06635).  # noqa
+    Args:
+        mode (str, Optional):
+            Which GLA kernel to use.
+            Currently available: `chunk`, `fused_recurrent`, and `fused_chunk`.
+            Default: `chunk`.
+        hidden_size (int, Optional):
+            The hidden size of the input. Default: 1024.
+        expand_k (float, Optional):
+            The expansion ratio for the key dim. Default: 0.5.
+        expand_v (float, Optional):
+            The expansion ratio for the value dim. Default: 1.0.
+        num_heads (int, Optional):
+            The number of heads. Default: 4.
+        num_kv_heads (int, Optional):
+            The number of key/value heads, used for MQA. Default: None.
+        feature_map (str, Optional):
+            Feature map function applied to queries/keys. Default: None.
+        use_short_conv (bool, Optional):
+            Whether to use short convolutions. Default: `False`.
+        conv_size (int, Optional):
+            The kernel size of the short convolution, only used when `use_short_conv` is `True`. Default: 4.
+        conv_bias (bool, Optional):
+            Whether to use bias in the short convolution, only used when `use_short_conv` is `True`. Default: `False`.
+        use_output_gate (bool, Optional):
+            Whether to use output gate. Default: `True`.
+        gate_fn (str, Optional):
+            The activation function for the output gate. Default: `swish`.
+        elementwise_affine (bool, Optional):
+            If `True`, applies elementwise affine to LayerNorm with learnable parameters. Default: `True`.
+        norm_eps (float, Optional):
+            The epsilon value for the layernorm/rmsnorm layer. Default: 1e-5.
+        gate_logit_normalizer (int, Optional):
+            The normalizer for the gate logits, appied after `logsigmoid`. Default: 16.
+        gate_low_rank_dim (int, Optional):
+            The low rank dim for the gate projection. Default: 16.
+        clamp_min (float, Optional):
+            The minimum value for the gate logits. Default: None.
+        fuse_norm (bool, Optional):
+            Whether to fuse the norm and the output gate for better memory footprint. Default: `True`.
+        layer_idx (int, Optional):
+            The index of the layer. Default: None.
+    """
+    def __init__(
+        self,
+        mode: str = 'chunk',
+        hidden_size: int = 1024,
+        expand_k: float = 0.5,
+        expand_v: float = 1.0,
+        num_heads: int = 4,
+        num_kv_heads: Optional[int] = None,
+        feature_map: Optional[str] = None,
+        use_short_conv: bool = False,
+        conv_size: int = 4,
+        conv_bias: bool = False,
+        use_output_gate: bool = True,
+        gate_fn: str = 'swish',
+        elementwise_affine: Optional[bool] = True,
+        norm_eps: float = 1e-5,
+        gate_logit_normalizer: int = 16,
+        gate_low_rank_dim: int = 16,
+        clamp_min: Optional[float] = None,
+        fuse_norm: bool = True,
+        layer_idx: int = None,
+    ) -> GatedLinearAttention:
+        super().__init__()
+        self.mode = mode
+        self.hidden_size = hidden_size
+        self.expand_k = expand_k
+        self.expand_v = expand_v
+        self.num_heads = num_heads
+        self.num_kv_heads = num_kv_heads if num_kv_heads is not None else num_heads
+        self.num_kv_groups = self.num_heads // self.num_kv_heads
+        self.feature_map_fn = ACT2FN[feature_map] if feature_map is not None else None
+        self.use_short_conv = use_short_conv
+        self.conv_size = conv_size
+        self.conv_bias = conv_bias
+        self.use_output_gate = use_output_gate
+        self.key_dim = int(hidden_size * expand_k)
+        self.value_dim = int(hidden_size * expand_v)
+        self.key_dim_per_group = self.key_dim // self.num_kv_groups
+        self.value_dim_per_group = self.value_dim // self.num_kv_groups
+        self.clamp_min = clamp_min
+        self.layer_idx = layer_idx
+        assert mode in ['chunk', 'fused_recurrent', 'fused_chunk'], f"Not suppoerted mode `{mode}`."
+        assert self.key_dim % num_heads == 0, f"key dim must be divisible by num_heads of {num_heads}"
+        assert self.value_dim % num_heads == 0, f"value dim must be divisible by num_heads of {num_heads}"
+        self.head_qk_dim = self.key_dim // num_heads
+        self.head_v_dim = self.value_dim // num_heads
+        self.q_proj = nn.Linear(hidden_size, self.key_dim, bias=False)
+        self.k_proj = nn.Linear(hidden_size, self.key_dim_per_group, bias=False)
+        self.v_proj = nn.Linear(hidden_size, self.value_dim_per_group, bias=False)
+        if self.use_output_gate:
+            self.g_proj = nn.Linear(hidden_size, self.value_dim, bias=False)
+        if use_short_conv:
+            self.conv_size = conv_size
+            self.q_conv1d = ShortConvolution(self.key_dim, conv_size, activation='silu')
+            self.k_conv1d = ShortConvolution(self.key_dim_per_group, conv_size, activation='silu')
+            self.v_conv1d = ShortConvolution(self.value_dim_per_group, conv_size, activation='silu')
+        self.gk_proj = nn.Sequential(nn.Linear(hidden_size, gate_low_rank_dim, bias=False),
+                                     nn.Linear(gate_low_rank_dim, self.key_dim_per_group, bias=True))
+        self.o_proj = nn.Linear(self.value_dim, hidden_size, bias=False)
+        if gate_fn == 'swish' and fuse_norm and use_output_gate:
+            self.g_norm_swish_gate = FusedRMSNormSwishGate(self.head_v_dim, elementwise_affine, norm_eps)
+            self.fuse_norm_and_gate = True
+        else:
+            self.fuse_norm_and_gate = False
+            self.g_norm = RMSNorm(hidden_size=self.head_v_dim, elementwise_affine=elementwise_affine, eps=norm_eps)
+            self.gate_fn = ACT2FN[gate_fn]
+        self.gate_logit_normalizer = gate_logit_normalizer
+        self.apply(self._initialize_weights)
+    def _initialize_weights(self, module: nn.Module):
+        if getattr(module, "_is_hf_initialized", False):
+            return
+        if isinstance(module, nn.Linear):
+            nn.init.xavier_uniform_(module.weight, gain=2 ** -2.5)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        module._is_hf_initialized = True
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Cache] = None,
+        use_cache: Optional[bool] = False,
+        output_attentions: Optional[bool] = False,
+        **kwargs
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Cache]]:
+        if attention_mask is not None:
+            assert len(attention_mask.shape) == 2, (
+                "Expected attention_mask as a 0-1 matrix with shape [batch_size, seq_len] "
+                "for padding purposes (0 indicating padding). "
+                "Arbitrary attention masks of shape [batch_size, seq_len, seq_len] are not allowed."
+            )
+        # launching the triton kernel for just one token will actually be slower
+        mode = 'fused_recurrent' if hidden_states.shape[1] == 1 else self.mode
+        last_state = None
+        if past_key_values is not None and len(past_key_values) > self.layer_idx:
+            last_state = past_key_values[self.layer_idx]
+        if self.use_short_conv:
+            conv_state_q, conv_state_k, conv_state_v = None, None, None
+            if last_state is not None:
+                conv_state_q, conv_state_k, conv_state_v = last_state['conv_state']
+            conv_mask = attention_mask[:, -hidden_states.shape[1]:] if attention_mask is not None else None
+            q, conv_state_q = self.q_conv1d(x=self.q_proj(hidden_states),
+                                            mask=conv_mask,
+                                            cache=conv_state_q,
+                                            output_final_state=use_cache)
+            k, conv_state_k = self.k_conv1d(x=self.k_proj(hidden_states),
+                                            mask=conv_mask,
+                                            cache=conv_state_k,
+                                            output_final_state=use_cache)
+            v, conv_state_v = self.v_conv1d(x=self.v_proj(hidden_states),
+                                            mask=conv_mask,
+                                            cache=conv_state_v,
+                                            output_final_state=use_cache)
+        else:
+            q = self.q_proj(hidden_states)
+            k = self.k_proj(hidden_states)
+            v = self.v_proj(hidden_states)
+        gk = self.gk_proj(hidden_states)
+        if self.feature_map_fn is not None:
+            q, k = map(self.feature_map_fn, (q, k))
+        # dealing with left-padding
+        if attention_mask is not None:
+            v = v.mul_(attention_mask[:, -v.shape[-2]:, None])
+        q = rearrange(q, 'b t (h d) -> b t h d', h=self.num_heads)
+        if self.num_kv_groups > 1:
+            k, v, gk = (repeat(x, 'b t (h d) -> b t (h g) d', h=self.num_kv_heads, g=self.num_kv_groups) for x in (k, v, gk))
+        else:
+            k, v, gk = (rearrange(x, 'b t (h d) -> b t h d', h=self.num_kv_heads) for x in (k, v, gk))
+        gk = F.logsigmoid(gk) / self.gate_logit_normalizer
+        if self.clamp_min is not None:
+            gk = torch.clamp_min(gk, self.clamp_min)
+        recurrent_state = last_state['recurrent_state'] if last_state is not None else None
+        if mode == 'fused_recurrent':
+            o, recurrent_state = fused_recurrent_gla(
+                q=q,
+                k=k,
+                v=v,
+                gk=gk,
+                initial_state=recurrent_state,
+                output_final_state=use_cache,
+                head_first=False
+            )
+        elif mode == 'fused_chunk':
+            o, recurrent_state = fused_chunk_gla(
+                q=q,
+                k=k,
+                v=v,
+                g=gk,
+                initial_state=recurrent_state,
+                output_final_state=use_cache,
+                head_first=False
+            )
+        elif mode == 'chunk':
+            o, recurrent_state = chunk_gla(
+                q=q,
+                k=k,
+                v=v,
+                g=gk,
+                initial_state=recurrent_state,
+                output_final_state=use_cache,
+                head_first=False
+            )
+        else:
+            raise NotImplementedError(f"Not supported mode `{mode}`.")
+        if past_key_values is not None:
+            past_key_values.update(
+                recurrent_state=recurrent_state,
+                conv_state=(conv_state_q, conv_state_k, conv_state_v) if self.use_short_conv else None,
+                layer_idx=self.layer_idx,
+                offset=q.shape[2]
+            )
+        if self.use_output_gate:
+            g = self.g_proj(hidden_states)
+            if self.fuse_norm_and_gate:
+                g = rearrange(g, 'b t (h d) -> b t h d', h=self.num_heads)
+                o = self.g_norm_swish_gate(o, g)
+                o = rearrange(o, 'b t h d -> b t (h d)')
+            else:
+                o = rearrange(self.g_norm(o), 'b t h d -> b t (h d)')
+                o = o * self.gate_fn(g)
+        else:
+            o = rearrange(self.g_norm(o), 'b t h d -> b t (h d)')
+        o = self.o_proj(o)
+        return o, None, past_key_values
+    def state_size(self, **kwargs) -> int:
+        state_size = self.key_dim * self.head_v_dim
+        for module in self.children():
+            if isinstance(module, ShortConvolution):
+                state_size += module.state_size
+        return state_size

fla/layers/gsa.py ADDED Viewed

	@@ -0,0 +1,233 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2024, Songlin Yang, Yu Zhang
+from __future__ import annotations
+import warnings
+from typing import TYPE_CHECKING, Optional, Tuple
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from einops import rearrange
+from fla.modules import RMSNorm, ShortConvolution
+from fla.modules.activations import swish
+from fla.modules.feature_map import (ReLUFeatureMap, SwishFeatureMap,
+                                     T2RFeatureMap)
+from fla.modules.layernorm import rms_norm_linear
+from fla.ops.gsa import chunk_gsa, fused_recurrent_gsa
+if TYPE_CHECKING:
+    from fla.models.utils import Cache
+class GatedSlotAttention(nn.Module):
+    def __init__(
+        self,
+        mode: str = 'chunk',
+        hidden_size: int = 1024,
+        expand_k: float = 1.,
+        expand_v: float = 1.,
+        num_heads: int = 4,
+        num_kv_heads: Optional[int] = None,
+        use_short_conv: bool = False,
+        conv_size: int = 4,
+        conv_bias: bool = False,
+        num_slots: Optional[int] = None,
+        elementwise_affine: Optional[bool] = True,
+        norm_first: bool = True,
+        norm_eps: float = 1e-5,
+        gate_logit_normalizer: int = 8,
+        feature_map: str = 'swish',
+        use_output_gate: bool = False,
+        use_norm: bool = True,
+        layer_idx: Optional[int] = None,
+        scale: Optional[float] = 1.,
+        **kwargs
+    ) -> GatedSlotAttention:
+        super().__init__()
+        self.mode = mode
+        self.hidden_size = hidden_size
+        self.expand_k = expand_k
+        self.expand_v = expand_v
+        self.num_heads = num_heads
+        self.num_kv_heads = num_heads if num_kv_heads is None else num_kv_heads
+        self.num_kv_groups = self.num_heads // self.num_kv_heads
+        self.key_dim = int(hidden_size * expand_k)
+        self.value_dim = int(hidden_size * expand_v)
+        self.key_dim_per_group = self.key_dim // self.num_kv_groups
+        self.value_dim_per_group = self.value_dim // self.num_kv_groups
+        self.head_k_dim = self.key_dim // self.num_heads
+        self.head_v_dim = self.value_dim // self.num_heads
+        self.use_short_conv = use_short_conv
+        self.conv_size = conv_size
+        self.conv_bias = conv_bias
+        self.gate_logit_normalizer = gate_logit_normalizer
+        self.use_output_gate = use_output_gate
+        self.use_norm = use_norm
+        self.scale = scale
+        if num_slots is None:
+            num_slots = self.head_k_dim
+        self.num_slots = num_slots
+        self.norm_first = norm_first
+        self.layer_idx = layer_idx
+        if layer_idx is None:
+            warnings.warn(
+                f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will "
+                "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` "
+                "when creating this class."
+            )
+        if norm_first:
+            self.norm = RMSNorm(self.hidden_size, eps=norm_eps)
+        self.register_module('feature_map', None)
+        if feature_map == 'swish':
+            self.feature_map = SwishFeatureMap()
+        elif feature_map == 'relu':
+            self.feature_map = ReLUFeatureMap()
+        elif feature_map == 't2r':
+            self.feature_map = T2RFeatureMap(self.head_k_dim, self.head_k_dim)
+        else:
+            raise NotImplementedError(f"Feature map `{feature_map}` is not supported now.")
+        self.q_proj = nn.Linear(self.hidden_size, self.key_dim, bias=False)
+        self.k_proj = nn.Linear(self.hidden_size, self.key_dim_per_group, bias=False)
+        self.v_proj = nn.Linear(self.hidden_size, self.value_dim_per_group, bias=False)
+        self.f_proj = nn.Linear(self.hidden_size, self.num_kv_heads * self.num_slots, bias=False)
+        if use_short_conv:
+            self.conv_size = conv_size
+            self.q_conv1d = ShortConvolution(self.key_dim, conv_size, activation='silu')
+            self.k_conv1d = ShortConvolution(self.key_dim_per_group, conv_size, activation='silu')
+            self.v_conv1d = ShortConvolution(self.value_dim_per_group, conv_size, activation='silu')
+        self.g_norm = RMSNorm(self.hidden_size, elementwise_affine, eps=norm_eps)
+        self.o_proj = nn.Linear(self.value_dim, self.hidden_size, bias=False)
+        self.apply(self._initialize_weights)
+    def _initialize_weights(self, module: nn.Module):
+        if getattr(module, "_is_hf_initialized", False):
+            return
+        if isinstance(module, nn.Linear):
+            nn.init.xavier_uniform_(module.weight, gain=2 ** -2.5)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        module._is_hf_initialized = True
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Cache] = None,
+        use_cache: Optional[bool] = False,
+        output_attentions: Optional[bool] = False,
+        **kwargs
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Cache]]:
+        if attention_mask is not None:
+            assert len(attention_mask.shape) == 2, (
+                "Expected attention_mask as a 0-1 matrix with shape [batch_size, seq_len] "
+                "for padding purposes (0 indicating padding). "
+                "Arbitrary attention masks of shape [batch_size, seq_len, seq_len] are not allowed."
+            )
+        # launching the triton kernel for just one token will actually be slower
+        mode = 'fused_recurrent' if hidden_states.shape[1] == 1 else self.mode
+        if self.norm_first:
+            hidden_states = self.norm(hidden_states)
+        last_state = None
+        if past_key_values is not None and len(past_key_values) > self.layer_idx:
+            last_state = past_key_values[self.layer_idx]
+        if self.use_short_conv:
+            conv_state_q, conv_state_k, conv_state_v = None, None, None
+            if last_state is not None:
+                conv_state_q, conv_state_k, conv_state_v = last_state['conv_state']
+            conv_mask = attention_mask[:, -hidden_states.shape[1]:] if attention_mask is not None else None
+            q, conv_state_q = self.q_conv1d(x=self.q_proj(hidden_states),
+                                            mask=conv_mask,
+                                            cache=conv_state_q,
+                                            output_final_state=use_cache)
+            k, conv_state_k = self.k_conv1d(x=self.k_proj(hidden_states),
+                                            mask=conv_mask,
+                                            cache=conv_state_k,
+                                            output_final_state=use_cache)
+            v, conv_state_v = self.v_conv1d(x=self.v_proj(hidden_states),
+                                            mask=conv_mask,
+                                            cache=conv_state_v,
+                                            output_final_state=use_cache)
+        else:
+            q = self.q_proj(hidden_states)
+            k = self.k_proj(hidden_states)
+            v = self.v_proj(hidden_states)
+        f = self.f_proj(hidden_states)
+        q = rearrange(q, 'b t (h d) -> b t h d', h=self.num_heads)
+        k = rearrange(k, 'b t (h d) -> b t h d', h=self.num_kv_heads)
+        v = rearrange(v, 'b t (h d) -> b t h d', h=self.num_kv_heads)
+        f = rearrange(f, 'b t (h m) -> b t h m', h=self.num_kv_heads)
+        if self.feature_map is not None:
+            q, k = map(lambda x: self.feature_map(x), (q, k))
+        v = swish(v)
+        f = F.logsigmoid(f) / self.gate_logit_normalizer
+        s = (1 - f.exp()).to(f.dtype)
+        # dealing with left-padding
+        if attention_mask is not None:
+            s = s.mul_(attention_mask[:, -s.shape[1]:, None, None])
+            v = v.mul_(attention_mask[:, -v.shape[1]:, None, None])
+        recurrent_state = last_state['recurrent_state'] if last_state is not None else None
+        if mode == 'fused_recurrent':
+            o, recurrent_state = fused_recurrent_gsa(
+                q=q,
+                k=k,
+                v=v,
+                s=s,
+                g=f,
+                initial_state=recurrent_state,
+                output_final_state=use_cache,
+                scale=self.scale,
+                head_first=False
+            )
+        elif mode == 'chunk':
+            o, recurrent_state = chunk_gsa(
+                q=q,
+                k=k,
+                v=v,
+                s=s,
+                g=f,
+                initial_state=recurrent_state,
+                output_final_state=use_cache,
+                scale=self.scale,
+                head_first=False
+            )
+        else:
+            raise NotImplementedError(f"Not supported mode `{mode}`.")
+        if past_key_values is not None:
+            past_key_values.update(
+                recurrent_state=recurrent_state,
+                conv_state=(conv_state_q, conv_state_k, conv_state_v) if self.use_short_conv else None,
+                layer_idx=self.layer_idx,
+                offset=q.shape[2]
+            )
+        o = rearrange(o, 'b t h d -> b t (h d)')
+        o = rms_norm_linear(swish(o), self.g_norm.weight, self.g_norm.bias, self.o_proj.weight, self.o_proj.bias)
+        return o, None, past_key_values
+    def state_size(self, *args, **kwargs) -> int:
+        return 2 * self.num_slots * self.hidden_size

fla/layers/hgrn.py ADDED Viewed

	@@ -0,0 +1,153 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2024, Songlin Yang, Yu Zhang
+# "Hierarchically Gated Recurrent Neural Network for Sequence Modeling" [https://arxiv.org/abs/2311.04823]
+from __future__ import annotations
+from typing import TYPE_CHECKING, Optional, Tuple
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from fla.modules import FusedRMSNormSwishGate, ShortConvolution
+from fla.modules.activations import swiglu
+from fla.ops.hgrn import chunk_hgrn, fused_recurrent_hgrn
+if TYPE_CHECKING:
+    from fla.models.utils import Cache
+class HGRNAttention(nn.Module):
+    def __init__(
+        self,
+        mode: str = 'chunk',
+        hidden_size: int = 1024,
+        expand_ratio: Optional[int] = 1,
+        use_short_conv: bool = False,
+        conv_size: int = 4,
+        conv_bias: bool = False,
+        elementwise_affine: Optional[bool] = True,
+        norm_eps: float = 1e-5,
+        layer_idx: int = None
+    ) -> HGRNAttention:
+        super().__init__()
+        self.mode = mode
+        self.hidden_size = hidden_size
+        self.expand_ratio = expand_ratio
+        self.input_dim = int(hidden_size * expand_ratio)
+        self.use_short_conv = use_short_conv
+        self.conv_size = conv_size
+        self.conv_bias = conv_bias
+        self.layer_idx = layer_idx
+        assert mode in ['chunk', 'fused_recurrent'], f"Not suppoerted mode `{mode}`."
+        self.i_proj = nn.Linear(hidden_size, self.input_dim, bias=False)
+        self.f_proj = nn.Linear(hidden_size, self.input_dim, bias=False)
+        self.g_proj = nn.Linear(hidden_size, self.input_dim, bias=False)
+        if use_short_conv:
+            self.conv_size = conv_size
+            self.q_conv1d = ShortConvolution(self.input_dim, conv_size, activation=None)
+            self.f_conv1d = ShortConvolution(self.input_dim, conv_size, activation=None)
+            self.i_conv1d = ShortConvolution(self.input_dim, conv_size, activation=None)
+        self.g_norm = FusedRMSNormSwishGate(self.input_dim, elementwise_affine, norm_eps)
+        self.o_proj = nn.Linear(self.input_dim, hidden_size, bias=False)
+        self.apply(self._initialize_weights)
+    def _initialize_weights(self, module: nn.Module):
+        if getattr(module, "_is_hf_initialized", False):
+            return
+        if isinstance(module, nn.Linear):
+            nn.init.xavier_uniform_(module.weight, gain=2 ** -2.5)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        module._is_hf_initialized = True
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Cache] = None,
+        use_cache: Optional[bool] = False,
+        output_attentions: Optional[bool] = False,
+        lower_bound: Optional[torch.Tensor] = None,
+        **kwargs
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Cache]]:
+        if attention_mask is not None:
+            assert len(attention_mask.shape) == 2, (
+                "Expected attention_mask as a 0-1 matrix with shape [batch_size, seq_len] "
+                "for padding purposes (0 indicating padding). "
+                "Arbitrary attention masks of shape [batch_size, seq_len, seq_len] are not allowed."
+            )
+        # launching the triton kernel for just one token will actually be slower
+        mode = 'fused_recurrent' if hidden_states.shape[1] == 1 else self.mode
+        last_state = None
+        if past_key_values is not None and len(past_key_values) > self.layer_idx:
+            last_state = past_key_values[self.layer_idx]
+        if self.use_short_conv:
+            conv_state_i, conv_state_f = None, None
+            if last_state is not None:
+                conv_state_i, conv_state_f = last_state['conv_state']
+            conv_mask = attention_mask[:, -hidden_states.shape[1]:] if attention_mask is not None else None
+            i, conv_state_i = self.i_conv1d(x=self.i_proj(hidden_states),
+                                            mask=conv_mask,
+                                            cache=conv_state_i,
+                                            output_final_state=use_cache)
+            f, conv_state_f = self.f_conv1d(x=self.f_proj(hidden_states),
+                                            mask=conv_mask,
+                                            cache=conv_state_f,
+                                            output_final_state=use_cache)
+        else:
+            i = self.i_proj(hidden_states)
+            f = self.f_proj(hidden_states)
+        # the lower bound for the first layer is zero
+        if lower_bound is None or self.layer_idx == 0:
+            i, f = swiglu(i, 1 - f.sigmoid()), F.logsigmoid(f)
+        else:
+            g = lower_bound + (1 - lower_bound) * f.sigmoid()
+            i, f = swiglu(i, 1 - g), g.log()
+        # dealing with left-padding
+        if attention_mask is not None:
+            i = i.mul_(attention_mask[:, -i.shape[-2]:, None])
+        recurrent_state = last_state['recurrent_state'] if last_state is not None else None
+        if mode == 'chunk':
+            o, recurrent_state = chunk_hgrn(i, f, recurrent_state, use_cache)
+        elif mode == 'fused_recurrent':
+            o, recurrent_state = fused_recurrent_hgrn(i, f, recurrent_state, use_cache)
+        else:
+            raise NotImplementedError(f"Not supported mode `{mode}`.")
+        if past_key_values is not None:
+            past_key_values.update(
+                recurrent_state=recurrent_state,
+                conv_state=(conv_state_i, conv_state_f) if self.use_short_conv else None,
+                layer_idx=self.layer_idx,
+                offset=i.shape[2]
+            )
+        o = self.g_norm(o, self.g_proj(hidden_states))
+        o = self.o_proj(o)
+        return o, None, past_key_values
+    def state_size(self, **kwargs) -> int:
+        state_size = self.hidden_size
+        for module in self.children():
+            if isinstance(module, ShortConvolution):
+                state_size += module.state_size
+        return state_size

fla/layers/hgrn2.py ADDED Viewed

	@@ -0,0 +1,207 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2024, Songlin Yang, Yu Zhang
+# "HGRN2: Gated Linear RNNs with State Expansion"[https://arxiv.org/abs/2404.07904]
+from __future__ import annotations
+from typing import TYPE_CHECKING, Optional, Tuple
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from einops import rearrange
+from fla.modules import RMSNorm, ShortConvolution
+from fla.modules.activations import swish
+from fla.modules.layernorm import rms_norm_linear
+from fla.ops.gla import chunk_gla, fused_chunk_gla, fused_recurrent_gla
+if TYPE_CHECKING:
+    from fla.models.utils import Cache
+class HGRN2Attention(nn.Module):
+    def __init__(
+        self,
+        mode: str = 'chunk',
+        hidden_size: int = 1024,
+        num_heads: Optional[int] = None,
+        expand_ratio: Optional[int] = 128,
+        use_short_conv: bool = False,
+        conv_size: int = 4,
+        conv_bias: bool = False,
+        elementwise_affine: Optional[bool] = True,
+        norm_eps: float = 1e-5,
+        layer_idx: int = None
+    ) -> HGRN2Attention:
+        super().__init__()
+        self.mode = mode
+        self.hidden_size = hidden_size
+        if expand_ratio is None and num_heads is not None:
+            expand_ratio = hidden_size // num_heads
+        elif expand_ratio is not None and num_heads is None:
+            num_heads = hidden_size // expand_ratio
+        elif expand_ratio is None and num_heads is None:
+            raise RuntimeError("One of `expand_ratio` or `num_heads` should be provided.")
+        self.num_heads = num_heads
+        self.expand_ratio = expand_ratio
+        self.use_short_conv = use_short_conv
+        self.conv_size = conv_size
+        self.conv_bias = conv_bias
+        self.forget_dim = int(self.num_heads * self.expand_ratio)
+        self.input_dim = hidden_size
+        self.layer_idx = layer_idx
+        assert mode in ['chunk', 'fused_recurrent', 'fused_chunk'], f"Not suppoerted mode `{mode}`."
+        assert self.forget_dim % num_heads == 0, f"forget dim must be divisible by num_heads of {num_heads}"
+        assert self.input_dim % num_heads == 0, f"input dim must be divisible by num_heads of {num_heads}"
+        self.head_f_dim = self.expand_ratio
+        self.head_i_dim = self.hidden_size // num_heads
+        self.q_proj = nn.Linear(hidden_size, self.forget_dim, bias=False)
+        self.f_proj = nn.Linear(hidden_size, self.forget_dim, bias=False)
+        self.i_proj = nn.Linear(hidden_size, self.input_dim, bias=False)
+        if use_short_conv:
+            self.conv_size = conv_size
+            self.q_conv1d = ShortConvolution(self.forget_dim, conv_size, activation=None)
+            self.f_conv1d = ShortConvolution(self.forget_dim, conv_size, activation=None)
+            self.i_conv1d = ShortConvolution(self.input_dim, conv_size, activation=None)
+        self.g_norm = RMSNorm(hidden_size=self.hidden_size, elementwise_affine=elementwise_affine, eps=norm_eps)
+        self.o_proj = nn.Linear(self.input_dim, hidden_size, bias=False)
+        self.apply(self._initialize_weights)
+    def _initialize_weights(self, module: nn.Module):
+        if getattr(module, "_is_hf_initialized", False):
+            return
+        if isinstance(module, nn.Linear):
+            nn.init.xavier_uniform_(module.weight, gain=2 ** -2.5)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        module._is_hf_initialized = True
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Cache] = None,
+        use_cache: Optional[bool] = False,
+        output_attentions: Optional[bool] = False,
+        lower_bound: Optional[torch.Tensor] = None,
+        **kwargs
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Cache]]:
+        if attention_mask is not None:
+            assert len(attention_mask.shape) == 2, (
+                "Expected attention_mask as a 0-1 matrix with shape [batch_size, seq_len] "
+                "for padding purposes (0 indicating padding). "
+                "Arbitrary attention masks of shape [batch_size, seq_len, seq_len] are not allowed."
+            )
+        # launching the triton kernel for just one token will actually be slower
+        mode = 'fused_recurrent' if hidden_states.shape[1] == 1 else self.mode
+        last_state = None
+        if past_key_values is not None and len(past_key_values) > self.layer_idx:
+            last_state = past_key_values[self.layer_idx]
+        if self.use_short_conv:
+            conv_state_q, conv_state_f, conv_state_i = None, None, None
+            if last_state is not None:
+                conv_state_q, conv_state_f, conv_state_i = last_state['conv_state']
+            conv_mask = attention_mask[:, -hidden_states.shape[1]:] if attention_mask is not None else None
+            q, conv_state_q = self.q_conv1d(x=self.q_proj(hidden_states),
+                                            mask=conv_mask,
+                                            cache=conv_state_q,
+                                            output_final_state=use_cache)
+            f, conv_state_f = self.f_conv1d(x=self.f_proj(hidden_states),
+                                            mask=conv_mask,
+                                            cache=conv_state_f,
+                                            output_final_state=use_cache)
+            i, conv_state_i = self.i_conv1d(x=self.i_proj(hidden_states),
+                                            mask=conv_mask,
+                                            cache=conv_state_i,
+                                            output_final_state=use_cache)
+        else:
+            q = self.q_proj(hidden_states)
+            f = self.f_proj(hidden_states)
+            i = self.i_proj(hidden_states)
+        # dealing with left-padding
+        if attention_mask is not None:
+            i = i.mul_(attention_mask[:, -i.shape[-2]:, None])
+        q = swish(q)
+        # improve precision
+        f = f.float()
+        # the lower bound for the first layer is zero
+        if lower_bound is None or self.layer_idx == 0:
+            k, g = 1 - f.sigmoid(), F.logsigmoid(f)
+        else:
+            g = lower_bound + (1 - lower_bound) * f.sigmoid()
+            k, g = 1 - g, g.log()
+        q, k, i, g = map(lambda x: rearrange(x, '... (h d) -> ... h d', h=self.num_heads), (q, k.to(i), i, g))
+        recurrent_state = last_state['recurrent_state'] if last_state is not None else None
+        if mode == 'fused_recurrent':
+            o, recurrent_state = fused_recurrent_gla(
+                q=q,
+                k=k,
+                v=i,
+                gk=g,
+                initial_state=recurrent_state,
+                output_final_state=use_cache,
+                head_first=False
+            )
+        elif mode == 'fused_chunk':
+            o, recurrent_state = fused_chunk_gla(
+                q=q,
+                k=k,
+                v=i,
+                g=g,
+                initial_state=recurrent_state,
+                output_final_state=use_cache,
+                head_first=False
+            )
+        elif mode == 'chunk':
+            o, recurrent_state = chunk_gla(
+                q=q,
+                k=k,
+                v=i,
+                g=g,
+                initial_state=recurrent_state,
+                output_final_state=use_cache,
+                head_first=False
+            )
+        else:
+            raise NotImplementedError(f"Not supported mode `{mode}`.")
+        if past_key_values is not None:
+            past_key_values.update(
+                recurrent_state=recurrent_state,
+                conv_state=(conv_state_q, conv_state_f, conv_state_i) if self.use_short_conv else None,
+                layer_idx=self.layer_idx,
+                offset=q.shape[2]
+            )
+        o = rearrange(o, '... h d -> ... (h d)')
+        o = rms_norm_linear(o, self.g_norm.weight, self.g_norm.bias, self.o_proj.weight, self.o_proj.bias)
+        return o, None, past_key_values
+    def state_size(self, **kwargs) -> int:
+        state_size = self.forget_dim * self.head_i_dim
+        for module in self.children():
+            if isinstance(module, ShortConvolution):
+                state_size += module.state_size
+        return state_size

fla/layers/linear_attn.py ADDED Viewed

	@@ -0,0 +1,171 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2024, Songlin Yang, Yu Zhang
+from typing import Optional
+import torch.nn as nn
+import torch.nn.functional as F
+from einops import rearrange, repeat
+from fla.modules import RMSNorm
+from fla.modules.feature_map import (DPFPFeatureMap, HadamardFeatureMap,
+                                     HedgehogFeatureMap, T2RFeatureMap)
+from fla.ops.linear_attn import (chunk_linear_attn, fused_chunk_linear_attn,
+                                 fused_recurrent_linear_attn)
+class LinearAttention(nn.Module):
+    def __init__(
+        self,
+        mode: str = 'chunk',
+        hidden_size: str = 1024,
+        expand_k: int = 1.0,
+        expand_v: int = 1.0,
+        num_heads: int = 8,
+        num_kv_heads: Optional[int] = None,
+        feature_map: str = 'elementwise_product',
+        tie_feature_map_qk: bool = False,
+        output_norm: str = 'rmsnorm',
+        norm_q: bool = False,
+        norm_k: bool = False,
+        # standard linear attention normalization
+        do_feature_map_norm: bool = False,
+        elementwise_affine: bool = True,
+        norm_eps: float = 1e-5,
+        **kwargs
+    ):
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.mode = mode
+        self.num_heads = num_heads
+        self.num_kv_heads = num_kv_heads if num_kv_heads is not None else num_heads
+        self.num_kv_groups = self.num_heads // self.num_kv_heads
+        self.key_dim = int(hidden_size * expand_k)
+        self.value_dim = int(hidden_size * expand_v)
+        self.key_dim_per_group = self.key_dim // self.num_kv_groups
+        self.value_dim_per_group = self.value_dim // self.num_kv_groups
+        assert mode in ['chunk', 'fused_chunk', 'fused_recurrent'], f"Not suppoerted mode `{mode}`."
+        assert self.key_dim % num_heads == 0, f"key dim must be divisible by num_heads of {num_heads}"
+        assert self.value_dim % num_heads == 0, f"value dim must be divisible by num_heads of {num_heads}"
+        self.head_qk_dim = self.key_dim // num_heads
+        self.head_v_dim = self.value_dim // num_heads
+        self.do_feature_map_norm = do_feature_map_norm
+        if feature_map == 'hedgehog':
+            if tie_feature_map_qk:
+                self.feature_map_q = self.feature_map_k = HedgehogFeatureMap(head_dim=self.head_qk_dim)
+            else:
+                self.feature_map_q = HedgehogFeatureMap(head_dim=self.head_qk_dim)
+                self.feature_map_k = HedgehogFeatureMap(head_dim=self.head_qk_dim)
+        elif feature_map == 't2r':
+            if tie_feature_map_qk:
+                self.feature_map_q = self.feature_map_k = T2RFeatureMap(head_dim=self.head_qk_dim)
+            else:
+                self.feature_map_q = T2RFeatureMap(head_dim=self.head_qk_dim)
+                self.feature_map_k = T2RFeatureMap(head_dim=self.head_qk_dim)
+        elif feature_map == 'elementwise_product':
+            if tie_feature_map_qk:
+                self.feature_map_q = self.feature_map_k = HadamardFeatureMap(head_dim=self.head_qk_dim)
+            else:
+                self.feature_map_q = HadamardFeatureMap(head_dim=self.head_qk_dim)
+                self.feature_map_k = HadamardFeatureMap(head_dim=self.head_qk_dim)
+        elif feature_map == 'dpfp':
+            self.feature_map_q = DPFPFeatureMap(head_dim=self.head_qk_dim)
+            self.feature_map_k = DPFPFeatureMap(head_dim=self.head_qk_dim)
+        elif feature_map == 'elu':
+            def elu(x):
+                return F.elu(x) + 1
+            self.feature_map_q = elu
+            self.feature_map_k = elu
+        elif feature_map == 'relu':
+            self.feature_map_q = nn.ReLU()
+            self.feature_map_k = nn.ReLU()
+        elif feature_map == 'identity':
+            self.feature_map_q = nn.Identity()
+            self.feature_map_k = nn.Identity()
+        else:
+            raise NotImplementedError(f"Not supported feature map `{feature_map}`.")
+        self.q_proj = nn.Linear(hidden_size, self.key_dim, bias=False)
+        self.k_proj = nn.Linear(hidden_size, self.key_dim_per_group, bias=False)
+        self.v_proj = nn.Linear(hidden_size, self.value_dim_per_group, bias=False)
+        if output_norm == 'rmsnorm':
+            self.norm = RMSNorm(hidden_size=self.head_v_dim, elementwise_affine=elementwise_affine, eps=norm_eps)
+        elif output_norm == 'identity':
+            self.norm = nn.Identity()
+        else:
+            raise NotImplementedError(f"Not supported output norm `{output_norm}`.")
+        self.o_proj = nn.Linear(self.value_dim, hidden_size, bias=False)
+        self.norm_q = norm_q
+        self.norm_k = norm_k
+        self.apply(self._initialize_weights)
+    def _initialize_weights(self, module: nn.Module):
+        if getattr(module, "_is_hf_initialized", False):
+            return
+        if isinstance(module, nn.Linear):
+            nn.init.xavier_uniform_(module.weight, gain=2 ** -2.5)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        module._is_hf_initialized = True
+    def forward(self, x):
+        mode = self.mode
+        q = self.q_proj(x)
+        k = self.k_proj(x)
+        v = self.v_proj(x)
+        q = rearrange(q, '... (h d) -> ... h d', h=self.num_heads)
+        if self.num_kv_groups > 1:
+            k, v = (repeat(x, '... (h d) -> ... (h g) d', h=self.num_kv_heads, g=self.num_kv_groups) for x in (k, v))
+        else:
+            k, v = (rearrange(x, '... (h d) -> ... h d', h=self.num_kv_heads) for x in (k, v))
+        q = self.feature_map_q(q)
+        k = self.feature_map_k(k)
+        if self.norm_q:
+            q = q / (q.sum(-1, True) + 1e-4)
+        if self.norm_k:
+            k = k / (k.sum(-1, True) + 1e-4)
+        if mode == 'chunk':
+            o, final_state = chunk_linear_attn(
+                q=q,
+                k=k,
+                v=v,
+                normalize=self.do_feature_map_norm,
+                head_first=False
+            )
+        elif mode == 'fused_chunk':
+            o, final_state = fused_chunk_linear_attn(
+                q=q,
+                k=k,
+                v=v,
+                normalize=self.do_feature_map_norm,
+            )
+        elif mode == 'fused_recurrent':
+            o, final_state = fused_recurrent_linear_attn(
+                q=q,
+                k=k,
+                v=v,
+                normalize=self.do_feature_map_norm,
+            )
+        else:
+            raise NotImplementedError
+        o = self.norm(o)
+        o = self.o_proj(o)
+        return o

fla/layers/multiscale_retention.py ADDED Viewed

	@@ -0,0 +1,282 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2024, Songlin Yang, Yu Zhang
+from __future__ import annotations
+from typing import TYPE_CHECKING, Optional, Tuple
+import torch
+import torch.nn as nn
+from einops import rearrange, repeat
+from transformers.activations import ACT2FN
+from fla.modules import FusedRMSNormSwishGate, RMSNorm, ShortConvolution
+from fla.modules.rotary import RotaryEmbedding
+from fla.ops.retention import (chunk_retention, fused_chunk_retention,
+                               fused_recurrent_retention, parallel_retention)
+if TYPE_CHECKING:
+    from fla.models.utils import Cache
+class MultiScaleRetention(nn.Module):
+    r"""
+    The layer implementaion for [Retentive Network: A Successor to Transformer for Large Language Models](https://arxiv.org/pdf/2307.08621.pdf).  # noqa
+    Args:
+        mode (str, Optional):
+            Which Retention kernel to use.
+            Currently available: `chunk`, `fused_recurrent`, `parallel`, and `fused_chunk`.
+            Default: `fused_chunk`.
+        hidden_size (int, Optional):
+            The hidden size of the input. Default: 1024.
+        expand_k (float, Optional):
+            The expansion ratio for the key dim. Default: 1.0.
+        expand_v (float, Optional):
+            The expansion ratio for the value dim. Default: 2.0.
+        num_heads (int, Optional):
+            The number of heads. Default: 8.
+        num_kv_heads (int, Optional):
+            The number of key/value heads, used for MQA. Default: None.
+        feature_map (str, Optional):
+            Feature map function applied to queries/keys. Default: None.
+        use_short_conv (bool, Optional):
+            Whether to use short convolutions. Default: `False`.
+        conv_size (int, Optional):
+            The kernel size of the short convolution, only used when `use_short_conv` is `True`. Default: 4.
+        conv_bias (bool, Optional):
+            Whether to use bias in the short convolution, only used when `use_short_conv` is `True`. Default: `False`.
+        use_output_gate (bool, Optional):
+            Whether to use output gate. Default: `True`.
+        gate_fn (str, Optional):
+            The activation function for the output gate. Default: `swish`.
+        elementwise_affine (bool, Optional):
+            If `True`, applies elementwise affine to LayerNorm with learnable parameters. Default: `True`.
+        norm_eps (float, Optional):
+            The epsilon value for the layernorm/rmsnorm layer. Default: 1e-5.
+        fuse_norm (bool, Optional):
+            Whether to fuse the norm and the output gate for better memory footprint. Default: `True`.
+        layer_idx (int, Optional):
+            The index of the layer. Default: None.
+    """
+    def __init__(
+        self,
+        mode: str = 'chunk',
+        hidden_size: int = 1024,
+        expand_k: float = 1.0,
+        expand_v: float = 2.0,
+        num_heads: int = 8,
+        num_kv_heads: Optional[int] = None,
+        feature_map: Optional[str] = None,
+        use_short_conv: bool = False,
+        conv_size: int = 4,
+        conv_bias: bool = False,
+        use_output_gate: bool = True,
+        gate_fn: str = 'swish',
+        elementwise_affine: Optional[bool] = True,
+        norm_eps: float = 1e-5,
+        fuse_norm: bool = True,
+        layer_idx: int = None,
+        **kwargs
+    ) -> MultiScaleRetention:
+        super().__init__()
+        self.mode = mode
+        self.hidden_size = hidden_size
+        self.expand_k = expand_k
+        self.expand_v = expand_v
+        self.num_heads = num_heads
+        self.num_kv_heads = num_kv_heads if num_kv_heads is not None else num_heads
+        self.num_kv_groups = self.num_heads // self.num_kv_heads
+        self.feature_map_fn = ACT2FN[feature_map] if feature_map is not None else None
+        self.use_short_conv = use_short_conv
+        self.conv_size = conv_size
+        self.conv_bias = conv_bias
+        self.use_output_gate = use_output_gate
+        self.key_dim = int(hidden_size * expand_k)
+        self.value_dim = int(hidden_size * expand_v)
+        self.key_dim_per_group = self.key_dim // self.num_kv_groups
+        self.value_dim_per_group = self.value_dim // self.num_kv_groups
+        self.layer_idx = layer_idx
+        assert mode in ['chunk', 'fused_chunk', 'parallel', 'fused_recurrent'], f"Not suppoerted mode `{mode}`."
+        assert self.key_dim % num_heads == 0, f"key dim must be divisible by num_heads of {num_heads}"
+        assert self.value_dim % num_heads == 0, f"value dim must be divisible by num_heads of {num_heads}"
+        self.head_qk_dim = self.key_dim // num_heads
+        self.head_v_dim = self.value_dim // num_heads
+        self.q_proj = nn.Linear(hidden_size, self.key_dim, bias=False)
+        self.k_proj = nn.Linear(hidden_size, self.key_dim_per_group, bias=False)
+        self.v_proj = nn.Linear(hidden_size, self.value_dim_per_group, bias=False)
+        if self.use_output_gate:
+            self.g_proj = nn.Linear(hidden_size, self.value_dim, bias=False)
+        if use_short_conv:
+            self.conv_size = conv_size
+            self.q_conv1d = ShortConvolution(self.key_dim, conv_size, activation='silu')
+            self.k_conv1d = ShortConvolution(self.key_dim_per_group, conv_size, activation='silu')
+            self.v_conv1d = ShortConvolution(self.value_dim_per_group, conv_size, activation='silu')
+        self.o_proj = nn.Linear(self.value_dim, hidden_size, bias=False)
+        if gate_fn == 'swish' and fuse_norm and use_output_gate:
+            self.g_norm_swish_gate = FusedRMSNormSwishGate(self.head_v_dim, elementwise_affine, norm_eps)
+            self.fuse_norm_and_gate = True
+        else:
+            self.fuse_norm_and_gate = False
+            self.g_norm = RMSNorm(hidden_size=self.head_v_dim, elementwise_affine=elementwise_affine, eps=norm_eps)
+            self.gate_fn = ACT2FN[gate_fn]
+        # TODO: fix this issue
+        # https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/ops/triton/rotary.py#L180
+        # Ideally, we would want to support arbitrary d_head_qk
+        assert self.head_qk_dim <= 256, "head_qk_dim must be less than or equal to 256"
+        self.rotary = RotaryEmbedding(dim=self.head_qk_dim)
+        self.apply(self._initialize_weights)
+    def _initialize_weights(self, module: nn.Module):
+        if getattr(module, "_is_hf_initialized", False):
+            return
+        if isinstance(module, nn.Linear):
+            nn.init.xavier_uniform_(module.weight, gain=2 ** -2.5)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        module._is_hf_initialized = True
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Cache] = None,
+        use_cache: Optional[bool] = False,
+        output_attentions: Optional[bool] = False,
+        **kwargs
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Cache]]:
+        if attention_mask is not None:
+            assert len(attention_mask.shape) == 2, (
+                "Expected attention_mask as a 0-1 matrix with shape [batch_size, seq_len] "
+                "for padding purposes (0 indicating padding). "
+                "Arbitrary attention masks of shape [batch_size, seq_len, seq_len] are not allowed."
+            )
+        # launching the triton kernel for just one token will actually be slower
+        mode = 'fused_recurrent' if hidden_states.shape[1] == 1 else self.mode
+        last_state = None
+        if past_key_values is not None and len(past_key_values) > self.layer_idx:
+            last_state = past_key_values[self.layer_idx]
+        if self.use_short_conv:
+            conv_state_q, conv_state_k, conv_state_v = None, None, None
+            if last_state is not None:
+                conv_state_q, conv_state_k, conv_state_v = last_state['conv_state']
+            conv_mask = attention_mask[:, -hidden_states.shape[1]:] if attention_mask is not None else None
+            q, conv_state_q = self.q_conv1d(x=self.q_proj(hidden_states),
+                                            mask=conv_mask,
+                                            cache=conv_state_q,
+                                            output_final_state=use_cache)
+            k, conv_state_k = self.k_conv1d(x=self.k_proj(hidden_states),
+                                            mask=conv_mask,
+                                            cache=conv_state_k,
+                                            output_final_state=use_cache)
+            v, conv_state_v = self.v_conv1d(x=self.v_proj(hidden_states),
+                                            mask=conv_mask,
+                                            cache=conv_state_v,
+                                            output_final_state=use_cache)
+        else:
+            q = self.q_proj(hidden_states)
+            k = self.k_proj(hidden_states)
+            v = self.v_proj(hidden_states)
+        # dealing with left-padding
+        if attention_mask is not None:
+            v = v.mul_(attention_mask[:, -v.shape[-2]:, None])
+        q = rearrange(q, '... (h d) -> ... h d', h=self.num_heads)
+        k = rearrange(k, '... (h d) -> ... h d', h=self.num_kv_heads)
+        if self.feature_map_fn is not None:
+            q, k = map(self.feature_map_fn, (q, k))
+        seqlen_offset, max_seqlen = 0, q.shape[1]
+        if past_key_values is not None:
+            seqlen_offset = past_key_values.get_seq_length(self.layer_idx)
+            max_seqlen = q.shape[1] + seqlen_offset
+            if attention_mask is not None:
+                # to deliminate the offsets of padding tokens
+                seqlen_offset = (seqlen_offset + attention_mask.sum(-1) - attention_mask.shape[-1]).clamp(min=0)
+                max_seqlen = q.shape[1] + max(seqlen_offset)
+        q, k = self.rotary(q, k, seqlen_offset, max_seqlen)
+        if self.num_kv_groups > 1:
+            k = repeat(k, 'b t h d -> b t (h g) d', h=self.num_kv_heads, g=self.num_kv_groups)
+            v = repeat(v, 'b t (h d) -> b t (h g) d', h=self.num_kv_heads, g=self.num_kv_groups)
+        else:
+            k, v = rearrange(k, 'b t h d -> b t h d'), rearrange(v, 'b t (h d) -> b t h d', h=self.num_kv_heads)
+        recurrent_state = last_state['recurrent_state'] if last_state is not None else None
+        if mode == 'chunk':
+            o, recurrent_state = chunk_retention(
+                q=q,
+                k=k,
+                v=v,
+                initial_state=recurrent_state,
+                output_final_state=use_cache,
+                head_first=False
+            )
+        elif mode == 'fused_chunk':
+            o, recurrent_state = fused_chunk_retention(
+                q=q,
+                k=k,
+                v=v,
+                initial_state=recurrent_state,
+                output_final_state=use_cache,
+                head_first=False
+            )
+        elif mode == 'parallel':
+            o, recurrent_state = parallel_retention(q, k, v, head_first=False)
+        elif mode == 'fused_recurrent':
+            o, recurrent_state = fused_recurrent_retention(
+                q=q,
+                k=k,
+                v=v,
+                initial_state=recurrent_state,
+                output_final_state=use_cache,
+                head_first=False
+            )
+        else:
+            raise NotImplementedError(f"Not supported mode `{mode}`.")
+        if past_key_values is not None:
+            past_key_values.update(
+                recurrent_state=recurrent_state,
+                conv_state=(conv_state_q, conv_state_k, conv_state_v) if self.use_short_conv else None,
+                layer_idx=self.layer_idx,
+                offset=q.shape[2]
+            )
+        if self.use_output_gate:
+            g = self.g_proj(hidden_states)
+            if self.fuse_norm_and_gate:
+                g = rearrange(g, 'b t (h d) -> b t h d', h=self.num_heads)
+                o = self.g_norm_swish_gate(o, g)
+                o = rearrange(o, 'b t h d -> b t (h d)')
+            else:
+                o = rearrange(self.g_norm(o), 'b t h d -> b t (h d)')
+                o = o * self.gate_fn(g)
+        else:
+            o = rearrange(self.g_norm(o), 'b t h d -> b t (h d)')
+        o = self.o_proj(o)
+        return o, None, past_key_values
+    def state_size(self, **kwargs) -> int:
+        state_size = self.key_dim * self.head_v_dim
+        for module in self.children():
+            if isinstance(module, ShortConvolution):
+                state_size += module.state_size
+        return state_size

fla/layers/rebased.py ADDED Viewed

	@@ -0,0 +1,136 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2024, Songlin Yang, Yu Zhang
+"""
+https://github.com/corl-team/rebased/blob/main/flash_linear_attention/fla/layers/rebased_fast.py
+"""
+from __future__ import annotations
+from typing import Optional
+import torch
+import torch.nn as nn
+from einops import rearrange
+from fla.modules.feature_map import RebasedFeatureMap
+from fla.ops.linear_attn import chunk_linear_attn, fused_chunk_linear_attn
+from fla.ops.rebased import parallel_rebased
+class ReBasedLinearAttention(nn.Module):
+    def __init__(
+        self,
+        hidden_size: int,
+        l_max: int = 2048,
+        feature_dim: int = 16,
+        num_key_value_heads: int = 16,
+        num_heads: int = 16,
+        use_gamma: Optional[bool] = True,
+        use_beta: Optional[bool] = True,
+        normalize: Optional[bool] = True,
+        causal: bool = True,
+        eps: float = 1e-5,
+        mode: str = "parallel",
+        layer_idx: Optional[int] = None,
+        **kwargs
+    ) -> ReBasedLinearAttention:
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.l_max = l_max
+        self.mode = mode
+        assert self.mode in ["fused_chunk", "parallel", 'chunk']
+        # linear attention
+        self.feature_dim = feature_dim
+        self.num_key_value_heads = num_key_value_heads
+        self.num_heads = num_heads
+        self.head_dim = self.hidden_size // self.num_key_value_heads
+        self.use_gamma = use_gamma
+        self.use_beta = use_beta
+        self.normalize = normalize
+        self.causal = causal
+        self.feature_map = RebasedFeatureMap(self.feature_dim, use_gamma, use_beta, normalize)
+        self.q_proj = nn.Linear(self.hidden_size, self.feature_dim * self.num_heads, bias=False)
+        self.k_proj = nn.Linear(self.hidden_size, self.feature_dim * self.num_heads, bias=False)
+        self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
+        self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
+        self.dropout = nn.Identity()
+        self.eps = eps
+        self.apply(self._initialize_weights)
+    def _initialize_weights(self, module: nn.Module):
+        if getattr(module, "_is_hf_initialized", False):
+            return
+        if isinstance(module, nn.Linear):
+            nn.init.xavier_uniform_(module.weight, gain=2 ** -2.5)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        module._is_hf_initialized = True
+    def forward(self, hidden_states: torch.Tensor, **kwargs):
+        mode = self.mode
+        q, k, v = self.q_proj(hidden_states), self.k_proj(hidden_states), self.v_proj(hidden_states)
+        q, k, v = map(lambda x: rearrange(x, "... (h d) -> ... h d", h=self.num_heads), [q, k, v])
+        q, k = self.feature_map(q, flatten=(mode != 'parallel')), self.feature_map(k, flatten=(mode != 'parallel'))
+        if mode == "fused_chunk":
+            o = fused_chunk_linear_attn(
+                q=q,
+                k=k,
+                v=v,
+                normalize=True,
+                scale=1,
+                head_first=False
+            )
+        elif mode == 'chunk':
+            o = chunk_linear_attn(
+                q=q,
+                k=k,
+                v=v,
+                normalize=True,
+                scale=1,
+                head_first=False
+            )
+        elif mode == 'parallel':
+            assert q.shape[-1] <= 128
+            o = parallel_rebased(
+                q=q,
+                k=k,
+                v=v,
+                eps=self.eps,
+                use_scale=True,
+                use_normalize=True,
+                head_first=False
+            )
+        o = self.o_proj(o)
+        o = self.dropout(o)
+        return o
+    # https://github.com/HazyResearch/zoology/blob/main/zoology/mixers/based.py#L119
+    def forward_reference(self, hidden_states: torch.Tensor, filters: torch.Tensor = None, *args, **kwargs):
+        """
+        x (torch.Tensor): tensor of shape (b, d, t)
+        y (torch.Tensor): tensor of shape (b, d, t)
+        """
+        b, t, _ = hidden_states.size()
+        q, k, v = self.q_proj(hidden_states), self.k_proj(hidden_states), self.v_proj(hidden_states)
+        q = q.view(b, t, self.num_heads, self.feature_dim).transpose(1, 2)
+        k = k.view(b, t, self.num_key_value_heads, self.feature_dim).transpose(1, 2)
+        v = v.view(b, t, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        # Linear attention
+        q, k = self.feature_map(q), self.feature_map(k)
+        q, k, v = q.unsqueeze(-2), k.unsqueeze(-2), v.unsqueeze(-1)
+        # Compute attention
+        if self.causal:
+            y = ((q * (k * v).cumsum(2)).sum(-1) / ((q * k.cumsum(2)).sum(-1) + self.eps))
+        else:
+            y = ((q * (k * v).sum(2, True)).sum(-1) / ((q * k.sum(2, True)).sum(-1) + self.eps))
+        y = rearrange(y, 'b h t d -> b t (h d)')
+        y = self.o_proj(y.to(hidden_states.dtype))
+        y = self.dropout(y)
+        return y.to(hidden_states.dtype)

fla/layers/rwkv6.py ADDED Viewed

	@@ -0,0 +1,291 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2024, Songlin Yang, Yu Zhang
+# "Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence"[https://arxiv.org/abs/2404.05892]
+from __future__ import annotations
+from typing import TYPE_CHECKING, Optional, Tuple
+import torch
+import torch.nn as nn
+from einops import rearrange
+from fla.modules import GroupNorm
+from fla.modules.activations import ACT2FN
+from fla.ops.rwkv6 import chunk_rwkv6, fused_recurrent_rwkv6
+if TYPE_CHECKING:
+    from fla.models.utils import Cache
+class RWKV6Attention(nn.Module):
+    def __init__(
+        self,
+        mode: str = 'chunk',
+        hidden_size: int = 1024,
+        expand_k: float = 0.5,
+        expand_v: float = 1.0,
+        num_heads: int = 4,
+        gate_fn: str = 'swish',
+        proj_low_rank_dim: int = 32,
+        gate_low_rank_dim: int = 64,
+        fuse_norm: bool = True,
+        elementwise_affine: Optional[bool] = True,
+        norm_eps: float = 1e-5,
+        layer_idx: int = None,
+        **kwargs
+    ) -> RWKV6Attention:
+        super().__init__()
+        self.mode = mode
+        self.hidden_size = hidden_size
+        self.expand_k = expand_k
+        self.expand_v = expand_v
+        self.num_heads = num_heads
+        self.proj_low_rank_dim = proj_low_rank_dim
+        self.gate_low_rank_dim = gate_low_rank_dim
+        self.key_dim = int(hidden_size * expand_k)
+        self.value_dim = int(hidden_size * expand_v)
+        self.layer_idx = layer_idx
+        assert mode in ['chunk', 'fused_recurrent'], f"Not suppoerted mode `{mode}`."
+        assert self.key_dim % num_heads == 0, f"key dim must be divisible by num_heads of {num_heads}"
+        assert self.value_dim % num_heads == 0, f"value dim must be divisible by num_heads of {num_heads}"
+        self.head_qk_dim = self.key_dim // num_heads
+        self.head_v_dim = self.value_dim // num_heads
+        self.time_shift = nn.ZeroPad2d((0, 0, 1, -1))
+        self.x_proj = nn.Sequential(
+            LerpLinear(hidden_size, proj_low_rank_dim * 5),
+            nn.Tanh(),
+            nn.Linear(proj_low_rank_dim * 5, hidden_size, bias=False)
+        )
+        self.x_bias = nn.Parameter(torch.zeros(5, hidden_size))
+        self.r_proj = DDLerpLinear(hidden_size, self.key_dim)
+        self.w_proj = DDLerpLinear(hidden_size, self.key_dim, low_rank_dim=gate_low_rank_dim)
+        self.k_proj = DDLerpLinear(hidden_size, self.key_dim)
+        self.v_proj = DDLerpLinear(hidden_size, self.value_dim)
+        self.g_proj = DDLerpLinear(hidden_size, self.value_dim)
+        self.bonus = nn.Parameter(torch.zeros(num_heads, self.head_qk_dim))
+        # TODO: fuse GroupNorm and output gate
+        self.g_norm = GroupNorm(self.num_heads, self.value_dim, elementwise_affine=elementwise_affine, bias=True, eps=norm_eps)
+        self.o_proj = nn.Linear(self.value_dim, hidden_size, bias=False)
+        self.gate_fn = ACT2FN[gate_fn]
+        self.apply(self._initialize_weights)
+    def _initialize_weights(self, module: nn.Module):
+        if getattr(module, "_is_hf_initialized", False):
+            return
+        if isinstance(module, nn.Linear):
+            nn.init.xavier_uniform_(module.weight, gain=2 ** -2.5)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        if isinstance(module, nn.Parameter):
+            nn.init.xavier_uniform_(module, gain=2 ** -2.5)
+        module._is_hf_initialized = True
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Cache] = None,
+        use_cache: Optional[bool] = False,
+        output_attentions: Optional[bool] = False,
+        **kwargs
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Cache]]:
+        if attention_mask is not None:
+            assert len(attention_mask.shape) == 2, (
+                "Expected attention_mask as a 0-1 matrix with shape [batch_size, seq_len] "
+                "for padding purposes (0 indicating padding). "
+                "Arbitrary attention masks of shape [batch_size, seq_len, seq_len] are not allowed."
+            )
+        batch_size, seq_len, hidden_size = hidden_states.shape
+        # launching the triton kernel for just one token will actually be slower
+        mode = 'fused_recurrent' if hidden_states.shape[1] == 1 else self.mode
+        last_state = None
+        if past_key_values is not None and len(past_key_values) > self.layer_idx:
+            last_state = past_key_values[self.layer_idx]
+        if attention_mask is not None:
+            hidden_states = hidden_states.mul_(attention_mask[:, -hidden_states.shape[-2]:, None])
+        if hidden_states.shape[1] == 1 and last_state is not None:
+            shifted = last_state['conv_state'].unsqueeze(1)
+        else:
+            shifted = self.time_shift(hidden_states)
+            if last_state is not None:
+                shifted[:, 0] = last_state['conv_state'][0]
+        delta = shifted - hidden_states
+        x = self.x_proj[0](hidden_states, delta).view(batch_size, seq_len, -1, self.proj_low_rank_dim)
+        x = torch.einsum('b t n r, h n r-> b t n h', self.x_proj[1](x), self.x_proj[2].weight.view(hidden_size, 5, -1))
+        r, w, k, v, g = x.add_(self.x_bias).unbind(-2)
+        r = self.r_proj(hidden_states, r, delta)
+        w = self.w_proj(hidden_states, w, delta)
+        k = self.k_proj(hidden_states, k, delta)
+        v = self.v_proj(hidden_states, v, delta)
+        g = self.g_proj(hidden_states, g, delta)
+        # dealing with left-padding
+        if attention_mask is not None:
+            v = v.mul_(attention_mask[:, -v.shape[-2]:, None])
+        r, w, k, v = map(lambda x: rearrange(x, 'b t (h d) -> b t h d', h=self.num_heads), (r, w, k, v))
+        w = -torch.exp(w)
+        u = self.bonus
+        recurrent_state = last_state['recurrent_state'] if last_state is not None else None
+        if mode == 'fused_recurrent':
+            o, recurrent_state = fused_recurrent_rwkv6(
+                r=r,
+                k=k,
+                v=v,
+                w=w,
+                u=u,
+                scale=1.,
+                initial_state=recurrent_state,
+                output_final_state=use_cache,
+                head_first=False
+            )
+        elif mode == 'chunk':
+            o, recurrent_state = chunk_rwkv6(
+                q=r,
+                k=k,
+                v=v,
+                g=w,
+                u=u,
+                scale=1.,
+                initial_state=recurrent_state,
+                output_final_state=use_cache,
+                head_first=False
+            )
+        else:
+            raise NotImplementedError(f"Not supported mode `{mode}`.")
+        if past_key_values is not None:
+            past_key_values.update(
+                recurrent_state=recurrent_state,
+                conv_state=hidden_states[:, -1],
+                layer_idx=self.layer_idx,
+                offset=r.shape[2]
+            )
+        o = self.g_norm(rearrange(o, '... h d -> ... (h d)')) * self.gate_fn(g)
+        o = self.o_proj(o)
+        return o, None, past_key_values
+class LoRA(nn.Module):
+    def __init__(
+        self,
+        input_dim: int,
+        output_dim: int,
+        low_rank_dim: int,
+        bias: Optional[bool] = True
+    ):
+        super().__init__()
+        self.input_dim = input_dim
+        self.output_dim = output_dim
+        self.low_rank_dim = low_rank_dim
+        self.bias = bias
+        self.lora = nn.Sequential(
+            nn.Linear(input_dim, low_rank_dim, bias=False),
+            nn.Tanh(),
+            nn.Linear(low_rank_dim, output_dim, bias=bias)
+        )
+    def __repr__(self) -> str:
+        s = f"{self.__class__.__name__}("
+        s += f"input_dim={self.input_dim}, low_rank_dim={self.low_rank_dim}, output_dim={self.output_dim}"
+        if not self.bias:
+            s += f", bias={self.bias}"
+        s += ")"
+        return s
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.lora(x)
+class LerpLinear(nn.Module):
+    def __init__(
+        self,
+        input_dim: int,
+        output_dim: int,
+        low_rank_dim: Optional[int] = None
+    ):
+        super().__init__()
+        self.input_dim = input_dim
+        self.output_dim = output_dim
+        self.low_rank_dim = low_rank_dim
+        self.time_shift = nn.ZeroPad2d((0, 0, 1, -1))
+        if low_rank_dim is None:
+            self.linear = nn.Linear(input_dim, output_dim, bias=False)
+        else:
+            self.linear = LoRA(input_dim, output_dim, low_rank_dim)
+        self.mu = nn.Parameter(torch.zeros(input_dim))
+    def __repr__(self) -> str:
+        s = f"{self.__class__.__name__}({self.input_dim}, {self.output_dim}"
+        if self.low_rank_dim is not None:
+            s += f", low_rank_dim={self.low_rank_dim}"
+        s += ")"
+        return s
+    def forward(self, x: torch.Tensor, delta: Optional[torch.Tensor] = None) -> torch.Tensor:
+        if delta is None:
+            shifted = self.time_shift(x)
+            if len(shifted.shape) == 2:
+                shifted = shifted.unsqueeze(1)
+            delta = shifted - x
+        return self.linear(x + delta * self.mu)
+class DDLerpLinear(nn.Module):
+    def __init__(
+        self,
+        input_dim: int,
+        output_dim: int,
+        low_rank_dim: Optional[int] = None
+    ):
+        super().__init__()
+        self.input_dim = input_dim
+        self.output_dim = output_dim
+        self.low_rank_dim = low_rank_dim
+        self.time_shift = nn.ZeroPad2d((0, 0, 1, -1))
+        if low_rank_dim is None:
+            self.linear = nn.Linear(input_dim, output_dim, bias=False)
+        else:
+            self.linear = LoRA(input_dim, output_dim, low_rank_dim)
+    def __repr__(self) -> str:
+        s = f"{self.__class__.__name__}({self.input_dim}, {self.output_dim}"
+        if self.low_rank_dim is not None:
+            s += f", low_rank_dim={self.low_rank_dim}"
+        s += ")"
+        return s
+    def forward(self, x: torch.Tensor, mu: torch.Tensor, delta: Optional[torch.Tensor] = None) -> torch.Tensor:
+        if delta is None:
+            shifted = self.time_shift(x)
+            if len(shifted.shape) == 2:
+                shifted = shifted.unsqueeze(1)
+            delta = shifted - x
+        return self.linear(x + delta * mu)

fla/layers/scan.py ADDED Viewed

	@@ -0,0 +1,237 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2024, Songlin Yang, Yu Zhang
+from __future__ import annotations
+import warnings
+from typing import TYPE_CHECKING, Optional, Tuple
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from einops import rearrange
+from fla.modules import RMSNorm
+from fla.modules.activations import swish, sigmoid
+from fla.modules.layernorm import rms_norm_linear
+from fla.ops.scan import parallel_scan, naive_recurrent_scan
+if TYPE_CHECKING:
+    from fla.models.utils import Cache
+def build_alibi_tensor_scan(head_num, seq_len, window_len, state_size):
+    slopes = torch.tensor([2 ** (-8.0 * i / head_num) for i in range(head_num)])
+    alibi = torch.zeros((head_num, seq_len, window_len))
+    for i in range(seq_len):
+        for j in range(window_len):
+            if i < window_len:
+                alibi[:, i, j] = slopes * (j - window_len + 1) if i > (window_len - j - 2) else 0
+            else:
+                alibi[:, i, j] = alibi[:, window_len-1, j]
+    # Now concat a zeros tensor of size (head_num, seq_len, state_size) to the left of the above square tensor
+    alibi = torch.cat((torch.zeros(head_num, seq_len, state_size), alibi), dim=2)
+    return alibi # shape: (head_num, seq_len, state_size + window_size) or (H, T, S + W)
+def scores_mask(T, W, S):
+    # create lower right triangle mask (W, W)
+    mask = torch.tril(torch.ones(W, W)).flip(1)
+    # concat ones with size (T-W, W) in 0th dim
+    mask = torch.cat((mask, torch.ones(T-W, W)), dim=0)
+    # concat ones with size (T, S) in 1st dim
+    mask = torch.cat((torch.ones(T, S), mask), dim=1)
+    return mask # shape: (T, S + W)
+class SemiCompressedAttention(nn.Module):
+    def __init__(
+        self,
+        mode: str = 'parallel',
+        hidden_size: int = 1024,
+        window_size: int = 512,
+        state_size: int = 64,
+        gate_act: str = 'softmax',
+        max_position_embeddings: Optional[int] = 2048,
+        expand_k: float = 1.,
+        expand_v: float = 1.,
+        num_heads: int = 4,
+        num_kv_heads: Optional[int] = None,
+        elementwise_affine: Optional[bool] = True,
+        norm_first: bool = True,
+        norm_eps: float = 1e-5,
+        gate_logit_normalizer: int = 8,
+        use_output_gate: bool = False,
+        use_norm: bool = True,
+        layer_idx: Optional[int] = None,
+        scale: Optional[float] = 1.,
+        **kwargs
+    ) -> SemiCompressedAttention:
+        super().__init__()
+        self.mode = mode
+        self.hidden_size = hidden_size
+        self.window_size = window_size
+        self.state_size = state_size
+        self.gate_act = gate_act
+        self.max_position_embeddings = max_position_embeddings
+        self.expand_k = expand_k
+        self.expand_v = expand_v
+        self.num_heads = num_heads
+        self.num_kv_heads = num_heads if num_kv_heads is None else num_kv_heads
+        self.num_kv_groups = self.num_heads // self.num_kv_heads
+        self.key_dim = int(hidden_size * expand_k)
+        self.value_dim = int(hidden_size * expand_v)
+        self.key_dim_per_group = self.key_dim // self.num_kv_groups
+        self.value_dim_per_group = self.value_dim // self.num_kv_groups
+        self.head_k_dim = self.key_dim // self.num_heads
+        self.head_v_dim = self.value_dim // self.num_heads
+        self.gate_logit_normalizer = gate_logit_normalizer
+        self.use_output_gate = use_output_gate
+        self.use_norm = use_norm
+        self.scale = scale
+        self.norm_first = norm_first
+        self.layer_idx = layer_idx
+        if layer_idx is None:
+            warnings.warn(
+                f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will "
+                "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` "
+                "when creating this class."
+            )
+        if norm_first:
+            self.norm = RMSNorm(self.hidden_size, eps=norm_eps)
+        self.q_proj = nn.Linear(self.hidden_size, self.key_dim, bias=False)
+        self.k_proj = nn.Linear(self.hidden_size, self.key_dim_per_group, bias=False)
+        self.v_proj = nn.Linear(self.hidden_size, self.value_dim_per_group, bias=False)
+        self.s_proj = nn.Linear(self.hidden_size, self.key_dim_per_group, bias=False)
+        self.g_proj = nn.Linear(self.hidden_size, self.num_heads * self.state_size, bias=False)
+        self.norm = RMSNorm(self.hidden_size, elementwise_affine, eps=norm_eps)
+        self.o_proj = nn.Linear(self.value_dim, self.hidden_size, bias=False)
+        self.apply(self._initialize_weights)
+        self.register_buffer('alibi', build_alibi_tensor_scan(self.num_heads, self.max_position_embeddings, self.window_size, self.state_size))
+        self.register_buffer('mask', scores_mask(self.max_position_embeddings, self.window_size, self.state_size))
+    def _initialize_weights(self, module: nn.Module):
+        if getattr(module, "_is_hf_initialized", False):
+            return
+        if isinstance(module, nn.Linear):
+            nn.init.xavier_uniform_(module.weight, gain=2 ** -2.5)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        module._is_hf_initialized = True
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Cache] = None,
+        use_cache: Optional[bool] = False,
+        output_attentions: Optional[bool] = False,
+        **kwargs
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Cache]]:
+        if attention_mask is not None:
+            assert len(attention_mask.shape) == 2, (
+                "Expected attention_mask as a 0-1 matrix with shape [batch_size, seq_len] "
+                "for padding purposes (0 indicating padding). "
+                "Arbitrary attention masks of shape [batch_size, seq_len, seq_len] are not allowed."
+            )
+        # launching the triton kernel for just one token will actually be slower
+        mode = 'naive' if past_key_values is not None else self.mode
+        if self.norm_first:
+            hidden_states = self.norm(hidden_states)
+        last_state = None
+        if past_key_values is not None and len(past_key_values) > self.layer_idx:
+            last_state = past_key_values[self.layer_idx]
+        q = self.q_proj(hidden_states)
+        k = self.k_proj(hidden_states)
+        v = self.v_proj(hidden_states)
+        s = self.s_proj(hidden_states)
+        g = self.g_proj(hidden_states)
+        if self.gate_act == 'softmax':
+            g = F.softmax(g, dim=-1)
+        elif self.gate_act == 'sigmoid':
+            g = sigmoid(g)
+        else:
+            raise NotImplementedError(f"Gate activation `{self.gate_act}` is not supported.")
+        # KV cache is updated before going into SCAN
+        if past_key_values is not None:
+            k, v = past_key_values.update(
+                attn_state=(k, v),
+                layer_idx=self.layer_idx,
+                offset=q.shape[2],
+                # We actually don't want to crop to window for the initial prompt, only for subsequent autoregressive tokens
+                cache_kwargs=dict(window_size=self.window_size) if q.shape[-2] == 1 else dict()
+            )['attn_state']
+        recurrent_state = last_state['recurrent_state'] if last_state is not None else None
+        if mode == 'parallel':
+            # Split heads (but merge with batch dimension because kernels receive (B T C) shape)
+            q = rearrange(q, 'b t (h c) -> (b h) t c', h=self.num_heads)
+            k = rearrange(k, 'b t (h c) -> (b h) t c', h=self.num_kv_heads)
+            v = rearrange(v, 'b t (h c) -> (b h) t c', h=self.num_kv_heads)
+            s = rearrange(s, 'b t (h c) -> (b h) t c', h=self.num_kv_heads)
+            g = rearrange(g, 'b t (h s) -> (b h) t s', h=self.num_kv_heads)
+            o, recurrent_state = parallel_scan(
+                q=q,
+                k=k,
+                v=v,
+                s=s,
+                g=g,
+                window_size=self.window_size,
+                num_heads=self.num_heads,
+                alibi=self.alibi.to(q.device),
+                mask=self.mask.to(q.device),
+                initial_state=recurrent_state,
+                output_final_state=use_cache,
+                scale=self.scale,
+                head_first=False
+            )
+            o = rearrange(o, '(b h) t c -> b t (h c)', h=self.num_heads)
+        elif mode == 'naive':
+            # TODO: Implement naive recurrent SCAN for inference
+            q = rearrange(q, 'b t (h c) -> b h t c', h=self.num_heads)
+            k = rearrange(k, 'b t (h c) -> b h t c', h=self.num_kv_heads)
+            v = rearrange(v, 'b t (h c) -> b h t c', h=self.num_kv_heads)
+            s = rearrange(s, 'b t (h c) -> b h t c', h=self.num_kv_heads)
+            g = rearrange(g, 'b t (h s) -> b h t s', h=self.num_kv_heads)
+            o, recurrent_state = naive_recurrent_scan(
+                q=q,
+                k=k,
+                v=v,
+                s=s,
+                g=g,
+                window_size=self.window_size,
+                alibi=self.alibi.to(q.device),
+                mask=self.mask.to(q.device),
+                initial_state=recurrent_state,
+                output_final_state=use_cache,
+                scale=self.scale,
+                head_first=False
+            )
+            o = rearrange(o, 'b h t c -> b t (h c)', h=self.num_heads)
+        else:
+            raise NotImplementedError(f"Not supported mode `{mode}`.")
+        # Update the recurrent state after SCAN
+        if past_key_values is not None:
+            past_key_values.update(
+                recurrent_state=recurrent_state,
+                layer_idx=self.layer_idx
+            )
+        o = rms_norm_linear(swish(o), self.norm.weight, self.norm.bias, self.o_proj.weight, self.o_proj.bias)
+        return o, None, past_key_values

fla/layers/simple_gla.py ADDED Viewed

	@@ -0,0 +1,252 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2024, Songlin Yang, Yu Zhang
+from __future__ import annotations
+from typing import TYPE_CHECKING, Optional, Tuple
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from einops import rearrange, repeat
+from fla.modules import FusedRMSNormSwishGate, RMSNorm, ShortConvolution
+from fla.modules.activations import ACT2FN
+from fla.ops.simple_gla import chunk_simple_gla, fused_recurrent_simple_gla
+if TYPE_CHECKING:
+    from fla.models.utils import Cache
+class SimpleGatedLinearAttention(nn.Module):
+    r"""
+    The layer implementaion for [Gated Linear Attention Transformers with Hardware-Efficient Training](https://arxiv.org/abs/2312.06635).  # noqa
+    This layer calls the simplified GLA kernel in which the gating is head-wise instead of elementwise.
+    Args:
+        mode (str, Optional):
+            Which GLA kernel to use.
+            Currently available: `chunk`.
+            Default: `chunk`.
+        hidden_size (int, Optional):
+            The hidden size of the input. Default: 1024.
+        expand_k (float, Optional):
+            The expansion ratio for the key dim. Default: 1.0.
+        expand_v (float, Optional):
+            The expansion ratio for the value dim. Default: 1.0.
+        num_heads (int, Optional):
+            The number of heads. Default: 4.
+        num_kv_heads (int, Optional):
+            The number of key/value heads, used for MQA. Default: None.
+        feature_map (str, Optional):
+            Feature map function applied to queries/keys. Default: None.
+        use_short_conv (bool, Optional):
+            Whether to use short convolutions. Default: `False`.
+        conv_size (int, Optional):
+            The kernel size of the short convolution, only used when `use_short_conv` is `True`. Default: 4.
+        conv_bias (bool, Optional):
+            Whether to use bias in the short convolution, only used when `use_short_conv` is `True`. Default: `False`.
+        gate_fn (str, Optional):
+            The activation function for the output gate. Default: `swish`.
+        elementwise_affine (bool, Optional):
+            If `True`, applies elementwise affine to LayerNorm with learnable parameters. Default: `True`.
+        norm_eps (float, Optional):
+            The epsilon value for the layernorm/rmsnorm layer. Default: 1e-5.
+        gate_logit_normalizer (int, Optional):
+            The normalizer for the gate logits, appied after `logsigmoid`. Default: 16.
+        fuse_norm (bool, Optional):
+            Whether to fuse the norm and the output gate for better memory footprint. Default: `True`.
+        layer_idx (int, Optional):
+            The index of the layer. Default: None.
+    """
+    def __init__(
+        self,
+        mode: str = 'chunk',
+        hidden_size: int = 1024,
+        expand_k: float = 1.,
+        expand_v: float = 1.,
+        num_heads: int = 4,
+        num_kv_heads: Optional[int] = None,
+        feature_map: Optional[str] = None,
+        use_short_conv: bool = True,
+        conv_size: int = 4,
+        conv_bias: bool = False,
+        gate_fn: str = 'swish',
+        elementwise_affine: Optional[bool] = True,
+        norm_eps: float = 1e-5,
+        gate_logit_normalizer: int = 16,
+        fuse_norm: bool = True,
+        layer_idx: int = None,
+    ) -> SimpleGatedLinearAttention:
+        super().__init__()
+        self.mode = mode
+        self.hidden_size = hidden_size
+        self.expand_k = expand_k
+        self.expand_v = expand_v
+        self.num_heads = num_heads
+        self.num_kv_heads = num_kv_heads if num_kv_heads is not None else num_heads
+        self.num_kv_groups = self.num_heads // self.num_kv_heads
+        self.feature_map_fn = ACT2FN[feature_map] if feature_map is not None else None
+        self.use_short_conv = use_short_conv
+        self.conv_size = conv_size
+        self.conv_bias = conv_bias
+        self.key_dim = int(hidden_size * expand_k)
+        self.value_dim = int(hidden_size * expand_v)
+        self.key_dim_per_group = self.key_dim // self.num_kv_groups
+        self.value_dim_per_group = self.value_dim // self.num_kv_groups
+        self.layer_idx = layer_idx
+        assert mode in ['chunk', "fused_recurrent"], f"Not suppoerted mode `{mode}`."
+        assert self.key_dim % num_heads == 0, f"key dim must be divisible by num_heads of {num_heads}"
+        assert self.value_dim % num_heads == 0, f"value dim must be divisible by num_heads of {num_heads}"
+        self.head_qk_dim = self.key_dim // num_heads
+        self.head_v_dim = self.value_dim // num_heads
+        self.q_proj = nn.Linear(hidden_size, self.key_dim, bias=False)
+        self.k_proj = nn.Linear(hidden_size, self.key_dim_per_group, bias=False)
+        self.v_proj = nn.Linear(hidden_size, self.value_dim_per_group, bias=False)
+        self.g_proj = nn.Linear(hidden_size, self.value_dim, bias=False)
+        if use_short_conv:
+            self.conv_size = conv_size
+            self.q_conv1d = ShortConvolution(self.key_dim, conv_size, activation='silu')
+            self.k_conv1d = ShortConvolution(self.key_dim_per_group, conv_size, activation='silu')
+            self.v_conv1d = ShortConvolution(self.value_dim_per_group, conv_size, activation='silu')
+        self.gk_proj = nn.Linear(hidden_size, self.num_heads)
+        if gate_fn == 'swish' and fuse_norm:
+            self.g_norm_swish_gate = FusedRMSNormSwishGate(self.head_v_dim, elementwise_affine, norm_eps)
+            self.fuse_norm_and_gate = True
+        else:
+            self.fuse_norm_and_gate = False
+            self.g_norm = RMSNorm(hidden_size=self.head_v_dim, elementwise_affine=elementwise_affine, eps=norm_eps)
+            self.gate_fn = ACT2FN[gate_fn]
+        self.o_proj = nn.Linear(self.value_dim, hidden_size, bias=False)
+        self.gate_logit_normalizer = gate_logit_normalizer
+        self.apply(self._initialize_weights)
+    def _initialize_weights(self, module: nn.Module):
+        if getattr(module, "_is_hf_initialized", False):
+            return
+        if isinstance(module, nn.Linear):
+            nn.init.xavier_uniform_(module.weight, gain=2 ** -2.5)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        module._is_hf_initialized = True
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Cache] = None,
+        use_cache: Optional[bool] = False,
+        output_attentions: Optional[bool] = False,
+        **kwargs
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Cache]]:
+        if attention_mask is not None:
+            assert len(attention_mask.shape) == 2, (
+                "Expected attention_mask as a 0-1 matrix with shape [batch_size, seq_len] "
+                "for padding purposes (0 indicating padding). "
+                "Arbitrary attention masks of shape [batch_size, seq_len, seq_len] are not allowed."
+            )
+        # launching the triton kernel for just one token will actually be slower
+        mode = 'fused_recurrent' if hidden_states.shape[1] == 1 else self.mode
+        last_state = None
+        if past_key_values is not None and len(past_key_values) > self.layer_idx:
+            last_state = past_key_values[self.layer_idx]
+        if self.use_short_conv:
+            conv_state_q, conv_state_k, conv_state_v = None, None, None
+            if last_state is not None:
+                conv_state_q, conv_state_k, conv_state_v = last_state['conv_state']
+            conv_mask = attention_mask[:, -hidden_states.shape[1]:] if attention_mask is not None else None
+            q, conv_state_q = self.q_conv1d(x=self.q_proj(hidden_states),
+                                            mask=conv_mask,
+                                            cache=conv_state_q,
+                                            output_final_state=use_cache)
+            k, conv_state_k = self.k_conv1d(x=self.k_proj(hidden_states),
+                                            mask=conv_mask,
+                                            cache=conv_state_k,
+                                            output_final_state=use_cache)
+            v, conv_state_v = self.v_conv1d(x=self.v_proj(hidden_states),
+                                            mask=conv_mask,
+                                            cache=conv_state_v,
+                                            output_final_state=use_cache)
+        else:
+            q = self.q_proj(hidden_states)
+            k = self.k_proj(hidden_states)
+            v = self.v_proj(hidden_states)
+        gk = self.gk_proj(hidden_states)
+        if self.feature_map_fn is not None:
+            q, k = map(self.feature_map_fn, (q, k))
+        # dealing with left-padding
+        if attention_mask is not None:
+            v = v.mul_(attention_mask[:, -v.shape[-2]:, None])
+        q = rearrange(q, '... (h d) -> ... h d', h=self.num_heads)
+        if self.num_kv_groups > 1:
+            k, v = (repeat(x, '... (h d) -> ... (h g) d', h=self.num_kv_heads, g=self.num_kv_groups) for x in (k, v))
+        else:
+            k, v = (rearrange(x, '... (h d) -> ... h d', h=self.num_kv_heads) for x in (k, v))
+        gk = F.logsigmoid(gk) / self.gate_logit_normalizer
+        recurrent_state = last_state['recurrent_state'] if last_state is not None else None
+        if mode == 'chunk':
+            o, recurrent_state = chunk_simple_gla(
+                q=q,
+                k=k,
+                v=v,
+                gk=gk,
+                initial_state=recurrent_state,
+                output_final_state=use_cache,
+                head_first=False
+            )
+        elif mode == 'fused_recurrent':
+            o, recurrent_state = fused_recurrent_simple_gla(
+                q=q,
+                k=k,
+                v=v,
+                gk=gk,
+                initial_state=recurrent_state,
+                output_final_state=use_cache,
+                head_first=False
+            )
+        else:
+            raise NotImplementedError(f"Not supported mode `{mode}`.")
+        if past_key_values is not None:
+            past_key_values.update(
+                recurrent_state=recurrent_state,
+                conv_state=(conv_state_q, conv_state_k, conv_state_v) if self.use_short_conv else None,
+                layer_idx=self.layer_idx,
+                offset=q.shape[2]
+            )
+        g = self.g_proj(hidden_states)
+        if self.fuse_norm_and_gate:
+            g = rearrange(g, 'b t (h d) -> b t h d', h=self.num_heads)
+            o = self.g_norm_swish_gate(o, g)
+            o = rearrange(o, 'b t h d -> b t (h d)')
+        else:
+            o = rearrange(self.g_norm(o), 'b t h d -> b t (h d)')
+            o = o * self.gate_fn(g)
+        o = self.o_proj(o)
+        return o, None, past_key_values
+    def state_size(self, **kwargs) -> int:
+        state_size = self.key_dim * self.head_v_dim
+        for module in self.children():
+            if isinstance(module, ShortConvolution):
+                state_size += module.state_size
+        return state_size

fla/models/__init__.py ADDED Viewed

	@@ -0,0 +1,39 @@

+# -*- coding: utf-8 -*-
+from fla.models.abc import ABCConfig, ABCForCausalLM, ABCModel
+from fla.models.bitnet import BitNetConfig, BitNetForCausalLM, BitNetModel
+from fla.models.delta_net import (DeltaNetConfig, DeltaNetForCausalLM,
+                                  DeltaNetModel)
+from fla.models.gla import GLAConfig, GLAForCausalLM, GLAModel
+from fla.models.gsa import GSAConfig, GSAForCausalLM, GSAModel
+from fla.models.hgrn import HGRNConfig, HGRNForCausalLM, HGRNModel
+from fla.models.hgrn2 import HGRN2Config, HGRN2ForCausalLM, HGRN2Model
+from fla.models.linear_attn import (LinearAttentionConfig,
+                                    LinearAttentionForCausalLM,
+                                    LinearAttentionModel)
+from fla.models.mamba import MambaConfig, MambaForCausalLM, MambaModel
+from fla.models.mamba2 import Mamba2Config, Mamba2ForCausalLM, Mamba2Model
+from fla.models.retnet import RetNetConfig, RetNetForCausalLM, RetNetModel
+from fla.models.rwkv6 import RWKV6Config, RWKV6ForCausalLM, RWKV6Model
+from fla.models.samba import SambaConfig, SambaForCausalLM, SambaModel
+from fla.models.scan import SCANConfig, SCANForCausalLM, SCANModel
+from fla.models.transformer import (TransformerConfig, TransformerForCausalLM,
+                                    TransformerModel)
+__all__ = [
+    'ABCConfig', 'ABCForCausalLM', 'ABCModel',
+    'BitNetConfig', 'BitNetForCausalLM', 'BitNetModel',
+    'DeltaNetConfig', 'DeltaNetForCausalLM', 'DeltaNetModel',
+    'GLAConfig', 'GLAForCausalLM', 'GLAModel',
+    'GSAConfig', 'GSAForCausalLM', 'GSAModel',
+    'HGRNConfig', 'HGRNForCausalLM', 'HGRNModel',
+    'HGRN2Config', 'HGRN2ForCausalLM', 'HGRN2Model',
+    'LinearAttentionConfig', 'LinearAttentionForCausalLM', 'LinearAttentionModel',
+    'MambaConfig', 'MambaForCausalLM', 'MambaModel',
+    'Mamba2Config', 'Mamba2ForCausalLM', 'Mamba2Model',
+    'RetNetConfig', 'RetNetForCausalLM', 'RetNetModel',
+    'RWKV6Config', 'RWKV6ForCausalLM', 'RWKV6Model',
+    'SambaConfig', 'SambaForCausalLM', 'SambaModel',
+    'SCANConfig', 'SCANForCausalLM', 'SCANModel',
+    'TransformerConfig', 'TransformerForCausalLM', 'TransformerModel'
+]

fla/models/abc/__init__.py ADDED Viewed

	@@ -0,0 +1,13 @@

+# -*- coding: utf-8 -*-
+from transformers import AutoConfig, AutoModel, AutoModelForCausalLM
+from fla.models.abc.configuration_abc import ABCConfig
+from fla.models.abc.modeling_abc import ABCForCausalLM, ABCModel
+AutoConfig.register(ABCConfig.model_type, ABCConfig)
+AutoModel.register(ABCConfig, ABCModel)
+AutoModelForCausalLM.register(ABCConfig, ABCForCausalLM)
+__all__ = ['ABCConfig', 'ABCForCausalLM', 'ABCModel']

fla/models/abc/configuration_abc.py ADDED Viewed

	@@ -0,0 +1,84 @@

+# -*- coding: utf-8 -*-
+from typing import Dict, Optional
+from transformers.configuration_utils import PretrainedConfig
+class ABCConfig(PretrainedConfig):
+    model_type = 'abc'
+    keys_to_ignore_at_inference = ['past_key_values']
+    def __init__(
+        self,
+        hidden_size: int = 2048,
+        gate_low_rank_dim: int = 16,
+        clamp_min: float = -32,
+        clamp_max: float = 32,
+        hidden_ratio: Optional[int] = 4,
+        intermediate_size: Optional[int] = None,
+        num_hidden_layers: int = 24,
+        num_heads: int = 4,
+        num_slots: Optional[int] = 64,
+        use_short_conv: bool = False,
+        conv_size: int = 4,
+        exapnd_k: float = 0.5,
+        exapnd_v: float = 1,
+        hidden_act: str = "swish",
+        max_position_embeddings: int = 2048,
+        elementwise_affine: Optional[bool] = True,
+        norm_eps: float = 1e-6,
+        attn: Optional[Dict] = None,
+        use_cache: bool = True,
+        pad_token_id: int = None,
+        bos_token_id: int = 1,
+        eos_token_id: int = 2,
+        tie_word_embeddings: bool = False,
+        initializer_range: float = 0.02,
+        fuse_norm: bool = True,
+        fuse_cross_entropy: bool = True,
+        vocab_size: int = 32000,
+        **kwargs
+    ):
+        self.hidden_size = hidden_size
+        self.gate_low_rank_dim = gate_low_rank_dim
+        self.clamp_min = clamp_min
+        self.clamp_max = clamp_max
+        self.hidden_ratio = hidden_ratio
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_heads = num_heads
+        self.num_slots = num_slots
+        self.use_short_conv = use_short_conv
+        self.conv_size = conv_size
+        self.expand_k = exapnd_k
+        self.expand_v = exapnd_v
+        self.hidden_act = hidden_act
+        self.max_position_embeddings = max_position_embeddings
+        self.elementwise_affine = elementwise_affine
+        self.norm_eps = norm_eps
+        self.attn = attn
+        self.use_cache = use_cache
+        self.initializer_range = initializer_range
+        self.fuse_norm = fuse_norm
+        self.fuse_cross_entropy = fuse_cross_entropy
+        self.vocab_size = vocab_size
+        if attn is not None:
+            if not isinstance(attn, Dict):
+                raise ValueError("attn must be a dictionary")
+            if 'layers' not in attn:
+                raise ValueError("Layer indices must be provided to initialize hybrid attention layers")
+            if 'num_heads' not in attn:
+                raise ValueError("Number of heads must be provided to initialize hybrid attention layers")
+            attn['num_kv_heads'] = attn.get('num_kv_heads', attn['num_heads'])
+            attn['window_size'] = attn.get('window_size', None)
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )

fla/models/abc/modeling_abc.py ADDED Viewed

	@@ -0,0 +1,403 @@

+# -*- coding: utf-8 -*-
+from __future__ import annotations
+import math
+import warnings
+from typing import List, Optional, Tuple, Union
+import torch
+import torch.nn as nn
+import torch.utils.checkpoint
+from transformers.activations import ACT2FN
+from transformers.generation import GenerationMixin
+from transformers.modeling_outputs import (BaseModelOutputWithPast,
+                                           CausalLMOutputWithPast)
+from transformers.modeling_utils import PreTrainedModel
+from transformers.utils import logging
+from fla.layers.abc import ABCAttention
+from fla.layers.attn import Attention
+from fla.models.abc.configuration_abc import ABCConfig
+from fla.models.utils import Cache
+from fla.modules import (FusedCrossEntropyLoss, FusedLinearCrossEntropyLoss,
+                         RMSNorm)
+from fla.modules.activations import swiglu_linear
+logger = logging.get_logger(__name__)
+class ABCMLP(nn.Module):
+    def __init__(
+        self,
+        hidden_size: int,
+        hidden_ratio: Optional[int] = None,
+        intermediate_size: Optional[int] = None,
+        hidden_act: str = 'swish'
+    ) -> ABCMLP:
+        super().__init__()
+        self.hidden_size = hidden_size
+        # the final number of params is `hidden_ratio * hidden_size^2`
+        # `intermediate_size` is chosen to be a multiple of 256 closest to `2/3 * hidden_size * hidden_ratio`
+        if hidden_ratio is None:
+            hidden_ratio = 4
+        if intermediate_size is None:
+            intermediate_size = int(hidden_size * hidden_ratio * 2 / 3)
+            intermediate_size = 256 * ((intermediate_size + 256 - 1) // 256)
+        self.hidden_ratio = hidden_ratio
+        self.intermediate_size = intermediate_size
+        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size * 2, bias=False)
+        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
+        self.act_fn = ACT2FN[hidden_act]
+    def forward(self, x):
+        y = self.gate_proj(x)
+        gate, y = y.chunk(2, -1)
+        return swiglu_linear(gate, y, self.down_proj.weight, self.down_proj.bias)
+class ABCBlock(nn.Module):
+    def __init__(self, config: ABCConfig, layer_idx: int):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.attn_norm = RMSNorm(hidden_size=config.hidden_size, eps=config.norm_eps)
+        if config.attn is not None and layer_idx in config.attn['layers']:
+            self.attn = Attention(
+                hidden_size=config.hidden_size,
+                num_heads=config.attn['num_heads'],
+                num_kv_heads=config.attn['num_kv_heads'],
+                window_size=config.attn['window_size'],
+                max_position_embeddings=config.max_position_embeddings,
+                layer_idx=layer_idx
+            )
+        else:
+            self.attn = ABCAttention(
+                hidden_size=config.hidden_size,
+                expand_k=config.expand_k,
+                expand_v=config.expand_v,
+                num_heads=config.num_heads,
+                num_slots=config.num_slots,
+                use_short_conv=config.use_short_conv,
+                conv_size=config.conv_size,
+                gate_fn=config.hidden_act,
+                elementwise_affine=config.elementwise_affine,
+                norm_eps=config.norm_eps,
+                clamp_min=config.clamp_min,
+                clamp_max=config.clamp_max,
+                fuse_norm=config.fuse_norm,
+                layer_idx=layer_idx
+            )
+        self.mlp_norm = RMSNorm(hidden_size=config.hidden_size, eps=config.norm_eps)
+        self.mlp = ABCMLP(
+            hidden_size=config.hidden_size,
+            hidden_ratio=config.hidden_ratio,
+            intermediate_size=config.intermediate_size,
+            hidden_act=config.hidden_act
+        )
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
+        use_cache: Optional[bool] = False,
+        output_attentions: Optional[bool] = False,
+        **kwargs,
+    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
+        residual = hidden_states
+        hidden_states = self.attn_norm(hidden_states)
+        hidden_states, attentions, past_key_values = self.attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions
+        )
+        hidden_states, residual = self.mlp_norm(hidden_states, residual, True)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+        outputs = (hidden_states, attentions, past_key_values)
+        return outputs
+class ABCPreTrainedModel(PreTrainedModel):
+    config_class = ABCConfig
+    supports_gradient_checkpointing = True
+    _no_split_modules = ['ABCBlock']
+    def __init__(self, *inputs, **kwargs):
+        super().__init__(*inputs, **kwargs)
+    def _init_weights(
+        self,
+        module: nn.Module,
+        rescale_prenorm_residual: bool = True,
+        num_residuals_per_layer: int = 2,
+    ):
+        if isinstance(module, (nn.Linear, nn.Conv1d)):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            nn.init.normal_(module.weight, mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.Embedding):
+            nn.init.normal_(module.weight, mean=0.0, std=self.config.initializer_range)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        if rescale_prenorm_residual:
+            # Reinitialize selected weights subject to the OpenAI GPT-2 Paper Scheme:
+            #   > A modified initialization which accounts for the accumulation on the residual path with model depth. Scale
+            #   > the weights of residual layers at initialization by a factor of 1/√N where N is the # of residual layers.
+            #   >   -- GPT-2 :: https://openai.com/blog/better-language-models/
+            #
+            # Reference (Megatron-LM): https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/gpt_model.py
+            for name, p in module.named_parameters():
+                if name in ["o_proj.weight", "down_proj.weight"]:
+                    # Special Scaled Initialization --> There are 2 Layer Norms per Transformer Block
+                    # Following Pytorch init, except scale by 1/sqrt(2 * n_layer)
+                    # We need to reinit p since this code could be called multiple times
+                    # Having just p *= scale would repeatedly scale it down
+                    with torch.no_grad():
+                        p /= math.sqrt(num_residuals_per_layer * self.config.num_hidden_layers)
+class ABCModel(ABCPreTrainedModel):
+    def __init__(self, config: ABCConfig):
+        super().__init__(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+        self.embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        self.layers = nn.ModuleList([ABCBlock(config, layer_idx) for layer_idx in range(config.num_hidden_layers)])
+        self.norm = RMSNorm(config.hidden_size, eps=config.norm_eps)
+        self.gradient_checkpointing = False
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.embeddings
+    def set_input_embeddings(self, value):
+        self.embeddings = value
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,  # noqa
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None
+    ) -> Union[Tuple, BaseModelOutputWithPast]:
+        if output_attentions:
+            warnings.warn("`ABCModel` does not `output_attentions` now, setting it to `False`.")
+            output_attentions = False
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        use_cache = use_cache if use_cache is not None else (self.config.use_cache if not self.training else False)
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # retrieve input_ids and inputs_embeds
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        if input_ids is None and inputs_embeds is None:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+        if inputs_embeds is None:
+            inputs_embeds = self.embeddings(input_ids)
+        hidden_states = inputs_embeds
+        if use_cache and not isinstance(past_key_values, Cache):
+            past_key_values = Cache.from_legacy_cache(past_key_values)
+        if self.gradient_checkpointing and self.training and use_cache:
+            logger.warning_once("`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...")
+            use_cache = False
+        all_hidden_states = () if output_hidden_states else None
+        all_attns = () if output_attentions else None
+        for layer in self.layers:
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+            if self.gradient_checkpointing and self.training:
+                hidden_states, attentions, past_key_values = self._gradient_checkpointing_func(
+                    layer.__call__,
+                    hidden_states,
+                    attention_mask,
+                    past_key_values,
+                    use_cache,
+                    output_attentions
+                )
+            else:
+                hidden_states, attentions, past_key_values = layer(
+                    hidden_states,
+                    attention_mask,
+                    past_key_values=past_key_values,
+                    use_cache=use_cache,
+                    output_attentions=output_attentions
+                )
+            if output_attentions:
+                all_attns += (attentions,)
+        hidden_states = self.norm(hidden_states)
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+        if not return_dict:
+            return tuple(i for i in [hidden_states, past_key_values, all_hidden_states, all_attns] if i is not None)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=past_key_values,
+            hidden_states=all_hidden_states,
+            attentions=all_attns
+        )
+class ABCForCausalLM(ABCPreTrainedModel, GenerationMixin):
+    _tied_weights_keys = ["lm_head.weight"]
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = ABCModel(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.model.embeddings
+    def set_input_embeddings(self, value):
+        self.model.embeddings = value
+    def get_output_embeddings(self):
+        return self.lm_head
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+    def set_decoder(self, decoder):
+        self.model = decoder
+    def get_decoder(self):
+        return self.model
+    def generate(self, *args, **kwargs):
+        try:
+            return super().generate(*args, **kwargs)
+        except AttributeError as exception:
+            if 'past_key_values' in str(exception):
+                raise AttributeError(
+                    f"You tried to call `generate` with a decoding strategy that manipulates `past_key_values`, "
+                    f"which is not supported for {self.__class__.__name__}. "
+                    f"Try another generation strategy instead. "
+                    f"For the available generation strategies, check this doc: "
+                    f"https://huggingface.co/docs/transformers/en/generation_strategies#decoding-strategies"
+                )
+            else:
+                raise exception
+    def prepare_inputs_for_generation(
+        self,
+        input_ids: torch.LongTensor = None,
+        past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        **kwargs
+    ):
+        # only last token for `inputs_ids` if the `past_key_values` is passed along.
+        if past_key_values is not None:
+            input_ids = input_ids[:, -1:]
+        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
+        if inputs_embeds is not None and past_key_values is None:
+            model_inputs = {'inputs_embeds': inputs_embeds}
+        else:
+            model_inputs = {'input_ids': input_ids}
+        model_inputs['past_key_values'] = past_key_values
+        return model_inputs
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        num_logits_to_keep: Optional[int] = 0
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict
+        )
+        hidden_states = outputs[0]
+        fuse_linear_and_cross_entropy = self.config.fuse_cross_entropy and self.training
+        logits = None if fuse_linear_and_cross_entropy else self.lm_head(hidden_states[:, -num_logits_to_keep:])
+        loss = None
+        if labels is not None:
+            if self.config.fuse_cross_entropy:
+                if fuse_linear_and_cross_entropy:
+                    loss_fct = FusedLinearCrossEntropyLoss()
+                else:
+                    loss_fct = FusedCrossEntropyLoss(inplace_backward=True)
+            else:
+                loss_fct = nn.CrossEntropyLoss()
+            # Enable model parallelism
+            labels = labels.to(hidden_states.device)
+            labels = torch.cat((labels[..., 1:], torch.full_like(labels[:, :1], loss_fct.ignore_index)), 1)
+            if fuse_linear_and_cross_entropy:
+                loss = loss_fct(hidden_states.view(-1, self.config.hidden_size),
+                                labels.view(-1),
+                                self.lm_head.weight,
+                                self.lm_head.bias)
+            else:
+                loss = loss_fct(logits.view(-1, self.config.vocab_size), labels.view(-1))
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )

fla/models/bitnet/__init__.py ADDED Viewed

	@@ -0,0 +1,13 @@

+# -*- coding: utf-8 -*-
+from transformers import AutoConfig, AutoModel, AutoModelForCausalLM
+from fla.models.bitnet.configuration_bitnet import BitNetConfig
+from fla.models.bitnet.modeling_bitnet import BitNetForCausalLM, BitNetModel
+AutoConfig.register(BitNetConfig.model_type, BitNetConfig)
+AutoModel.register(BitNetConfig, BitNetModel)
+AutoModelForCausalLM.register(BitNetConfig, BitNetForCausalLM)
+__all__ = ['BitNetConfig', 'BitNetForCausalLM', 'BitNetModel']

fla/models/bitnet/configuration_bitnet.py ADDED Viewed

	@@ -0,0 +1,68 @@

+# -*- coding: utf-8 -*-
+from typing import Optional
+from transformers.configuration_utils import PretrainedConfig
+class BitNetConfig(PretrainedConfig):
+    model_type = 'bitnet'
+    keys_to_ignore_at_inference = ['past_key_values']
+    def __init__(
+        self,
+        vocab_size: int = 32000,
+        hidden_size: int = 2048,
+        num_hidden_layers: int = 24,
+        num_heads: int = 32,
+        num_kv_heads: int = None,
+        window_size: Optional[int] = None,
+        rope_theta: Optional[float] = 10000.,
+        max_position_embeddings: int = 2048,
+        hidden_ratio: Optional[int] = 4,
+        intermediate_size: Optional[int] = None,
+        hidden_act: str = "swish",
+        initializer_range: float = 0.02,
+        elementwise_affine: Optional[bool] = True,
+        norm_first: bool = False,
+        norm_eps: float = 1e-6,
+        use_cache: bool = True,
+        pad_token_id: int = None,
+        bos_token_id: int = 1,
+        eos_token_id: int = 2,
+        tie_word_embeddings: bool = False,
+        attention_bias: bool = False,
+        fuse_norm: bool = True,
+        fuse_cross_entropy: bool = True,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_heads = num_heads
+        self.num_kv_heads = num_kv_heads
+        self.window_size = window_size
+        self.rope_theta = rope_theta
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_ratio = hidden_ratio
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.elementwise_affine = elementwise_affine
+        self.norm_first = norm_first
+        self.norm_eps = norm_eps
+        self.use_cache = use_cache
+        self.attention_bias = attention_bias
+        self.fuse_cross_entropy = fuse_cross_entropy
+        self.fuse_norm = fuse_norm
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )

fla/models/bitnet/modeling_bitnet.py ADDED Viewed

	@@ -0,0 +1,428 @@

+# -*- coding: utf-8 -*-
+from __future__ import annotations
+import math
+import warnings
+from typing import List, Optional, Tuple, Union
+import torch
+import torch.nn as nn
+import torch.utils.checkpoint
+from transformers.activations import ACT2FN
+from transformers.generation import GenerationMixin
+from transformers.modeling_outputs import (BaseModelOutputWithPast,
+                                           CausalLMOutputWithPast)
+from transformers.modeling_utils import PreTrainedModel
+from transformers.utils import logging
+from fla.layers.bitattn import BitAttention
+from fla.models.bitnet.configuration_bitnet import BitNetConfig
+from fla.models.utils import Cache
+from fla.modules import (FusedCrossEntropyLoss, FusedLinearCrossEntropyLoss,
+                         RMSNorm)
+from fla.modules.activations import swiglu_bitlinear
+from fla.modules.fused_bitlinear import BitLinear, rms_norm_linear_quant
+logger = logging.get_logger(__name__)
+class BitNetMLP(nn.Module):
+    def __init__(
+        self,
+        hidden_size: int,
+        hidden_ratio: Optional[int] = None,
+        intermediate_size: Optional[int] = None,
+        hidden_act: str = 'swish',
+        norm_first: bool = True,
+        norm_eps: float = 1e-5
+    ) -> BitNetMLP:
+        super().__init__()
+        self.hidden_size = hidden_size
+        # the final number of params is `hidden_ratio * hidden_size^2`
+        # `intermediate_size` is chosen to be a multiple of 256 closest to `2/3 * hidden_size * hidden_ratio`
+        if hidden_ratio is None:
+            hidden_ratio = 4
+        if intermediate_size is None:
+            intermediate_size = int(hidden_size * hidden_ratio * 2 / 3)
+            intermediate_size = 256 * ((intermediate_size + 256 - 1) // 256)
+        self.hidden_ratio = hidden_ratio
+        self.intermediate_size = intermediate_size
+        self.norm_first = norm_first
+        if norm_first:
+            self.norm = RMSNorm(hidden_size=hidden_size, eps=norm_eps)
+        self.gate_proj = BitLinear(self.hidden_size, self.intermediate_size * 2, bias=False)
+        self.down_proj = BitLinear(self.intermediate_size, self.hidden_size, bias=False)
+        self.act_fn = ACT2FN[hidden_act]
+    def forward(self, x):
+        if self.norm_first:
+            x = rms_norm_linear_quant(x, self.norm.weight, self.norm.bias, self.gate_proj.weight, self.gate_proj.bias)
+        else:
+            x = self.gate_proj(x)
+        gate, y = x.chunk(2, -1)
+        return swiglu_bitlinear(gate, y, self.down_proj.weight, self.down_proj.bias)
+class BitNetBlock(nn.Module):
+    def __init__(self, config: BitNetConfig, layer_idx: int):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        if not config.norm_first:
+            self.attn_norm = RMSNorm(hidden_size=config.hidden_size, eps=config.norm_eps)
+        self.attn = BitAttention(
+            hidden_size=config.hidden_size,
+            num_heads=config.num_heads,
+            num_kv_heads=config.num_kv_heads,
+            window_size=config.window_size,
+            rope_theta=config.rope_theta,
+            max_position_embeddings=config.max_position_embeddings,
+            norm_first=config.norm_first,
+            norm_eps=config.norm_eps,
+            layer_idx=layer_idx
+        )
+        if not config.norm_first:
+            self.mlp_norm = RMSNorm(hidden_size=config.hidden_size, eps=config.norm_eps)
+        self.mlp = BitNetMLP(
+            hidden_size=config.hidden_size,
+            hidden_ratio=config.hidden_ratio,
+            intermediate_size=config.intermediate_size,
+            hidden_act=config.hidden_act,
+            norm_first=config.norm_first,
+            norm_eps=config.norm_eps
+        )
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Tuple[torch.Tensor]] = None,
+        output_attentions: Optional[bool] = False,
+        use_cache: Optional[bool] = False,
+        **kwargs,
+    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
+        residual = hidden_states
+        if hasattr(self, 'attn_norm'):
+            hidden_states = self.attn_norm(hidden_states)
+        hidden_states, attentions, past_key_values = self.attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions
+        )
+        if hasattr(self, 'mlp_norm'):
+            hidden_states, residual = self.mlp_norm(hidden_states, residual, True)
+        else:
+            hidden_states = residual + hidden_states
+            residual = hidden_states
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+        outputs = (hidden_states,)
+        if output_attentions:
+            outputs += (attentions,)
+        if use_cache:
+            outputs += (past_key_values,)
+        return outputs
+class BitNetPreTrainedModel(PreTrainedModel):
+    config_class = BitNetConfig
+    supports_gradient_checkpointing = True
+    _no_split_modules = ['BitNetBlock']
+    def __init__(self, *inputs, **kwargs):
+        super().__init__(*inputs, **kwargs)
+    def _init_weights(
+        self,
+        module: nn.Module,
+        rescale_prenorm_residual: bool = False,
+        num_residuals_per_layer: int = 2,
+    ):
+        if isinstance(module, (BitLinear, nn.Conv1d)):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            nn.init.normal_(module.weight, mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.Embedding):
+            nn.init.normal_(module.weight, mean=0.0, std=self.config.initializer_range)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        if rescale_prenorm_residual:
+            # Reinitialize selected weights subject to the OpenAI GPT-2 Paper Scheme:
+            #   > A modified initialization which accounts for the accumulation on the residual path with model depth. Scale
+            #   > the weights of residual layers at initialization by a factor of 1/√N where N is the # of residual layers.
+            #   >   -- GPT-2 :: https://openai.com/blog/better-language-models/
+            #
+            # Reference (Megatron-LM): https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/gpt_model.py
+            for name, p in module.named_parameters():
+                if name in ["o_proj.weight", "down_proj.weight"]:
+                    # Special Scaled Initialization --> There are 2 Layer Norms per Transformer Block
+                    # Following Pytorch init, except scale by 1/sqrt(2 * n_layer)
+                    # We need to reinit p since this code could be called multiple times
+                    # Having just p *= scale would repeatedly scale it down
+                    with torch.no_grad():
+                        p /= math.sqrt(num_residuals_per_layer * self.config.num_hidden_layers)
+class BitNetModel(BitNetPreTrainedModel):
+    def __init__(self, config: BitNetConfig):
+        super().__init__(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+        self.embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        self.layers = nn.ModuleList([BitNetBlock(config, layer_idx) for layer_idx in range(config.num_hidden_layers)])
+        self.norm = RMSNorm(config.hidden_size, eps=config.norm_eps)
+        self.gradient_checkpointing = False
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.embeddings
+    def set_input_embeddings(self, value):
+        self.embeddings = value
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        if output_attentions:
+            warnings.warn(
+                "`BitNetModel` does not support output attention weights now, so `output_attentions` is set to `False`."
+            )
+            output_attentions = False
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        use_cache = use_cache if use_cache is not None else (self.config.use_cache if not self.training else False)
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # retrieve input_ids and inputs_embeds
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is None and inputs_embeds is None:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+        if use_cache and not isinstance(past_key_values, Cache):
+            past_key_values = Cache.from_legacy_cache(past_key_values)
+        if inputs_embeds is None:
+            inputs_embeds = self.embeddings(input_ids)
+        # embed positions
+        hidden_states = inputs_embeds
+        if self.gradient_checkpointing and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                )
+                use_cache = False
+        all_hidden_states = () if output_hidden_states else None
+        all_attns = () if output_attentions else None
+        next_cache = None
+        for layer in self.layers:
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+            if self.gradient_checkpointing and self.training:
+                layer_outputs = self._gradient_checkpointing_func(
+                    layer.__call__,
+                    hidden_states,
+                    attention_mask,
+                    past_key_values,
+                    output_attentions,
+                    use_cache
+                )
+            else:
+                layer_outputs = layer(
+                    hidden_states,
+                    attention_mask=attention_mask,
+                    past_key_values=past_key_values,
+                    output_attentions=output_attentions,
+                    use_cache=use_cache
+                )
+            hidden_states = layer_outputs[0]
+            if use_cache:
+                next_cache = layer_outputs[2 if output_attentions else 1]
+            if output_attentions:
+                all_attns += (layer_outputs[1],)
+        hidden_states = self.norm(hidden_states)
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+        if not return_dict:
+            return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_attns] if v is not None)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=next_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_attns
+        )
+class BitNetForCausalLM(BitNetPreTrainedModel, GenerationMixin):
+    _tied_weights_keys = ["lm_head.weight"]
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = BitNetModel(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = BitLinear(config.hidden_size, config.vocab_size, bias=False)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.model.embeddings
+    def set_input_embeddings(self, value):
+        self.model.embeddings = value
+    def get_output_embeddings(self):
+        return self.lm_head
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+    def set_decoder(self, decoder):
+        self.model = decoder
+    def get_decoder(self):
+        return self.model
+    def prepare_inputs_for_generation(
+        self,
+        input_ids: torch.LongTensor = None,
+        past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        use_cache: bool = True,
+        num_logits_to_keep: Optional[int] = None,
+        **kwargs
+    ):
+        # only last token for `inputs_ids` if the `past_key_values` is passed along.
+        if past_key_values is not None:
+            input_ids = input_ids[:, -1:]
+        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
+        if inputs_embeds is not None and past_key_values is None:
+            model_inputs = {'inputs_embeds': inputs_embeds}
+        else:
+            # The `contiguous()` here is necessary to have a static stride during decoding. torchdynamo otherwise
+            # recompiles graphs as the stride of the inputs is a guard.
+            # Ref: https://github.com/huggingface/transformers/pull/29114
+            # TODO: use `next_tokens` directly instead.
+            model_inputs = {'input_ids': input_ids.contiguous()}
+        if num_logits_to_keep is not None:
+            model_inputs['num_logits_to_keep'] = num_logits_to_keep
+        model_inputs.update({
+            'past_key_values': past_key_values,
+            'use_cache': use_cache,
+            'attention_mask': attention_mask,
+            'num_logits_to_keep': num_logits_to_keep,
+        })
+        return model_inputs
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        num_logits_to_keep: Optional[int] = 0
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict
+        )
+        hidden_states = outputs[0]
+        fuse_linear_and_cross_entropy = self.config.fuse_cross_entropy and self.training
+        logits = None if fuse_linear_and_cross_entropy else self.lm_head(hidden_states[:, -num_logits_to_keep:])
+        loss = None
+        if labels is not None:
+            if self.config.fuse_cross_entropy:
+                if fuse_linear_and_cross_entropy:
+                    loss_fct = FusedLinearCrossEntropyLoss()
+                else:
+                    loss_fct = FusedCrossEntropyLoss(inplace_backward=True)
+            else:
+                loss_fct = nn.CrossEntropyLoss()
+            # Enable model parallelism
+            labels = labels.to(hidden_states.device)
+            labels = torch.cat((labels[..., 1:], torch.full_like(labels[:, :1], loss_fct.ignore_index)), 1)
+            if fuse_linear_and_cross_entropy:
+                loss = loss_fct(hidden_states.view(-1, self.config.hidden_size),
+                                labels.view(-1),
+                                self.lm_head.weight,
+                                self.lm_head.bias)
+            else:
+                loss = loss_fct(logits.view(-1, self.config.vocab_size), labels.view(-1))
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )

fla/models/delta_net/__init__.py ADDED Viewed

	@@ -0,0 +1,13 @@

+# -*- coding: utf-8 -*-
+from transformers import AutoConfig, AutoModel, AutoModelForCausalLM
+from fla.models.delta_net.configuration_delta_net import DeltaNetConfig
+from fla.models.delta_net.modeling_delta_net import (DeltaNetForCausalLM,
+                                                     DeltaNetModel)
+AutoConfig.register(DeltaNetConfig.model_type, DeltaNetConfig)
+AutoModel.register(DeltaNetConfig, DeltaNetModel)
+AutoModelForCausalLM.register(DeltaNetConfig, DeltaNetForCausalLM)
+__all__ = ['DeltaNetConfig', 'DeltaNetForCausalLM', 'DeltaNetModel']

fla/models/delta_net/configuration_delta_net.py ADDED Viewed

	@@ -0,0 +1,87 @@

+# -*- coding: utf-8 -*-
+from typing import Dict, Optional
+from transformers.configuration_utils import PretrainedConfig
+class DeltaNetConfig(PretrainedConfig):
+    model_type = 'delta_net'
+    keys_to_ignore_at_inference = ['past_key_values']
+    def __init__(
+        self,
+        attn_mode: str = "chunk",
+        hidden_size: int = 2048,
+        expand_k: int = 1,
+        expand_v: int = 1,
+        use_gate: bool = False,
+        use_short_conv: bool = True,
+        conv_size: int = 4,
+        use_beta: bool = True,
+        use_output_norm: bool = True,
+        num_heads: int = 16,
+        qk_norm: str = 'l2',
+        qk_activation: str = 'silu',
+        max_position_embeddings: int = 2048,
+        hidden_ratio: Optional[int] = 4,
+        intermediate_size: Optional[int] = None,
+        hidden_act: str = "swish",
+        num_hidden_layers: int = 24,
+        norm_first: bool = False,
+        norm_eps: float = 1e-6,
+        attn: Optional[Dict] = None,
+        use_cache: bool = True,
+        pad_token_id: int = None,
+        bos_token_id: int = 1,
+        eos_token_id: int = 2,
+        tie_word_embeddings: bool = False,
+        initializer_range: float = 0.02,
+        fuse_cross_entropy: bool = True,
+        vocab_size: int = 32000,
+        **kwargs
+    ):
+        self.attn_mode = attn_mode
+        self.hidden_size = hidden_size
+        self.expand_k = expand_k
+        self.expand_v = expand_v
+        self.use_gate = use_gate
+        self.use_short_conv = use_short_conv
+        self.conv_size = conv_size
+        self.use_beta = use_beta
+        self.use_output_norm = use_output_norm
+        self.num_heads = num_heads
+        self.qk_norm = qk_norm
+        self.qk_activation = qk_activation
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_ratio = hidden_ratio
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.num_hidden_layers = num_hidden_layers
+        self.norm_first = norm_first
+        self.norm_eps = norm_eps
+        self.attn = attn
+        self.use_cache = use_cache
+        self.initializer_range = initializer_range
+        self.fuse_cross_entropy = fuse_cross_entropy
+        self.vocab_size = vocab_size
+        if attn is not None:
+            if not isinstance(attn, Dict):
+                raise ValueError("attn must be a dictionary")
+            if 'layers' not in attn:
+                raise ValueError("Layer indices must be provided to initialize hybrid attention layers")
+            if 'num_heads' not in attn:
+                raise ValueError("Number of heads must be provided to initialize hybrid attention layers")
+            attn['num_kv_heads'] = attn.get('num_kv_heads', attn['num_heads'])
+            attn['window_size'] = attn.get('window_size', None)
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )

fla/models/delta_net/modeling_delta_net.py ADDED Viewed

	@@ -0,0 +1,439 @@

+# -*- coding: utf-8 -*-
+from __future__ import annotations
+import math
+import warnings
+from typing import List, Optional, Tuple, Union
+import torch
+import torch.nn as nn
+import torch.utils.checkpoint
+from transformers.activations import ACT2FN
+from transformers.generation import GenerationMixin
+from transformers.modeling_outputs import (BaseModelOutputWithPast,
+                                           CausalLMOutputWithPast)
+from transformers.modeling_utils import PreTrainedModel
+from transformers.utils import logging
+from fla.layers.attn import Attention
+from fla.layers.delta_net import DeltaNet
+from fla.models.delta_net.configuration_delta_net import DeltaNetConfig
+from fla.models.utils import Cache
+from fla.modules import (FusedCrossEntropyLoss, FusedLinearCrossEntropyLoss,
+                         RMSNorm)
+from fla.modules.activations import swiglu_linear
+from fla.modules.layernorm import rms_norm_linear
+logger = logging.get_logger(__name__)
+class DeltaNetMLP(nn.Module):
+    def __init__(
+        self,
+        hidden_size: int,
+        hidden_ratio: Optional[int] = None,
+        intermediate_size: Optional[int] = None,
+        hidden_act: str = 'swish',
+        norm_first: bool = True,
+        norm_eps: float = 1e-5
+    ) -> DeltaNetMLP:
+        super().__init__()
+        self.hidden_size = hidden_size
+        # the final number of params is `hidden_ratio * hidden_size^2`
+        # `intermediate_size` is chosen to be a multiple of 256 closest to `2/3 * hidden_size * hidden_ratio`
+        if hidden_ratio is None:
+            hidden_ratio = 4
+        if intermediate_size is None:
+            intermediate_size = int(hidden_size * hidden_ratio * 2 / 3)
+            intermediate_size = 256 * ((intermediate_size + 256 - 1) // 256)
+        self.hidden_ratio = hidden_ratio
+        self.intermediate_size = intermediate_size
+        self.norm_first = norm_first
+        if norm_first:
+            self.norm = RMSNorm(hidden_size=hidden_size, eps=norm_eps)
+        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size * 2, bias=False)
+        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
+        self.act_fn = ACT2FN[hidden_act]
+    def forward(self, x):
+        if self.norm_first:
+            x = rms_norm_linear(x, self.norm.weight, self.norm.bias, self.gate_proj.weight, self.gate_proj.bias)
+        else:
+            x = self.gate_proj(x)
+        gate, y = x.chunk(2, -1)
+        return swiglu_linear(gate, y, self.down_proj.weight, self.down_proj.bias)
+class DeltaNetBlock(nn.Module):
+    def __init__(self, config: DeltaNetConfig, layer_idx: int):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        if not config.norm_first:
+            self.attn_norm = RMSNorm(hidden_size=config.hidden_size, eps=config.norm_eps)
+        if config.attn is not None and layer_idx in config.attn['layers']:
+            self.attn = Attention(
+                hidden_size=config.hidden_size,
+                num_heads=config.attn['num_heads'],
+                num_kv_heads=config.attn['num_kv_heads'],
+                window_size=config.attn['window_size'],
+                max_position_embeddings=config.max_position_embeddings,
+                layer_idx=layer_idx
+            )
+        else:
+            self.attn = DeltaNet(
+                mode=config.attn_mode,
+                hidden_size=config.hidden_size,
+                expand_k=config.expand_k,
+                expand_v=config.expand_v,
+                num_heads=config.num_heads,
+                use_gate=config.use_gate,
+                use_beta=config.use_beta,
+                use_short_conv=config.use_short_conv,
+                use_output_norm=config.use_output_norm,
+                conv_size=config.conv_size,
+                qk_norm=config.qk_norm,
+                qk_activation=config.qk_activation,
+                norm_first=config.norm_first,
+                norm_eps=config.norm_eps,
+                layer_idx=layer_idx
+            )
+        if not config.norm_first:
+            self.mlp_norm = RMSNorm(hidden_size=config.hidden_size, eps=config.norm_eps)
+        self.mlp = DeltaNetMLP(
+            hidden_size=config.hidden_size,
+            hidden_ratio=config.hidden_ratio,
+            intermediate_size=config.intermediate_size,
+            hidden_act=config.hidden_act,
+            norm_first=config.norm_first,
+            norm_eps=config.norm_eps
+        )
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
+        use_cache: Optional[bool] = False,
+        output_attentions: Optional[bool] = False,
+        **kwargs
+    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
+        residual = hidden_states
+        if hasattr(self, 'attn_norm'):
+            hidden_states = self.attn_norm(hidden_states)
+        hidden_states, attentions, past_key_values = self.attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions
+        )
+        if hasattr(self, 'mlp_norm'):
+            hidden_states, residual = self.mlp_norm(hidden_states, residual, True)
+        else:
+            hidden_states = residual + hidden_states
+            residual = hidden_states
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+        outputs = (hidden_states, attentions, past_key_values)
+        return outputs
+class DeltaNetPreTrainedModel(PreTrainedModel):
+    config_class = DeltaNetConfig
+    supports_gradient_checkpointing = True
+    _no_split_modules = ['DeltaNetBlock']
+    def __init__(self, *inputs, **kwargs):
+        super().__init__(*inputs, **kwargs)
+    def _init_weights(
+        self,
+        module: nn.Module,
+        rescale_prenorm_residual: bool = True,
+        num_residuals_per_layer: int = 2,
+    ):
+        if isinstance(module, (nn.Linear, nn.Conv1d)):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            nn.init.normal_(module.weight, mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.Embedding):
+            nn.init.normal_(module.weight, mean=0.0, std=self.config.initializer_range)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        if rescale_prenorm_residual:
+            # Reinitialize selected weights subject to the OpenAI GPT-2 Paper Scheme:
+            #   > A modified initialization which accounts for the accumulation on the residual path with model depth. Scale
+            #   > the weights of residual layers at initialization by a factor of 1/√N where N is the # of residual layers.
+            #   >   -- GPT-2 :: https://openai.com/blog/better-language-models/
+            #
+            # Reference (Megatron-LM): https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/gpt_model.py
+            for name, p in module.named_parameters():
+                if name in ["o_proj.weight", "down_proj.weight"]:
+                    # Special Scaled Initialization --> There are 2 Layer Norms per Transformer Block
+                    # Following Pytorch init, except scale by 1/sqrt(2 * n_layer)
+                    # We need to reinit p since this code could be called multiple times
+                    # Having just p *= scale would repeatedly scale it down
+                    with torch.no_grad():
+                        p /= math.sqrt(num_residuals_per_layer * self.config.num_hidden_layers)
+class DeltaNetModel(DeltaNetPreTrainedModel):
+    def __init__(self, config: DeltaNetConfig):
+        super().__init__(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+        self.embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        self.layers = nn.ModuleList([DeltaNetBlock(config, layer_idx) for layer_idx in range(config.num_hidden_layers)])
+        self.norm = RMSNorm(config.hidden_size, eps=config.norm_eps)
+        self.gradient_checkpointing = False
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.embeddings
+    def set_input_embeddings(self, value):
+        self.embeddings = value
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,  # noqa
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None
+    ) -> Union[Tuple, BaseModelOutputWithPast]:
+        if output_attentions:
+            warnings.warn("`DeltaNetModel` does not `output_attentions` now, setting it to `False`.")
+            output_attentions = False
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        use_cache = use_cache if use_cache is not None else (self.config.use_cache if not self.training else False)
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # retrieve input_ids and inputs_embeds
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        if input_ids is None and inputs_embeds is None:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+        if inputs_embeds is None:
+            inputs_embeds = self.embeddings(input_ids)
+        hidden_states = inputs_embeds
+        if use_cache and not isinstance(past_key_values, Cache):
+            past_key_values = Cache.from_legacy_cache(past_key_values)
+        if self.gradient_checkpointing and self.training and use_cache:
+            logger.warning_once("`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...")
+            use_cache = False
+        all_hidden_states = () if output_hidden_states else None
+        all_attns = () if output_attentions else None
+        for layer in self.layers:
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+            if self.gradient_checkpointing and self.training:
+                hidden_states, attentions, past_key_values = self._gradient_checkpointing_func(
+                    layer.__call__,
+                    hidden_states,
+                    attention_mask,
+                    past_key_values,
+                    use_cache,
+                    output_attentions
+                )
+            else:
+                hidden_states, attentions, past_key_values = layer(
+                    hidden_states,
+                    attention_mask=attention_mask,
+                    past_key_values=past_key_values,
+                    use_cache=use_cache,
+                    output_attentions=output_attentions
+                )
+            if output_attentions:
+                all_attns += (attentions,)
+        hidden_states = self.norm(hidden_states)
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+        next_cache = past_key_values
+        if not return_dict:
+            return tuple(x for x in [hidden_states, next_cache, all_hidden_states, all_attns] if x is not None)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=next_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_attns
+        )
+class DeltaNetForCausalLM(DeltaNetPreTrainedModel, GenerationMixin):
+    _tied_weights_keys = ["lm_head.weight"]
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = DeltaNetModel(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.model.embeddings
+    def set_input_embeddings(self, value):
+        self.model.embeddings = value
+    def get_output_embeddings(self):
+        return self.lm_head
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+    def set_decoder(self, decoder):
+        self.model = decoder
+    def get_decoder(self):
+        return self.model
+    def generate(self, *args, **kwargs):
+        try:
+            return super().generate(*args, **kwargs)
+        except AttributeError as exception:
+            if 'past_key_values' in str(exception):
+                raise AttributeError(
+                    f"You tried to call `generate` with a decoding strategy that manipulates `past_key_values`, "
+                    f"which is not supported for {self.__class__.__name__}. "
+                    f"Try another generation strategy instead. "
+                    f"For the available generation strategies, check this doc: "
+                    f"https://huggingface.co/docs/transformers/en/generation_strategies#decoding-strategies"
+                )
+            else:
+                raise exception
+    def prepare_inputs_for_generation(
+        self,
+        input_ids: torch.LongTensor = None,
+        past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        use_cache: bool = True,
+        num_logits_to_keep: Optional[int] = None,
+        **kwargs
+    ):
+        # only last token for `inputs_ids` if the `past_key_values` is passed along.
+        if past_key_values is not None:
+            input_ids = input_ids[:, -1:]
+        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
+        if inputs_embeds is not None and past_key_values is None:
+            model_inputs = {'inputs_embeds': inputs_embeds}
+        else:
+            # The `contiguous()` here is necessary to have a static stride during decoding. torchdynamo otherwise
+            # recompiles graphs as the stride of the inputs is a guard.
+            # Ref: https://github.com/huggingface/transformers/pull/29114
+            # TODO: use `next_tokens` directly instead.
+            model_inputs = {'input_ids': input_ids.contiguous()}
+        if num_logits_to_keep is not None:
+            model_inputs['num_logits_to_keep'] = num_logits_to_keep
+        model_inputs.update({
+            'past_key_values': past_key_values,
+            'use_cache': use_cache,
+            'attention_mask': attention_mask,
+            'num_logits_to_keep': num_logits_to_keep,
+        })
+        return model_inputs
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        num_logits_to_keep: Optional[int] = 0
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict
+        )
+        hidden_states = outputs[0]
+        fuse_linear_and_cross_entropy = self.config.fuse_cross_entropy and self.training
+        logits = None if fuse_linear_and_cross_entropy else self.lm_head(hidden_states[:, -num_logits_to_keep:])
+        loss = None
+        if labels is not None:
+            if self.config.fuse_cross_entropy:
+                if fuse_linear_and_cross_entropy:
+                    loss_fct = FusedLinearCrossEntropyLoss()
+                else:
+                    loss_fct = FusedCrossEntropyLoss(inplace_backward=True)
+            else:
+                loss_fct = nn.CrossEntropyLoss()
+            # Enable model parallelism
+            labels = labels.to(hidden_states.device)
+            labels = torch.cat((labels[..., 1:], torch.full_like(labels[:, :1], loss_fct.ignore_index)), 1)
+            if fuse_linear_and_cross_entropy:
+                loss = loss_fct(hidden_states.view(-1, self.config.hidden_size),
+                                labels.view(-1),
+                                self.lm_head.weight,
+                                self.lm_head.bias)
+            else:
+                loss = loss_fct(logits.view(-1, self.config.vocab_size), labels.view(-1))
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )

fla/models/gla/__init__.py ADDED Viewed

	@@ -0,0 +1,13 @@

+# -*- coding: utf-8 -*-
+from transformers import AutoConfig, AutoModel, AutoModelForCausalLM
+from fla.models.gla.configuration_gla import GLAConfig
+from fla.models.gla.modeling_gla import GLAForCausalLM, GLAModel
+AutoConfig.register(GLAConfig.model_type, GLAConfig)
+AutoModel.register(GLAConfig, GLAModel)
+AutoModelForCausalLM.register(GLAConfig, GLAForCausalLM)
+__all__ = ['GLAConfig', 'GLAForCausalLM', 'GLAModel']

fla/models/gla/configuration_gla.py ADDED Viewed

	@@ -0,0 +1,90 @@

+# -*- coding: utf-8 -*-
+from typing import Dict, Optional
+from transformers.configuration_utils import PretrainedConfig
+class GLAConfig(PretrainedConfig):
+    model_type = 'gla'
+    keys_to_ignore_at_inference = ['past_key_values']
+    def __init__(
+        self,
+        hidden_size: int = 2048,
+        expand_k: int = 0.5,
+        expand_v: int = 1,
+        hidden_ratio: Optional[int] = 4,
+        intermediate_size: Optional[int] = None,
+        num_hidden_layers: int = 24,
+        num_heads: int = 4,
+        num_kv_heads: Optional[int] = None,
+        feature_map: Optional[str] = None,
+        attn_mode: str = "chunk",
+        use_short_conv: bool = False,
+        conv_size: int = 4,
+        use_output_gate: bool = True,
+        clamp_min: Optional[float] = None,
+        hidden_act: str = "swish",
+        max_position_embeddings: int = 2048,
+        elementwise_affine: Optional[bool] = True,
+        norm_eps: float = 1e-6,
+        use_gk: bool = True,
+        use_gv: bool = False,
+        attn: Optional[Dict] = None,
+        use_cache: bool = True,
+        pad_token_id: int = None,
+        bos_token_id: int = 1,
+        eos_token_id: int = 2,
+        tie_word_embeddings: bool = False,
+        initializer_range: float = 0.02,
+        fuse_norm: bool = True,
+        fuse_cross_entropy: bool = True,
+        vocab_size: int = 32000,
+        **kwargs
+    ):
+        self.hidden_size = hidden_size
+        self.expand_k = expand_k
+        self.expand_v = expand_v
+        self.hidden_ratio = hidden_ratio
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_heads = num_heads
+        self.num_kv_heads = num_kv_heads
+        self.feature_map = feature_map
+        self.attn_mode = attn_mode
+        self.use_short_conv = use_short_conv
+        self.conv_size = conv_size
+        self.use_output_gate = use_output_gate
+        self.clamp_min = clamp_min
+        self.hidden_act = hidden_act
+        self.max_position_embeddings = max_position_embeddings
+        self.elementwise_affine = elementwise_affine
+        self.norm_eps = norm_eps
+        self.use_gk = use_gk
+        self.use_gv = use_gv
+        self.attn = attn
+        self.use_cache = use_cache
+        self.initializer_range = initializer_range
+        self.fuse_norm = fuse_norm
+        self.fuse_cross_entropy = fuse_cross_entropy
+        self.vocab_size = vocab_size
+        if attn is not None:
+            if not isinstance(attn, Dict):
+                raise ValueError("attn must be a dictionary")
+            if 'layers' not in attn:
+                raise ValueError("Layer indices must be provided to initialize hybrid attention layers")
+            if 'num_heads' not in attn:
+                raise ValueError("Number of heads must be provided to initialize hybrid attention layers")
+            attn['num_kv_heads'] = attn.get('num_kv_heads', attn['num_heads'])
+            attn['window_size'] = attn.get('window_size', None)
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )

fla/models/gla/modeling_gla.py ADDED Viewed

	@@ -0,0 +1,418 @@

+# -*- coding: utf-8 -*-
+from __future__ import annotations
+import math
+import warnings
+from typing import List, Optional, Tuple, Union
+import torch
+import torch.nn as nn
+import torch.utils.checkpoint
+from transformers.activations import ACT2FN
+from transformers.generation import GenerationMixin
+from transformers.modeling_outputs import (BaseModelOutputWithPast,
+                                           CausalLMOutputWithPast)
+from transformers.modeling_utils import PreTrainedModel
+from transformers.utils import logging
+from fla.layers.attn import Attention
+from fla.layers.gla import GatedLinearAttention
+from fla.models.gla.configuration_gla import GLAConfig
+from fla.models.utils import Cache
+from fla.modules import (FusedCrossEntropyLoss, FusedLinearCrossEntropyLoss,
+                         RMSNorm)
+from fla.modules.activations import swiglu_linear
+logger = logging.get_logger(__name__)
+class GLAMLP(nn.Module):
+    def __init__(
+        self,
+        hidden_size: int,
+        hidden_ratio: Optional[int] = None,
+        intermediate_size: Optional[int] = None,
+        hidden_act: str = 'swish'
+    ) -> GLAMLP:
+        super().__init__()
+        self.hidden_size = hidden_size
+        # the final number of params is `hidden_ratio * hidden_size^2`
+        # `intermediate_size` is chosen to be a multiple of 256 closest to `2/3 * hidden_size * hidden_ratio`
+        if hidden_ratio is None:
+            hidden_ratio = 4
+        if intermediate_size is None:
+            intermediate_size = int(hidden_size * hidden_ratio * 2 / 3)
+            intermediate_size = 256 * ((intermediate_size + 256 - 1) // 256)
+        self.hidden_ratio = hidden_ratio
+        self.intermediate_size = intermediate_size
+        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size * 2, bias=False)
+        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
+        self.act_fn = ACT2FN[hidden_act]
+    def forward(self, x):
+        y = self.gate_proj(x)
+        gate, y = y.chunk(2, -1)
+        return swiglu_linear(gate, y, self.down_proj.weight, self.down_proj.bias)
+class GLABlock(nn.Module):
+    def __init__(self, config: GLAConfig, layer_idx: int):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.attn_norm = RMSNorm(hidden_size=config.hidden_size, eps=config.norm_eps)
+        if config.attn is not None and layer_idx in config.attn['layers']:
+            self.attn = Attention(
+                hidden_size=config.hidden_size,
+                num_heads=config.attn['num_heads'],
+                num_kv_heads=config.attn['num_kv_heads'],
+                window_size=config.attn['window_size'],
+                max_position_embeddings=config.max_position_embeddings,
+                layer_idx=layer_idx
+            )
+        else:
+            self.attn = GatedLinearAttention(
+                mode=config.attn_mode,
+                hidden_size=config.hidden_size,
+                expand_k=config.expand_k,
+                expand_v=config.expand_v,
+                num_heads=config.num_heads,
+                num_kv_heads=config.num_kv_heads,
+                feature_map=config.feature_map,
+                use_short_conv=config.use_short_conv,
+                conv_size=config.conv_size,
+                use_output_gate=config.use_output_gate,
+                gate_fn=config.hidden_act,
+                elementwise_affine=config.elementwise_affine,
+                norm_eps=config.norm_eps,
+                clamp_min=config.clamp_min,
+                fuse_norm=config.fuse_norm,
+                layer_idx=layer_idx
+            )
+        self.mlp_norm = RMSNorm(hidden_size=config.hidden_size, eps=config.norm_eps)
+        self.mlp = GLAMLP(
+            hidden_size=config.hidden_size,
+            hidden_ratio=config.hidden_ratio,
+            intermediate_size=config.intermediate_size,
+            hidden_act=config.hidden_act
+        )
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
+        use_cache: Optional[bool] = False,
+        output_attentions: Optional[bool] = False,
+        **kwargs,
+    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
+        residual = hidden_states
+        hidden_states = self.attn_norm(hidden_states)
+        hidden_states, attentions, past_key_values = self.attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions
+        )
+        hidden_states, residual = self.mlp_norm(hidden_states, residual, True)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+        outputs = (hidden_states, attentions, past_key_values)
+        return outputs
+class GLAPreTrainedModel(PreTrainedModel):
+    config_class = GLAConfig
+    supports_gradient_checkpointing = True
+    _no_split_modules = ['GLABlock']
+    def __init__(self, *inputs, **kwargs):
+        super().__init__(*inputs, **kwargs)
+    def _init_weights(
+        self,
+        module: nn.Module,
+        rescale_prenorm_residual: bool = True,
+        num_residuals_per_layer: int = 2,
+    ):
+        if isinstance(module, (nn.Linear, nn.Conv1d)):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            nn.init.normal_(module.weight, mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.Embedding):
+            nn.init.normal_(module.weight, mean=0.0, std=self.config.initializer_range)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        if rescale_prenorm_residual:
+            # Reinitialize selected weights subject to the OpenAI GPT-2 Paper Scheme:
+            #   > A modified initialization which accounts for the accumulation on the residual path with model depth. Scale
+            #   > the weights of residual layers at initialization by a factor of 1/√N where N is the # of residual layers.
+            #   >   -- GPT-2 :: https://openai.com/blog/better-language-models/
+            #
+            # Reference (Megatron-LM): https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/gpt_model.py
+            for name, p in module.named_parameters():
+                if name in ["o_proj.weight", "down_proj.weight"]:
+                    # Special Scaled Initialization --> There are 2 Layer Norms per Transformer Block
+                    # Following Pytorch init, except scale by 1/sqrt(2 * n_layer)
+                    # We need to reinit p since this code could be called multiple times
+                    # Having just p *= scale would repeatedly scale it down
+                    with torch.no_grad():
+                        p /= math.sqrt(num_residuals_per_layer * self.config.num_hidden_layers)
+class GLAModel(GLAPreTrainedModel):
+    def __init__(self, config: GLAConfig):
+        super().__init__(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+        self.embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        self.layers = nn.ModuleList([GLABlock(config, layer_idx) for layer_idx in range(config.num_hidden_layers)])
+        self.norm = RMSNorm(config.hidden_size, eps=config.norm_eps)
+        self.gradient_checkpointing = False
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.embeddings
+    def set_input_embeddings(self, value):
+        self.embeddings = value
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,  # noqa
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None
+    ) -> Union[Tuple, BaseModelOutputWithPast]:
+        if output_attentions:
+            warnings.warn("`GLAModel` does not `output_attentions` now, setting it to `False`.")
+            output_attentions = False
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        use_cache = use_cache if use_cache is not None else (self.config.use_cache if not self.training else False)
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # retrieve input_ids and inputs_embeds
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        if input_ids is None and inputs_embeds is None:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+        if inputs_embeds is None:
+            inputs_embeds = self.embeddings(input_ids)
+        hidden_states = inputs_embeds
+        if use_cache and not isinstance(past_key_values, Cache):
+            past_key_values = Cache.from_legacy_cache(past_key_values)
+        if self.gradient_checkpointing and self.training and use_cache:
+            logger.warning_once("`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...")
+            use_cache = False
+        all_hidden_states = () if output_hidden_states else None
+        all_attns = () if output_attentions else None
+        for layer in self.layers:
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+            if self.gradient_checkpointing and self.training:
+                hidden_states, attentions, past_key_values = self._gradient_checkpointing_func(
+                    layer.__call__,
+                    hidden_states,
+                    attention_mask,
+                    past_key_values,
+                    use_cache,
+                    output_attentions
+                )
+            else:
+                hidden_states, attentions, past_key_values = layer(
+                    hidden_states,
+                    attention_mask=attention_mask,
+                    past_key_values=past_key_values,
+                    use_cache=use_cache,
+                    output_attentions=output_attentions
+                )
+            if output_attentions:
+                all_attns += (attentions,)
+        hidden_states = self.norm(hidden_states)
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+        if not return_dict:
+            return tuple(i for i in [hidden_states, past_key_values, all_hidden_states, all_attns] if i is not None)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=past_key_values,
+            hidden_states=all_hidden_states,
+            attentions=all_attns
+        )
+class GLAForCausalLM(GLAPreTrainedModel, GenerationMixin):
+    _tied_weights_keys = ["lm_head.weight"]
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = GLAModel(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.model.embeddings
+    def set_input_embeddings(self, value):
+        self.model.embeddings = value
+    def get_output_embeddings(self):
+        return self.lm_head
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+    def set_decoder(self, decoder):
+        self.model = decoder
+    def get_decoder(self):
+        return self.model
+    def generate(self, *args, **kwargs):
+        try:
+            return super().generate(*args, **kwargs)
+        except AttributeError as exception:
+            if 'past_key_values' in str(exception):
+                raise AttributeError(
+                    f"You tried to call `generate` with a decoding strategy that manipulates `past_key_values`, "
+                    f"which is not supported for {self.__class__.__name__}. "
+                    f"Try another generation strategy instead. "
+                    f"For the available generation strategies, check this doc: "
+                    f"https://huggingface.co/docs/transformers/en/generation_strategies#decoding-strategies"
+                )
+            else:
+                raise exception
+    def prepare_inputs_for_generation(
+        self,
+        input_ids: torch.LongTensor = None,
+        past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        use_cache: bool = True,
+        num_logits_to_keep: Optional[int] = None,
+        **kwargs
+    ):
+        # only last token for `inputs_ids` if the `past_key_values` is passed along.
+        if past_key_values is not None:
+            input_ids = input_ids[:, -1:]
+        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
+        if inputs_embeds is not None and past_key_values is None:
+            model_inputs = {'inputs_embeds': inputs_embeds}
+        else:
+            # The `contiguous()` here is necessary to have a static stride during decoding. torchdynamo otherwise
+            # recompiles graphs as the stride of the inputs is a guard.
+            # Ref: https://github.com/huggingface/transformers/pull/29114
+            # TODO: use `next_tokens` directly instead.
+            model_inputs = {'input_ids': input_ids.contiguous()}
+        if num_logits_to_keep is not None:
+            model_inputs['num_logits_to_keep'] = num_logits_to_keep
+        model_inputs.update({
+            'past_key_values': past_key_values,
+            'use_cache': use_cache,
+            'attention_mask': attention_mask,
+            'num_logits_to_keep': num_logits_to_keep,
+        })
+        return model_inputs
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        num_logits_to_keep: Optional[int] = 0
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict
+        )
+        hidden_states = outputs[0]
+        fuse_linear_and_cross_entropy = self.config.fuse_cross_entropy and self.training
+        logits = None if fuse_linear_and_cross_entropy else self.lm_head(hidden_states[:, -num_logits_to_keep:])
+        loss = None
+        if labels is not None:
+            if self.config.fuse_cross_entropy:
+                if fuse_linear_and_cross_entropy:
+                    loss_fct = FusedLinearCrossEntropyLoss()
+                else:
+                    loss_fct = FusedCrossEntropyLoss(inplace_backward=True)
+            else:
+                loss_fct = nn.CrossEntropyLoss()
+            # Enable model parallelism
+            labels = labels.to(hidden_states.device)
+            labels = torch.cat((labels[..., 1:], torch.full_like(labels[:, :1], loss_fct.ignore_index)), 1)
+            if fuse_linear_and_cross_entropy:
+                loss = loss_fct(hidden_states.view(-1, self.config.hidden_size),
+                                labels.view(-1),
+                                self.lm_head.weight,
+                                self.lm_head.bias)
+            else:
+                loss = loss_fct(logits.view(-1, self.config.vocab_size), labels.view(-1))
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )

fla/models/gsa/__init__.py ADDED Viewed

	@@ -0,0 +1,13 @@

+# -*- coding: utf-8 -*-
+from transformers import AutoConfig, AutoModel, AutoModelForCausalLM
+from fla.models.gsa.configuration_gsa import GSAConfig
+from fla.models.gsa.modeling_gsa import GSAForCausalLM, GSAModel
+AutoConfig.register(GSAConfig.model_type, GSAConfig)
+AutoModel.register(GSAConfig, GSAModel)
+AutoModelForCausalLM.register(GSAConfig, GSAForCausalLM)
+__all__ = ['GSAConfig', 'GSAForCausalLM', 'GSAModel']

fla/models/gsa/configuration_gsa.py ADDED Viewed

	@@ -0,0 +1,94 @@

+# -*- coding: utf-8 -*-
+from typing import Dict, Optional
+from transformers.configuration_utils import PretrainedConfig
+class GSAConfig(PretrainedConfig):
+    model_type = 'gsa'
+    keys_to_ignore_at_inference = ['past_key_values']
+    def __init__(
+        self,
+        hidden_size: int = 2048,
+        gate_logit_normalizer: Optional[int] = 8,
+        clamp_min: Optional[float] = None,
+        clamp_max: Optional[float] = None,
+        hidden_ratio: Optional[int] = 4,
+        intermediate_size: Optional[int] = None,
+        num_hidden_layers: int = 24,
+        num_heads: int = 4,
+        num_kv_heads: Optional[int] = None,
+        num_slots: Optional[int] = 64,
+        use_short_conv: bool = False,
+        conv_size: int = 4,
+        exapnd_k: float = 1,
+        exapnd_v: float = 1,
+        feature_map: str = 'swish',
+        use_output_gate: bool = False,
+        use_norm: bool = True,
+        max_position_embeddings: int = 2048,
+        hidden_act: str = "swish",
+        elementwise_affine: Optional[bool] = True,
+        norm_first: bool = True,
+        norm_eps: float = 1e-6,
+        attn: Optional[Dict] = None,
+        use_cache: bool = True,
+        pad_token_id: int = None,
+        bos_token_id: int = 1,
+        eos_token_id: int = 2,
+        initializer_range: float = 0.02,
+        tie_word_embeddings: bool = False,
+        fuse_norm: bool = True,
+        fuse_cross_entropy: bool = True,
+        vocab_size: int = 32000,
+        **kwargs
+    ):
+        self.hidden_size = hidden_size
+        self.gate_logit_normalizer = gate_logit_normalizer
+        self.clamp_min = clamp_min
+        self.clamp_max = clamp_max
+        self.hidden_ratio = hidden_ratio
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_heads = num_heads
+        self.num_kv_heads = num_kv_heads
+        self.num_slots = num_slots
+        self.use_short_conv = use_short_conv
+        self.conv_size = conv_size
+        self.expand_k = exapnd_k
+        self.expand_v = exapnd_v
+        self.feature_map = feature_map
+        self.use_output_gate = use_output_gate
+        self.use_norm = use_norm
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_act = hidden_act
+        self.elementwise_affine = elementwise_affine
+        self.norm_first = norm_first
+        self.norm_eps = norm_eps
+        self.attn = attn
+        self.use_cache = use_cache
+        self.initializer_range = initializer_range
+        self.fuse_cross_entropy = fuse_cross_entropy
+        self.fuse_norm = fuse_norm
+        self.vocab_size = vocab_size
+        if attn is not None:
+            if not isinstance(attn, Dict):
+                raise ValueError("attn must be a dictionary")
+            if 'layers' not in attn:
+                raise ValueError("Layer indices must be provided to initialize hybrid attention layers")
+            if 'num_heads' not in attn:
+                raise ValueError("Number of heads must be provided to initialize hybrid attention layers")
+            attn['num_kv_heads'] = attn.get('num_kv_heads', attn['num_heads'])
+            attn['window_size'] = attn.get('window_size', None)
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )

fla/models/gsa/modeling_gsa.py ADDED Viewed

	@@ -0,0 +1,442 @@

+# -*- coding: utf-8 -*-
+from __future__ import annotations
+import math
+import warnings
+from typing import List, Optional, Tuple, Union
+import torch
+import torch.nn as nn
+import torch.utils.checkpoint
+from transformers.activations import ACT2FN
+from transformers.generation import GenerationMixin
+from transformers.modeling_outputs import (BaseModelOutputWithPast,
+                                           CausalLMOutputWithPast)
+from transformers.modeling_utils import PreTrainedModel
+from transformers.utils import logging
+from fla.layers.attn import Attention
+from fla.layers.gsa import GatedSlotAttention
+from fla.models.gsa.configuration_gsa import GSAConfig
+from fla.models.utils import Cache
+from fla.modules import (FusedCrossEntropyLoss, FusedLinearCrossEntropyLoss,
+                         RMSNorm)
+from fla.modules.activations import swiglu_linear
+from fla.modules.layernorm import rms_norm_linear
+logger = logging.get_logger(__name__)
+class GSAMLP(nn.Module):
+    def __init__(
+        self,
+        hidden_size: int,
+        hidden_ratio: Optional[int] = None,
+        intermediate_size: Optional[int] = None,
+        hidden_act: str = 'swish',
+        norm_first: bool = True,
+        norm_eps: float = 1e-5
+    ) -> GSAMLP:
+        super().__init__()
+        self.hidden_size = hidden_size
+        # the final number of params is `hidden_ratio * hidden_size^2`
+        # `intermediate_size` is chosen to be a multiple of 256 closest to `2/3 * hidden_size * hidden_ratio`
+        if hidden_ratio is None:
+            hidden_ratio = 4
+        if intermediate_size is None:
+            intermediate_size = int(hidden_size * hidden_ratio * 2 / 3)
+            intermediate_size = 256 * ((intermediate_size + 256 - 1) // 256)
+        self.hidden_ratio = hidden_ratio
+        self.intermediate_size = intermediate_size
+        self.norm_first = norm_first
+        if norm_first:
+            self.norm = RMSNorm(hidden_size=hidden_size, eps=norm_eps)
+        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size * 2, bias=False)
+        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
+        self.act_fn = ACT2FN[hidden_act]
+    def forward(self, x):
+        if self.norm_first:
+            x = rms_norm_linear(x, self.norm.weight, self.norm.bias, self.gate_proj.weight, self.gate_proj.bias)
+        else:
+            x = self.gate_proj(x)
+        gate, y = x.chunk(2, -1)
+        return swiglu_linear(gate, y, self.down_proj.weight, self.down_proj.bias)
+class GSABlock(nn.Module):
+    def __init__(self, config: GSAConfig, layer_idx: int):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        if not config.norm_first:
+            self.attn_norm = RMSNorm(hidden_size=config.hidden_size, eps=config.norm_eps)
+        if config.attn is not None and layer_idx in config.attn['layers']:
+            self.attn = Attention(
+                hidden_size=config.hidden_size,
+                num_heads=config.attn['num_heads'],
+                num_kv_heads=config.attn['num_kv_heads'],
+                window_size=config.attn['window_size'],
+                max_position_embeddings=config.max_position_embeddings,
+                layer_idx=layer_idx
+            )
+        else:
+            self.attn = GatedSlotAttention(
+                hidden_size=config.hidden_size,
+                expand_k=config.expand_k,
+                expand_v=config.expand_v,
+                num_heads=config.num_heads,
+                num_kv_heads=config.num_kv_heads,
+                num_slots=config.num_slots,
+                use_short_conv=config.use_short_conv,
+                conv_size=config.conv_size,
+                feature_map=config.feature_map,
+                use_output_gate=config.use_output_gate,
+                use_norm=config.use_norm,
+                gate_fn=config.hidden_act,
+                gate_logit_normalizer=config.gate_logit_normalizer,
+                elementwise_affine=config.elementwise_affine,
+                norm_first=config.norm_first,
+                norm_eps=config.norm_eps,
+                fuse_norm=config.fuse_norm,
+                layer_idx=layer_idx
+            )
+        if not config.norm_first:
+            self.mlp_norm = RMSNorm(hidden_size=config.hidden_size, eps=config.norm_eps)
+        self.mlp = GSAMLP(
+            hidden_size=config.hidden_size,
+            hidden_ratio=config.hidden_ratio,
+            intermediate_size=config.intermediate_size,
+            hidden_act=config.hidden_act,
+            norm_first=config.norm_first,
+            norm_eps=config.norm_eps
+        )
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
+        use_cache: Optional[bool] = False,
+        output_attentions: Optional[bool] = False,
+        **kwargs
+    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
+        residual = hidden_states
+        if hasattr(self, 'attn_norm'):
+            hidden_states = self.attn_norm(hidden_states)
+        hidden_states, attentions, past_key_values = self.attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions
+        )
+        if hasattr(self, 'mlp_norm'):
+            hidden_states, residual = self.mlp_norm(hidden_states, residual, True)
+        else:
+            hidden_states = residual + hidden_states
+            residual = hidden_states
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+        outputs = (hidden_states, attentions, past_key_values)
+        return outputs
+class GSAPreTrainedModel(PreTrainedModel):
+    config_class = GSAConfig
+    supports_gradient_checkpointing = True
+    _no_split_modules = ['GSABlock']
+    def __init__(self, *inputs, **kwargs):
+        super().__init__(*inputs, **kwargs)
+    def _init_weights(
+        self,
+        module: nn.Module,
+        rescale_prenorm_residual: bool = True,
+        num_residuals_per_layer: int = 2,
+    ):
+        if isinstance(module, (nn.Linear, nn.Conv1d)):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            nn.init.normal_(module.weight, mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.Embedding):
+            nn.init.normal_(module.weight, mean=0.0, std=self.config.initializer_range)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        if rescale_prenorm_residual:
+            # Reinitialize selected weights subject to the OpenAI GPT-2 Paper Scheme:
+            #   > A modified initialization which accounts for the accumulation on the residual path with model depth. Scale
+            #   > the weights of residual layers at initialization by a factor of 1/√N where N is the # of residual layers.
+            #   >   -- GPT-2 :: https://openai.com/blog/better-language-models/
+            #
+            # Reference (Megatron-LM): https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/gpt_model.py
+            for name, p in module.named_parameters():
+                if name in ["o_proj.weight", "down_proj.weight"]:
+                    # Special Scaled Initialization --> There are 2 Layer Norms per Transformer Block
+                    # Following Pytorch init, except scale by 1/sqrt(2 * n_layer)
+                    # We need to reinit p since this code could be called multiple times
+                    # Having just p *= scale would repeatedly scale it down
+                    with torch.no_grad():
+                        p /= math.sqrt(num_residuals_per_layer * self.config.num_hidden_layers)
+class GSAModel(GSAPreTrainedModel):
+    def __init__(self, config: GSAConfig):
+        super().__init__(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+        self.embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        self.layers = nn.ModuleList([GSABlock(config, layer_idx) for layer_idx in range(config.num_hidden_layers)])
+        self.norm = RMSNorm(config.hidden_size, eps=config.norm_eps)
+        self.gradient_checkpointing = False
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.embeddings
+    def set_input_embeddings(self, value):
+        self.embeddings = value
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,  # noqa
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None
+    ) -> Union[Tuple, BaseModelOutputWithPast]:
+        if output_attentions:
+            warnings.warn("`GSAModel` does not `output_attentions` now, setting it to `False`.")
+            output_attentions = False
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        use_cache = use_cache if use_cache is not None else (self.config.use_cache if not self.training else False)
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # retrieve input_ids and inputs_embeds
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        if input_ids is None and inputs_embeds is None:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+        if inputs_embeds is None:
+            inputs_embeds = self.embeddings(input_ids)
+        hidden_states = inputs_embeds
+        if use_cache and not isinstance(past_key_values, Cache):
+            past_key_values = Cache.from_legacy_cache(past_key_values)
+        if self.gradient_checkpointing and self.training and use_cache:
+            logger.warning_once("`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...")
+            use_cache = False
+        all_hidden_states = () if output_hidden_states else None
+        all_attns = () if output_attentions else None
+        for i, layer in enumerate(self.layers):
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+            if self.gradient_checkpointing and self.training:
+                hidden_states, attentions, past_key_values = self._gradient_checkpointing_func(
+                    layer.__call__,
+                    hidden_states,
+                    attention_mask,
+                    past_key_values,
+                    use_cache,
+                    output_attentions,
+                )
+            else:
+                hidden_states, attentions, past_key_values = layer(
+                    hidden_states,
+                    attention_mask=attention_mask,
+                    past_key_values=past_key_values,
+                    use_cache=use_cache,
+                    output_attentions=output_attentions
+                )
+            if output_attentions:
+                all_attns += (attentions,)
+        hidden_states = self.norm(hidden_states)
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+        if not return_dict:
+            return tuple(i for i in [hidden_states, past_key_values, all_hidden_states, all_attns] if i is not None)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=past_key_values,
+            hidden_states=all_hidden_states,
+            attentions=all_attns
+        )
+class GSAForCausalLM(GSAPreTrainedModel, GenerationMixin):
+    _tied_weights_keys = ["lm_head.weight"]
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = GSAModel(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.model.embeddings
+    def set_input_embeddings(self, value):
+        self.model.embeddings = value
+    def get_output_embeddings(self):
+        return self.lm_head
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+    def set_decoder(self, decoder):
+        self.model = decoder
+    def get_decoder(self):
+        return self.model
+    def generate(self, *args, **kwargs):
+        try:
+            return super().generate(*args, **kwargs)
+        except AttributeError as exception:
+            if 'past_key_values' in str(exception):
+                raise AttributeError(
+                    f"You tried to call `generate` with a decoding strategy that manipulates `past_key_values`, "
+                    f"which is not supported for {self.__class__.__name__}. "
+                    f"Try another generation strategy instead. "
+                    f"For the available generation strategies, check this doc: "
+                    f"https://huggingface.co/docs/transformers/en/generation_strategies#decoding-strategies"
+                )
+            else:
+                raise exception
+    def prepare_inputs_for_generation(
+        self,
+        input_ids: torch.LongTensor = None,
+        past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        use_cache: bool = True,
+        num_logits_to_keep: Optional[int] = None,
+        **kwargs
+    ):
+        # only last token for `inputs_ids` if the `past_key_values` is passed along.
+        if past_key_values is not None:
+            input_ids = input_ids[:, -1:]
+        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
+        if inputs_embeds is not None and past_key_values is None:
+            model_inputs = {'inputs_embeds': inputs_embeds}
+        else:
+            # The `contiguous()` here is necessary to have a static stride during decoding. torchdynamo otherwise
+            # recompiles graphs as the stride of the inputs is a guard.
+            # Ref: https://github.com/huggingface/transformers/pull/29114
+            # TODO: use `next_tokens` directly instead.
+            model_inputs = {'input_ids': input_ids.contiguous()}
+        if num_logits_to_keep is not None:
+            model_inputs['num_logits_to_keep'] = num_logits_to_keep
+        model_inputs.update({
+            'past_key_values': past_key_values,
+            'use_cache': use_cache,
+            'attention_mask': attention_mask,
+            'num_logits_to_keep': num_logits_to_keep,
+        })
+        return model_inputs
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        num_logits_to_keep: Optional[int] = 0
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict
+        )
+        hidden_states = outputs[0]
+        fuse_linear_and_cross_entropy = self.config.fuse_cross_entropy and self.training
+        logits = None if fuse_linear_and_cross_entropy else self.lm_head(hidden_states[:, -num_logits_to_keep:])
+        loss = None
+        if labels is not None:
+            if self.config.fuse_cross_entropy:
+                if fuse_linear_and_cross_entropy:
+                    loss_fct = FusedLinearCrossEntropyLoss()
+                else:
+                    loss_fct = FusedCrossEntropyLoss(inplace_backward=True)
+            else:
+                loss_fct = nn.CrossEntropyLoss()
+            # Enable model parallelism
+            labels = labels.to(hidden_states.device)
+            labels = torch.cat((labels[..., 1:], torch.full_like(labels[:, :1], loss_fct.ignore_index)), 1)
+            if fuse_linear_and_cross_entropy:
+                loss = loss_fct(hidden_states.view(-1, self.config.hidden_size),
+                                labels.view(-1),
+                                self.lm_head.weight,
+                                self.lm_head.bias)
+            else:
+                loss = loss_fct(logits.view(-1, self.config.vocab_size), labels.view(-1))
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )

fla/models/hgrn/__init__.py ADDED Viewed

	@@ -0,0 +1,13 @@

+# -*- coding: utf-8 -*-
+from transformers import AutoConfig, AutoModel, AutoModelForCausalLM
+from fla.models.hgrn.configuration_hgrn import HGRNConfig
+from fla.models.hgrn.modeling_hgrn import HGRNForCausalLM, HGRNModel
+AutoConfig.register(HGRNConfig.model_type, HGRNConfig)
+AutoModel.register(HGRNConfig, HGRNModel)
+AutoModelForCausalLM.register(HGRNConfig, HGRNForCausalLM)
+__all__ = ['HGRNConfig', 'HGRNForCausalLM', 'HGRNModel']

fla/models/hgrn/configuration_hgrn.py ADDED Viewed

	@@ -0,0 +1,74 @@

+# -*- coding: utf-8 -*-
+from typing import Dict, Optional
+from transformers.configuration_utils import PretrainedConfig
+class HGRNConfig(PretrainedConfig):
+    model_type = 'hgrn'
+    keys_to_ignore_at_inference = ['past_key_values']
+    def __init__(
+        self,
+        attn_mode: str = "chunk",
+        hidden_size: int = 2048,
+        num_hidden_layers: int = 24,
+        expand_ratio: Optional[int] = 1,
+        use_short_conv: bool = False,
+        conv_size: int = 4,
+        use_lower_bound: bool = True,
+        max_position_embeddings: int = 2048,
+        hidden_ratio: Optional[int] = 4,
+        intermediate_size: Optional[int] = None,
+        hidden_act: str = "swish",
+        elementwise_affine: Optional[bool] = True,
+        norm_eps: float = 1e-6,
+        attn: Optional[Dict] = None,
+        use_cache: bool = True,
+        pad_token_id: int = None,
+        bos_token_id: int = 1,
+        eos_token_id: int = 2,
+        tie_word_embeddings: bool = False,
+        initializer_range: float = 0.02,
+        fuse_cross_entropy: bool = True,
+        vocab_size: int = 32000,
+        **kwargs
+    ):
+        self.attn_mode = attn_mode
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.expand_ratio = expand_ratio
+        self.use_short_conv = use_short_conv
+        self.conv_size = conv_size
+        self.use_lower_bound = use_lower_bound
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_ratio = hidden_ratio
+        self.intermediate_size = intermediate_size
+        self.elementwise_affine = elementwise_affine
+        self.attn = attn
+        self.norm_eps = norm_eps
+        self.hidden_act = hidden_act
+        self.use_cache = use_cache
+        self.initializer_range = initializer_range
+        self.fuse_cross_entropy = fuse_cross_entropy
+        self.vocab_size = vocab_size
+        if attn is not None:
+            if not isinstance(attn, Dict):
+                raise ValueError("attn must be a dictionary")
+            if 'layers' not in attn:
+                raise ValueError("Layer indices must be provided to initialize hybrid attention layers")
+            if 'num_heads' not in attn:
+                raise ValueError("Number of heads must be provided to initialize hybrid attention layers")
+            attn['num_kv_heads'] = attn.get('num_kv_heads', attn['num_heads'])
+            attn['window_size'] = attn.get('window_size', None)
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )