Gated State Space

This repo contains pretrain model for the gated state space paper. The model has been trained on C4 dataset. I have used Lucidrains' implementation (commit) for the model. I think the main benefit of this model is the ability to scale beyond the training context length. As authors noted in the paper, they trained the model on 4k sequence length but it generalized beyond that length. I have written a blog post on how I started the training here.

Wandb Report is available at this link

How to use this.

Since it is not based on transformers library, it is a bit tricky to use the model out of the box. Here are the general steps:

pip install gated-state-spaces-pytorch
Download the model weights from here.
Download the config from here.
Following code to patch the original model:

    model = AutoregressiveWrapper(
        GatedStateSpacesLM(
            **config
        ),
    )
    model.net.to_logits = nn.Sequential(
        nn.LayerNorm(f_emb),
        model.net.to_logits,
    )

Load the state dict: model.load_state_dict(torch.load('model.pt'))
If you want to fine-tune the model, you can freeze the embeddings:

    model.net.token_emb.weight.requires_grad_(False)
    model.net.token_emb.weight.copy_(emb)

    model.net.to_logits[1].weight.requires_grad_(False)
    model.net.to_logits[1].weight.copy_(emb)

Training code is available in this repo. Link to the training script.

Training Information

Here are the details of the training:

Objective: Alternate between simple cross entropy and GPT-2 XL distillation
Gradient Accumulation: 4
Batch Size: 8
Sequence Length 128
Learning Rate: 2e-5
Optimizer: AdamW
Gradient Norm Clipping: 1.0
Hardware: RTX 3090 on vast.ai
Training Cost: ~20$
Training Time: ~3 days
Number of steps: 557,000
Tokens seen: 570 million
Final loss: ~3.9

Fine-Tuning Info:

model2.pt is available as fine-tuned version with longer context length.

Objective: Simple Cross Entropy
Gradient Accumulation: 4
Batch Size: 1
Sequence Length: 2048
Learning Rate: 5e-6
Embeddings: unfrozen for fine-tuning
Gradient Norm Clipping: 1.0
Hardware: 2x3090 on vast.ai
Extra Tricks: Used HuggingFace Accelerate with Full Sharding without CPU offload

naxalpha
/

gated-state-space

Gated State Space

How to use this.

Training Information

Fine-Tuning Info:

Dataset used to train naxalpha/gated-state-space