metadata

language: en
license: mit
pipeline_tag: text-generation
tags:
  - model_hub_mixin
  - pytorch_model_hub_mixin
dataset: HuggingFaceFW/fineweb-edu

DAT-sa8-ra8-nr32-ns1024-sh8-nkvh4-343M

This is a Dual-Attention Transformer Language Model, trained on the fineweb-edu dataset. The model is 344M parameters.

Model Details

Size	Training Tokens	Layers	Model Dimension	Self-Attention Heads	Relational Attention Heads	Relation Dimension	Context Length
344M	10B	24	1024	8	8	32	1024

Model Description

Developed by: Awni Altabaa, John Lafferty
Model type: Decoder-only Dual Attention Transformer
Tokenizer: GPT-2 BPE tokenizer
Language(s): English
Date: August, 2024

Model Sources

Repository: https://github.com/Awni00/abstract_transformer
Paper: Disentangling and Integrating Relational and Sensory Information in Transformer Architectures
Huggingface Collection: Dual Attention Transformer Collection

Model Usage

Use the code below to get started with the model. First, install the dual-attention python package hosted on PyPI via pip install dual-attention.

To load directly from huggingface hub, use the HFHub wrapper.

from dual_attention.hf import DualAttnTransformerLM_HFHub

DualAttnTransformerLM_HFHub.from_pretrained('awni00/DAT-sa8-ra8-nr32-ns1024-sh8-nkvh4-343M')

Training Details

The model was trained using the following setup:

Architecture: Decoder-only Dual Attention Transformer
Framework: PyTorch
Optimizer: AdamW
Learning Rate: 6e-4 (peak)
Weight Decay: 0.1
Batch Size: 524,288 Tokens
Sequence Length: 1024 tokens
Total Training Tokens: 10B Tokens

For more detailed training information, please refer to the paper.

Evaluation

See paper.

Model Interpretability Analysis

The DAT-LM-Visualization app is built to visualize the representations learned in a Dual Attention Transformer language model. It is hosted on Huggingface spaces using their free CPU resources. You can select a pre-trained DAT-LM model, enter a prompt, and visualize the internal representations in different parts of the model. You can also run the app locally (e.g., to use your own GPU) via the PyPI package.

Also, see paper.

Citation

@misc{altabaa2024disentanglingintegratingrelationalsensory,
      title={Disentangling and Integrating Relational and Sensory Information in Transformer Architectures}, 
      author={Awni Altabaa and John Lafferty},
      year={2024},
      eprint={2405.16727},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2405.16727},
}