awni00
/

DAT-sa16-ra16-nr128-ns2048-sh16-nkvh8-1.27B

Text Generation

Safetensors

English

model_hub_mixin

pytorch_model_hub_mixin

Model card Files Files and versions Community

awni00 commited on Aug 20, 2024

Commit

ddc2823

verified ·

1 Parent(s): 33c1ff7

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +84 -3

README.md CHANGED Viewed

@@ -1,9 +1,90 @@
 ---
 tags:
 - model_hub_mixin
 - pytorch_model_hub_mixin
 ---
-This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
-- Library: [More Information Needed]
-- Docs: [More Information Needed]

 ---
+language: en
+license: mit
+pipeline_tag: text-generation
 tags:
 - model_hub_mixin
 - pytorch_model_hub_mixin
+dataset: HuggingFaceFW/fineweb-edu
 ---
+# DAT-sa16-ra16-nr128-ns2048-sh16-nkvh8-1.27B
+<!-- Provide a quick summary of what the model is/does. -->
+This is a Dual-Attention Transformer Language Model, trained on the `fineweb-edu` dataset. The model is 1B parameters.
+## Model Details
+| Size | Training Tokens| Layers | Model Dimension | Self-Attention Heads | Relational Attention Heads | Relation Dimension | Context Length |
+|--|--|--|--|--|--|--|--|
+| 1B | 10B | 24| 2048 | 16 | 16 | 128 | 1024 |
+### Model Description
+- **Developed by:** Awni Altabaa, John Lafferty
+- **Model type:** Decoder-only Dual Attention Transformer
+- **Tokenizer:** GPT-2 BPE tokenizer
+- **Language(s):** English
+<!-- - **License:** MIT -->
+<!-- - **Contact:** awni.altabaa@yale.edu -->
+- **Date:** August, 2024
+### Model Sources
+- **Repository:** https://github.com/Awni00/abstract_transformer
+- **Paper:** [Disentangling and Integrating Relational and Sensory Information in Transformer Architectures](https://arxiv.org/abs/2405.16727)
+- **Huggingface Collection:** [Dual Attention Transformer Collection](https://huggingface.co/collections/awni00/dual-attention-transformer-66c23425a545b0cefe4b9489)
+## Model Usage
+Use the code below to get started with the model. First, install the `dual-attention` [python package hosted on PyPI](https://pypi.org/project/dual-attention/) via `pip install dual-attention`.
+To load directly from huggingface hub, use the HFHub wrapper.
+```
+from dual_attention.hf import DualAttnTransformerLM_HFHub
+DualAttnTransformerLM_HFHub.from_pretrained('awni00/DAT-sa16-ra16-nr128-ns2048-sh16-nkvh8-1.27B')
+```
+## Training Details
+The model was trained using the following setup:
+- **Architecture:** Decoder-only Dual Attention Transformer
+- **Framework:** PyTorch
+- **Optimizer:** AdamW
+- **Learning Rate:** 6e-4 (peak)
+- **Weight Decay:** 0.1
+- **Batch Size:** 524,288 Tokens
+- **Sequence Length:** 1024 tokens
+- **Total Training Tokens:** 10B Tokens
+For more detailed training information, please refer to the paper.
+## Evaluation
+See paper.
+## Model Interpretability Analysis
+The [DAT-LM-Visualization app](https://huggingface.co/spaces/awni00/DAT-LM-Visualization/) is built to visualize the representations learned in a Dual Attention Transformer language model. It is hosted on Huggingface spaces using their free CPU resources. You can select a pre-trained DAT-LM model, enter a prompt, and visualize the internal representations in different parts of the model. You can also run the app locally (e.g., to use your own GPU) via the PyPI package.
+Also, see paper.
+## Citation
+```
+@misc{altabaa2024disentanglingintegratingrelationalsensory,
+      title={Disentangling and Integrating Relational and Sensory Information in Transformer Architectures},
+      author={Awni Altabaa and John Lafferty},
+      year={2024},
+      eprint={2405.16727},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2405.16727},
+}
+```