---
license: apache-2.0
---
The *TokenFormer* is a **fully attention-based architecture**
that unifies the computations of token-token and token-parameter interactions
by entirely employing the attention mechanism, **maximizes the flexibility of neural network**.[(see paper)](https://arxiv.org/pdf/2410.23168).
It contains four models of sizes
150M, 450M, 900M, 1.5B. For each size, it's trained based on [gpt-neox](https://github.com/EleutherAI/gpt-neox) code base and uses [Pile](https://huggingface.co/datasets/EleutherAI/pile) with 300B tokens.
All 4 model sizes are trained on the exact
same data, in the exact same order.
# TokenFormer-150M
## Model Details
- Developed by: [Haiyang Wang](https://haiyang-w.github.io/)
- Model type: TokenFormer-based Language Model
- Language: English
- Learn more: [TokenFormer's GitHub repository](https://github.com/Haiyang-W/TokenFormer)
for training procedure, config files, and details on how to use.
[See paper](https://arxiv.org/pdf/2410.23168) for more evals and implementation
details.
- Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
- License: Apache 2.0
- Contact: to ask questions about this model, please email Haiyang Wang.
## Training
### Training data
[The Pile](https://pile.eleuther.ai/) is a 825GiB general-purpose dataset in
English. It was created by EleutherAI specifically for training large language
models. It contains texts from 22 diverse sources, roughly broken down into
five categories: academic writing (e.g. arXiv), internet (e.g. CommonCrawl),
prose (e.g. Project Gutenberg), dialogue (e.g. YouTube subtitles), and
miscellaneous (e.g. GitHub, Enron Emails). See [the Pile
paper](https://arxiv.org/abs/2101.00027) for a breakdown of all data sources,
methodology, and a discussion of ethical implications. Consult [the
datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation
about the Pile and its component datasets. The Pile can be downloaded from
the [official website](https://pile.eleuther.ai/), or from a [community
mirror](https://the-eye.eu/public/AI/pile/).
### Training procedure
We follow the default training strategy of [Pythia](https://arxiv.org/abs/2304.01373) in [gpt-neox](https://github.com/EleutherAI/gpt-neox),
including the dataset processing, hyper-parameter and code base.
All models were trained on the exact same data, in the exact same order. Each
model saw 299,892,736,000 tokens during training.
All *TokenFormer* models trained for 143000 steps at a batch size
of 2M (2,097,152 tokens).
See [GitHub](https://github.com/Haiyang-W/TokenFormer) for more details on training
procedure.
TokenFormer uses the same tokenizer as [GPT-NeoX-
20B](https://huggingface.co/EleutherAI/gpt-neox-20b).
## Evaluations
All *TokenFormer* models were evaluated using the [LM Evaluation
Harness](https://github.com/EleutherAI/lm-evaluation-harness).
You can run the evaluation with our [instruction](https://github.com/Haiyang-W/TokenFormer?tab=readme-ov-file#evaluations).
Expand the sections below to see plots of evaluation results for all
TokenFormer compared with Opensource Transformer-based LLMs.