|
--- |
|
license: apache-2.0 |
|
tags: |
|
- ipt |
|
- alibi |
|
inference: false |
|
datasets: |
|
- oscar-corpus/OSCAR-2301 |
|
language: |
|
- it |
|
--- |
|
|
|
# ipt-350m |
|
|
|
ipt-350m is a decoder-style transformer pretrained from scratch on ~13B tokens of Italian text (wip: trained on unfiltered oscar). |
|
|
|
It uses a modified transformer architecture optimized for efficient training and inference. Positional embeddings are replaced with Attention with Linear Biases ([ALiBi](https://arxiv.org/abs/2108.12409)). |
|
|
|
ipt-350m is: |
|
- **Licensed for the possibility of commercial use** |
|
- **Prepared to handle extremely long inputs** thanks to [ALiBi](https://arxiv.org/abs/2108.12409). |
|
- **Capable of fast training and inference** (via [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf) and [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)) |
|
- **Equipped with highly efficient open-source training code** via the [llm-foundry repository](https://github.com/mosaicml/llm-foundry) |
|
|
|
If you find this project useful, consider supporting its development: |
|
[![Buy me a coffee](https://badgen.net/badge/icon/Buy%20Me%20A%20Coffee?icon=buymeacoffee&label)](https://bmc.link/edoardofederici) |
|
|
|
## How to Use |
|
|
|
```python |
|
import transformers |
|
model = transformers.AutoModelForCausalLM.from_pretrained( |
|
'efederici/ipt-350m', |
|
trust_remote_code=True |
|
) |
|
``` |
|
Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method. |
|
|
|
To use the optimized [triton implementation](https://github.com/openai/triton) of FlashAttention, you can load the model on GPU (`cuda:0`) with `attn_impl='triton'` and with `bfloat16` precision: |
|
```python |
|
import torch |
|
import transformers |
|
|
|
name = 'efederici/ipt-350m' |
|
|
|
config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True) |
|
config.attn_config['attn_impl'] = 'triton' |
|
config.init_device = 'cuda:0' |
|
|
|
model = transformers.AutoModelForCausalLM.from_pretrained( |
|
name, |
|
config=config, |
|
torch_dtype=torch.bfloat16, |
|
trust_remote_code=True |
|
) |
|
``` |
|
|
|
Although the model was trained with a sequence length of 2048, ALiBi enables to increase the maximum sequence length during finetuning and/or inference. |
|
|
|
```python |
|
import transformers |
|
|
|
name = 'efederici/ipt-350m' |
|
|
|
config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True) |
|
config.max_seq_len = 4096 # (input + output) tokens can now be up to 4096 |
|
|
|
model = transformers.AutoModelForCausalLM.from_pretrained( |
|
name, |
|
config=config, |
|
trust_remote_code=True |
|
) |
|
``` |
|
|
|
## Model Description |
|
|
|
The architecture is a modification of a standard decoder-only transformer. |
|
|
|
The model has been modified from a standard transformer in the following ways: |
|
- It uses [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf) |
|
- It uses [ALiBi (Attention with Linear Biases)](https://arxiv.org/abs/2108.12409) and does not use positional embeddings |
|
- It does not use biases |
|
|
|
| Hyperparameter | Value | |
|
|----------------|-------| |
|
|n_parameters | 350M | |
|
|n_layers | 24 | |
|
| n_heads | 16 | |
|
| d_model | 1024 | |
|
| vocab size | 50432 | |
|
| sequence length | 2048 | |
|
|
|
### Dataset |
|
|
|
The model was trained for ~13B tokens (with batch size 64 and sequence length 2048) on [OSCAR-2301](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301). |
|
Each example was constructed from as many sequences from that dataset as were necessary to fill the 2048 sequence length. |
|
|
|
Vocabulary size is 50432, a multiple of 128 as suggested in [MEGATRON-LM](https://arxiv.org/abs/1909.08053), model flop utilization (MFU) increased by up to four percentage points. |