File size: 3,584 Bytes
af578b1
 
 
 
 
339d641
 
 
af578b1
 
 
 
 
339d641
 
af578b1
 
 
 
fa0590e
af578b1
 
 
 
 
 
 
 
 
3f94072
 
 
af578b1
 
 
 
 
e6b1b73
af578b1
 
 
 
 
 
 
 
 
 
e6b1b73
af578b1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e6b1b73
af578b1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3f94072
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
license: apache-2.0
tags:
- ipt
- alibi
- text-generation-inference
- text generation

inference: false
datasets:
- oscar-corpus/OSCAR-2301
language:
- it

pipeline_tag: text-generation
---

# ipt-350m

ipt-350m is a decoder-style transformer pretrained from scratch on ~13B tokens of Italian text (wip: trained on unfiltered oscar).

It uses a modified transformer architecture optimized for efficient training and inference. Positional embeddings are replaced with Attention with Linear Biases ([ALiBi](https://arxiv.org/abs/2108.12409)).

ipt-350m is:
- **Licensed for the possibility of commercial use**
- **Prepared to handle extremely long inputs** thanks to [ALiBi](https://arxiv.org/abs/2108.12409).
- **Capable of fast training and inference** (via [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf) and [FasterTransformer](https://github.com/NVIDIA/FasterTransformer))
- **Equipped with highly efficient open-source training code** via the [llm-foundry repository](https://github.com/mosaicml/llm-foundry)

If you find this project useful, consider supporting its development:
[![Buy me a coffee](https://badgen.net/badge/icon/Buy%20Me%20A%20Coffee?icon=buymeacoffee&label)](https://bmc.link/edoardofederici)

## How to Use

```python
import transformers
model = transformers.AutoModelForCausalLM.from_pretrained(
  'efederici/ipt-350m',
  trust_remote_code=True
)
```
Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method.

To use the optimized [triton implementation](https://github.com/openai/triton) of FlashAttention, you can load the model on GPU (`cuda:0`) with `attn_impl='triton'` and with `bfloat16` precision:
```python
import torch
import transformers

name = 'efederici/ipt-350m'

config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.attn_config['attn_impl'] = 'triton'
config.init_device = 'cuda:0'

model = transformers.AutoModelForCausalLM.from_pretrained(
  name,
  config=config,
  torch_dtype=torch.bfloat16,
  trust_remote_code=True
)
```

Although the model was trained with a sequence length of 2048, ALiBi enables to increase the maximum sequence length during finetuning and/or inference.

```python
import transformers

name = 'efederici/ipt-350m'

config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.max_seq_len = 4096 # (input + output) tokens can now be up to 4096

model = transformers.AutoModelForCausalLM.from_pretrained(
  name,
  config=config,
  trust_remote_code=True
)
```

## Model Description

The architecture is a modification of a standard decoder-only transformer.

The model has been modified from a standard transformer in the following ways:
- It uses [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf)
- It uses [ALiBi (Attention with Linear Biases)](https://arxiv.org/abs/2108.12409) and does not use positional embeddings
- It does not use biases

| Hyperparameter | Value |
|----------------|-------|
|n_parameters | 350M |
|n_layers | 24 |
| n_heads | 16 |
| d_model | 1024 |
| vocab size | 50432 |
| sequence length | 2048 |

### Dataset

The model was trained for ~13B tokens (with batch size 64 and sequence length 2048) on [OSCAR-2301](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301).
Each example was constructed from as many sequences from that dataset as were necessary to fill the 2048 sequence length.

Vocabulary size is 50432, a multiple of 128 as suggested in [MEGATRON-LM](https://arxiv.org/abs/1909.08053), model flop utilization (MFU) increased by up to four percentage points.