efederici
/

ipt-350m

Text Generation

text-generation-inference

text generation

Model card Files Files and versions Community

ipt-350m / README.md

efederici's picture

Update README.md

fa0590e about 1 year ago

|

3.51 kB

	---
	license: apache-2.0
	tags:
	- ipt
	- alibi
	inference: false
	datasets:
	- oscar-corpus/OSCAR-2301
	language:
	- it
	---

	# ipt-350m

	ipt-350m is a decoder-style transformer pretrained from scratch on ~13B tokens of Italian text (wip: trained on unfiltered oscar).

	It uses a modified transformer architecture optimized for efficient training and inference. Positional embeddings are replaced with Attention with Linear Biases ([ALiBi](https://arxiv.org/abs/2108.12409)).

	ipt-350m is:
	- Licensed for the possibility of commercial use
	- Prepared to handle extremely long inputs thanks to [ALiBi](https://arxiv.org/abs/2108.12409).
	- Capable of fast training and inference (via [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf) and [FasterTransformer](https://github.com/NVIDIA/FasterTransformer))
	- Equipped with highly efficient open-source training code via the [llm-foundry repository](https://github.com/mosaicml/llm-foundry)

	If you find this project useful, consider supporting its development:
	[![Buy me a coffee](https://badgen.net/badge/icon/Buy%20Me%20A%20Coffee?icon=buymeacoffee&label)](https://bmc.link/edoardofederici)

	## How to Use

	```python
	import transformers
	model = transformers.AutoModelForCausalLM.from_pretrained(
	'efederici/ipt-350m',
	trust_remote_code=True
	)
	```
	Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method.

	To use the optimized [triton implementation](https://github.com/openai/triton) of FlashAttention, you can load the model on GPU (`cuda:0`) with `attn_impl='triton'` and with `bfloat16` precision:
	```python
	import torch
	import transformers

	name = 'efederici/ipt-350m'

	config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
	config.attn_config['attn_impl'] = 'triton'
	config.init_device = 'cuda:0'

	model = transformers.AutoModelForCausalLM.from_pretrained(
	name,
	config=config,
	torch_dtype=torch.bfloat16,
	trust_remote_code=True
	)
	```

	Although the model was trained with a sequence length of 2048, ALiBi enables to increase the maximum sequence length during finetuning and/or inference.

	```python
	import transformers

	name = 'efederici/ipt-350m'

	config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
	config.max_seq_len = 4096 # (input + output) tokens can now be up to 4096

	model = transformers.AutoModelForCausalLM.from_pretrained(
	name,
	config=config,
	trust_remote_code=True
	)
	```

	## Model Description

	The architecture is a modification of a standard decoder-only transformer.

	The model has been modified from a standard transformer in the following ways:
	- It uses [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf)
	- It uses [ALiBi (Attention with Linear Biases)](https://arxiv.org/abs/2108.12409) and does not use positional embeddings
	- It does not use biases

	\| Hyperparameter \| Value \|
	\|----------------\|-------\|
	\|n_parameters \| 350M \|
	\|n_layers \| 24 \|
	\| n_heads \| 16 \|
	\| d_model \| 1024 \|
	\| vocab size \| 50432 \|
	\| sequence length \| 2048 \|

	### Dataset

	The model was trained for ~13B tokens (with batch size 64 and sequence length 2048) on [OSCAR-2301](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301).
	Each example was constructed from as many sequences from that dataset as were necessary to fill the 2048 sequence length.

	Vocabulary size is 50432, a multiple of 128 as suggested in [MEGATRON-LM](https://arxiv.org/abs/1909.08053), model flop utilization (MFU) increased by up to four percentage points.