ipt-350m is a decoder-style transformer pretrained from scratch on ~13B tokens of Italian text (wip: trained on unfiltered oscar).
It uses a modified transformer architecture optimized for efficient training and inference. Positional embeddings are replaced with Attention with Linear Biases (ALiBi).
- Licensed for the possibility of commercial use
- Prepared to handle extremely long inputs thanks to ALiBi.
- Capable of fast training and inference (via FlashAttention and FasterTransformer)
- Equipped with highly efficient open-source training code via the llm-foundry repository
import transformers model = transformers.AutoModelForCausalLM.from_pretrained( 'efederici/ipt-350m', trust_remote_code=True )
Note: This model requires that
trust_remote_code=True be passed to the
To use the optimized triton implementation of FlashAttention, you can load the model on GPU (
attn_impl='triton' and with
import torch import transformers name = 'efederici/ipt-350m' config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True) config.attn_config['attn_impl'] = 'triton' config.init_device = 'cuda:0' model = transformers.AutoModelForCausalLM.from_pretrained( name, config=config, torch_dtype=torch.bfloat16, trust_remote_code=True )
Although the model was trained with a sequence length of 2048, ALiBi enables to increase the maximum sequence length during finetuning and/or inference.
import transformers name = 'efederici/ipt-350m' config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True) config.max_seq_len = 4096 # (input + output) tokens can now be up to 4096 model = transformers.AutoModelForCausalLM.from_pretrained( name, config=config, trust_remote_code=True )
The architecture is a modification of a standard decoder-only transformer.
The model has been modified from a standard transformer in the following ways:
- It uses FlashAttention
- It uses ALiBi (Attention with Linear Biases) and does not use positional embeddings
- It does not use biases
The model was trained for ~13B tokens (with batch size 64 and sequence length 2048) on OSCAR-2301. Each example was constructed from as many sequences from that dataset as were necessary to fill the 2048 sequence length.
Vocabulary size is 50432, a multiple of 128 as suggested in MEGATRON-LM, model flop utilization (MFU) increased by up to four percentage points.
- Downloads last month