ipt-350m / README.md
efederici's picture
Update README.md
339d641
metadata
license: apache-2.0
tags:
  - ipt
  - alibi
  - text-generation-inference
  - text generation
inference: false
datasets:
  - oscar-corpus/OSCAR-2301
language:
  - it
pipeline_tag: text-generation

ipt-350m

ipt-350m is a decoder-style transformer pretrained from scratch on ~13B tokens of Italian text (wip: trained on unfiltered oscar).

It uses a modified transformer architecture optimized for efficient training and inference. Positional embeddings are replaced with Attention with Linear Biases (ALiBi).

ipt-350m is:

If you find this project useful, consider supporting its development: Buy me a coffee

How to Use

import transformers
model = transformers.AutoModelForCausalLM.from_pretrained(
  'efederici/ipt-350m',
  trust_remote_code=True
)

Note: This model requires that trust_remote_code=True be passed to the from_pretrained method.

To use the optimized triton implementation of FlashAttention, you can load the model on GPU (cuda:0) with attn_impl='triton' and with bfloat16 precision:

import torch
import transformers

name = 'efederici/ipt-350m'

config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.attn_config['attn_impl'] = 'triton'
config.init_device = 'cuda:0'

model = transformers.AutoModelForCausalLM.from_pretrained(
  name,
  config=config,
  torch_dtype=torch.bfloat16,
  trust_remote_code=True
)

Although the model was trained with a sequence length of 2048, ALiBi enables to increase the maximum sequence length during finetuning and/or inference.

import transformers

name = 'efederici/ipt-350m'

config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.max_seq_len = 4096 # (input + output) tokens can now be up to 4096

model = transformers.AutoModelForCausalLM.from_pretrained(
  name,
  config=config,
  trust_remote_code=True
)

Model Description

The architecture is a modification of a standard decoder-only transformer.

The model has been modified from a standard transformer in the following ways:

Hyperparameter Value
n_parameters 350M
n_layers 24
n_heads 16
d_model 1024
vocab size 50432
sequence length 2048

Dataset

The model was trained for ~13B tokens (with batch size 64 and sequence length 2048) on OSCAR-2301. Each example was constructed from as many sequences from that dataset as were necessary to fill the 2048 sequence length.

Vocabulary size is 50432, a multiple of 128 as suggested in MEGATRON-LM, model flop utilization (MFU) increased by up to four percentage points.