Quantizations of https://huggingface.co/mosaicml/mpt-7b-storywriter
Note: not sure why but Q2_K, Q3_K_S, Q4_0 and Q5_0 gave error during quantizations: "ggml_validate_row_data: found nan value at block xxx", so I skipped those quants.
From original readme
How to Use
Note: This model requires that trust_remote_code=True
be passed to the from_pretrained
method. This is because we use a custom model architecture that is not yet part of the transformers
package.
It includes options for many training efficiency features such as FlashAttention (Dao et al. 2022), ALiBi, QK LayerNorm, and more.
import transformers
model = transformers.AutoModelForCausalLM.from_pretrained(
'mosaicml/mpt-7b-storywriter',
trust_remote_code=True
)
To use the optimized triton implementation of FlashAttention, you can load the model on GPU (cuda:0
) with attn_impl='triton'
and with bfloat16
precision:
import torch
import transformers
name = 'mosaicml/mpt-7b-storywriter'
config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.attn_config['attn_impl'] = 'triton'
config.init_device = 'cuda:0' # For fast initialization directly on GPU!
model = transformers.AutoModelForCausalLM.from_pretrained(
name,
config=config,
torch_dtype=torch.bfloat16, # Load model weights in bfloat16
trust_remote_code=True
)
Although the model was trained with a sequence length of 2048 and finetuned with a sequence length of 65536, ALiBi enables users to increase the maximum sequence length during finetuning and/or inference. For example:
import transformers
name = 'mosaicml/mpt-7b'
config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.max_seq_len = 83968 # (input + output) tokens can now be up to 83968
model = transformers.AutoModelForCausalLM.from_pretrained(
name,
config=config,
trust_remote_code=True
)
This model was trained with the EleutherAI/gpt-neox-20b tokenizer.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
The model can then be used, for example, within a text-generation pipeline.
Note: when running Torch modules in lower precision, it is best practice to use the torch.autocast context manager.
from transformers import pipeline
pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0')
with torch.autocast('cuda', dtype=torch.bfloat16):
print(
pipe('Here is a recipe for vegan banana bread:\n',
max_new_tokens=100,
do_sample=True,
use_cache=True))
- Downloads last month
- 122