mosaicml/mpt-30b-chat on sagemaker ml.p3.8xlarge

#16
by markdoucette - opened

I'm trying to get MPT-30b-Chat running on a ml.p3.8xlarge instance and I'm gettin an error that says I'm out of disc space "[Errno 28] No space left on device". I've tweaked the code from this post (https://hackernoon.com/how-to-run-mpt-7b-on-aws-sagemaker-mosaicmls-chatgpt-competitor) which can be found here (https://colab.research.google.com/drive/1kJr2LHHLKYkbnNutVYEkt2vrYsbO38aw?ref=hackernoon.com). Here is my current code:

!pip install -qU transformers accelerate einops langchain xformers
from torch import cuda, bfloat16
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
import transformers

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

name = 'mosaicml/mpt-30b-chat'

tokenizer = AutoTokenizer.from_pretrained(name, trust_remote_code=True)

config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.init_device = 'cuda:0'

model = AutoModelForCausalLM.from_pretrained(name,
                                             trust_remote_code=True,
                                             config=config,
                                             torch_dtype=bfloat16)

Would really appreciate any help on this.

I am trying the deploy script instead, and even for me I get the following error on AWS Sagemaker
code: 28, kind: StorageFull, message: "No space left on device"

It always fails for me to download "pytorch_model-00004-of-00007.bin"

I tried many things. Creating a new notebook instance, increasing the EBS storage size to about 120 GB. But somehow the same error remains.
Not sure what is the issue.

To me it also gives the following error:- An error occurred while downloading using hf_transfer. Consider disabling HF_HUB_ENABLE_HF_TRANSFER for better error handling.

So, the next steps I will try is:-

  1. Disable the above hf_transfer
  2. Bring down my instance size to something cheaper and smaller, till I am unable to solve this issue.
  3. Would try to may be download it somehow to S3 and load it from S3 instead of using directly the hub, which anyways fails.

By the way I am using the Deploy script which is the same as the Hugging Face recommends, and not the one you mentioned. Not sure if that should matter.

Sign up or log in to comment