Saving dbrx model and tokenizer in dbfs

#49

by pro-shep - opened Apr 10, 2024

Apr 10, 2024

•

edited Apr 10, 2024

HI! Downloading the model from huggingface to databricks. The goal is to end up with 2 folders in either dbfs or local_disk0, one for the model and one for the tokenizer, so that I can use those paths as artifacts in an MLFlor log_model() step.

Followed @srowen 's helpful advice, that the caching area can be switched from default root volume to dbfs, or local_disk0 using the environment variable "HF_HUB_CACHE".

With that variable set to /dbfs, the download was taking 5 hours before I decided to kill it.
With that variable set to /local_disk0 the download took 20 minutes :-) but I don't get the expected folder structure:
--> Expected: 2 folders, one called 'model' and one called 'tokenizer', so that these can be the paths used by AutoModelForCausalLM.from_pretrained() and AutoTokenizer.from_pretrained()
--> Got: 4 folders ['.no_exist', 'refs', 'blobs', 'snapshots'] as revealed by os.listdir('/local_disk0/models--databricks--dbrx-instruct') after the download.

With the model and tokenizer in the cache I tried manually re-saving them in the desired folder structure using model.save_pretrained() and tokenizer.save_pretrained() - giving /dbfs/model and /dbfs/tokenizer target paths. This started out OK, the folders were created and the model folder partially populated, but the saving failed half way through - I have lost the error message by now but googling at the time hinted at something to do with sharing between cpu and gpu

Any thoughts here? E.g. is there some way to map the contents of the 'blobs' and 'snapshots' folders into 'model' and 'tokenizer'?

Thanks!

srowen

Databricks org Apr 10, 2024

Saving to /dbfs is saving to cloud storage, so it's slower than local disk, though that sounds a lot slower than expected. Is the storage in question mounted from another region?

I think this is a question about Hugging Face's cache. Its dir structure is not the same as the model's file and folder structure, it has more to it. You do not use the cache dirs directly anyway though.

Yes, you can save_pretrained to wherever you like, including a /dbfs path, and load that path back. This is not related to the cache.

I'm not sure what error you encountered. Saving shouldn't involve doing anything with GPUs. It's possible you are having problems loading the model to begin with, before saving it. Here we don't have context about how you load or into what type of instance (GPUs? enough RAM?)

pro-shep

Apr 12, 2024

•

edited Apr 12, 2024

Thanks again for the advice. I've reproduced the error and the message is below. Seems to be caused by an attempt to move tensors to cpu. Note that I'm on a single node GPU "cluster" (48 cores) so maybe there's no cpu available for this step.

[EDITED]: Nearly all of the model is saved before hitting this error. We get the two jsons (config and generation_config) and 34 out of 36 safetensor files.

Steps were:
(1) set the env variable to use local_disk0 for caching
(2) download dbrx straight from huggingface using model = AutoModelForCausalLM.from_pretrained("databricks/dbrx-instruct", etc.
(3) save a copy of the model in dbfs, giving it a dbfs filepath in model.save_pretrained(model_path)

----------- ERROR MESSAGE----------
NotImplementedError: Cannot copy out of meta tensor; no data!
File , line 2
1 # save the model
----> 2 model.save_pretrained(model_path)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-XXXX/lib/python3.11/site-packages/safetensors/torch.py:411, in _tobytes(tensor, name)
403 raise ValueError(
404 f"You are trying to save a non contiguous tensor: {name} which is not allowed. It either means you"
405 " are trying to save tensors which are reference of each other in which case it's recommended to save"
406 " only the full tensors, and reslice at load time, or simply call .contiguous() on your tensor to"
407 " pack it before saving."
408 )
409 if tensor.device.type != "cpu":
410 # Moving tensor to cpu before saving
--> 411 tensor = tensor.to("cpu")
413 import ctypes
415 import numpy as np

srowen

Databricks org Apr 12, 2024

I think this is a form of "out of memory". It's still not clear what resources you are using here.

pro-shep

Apr 12, 2024

•

edited Apr 12, 2024

The compute resources are:

1 Driver 440 GB Memory, 48 Cores
Node type: NC48ads A100 v4 (2 GPUs)
Runtime15.0.x-gpu-ml-scala2.12

But how do I find out what storage capacity I have in DBFS or local_disk0? (both of these targets fail to store all model tensors).
and what size (GB) is needed for the model?

It's worth noting that the save-out fails after 34 tensors, whether attempting to save into dbfs or local_disk0

srowen

Databricks org Apr 12, 2024

That isn't enough mem to load the model on the GPUs (132b x 16-bit = 264GB vs your 2 x 80GB A100), and i suspect this is related, but would expect a different error really. It's not clear how you're loading or onto what device.
DBFS is cloud storage, virtually infinite. /local_disk0 is elastic local storage, will go to tens of terabytes, so not the issue either.
The error is not related to disk storage.

Out of curiosity, why do this vs use the Foundation Model API, if you're on Databricks?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment