Error whilst running the dolly repo on Databricks

#10
by opyate - opened

OSError: /local_disk0/dolly_training/dolly__2023-04-25T17:01:09 does not appear to have a file named config.json.

I'll investigate this tomorrow, but for now, here's the error I'm getting:

Screenshot from 2023-04-25 18-06-40.png

It happens at the following step:

from training.generate import generate_response, load_model_tokenizer_for_generate

model, tokenizer = load_model_tokenizer_for_generate(local_output_dir)
Databricks org

Can you show the contents of that directory? did training complete successfully?

Ah, ok - the training error was below the fold in the previous block, so I didn't spot it.

image.png

Here's the entire log from the frame below the above error:
https://pastebin.com/uPwwqJbE

I'm using this instance: g5.12xlarge, so 4x A10G GPUs at 24GB each.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10G         Off  | 00000000:00:1B.0 Off |                    0 |
|  0%   28C    P0    57W / 300W |   7808MiB / 22731MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A10G         Off  | 00000000:00:1C.0 Off |                    0 |
|  0%   29C    P0    61W / 300W |   7926MiB / 22731MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A10G         Off  | 00000000:00:1D.0 Off |                    0 |
|  0%   29C    P0    59W / 300W |   5812MiB / 22731MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A10G         Off  | 00000000:00:1E.0 Off |                    0 |
|  0%   28C    P0    59W / 300W |   5546MiB / 22731MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

I'm trying pythia 3b (or 2.8b more specifically). Should I rather use a large GPU with more contiguous memory, like A100?

Databricks org

The error you show isn't actually an error, it's a weird ignorable error from the notebook (Databricks needs to fix that). Is there more below? did the training show a problem in the actual cell output? my guess is it didn't finish, but we don't see that output.

4 x A10 is fine for the smallest model, but, did you see these instructions? https://github.com/databrickslabs/dolly#a10-gpus-1 You need to set batch size to 3 or less.

Thanks for the guidance. I made the change in this PR, and it worked: https://github.com/databrickslabs/dolly/pull/135

My thinking is that the missing datetime import resulted in the timestamped output directory not being created, hence my error.

I successfully trained a 3b model on Databricks with the above GPU configuration in 5.6 hours.

EDIT: the PR is moot. One of my cells didn't run, so datetime wasn't imported in an earlier cell.

opyate changed discussion status to closed

Sign up or log in to comment