Reproducing the fine tuning gets stuck with 100% CPU on one process

#20
by felipemv - opened

Hi, I'm trying to reproduce your results, but at the early stages there seems to be a stuck process.

echo '
{
  "fp16": {
    "enabled": true,
β‹― (identical to yours)
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}
' > ./ds_config.json
deepspeed \
    ./trainer_sft.py \
    --configs defaults reference-data reference-pythia-12b \
    --cache_dir /root/.cache/huggingface \
    --output_dir .saved/oasst-sft-3-pythia-12b-reference_2kpre \
    --num_train_epochs 8 \
    --use_flash_attention false \
    --verbose true \
    --logging_steps 1 \
    --dtype fp16 \
    --residual_dropout 0.2 \
    --model_name andreaskoepf/pythia-12b-pre-2000

So I get the following logs (abbreviated):

Evaluation set sizes:
oasst_export: 2026 (16.55%)
alpaca: 10212 (83.45%)
Total eval: 12238
--------------------------------------------------------------------------------
β‹―
Number of trainable parameters: 11841M
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:17<00:00,  5.83s/it]
Resizing embeddings to 50288
β‹―
  warnings.warn(
/usr/local/lib/python3.10/site-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(

I get a burst of GPU activity some 3 minutes after starting the process. It lasts for about 10 seconds, then it halts completely and I get stuck with a single process using 100% of a CPU:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β–ˆβ–ˆβ–ˆβ–ˆ      β–ˆβ–ˆ   β–ˆ   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β–ˆ  β–ˆβ–ˆβ–ˆβ–ˆ   β–ˆβ–ˆβ–ˆ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ /usr/local/bin/python3 -u ./trainer_sft.py --local_rank=6

Do you have any idea what might that be?

Let me know if more logs/info would help. I'm using 8 GPUs which should fit this model comfortably in memory.

Sign up or log in to comment