slurm submission log: 2024-05-10 17:55:25.295279
created following sbatch script: 

###############################

#!/bin/bash

#SBATCH --account=nlp
#SBATCH --cpus-per-task=16
#SBATCH --dependency=afterok:7594444
#SBATCH --gres=gpu:2
#SBATCH --job-name=tthrush-job-4654014
#SBATCH --mem=400G
#SBATCH --nodelist=sphinx2
#SBATCH --open-mode=append
#SBATCH --output=/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms_4/pythia-70m_sciq/train_job_output.txt
#SBATCH --partition=sphinx
#SBATCH --time=14-0

# activate your desired anaconda environment
. /nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/etc/profile.d/conda.sh ; conda activate pretraining-coreset-selection

# cd to working directory
cd .

# launch commands
srun --unbuffered run_as_child_processes 'torchrun --master_port 29502 --nproc_per_node=2 train_llm.py --dataset_id /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/train_data_4/sciq --output_dir /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms_4/pythia-70m_sciq --output_hub_id pythia-70m_sciq --model_id EleutherAI/pythia-70m --num_train_epochs 1 --learning_rate 1e-3 --warmup_ratio=0.1 --gradient_accumulation_steps 2'

###############################

submission to slurm complete!


###############################
slurm submission output

Submitted batch job 7594445


###############################

/var/lib/slurm/slurmd/job7594445/slurm_script: line 16: /nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/etc/profile.d/conda.sh: No such file or directory

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.


###############################
start time: 2024-05-10 18:38:19.114921
machine: sphinx2
conda env: pretraining-coreset-selection
###############################
running following processes

	torchrun --master_port 29502 --nproc_per_node=2 train_llm.py --dataset_id /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/train_data_4/sciq --output_dir /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms_4/pythia-70m_sciq --output_hub_id pythia-70m_sciq --model_id EleutherAI/pythia-70m --num_train_epochs 1 --learning_rate 1e-3 --warmup_ratio=0.1 --gradient_accumulation_steps 2


###############################
command outputs: 


[2024-05-10 18:38:20,807] torch.distributed.run: [WARNING] 
[2024-05-10 18:38:20,807] torch.distributed.run: [WARNING] *****************************************
[2024-05-10 18:38:20,807] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-05-10 18:38:20,807] torch.distributed.run: [WARNING] *****************************************
05/10/2024 18:38:25 - INFO - __main__ - Script parameters ScriptArguments(dataset_id='/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/train_data_4/sciq', output_dir='/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms_4/pythia-70m_sciq', output_hub_id='pythia-70m_sciq', hf_hub_token=True, model_id='EleutherAI/pythia-70m', per_device_train_batch_size=256, num_train_epochs=1, learning_rate=0.001, gradient_accumulation_steps=2, from_scratch=True, warmup_ratio=0.1, adam_beta1=0.9, adam_beta2=0.95, adam_epsilon=1e-08, weight_decay=0.01, lr_scheduler_type='cosine', local_rank=0, resume_from_checkpoint=False, deepspeed=None, peft=False)
05/10/2024 18:38:25 - INFO - __main__ - Script parameters ScriptArguments(dataset_id='/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/train_data_4/sciq', output_dir='/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/llms_4/pythia-70m_sciq', output_hub_id='pythia-70m_sciq', hf_hub_token=True, model_id='EleutherAI/pythia-70m', per_device_train_batch_size=256, num_train_epochs=1, learning_rate=0.001, gradient_accumulation_steps=2, from_scratch=True, warmup_ratio=0.1, adam_beta1=0.9, adam_beta2=0.95, adam_epsilon=1e-08, weight_decay=0.01, lr_scheduler_type='cosine', local_rank=0, resume_from_checkpoint=False, deepspeed=None, peft=False)
  0%|          | 0/143 [00:00<?, ?it/s][rank0]:[W reducer.cpp:1360] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank1]:[W reducer.cpp:1360] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
  1%|          | 1/143 [00:05<12:46,  5.40s/it]  1%|▏         | 2/143 [00:06<06:44,  2.87s/it]  2%|▏         | 3/143 [00:07<04:20,  1.86s/it]  3%|▎         | 4/143 [00:07<03:05,  1.34s/it]  3%|▎         | 5/143 [00:08<02:24,  1.05s/it]  4%|▍         | 6/143 [00:08<01:58,  1.16it/s]  5%|▍         | 7/143 [00:09<01:41,  1.35it/s]  6%|▌         | 8/143 [00:09<01:29,  1.50it/s]  6%|▋         | 9/143 [00:10<01:22,  1.63it/s]  7%|▋         | 10/143 [00:10<01:16,  1.73it/s]  8%|▊         | 11/143 [00:11<01:12,  1.81it/s]  8%|▊         | 12/143 [00:11<01:10,  1.87it/s]  9%|▉         | 13/143 [00:12<01:08,  1.91it/s] 10%|▉         | 14/143 [00:12<01:06,  1.94it/s] 10%|█         | 15/143 [00:13<01:05,  1.96it/s] 11%|█         | 16/143 [00:13<01:04,  1.98it/s] 12%|█▏        | 17/143 [00:14<01:03,  1.99it/s] 13%|█▎        | 18/143 [00:14<01:02,  2.00it/s] 13%|█▎        | 19/143 [00:15<01:01,  2.00it/s] 14%|█▍        | 20/143 [00:15<01:01,  2.01it/s] 15%|█▍        | 21/143 [00:16<01:00,  2.01it/s] 15%|█▌        | 22/143 [00:16<01:00,  2.01it/s] 16%|█▌        | 23/143 [00:17<01:01,  1.96it/s] 17%|█▋        | 24/143 [00:17<01:00,  1.98it/s] 17%|█▋        | 25/143 [00:18<00:59,  1.99it/s]                                                {'loss': 8.8339, 'grad_norm': 0.4713192284107208, 'learning_rate': 0.000985015626597272, 'epoch': 0.17}
 17%|█▋        | 25/143 [00:18<00:59,  1.99it/s] 18%|█▊        | 26/143 [00:18<00:58,  1.99it/s] 19%|█▉        | 27/143 [00:19<00:57,  2.00it/s] 20%|█▉        | 28/143 [00:19<00:57,  2.01it/s] 20%|██        | 29/143 [00:20<00:56,  2.01it/s] 21%|██        | 30/143 [00:20<00:56,  2.02it/s] 22%|██▏       | 31/143 [00:21<00:55,  2.02it/s] 22%|██▏       | 32/143 [00:21<00:54,  2.02it/s] 23%|██▎       | 33/143 [00:22<00:54,  2.02it/s] 24%|██▍       | 34/143 [00:22<00:53,  2.02it/s] 24%|██▍       | 35/143 [00:23<00:53,  2.02it/s] 25%|██▌       | 36/143 [00:23<00:52,  2.02it/s] 26%|██▌       | 37/143 [00:24<00:52,  2.03it/s] 27%|██▋       | 38/143 [00:24<00:51,  2.03it/s] 27%|██▋       | 39/143 [00:25<00:51,  2.02it/s] 28%|██▊       | 40/143 [00:25<00:50,  2.02it/s] 29%|██▊       | 41/143 [00:26<00:50,  2.01it/s] 29%|██▉       | 42/143 [00:26<00:50,  2.02it/s] 30%|███       | 43/143 [00:27<00:49,  2.01it/s] 31%|███       | 44/143 [00:27<00:49,  2.02it/s] 31%|███▏      | 45/143 [00:28<00:48,  2.02it/s] 32%|███▏      | 46/143 [00:28<00:48,  2.01it/s] 33%|███▎      | 47/143 [00:29<00:47,  2.00it/s] 34%|███▎      | 48/143 [00:29<00:47,  2.00it/s] 34%|███▍      | 49/143 [00:30<00:46,  2.01it/s] 35%|███▍      | 50/143 [00:30<00:46,  2.01it/s]                                                {'loss': 6.6192, 'grad_norm': 0.38000625371932983, 'learning_rate': 0.0008265864214768884, 'epoch': 0.35}
 35%|███▍      | 50/143 [00:30<00:46,  2.01it/s] 36%|███▌      | 51/143 [00:31<00:45,  2.01it/s] 36%|███▋      | 52/143 [00:31<00:45,  2.01it/s] 37%|███▋      | 53/143 [00:32<00:44,  2.01it/s] 38%|███▊      | 54/143 [00:32<00:44,  2.01it/s] 38%|███▊      | 55/143 [00:33<00:43,  2.01it/s] 39%|███▉      | 56/143 [00:33<00:43,  2.02it/s] 40%|███▉      | 57/143 [00:34<00:42,  2.02it/s] 41%|████      | 58/143 [00:34<00:42,  2.02it/s] 41%|████▏     | 59/143 [00:35<00:41,  2.02it/s] 42%|████▏     | 60/143 [00:35<00:41,  2.02it/s] 43%|████▎     | 61/143 [00:36<00:40,  2.02it/s] 43%|████▎     | 62/143 [00:36<00:40,  2.02it/s] 44%|████▍     | 63/143 [00:37<00:39,  2.02it/s] 45%|████▍     | 64/143 [00:37<00:39,  2.02it/s] 45%|████▌     | 65/143 [00:38<00:38,  2.02it/s] 46%|████▌     | 66/143 [00:38<00:38,  2.02it/s] 47%|████▋     | 67/143 [00:39<00:37,  2.02it/s] 48%|████▊     | 68/143 [00:39<00:37,  2.01it/s] 48%|████▊     | 69/143 [00:40<00:36,  2.02it/s] 49%|████▉     | 70/143 [00:40<00:36,  2.02it/s] 50%|████▉     | 71/143 [00:41<00:35,  2.02it/s] 50%|█████     | 72/143 [00:41<00:35,  2.02it/s] 51%|█████     | 73/143 [00:41<00:34,  2.02it/s] 52%|█████▏    | 74/143 [00:42<00:34,  2.02it/s] 52%|█████▏    | 75/143 [00:42<00:33,  2.02it/s]                                                {'loss': 5.6515, 'grad_norm': 0.3039681315422058, 'learning_rate': 0.0005490085701647804, 'epoch': 0.52}
 52%|█████▏    | 75/143 [00:42<00:33,  2.02it/s] 53%|█████▎    | 76/143 [00:43<00:33,  2.02it/s] 54%|█████▍    | 77/143 [00:43<00:32,  2.02it/s] 55%|█████▍    | 78/143 [00:44<00:32,  2.02it/s] 55%|█████▌    | 79/143 [00:44<00:31,  2.02it/s] 56%|█████▌    | 80/143 [00:45<00:31,  2.02it/s] 57%|█████▋    | 81/143 [00:45<00:30,  2.02it/s] 57%|█████▋    | 82/143 [00:46<00:30,  2.02it/s] 58%|█████▊    | 83/143 [00:46<00:29,  2.02it/s] 59%|█████▊    | 84/143 [00:47<00:29,  2.02it/s] 59%|█████▉    | 85/143 [00:47<00:28,  2.02it/s] 60%|██████    | 86/143 [00:48<00:28,  2.02it/s] 61%|██████    | 87/143 [00:48<00:27,  2.02it/s] 62%|██████▏   | 88/143 [00:49<00:27,  2.02it/s] 62%|██████▏   | 89/143 [00:49<00:26,  2.02it/s] 63%|██████▎   | 90/143 [00:50<00:26,  2.03it/s] 64%|██████▎   | 91/143 [00:50<00:25,  2.02it/s] 64%|██████▍   | 92/143 [00:51<00:25,  2.03it/s] 65%|██████▌   | 93/143 [00:51<00:24,  2.02it/s] 66%|██████▌   | 94/143 [00:52<00:24,  2.02it/s] 66%|██████▋   | 95/143 [00:52<00:23,  2.02it/s] 67%|██████▋   | 96/143 [00:53<00:23,  2.02it/s] 68%|██████▊   | 97/143 [00:53<00:22,  2.02it/s] 69%|██████▊   | 98/143 [00:54<00:22,  2.02it/s] 69%|██████▉   | 99/143 [00:54<00:21,  2.02it/s] 70%|██████▉   | 100/143 [00:55<00:21,  2.02it/s]                                                 {'loss': 5.223, 'grad_norm': 0.1776563823223114, 'learning_rate': 0.00025355090388510805, 'epoch': 0.7}
 70%|██████▉   | 100/143 [00:55<00:21,  2.02it/s] 71%|███████   | 101/143 [00:55<00:21,  2.00it/s] 71%|███████▏  | 102/143 [00:56<00:20,  2.00it/s] 72%|███████▏  | 103/143 [00:56<00:20,  2.00it/s] 73%|███████▎  | 104/143 [00:57<00:19,  2.00it/s] 73%|███████▎  | 105/143 [00:57<00:18,  2.00it/s] 74%|███████▍  | 106/143 [00:58<00:18,  2.00it/s] 75%|███████▍  | 107/143 [00:58<00:17,  2.01it/s] 76%|███████▌  | 108/143 [00:59<00:17,  2.01it/s] 76%|███████▌  | 109/143 [00:59<00:16,  2.01it/s] 77%|███████▋  | 110/143 [01:00<00:16,  2.01it/s] 78%|███████▊  | 111/143 [01:00<00:15,  2.00it/s] 78%|███████▊  | 112/143 [01:01<00:15,  2.01it/s] 79%|███████▉  | 113/143 [01:01<00:14,  2.01it/s] 80%|███████▉  | 114/143 [01:02<00:14,  2.00it/s] 80%|████████  | 115/143 [01:02<00:13,  2.01it/s] 81%|████████  | 116/143 [01:03<00:13,  2.01it/s] 82%|████████▏ | 117/143 [01:03<00:12,  2.01it/s] 83%|████████▎ | 118/143 [01:04<00:12,  2.01it/s] 83%|████████▎ | 119/143 [01:04<00:11,  2.01it/s] 84%|████████▍ | 120/143 [01:05<00:11,  2.01it/s] 85%|████████▍ | 121/143 [01:05<00:10,  2.01it/s] 85%|████████▌ | 122/143 [01:06<00:10,  2.01it/s] 86%|████████▌ | 123/143 [01:06<00:09,  2.01it/s] 87%|████████▋ | 124/143 [01:07<00:09,  2.01it/s] 87%|████████▋ | 125/143 [01:07<00:08,  2.01it/s]                                                 {'loss': 5.0334, 'grad_norm': 0.15379749238491058, 'learning_rate': 4.800535343827833e-05, 'epoch': 0.87}
 87%|████████▋ | 125/143 [01:07<00:08,  2.01it/s] 88%|████████▊ | 126/143 [01:08<00:08,  2.01it/s] 89%|████████▉ | 127/143 [01:08<00:07,  2.00it/s] 90%|████████▉ | 128/143 [01:09<00:07,  2.01it/s] 90%|█████████ | 129/143 [01:09<00:06,  2.01it/s] 91%|█████████ | 130/143 [01:10<00:06,  2.01it/s] 92%|█████████▏| 131/143 [01:10<00:05,  2.01it/s] 92%|█████████▏| 132/143 [01:11<00:05,  2.01it/s] 93%|█████████▎| 133/143 [01:11<00:04,  2.01it/s] 94%|█████████▎| 134/143 [01:12<00:04,  2.01it/s] 94%|█████████▍| 135/143 [01:12<00:03,  2.01it/s] 95%|█████████▌| 136/143 [01:13<00:03,  2.01it/s] 96%|█████████▌| 137/143 [01:13<00:02,  2.01it/s] 97%|█████████▋| 138/143 [01:14<00:02,  2.01it/s] 97%|█████████▋| 139/143 [01:14<00:01,  2.01it/s] 98%|█████████▊| 140/143 [01:15<00:01,  2.01it/s] 99%|█████████▊| 141/143 [01:15<00:00,  2.01it/s] 99%|█████████▉| 142/143 [01:16<00:00,  2.01it/s]100%|██████████| 143/143 [01:16<00:00,  2.07it/s]                                                 {'train_runtime': 88.3061, 'train_samples_per_second': 1657.824, 'train_steps_per_second': 1.619, 'train_loss': 6.109253996735686, 'epoch': 1.0}
100%|██████████| 143/143 [01:28<00:00,  2.07it/s]100%|██████████| 143/143 [01:28<00:00,  1.62it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.