Tristan commited on
Commit
4683a9a
1 Parent(s): a332017

Training in progress, epoch 0

Browse files
eval_job_output.txt CHANGED
@@ -1,4 +1,4 @@
1
- slurm submission log: 2024-05-23 14:58:53.803713
2
  created following sbatch script:
3
 
4
  ###############################
@@ -7,24 +7,24 @@ created following sbatch script:
7
 
8
  #SBATCH --account=nlp
9
  #SBATCH --cpus-per-task=16
10
- #SBATCH --dependency=afterok:7645740
11
  #SBATCH --gres=gpu:1
12
- #SBATCH --job-name=tthrush-job-1104501
13
  #SBATCH --mem=60G
14
  #SBATCH --nodelist=sphinx1
15
  #SBATCH --open-mode=append
16
- #SBATCH --output=/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/test_ordinal_constrained_initial_init/llms/pythia-70m_sciq_1/eval_job_output.txt
17
  #SBATCH --partition=sphinx
18
  #SBATCH --time=14-0
19
 
20
  # activate your desired anaconda environment
21
- . /nlp/scr/tthrush/miniconda3/etc/profile.d/conda.sh ; conda activate pretraining-coreset-selection
22
 
23
  # cd to working directory
24
  cd .
25
 
26
  # launch commands
27
- srun --unbuffered run_as_child_processes 'lm_eval --model hf --model_args pretrained=/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/test_ordinal_constrained_initial_init/llms/pythia-70m_sciq_1,revision=main,dtype=float16,trust_remote_code=True --tasks xnli_en,xnli_fr,sciq,piqa,lambada,arc_easy --device cuda --output_path /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/test_ordinal_constrained_initial_init/llms/pythia-70m_sciq_1/perf'
28
 
29
  ###############################
30
 
@@ -34,7 +34,93 @@ submission to slurm complete!
34
  ###############################
35
  slurm submission output
36
 
37
- Submitted batch job 7645741
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
 
40
 
 
1
+ slurm submission log: 2024-05-24 11:42:10.904769
2
  created following sbatch script:
3
 
4
  ###############################
 
7
 
8
  #SBATCH --account=nlp
9
  #SBATCH --cpus-per-task=16
10
+ #SBATCH --dependency=afterok:7648449
11
  #SBATCH --gres=gpu:1
12
+ #SBATCH --job-name=tthrush-job-2437039
13
  #SBATCH --mem=60G
14
  #SBATCH --nodelist=sphinx1
15
  #SBATCH --open-mode=append
16
+ #SBATCH --output=/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/test_ordinal_constrained_initial_init_min_threshold/llms/pythia-70m_sciq_1/eval_job_output.txt
17
  #SBATCH --partition=sphinx
18
  #SBATCH --time=14-0
19
 
20
  # activate your desired anaconda environment
21
+ . /nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/etc/profile.d/conda.sh ; conda activate pretraining-coreset-selection
22
 
23
  # cd to working directory
24
  cd .
25
 
26
  # launch commands
27
+ srun --unbuffered run_as_child_processes 'lm_eval --model hf --model_args pretrained=/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/test_ordinal_constrained_initial_init_min_threshold/llms/pythia-70m_sciq_1,revision=main,dtype=float16,trust_remote_code=True --tasks xnli_en,xnli_fr,sciq,piqa,lambada,arc_easy --device cuda --output_path /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/test_ordinal_constrained_initial_init_min_threshold/llms/pythia-70m_sciq_1/perf'
28
 
29
  ###############################
30
 
 
34
  ###############################
35
  slurm submission output
36
 
37
+ Submitted batch job 7648450
38
+
39
+
40
+
41
+ ###############################
42
+
43
+ /var/lib/slurm/slurmd/job7648450/slurm_script: line 16: /nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/etc/profile.d/conda.sh: No such file or directory
44
+
45
+ CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
46
+ To initialize your shell, run
47
+
48
+ $ conda init <SHELL_NAME>
49
+
50
+ Currently supported shells are:
51
+ - bash
52
+ - fish
53
+ - tcsh
54
+ - xonsh
55
+ - zsh
56
+ - powershell
57
+
58
+ See 'conda init --help' for more information and options.
59
+
60
+ IMPORTANT: You may need to close and restart your shell after running 'conda init'.
61
+
62
+
63
+ ###############################
64
+ start time: 2024-05-24 11:44:53.492934
65
+ machine: sphinx1
66
+ conda env: pretraining-coreset-selection
67
+ ###############################
68
+ running following processes
69
+
70
+ lm_eval --model hf --model_args pretrained=/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/test_ordinal_constrained_initial_init_min_threshold/llms/pythia-70m_sciq_1,revision=main,dtype=float16,trust_remote_code=True --tasks xnli_en,xnli_fr,sciq,piqa,lambada,arc_easy --device cuda --output_path /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/test_ordinal_constrained_initial_init_min_threshold/llms/pythia-70m_sciq_1/perf
71
+
72
+
73
+ ###############################
74
+ command outputs:
75
+
76
+
77
+ 2024-05-24:11:44:56,229 INFO [utils.py:145] Note: detected 255 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
78
+ 2024-05-24:11:44:56,229 INFO [utils.py:148] Note: NumExpr detected 255 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
79
+ 2024-05-24:11:44:56,229 INFO [utils.py:160] NumExpr defaulting to 8 threads.
80
+ 2024-05-24:11:44:56,591 INFO [config.py:58] PyTorch version 2.2.2 available.
81
+ 2024-05-24:11:45:00,567 INFO [__main__.py:156] Verbosity set to INFO
82
+ 2024-05-24:11:45:07,210 WARNING [__init__.py:194] Some tasks could not be loaded due to missing dependencies. Run with `--verbosity DEBUG` for full details.
83
+ srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
84
+ slurmstepd: error: *** JOB 7648450 ON sphinx1 CANCELLED AT 2024-05-24T11:45:39 ***
85
+ slurmstepd: error: *** STEP 7648450.0 ON sphinx1 CANCELLED AT 2024-05-24T11:45:39 ***
86
+ Received SIGTERM, job terminating, terminating 1 processes...
87
+ slurm submission log: 2024-05-24 11:46:16.500701
88
+ created following sbatch script:
89
+
90
+ ###############################
91
+
92
+ #!/bin/bash
93
+
94
+ #SBATCH --account=nlp
95
+ #SBATCH --cpus-per-task=16
96
+ #SBATCH --dependency=afterok:7648481
97
+ #SBATCH --gres=gpu:1
98
+ #SBATCH --job-name=tthrush-job-372659
99
+ #SBATCH --mem=60G
100
+ #SBATCH --nodelist=sphinx1
101
+ #SBATCH --open-mode=append
102
+ #SBATCH --output=/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/test_ordinal_constrained_initial_init_min_threshold/llms/pythia-70m_sciq_1/eval_job_output.txt
103
+ #SBATCH --partition=sphinx
104
+ #SBATCH --time=14-0
105
+
106
+ # activate your desired anaconda environment
107
+ . /nlp/scr/tthrush/miniconda3/envs/pretraining-coreset-selection/etc/profile.d/conda.sh ; conda activate pretraining-coreset-selection
108
+
109
+ # cd to working directory
110
+ cd .
111
+
112
+ # launch commands
113
+ srun --unbuffered run_as_child_processes 'lm_eval --model hf --model_args pretrained=/juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/test_ordinal_constrained_initial_init_min_threshold/llms/pythia-70m_sciq_1,revision=main,dtype=float16,trust_remote_code=True --tasks xnli_en,xnli_fr,sciq,piqa,lambada,arc_easy --device cuda --output_path /juice5/scr5/tthrush/pretraining-coreset-selection/llm_pretraining/test_ordinal_constrained_initial_init_min_threshold/llms/pythia-70m_sciq_1/perf'
114
+
115
+ ###############################
116
+
117
+ submission to slurm complete!
118
+
119
+
120
+ ###############################
121
+ slurm submission output
122
+
123
+ Submitted batch job 7648482
124
 
125
 
126
 
logs/events.out.tfevents.1716583437.sphinx2 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:336fed18b51c59a42c4e0d7ddbe7e2af3bb2466de4f4bc2de3567ac400e4b03d
3
+ size 11389
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:72d975c1a09895fb677a80c6976a87b7c0c808aff25c8bd8eea46a1f10e6607c
3
  size 281715176
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:61a4bebdc413eda50b1afca4b1f85d8e92dd18b3e3aac55c140d076328279ef9
3
  size 281715176
train_job_output.txt CHANGED
The diff for this file is too large to render. See raw diff
 
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6e5ec4dcc9a1a5c561e3555297a62670552352ebb9dca8bbc21575d63cf52a8c
3
  size 5240
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:89a0a13e12ab5a5a74f69ba2b98abcee0b13224c17dc16f245d07203c86eba98
3
  size 5240