hbertrand commited on
Commit
413b78d
1 Parent(s): f97aa81

Better tables (#8)

Browse files
buster/data/document_embeddings.csv CHANGED
The diff for this file is too large to render. See raw diff
 
buster/data/documents.csv CHANGED
@@ -95,7 +95,8 @@ framework outside of the scope of the workload manager.
95
  If this all seems complicated, you should know that all these things
96
  do not need to always be used. It is perfectly acceptable to sumbit
97
  jobs with a single step, a single task and a single process.
98
- The available resources on the cluster are not infinite and it is the
 
99
  workload manager’s job to allocate them. Whenever a job request comes
100
  in and there are not enough resources available to start it
101
  immediately, it will go in the queue.
@@ -110,8 +111,7 @@ can see the status of your queued jobs and why they remain in the
110
  queue.
111
  The workload manager will divide the cluster into partitions according
112
  to the configuration set by the admins. A partition is a set of
113
- machi"
114
- The workload manager,https://docs.mila.quebec/Theory_cluster.html#the-workload-manager,"nes typically reserved for a particular purpose. An example might
115
  be CPU-only machines for preprocessing setup as a separate partition.
116
  It is possible for multiple partitions to share resources.
117
  There will always be at least one partition that is the default
@@ -125,7 +125,8 @@ clusters where different hardware is mixed in and not all software is
125
  compatible with all of it (for example x86 and POWER cpus).
126
  To ensure a fair share of the computing resources for all, the workload
127
  manager establishes limits on the amount of resources that a single
128
- user can use at once. These can be hard limits which prevent running
 
129
  jobs when you go over or soft limits which will let you run jobs, but
130
  only until some other job needs the resources.
131
  Admin policy will determine what those exact limits are for a
@@ -535,7 +536,8 @@ simultaneously, it is a weighting factor of the workload manager to balance
535
  jobs. For instance, even though we are allocated 400 GPU-years across all
536
  clusters, we can use more or less than 400 GPUs simultaneously depending on the
537
  history of usage from our group and other groups using the cluster at a given
538
- period of time. Please see the Alliance’s documentation for
 
539
  more information on how allocations and resource scheduling are configured for
540
  these installations.
541
  The table below provides information on the allocation for
@@ -543,62 +545,14 @@ rrg-bengioy-ad for the period which spans from April 2022 to
543
  April 2023. Note that there are no special allocations for GPUs on
544
  Graham and therefore jobs with GPUs should be submitted with the
545
  account def-bengioy.
546
-
547
-
548
-
549
-
550
-
551
-
552
-
553
-
554
-
555
-
556
-
557
- Cluster
558
- CPUs
559
- GPUs
560
-
561
- #
562
- account
563
- Model
564
- #
565
- SLURM type specifier
566
- account
567
-
568
- Beluga
569
- 238
570
- rrg-bengioy-ad
571
- V100-16G
572
- 77
573
- v100
574
- rrg-bengioy-ad
575
-
576
- Cedar
577
- 34
578
- rrg-bengioy-ad
579
- V100-32G
580
- 138
581
- v100l
582
- rrg-bengioy-ad
583
-
584
- Graham
585
- 34
586
- rrg-bengioy-ad
587
- various
588
-
589
-
590
- def-bengioy
591
-
592
- Narval
593
- 34
594
- rrg-bengioy-ad
595
- A100-40G
596
- 185
597
- a100
598
- rrg-bengioy-ad
599
-
600
-
601
-
602
  "
603
  Account Creation,https://docs.mila.quebec/Extra_compute.html#account-creation,"Account Creation
604
  To access the Alliance clusters you have to first create an account at
@@ -685,52 +639,12 @@ more time to get scheduled.
685
 
686
  "
687
  Beluga Storage,https://docs.mila.quebec/Extra_compute.html#beluga-storage,"Beluga Storage
688
-
689
-
690
-
691
-
692
-
693
-
694
-
695
- Storage
696
- Path
697
- Usage
698
-
699
-
700
-
701
- $HOME
702
- /home/<user>/
703
-
704
- Code
705
- Specific libraries
706
-
707
-
708
-
709
- $HOME/projects
710
- /project/rpp-bengioy
711
-
712
- Compressed raw datasets
713
-
714
-
715
-
716
- $SCRATCH
717
- /scratch/<user>
718
-
719
- Processed datasets
720
- Experimental results
721
- Logs of experiments
722
-
723
-
724
-
725
- $SLURM_TMPDIR
726
-
727
-
728
- Temporary job results
729
-
730
-
731
-
732
-
733
-
734
  They are roughly listed in order of increasing performance and optimized for
735
  different uses:
736
 
@@ -758,23 +672,11 @@ Modules,https://docs.mila.quebec/Extra_compute.html#modules,"Modules
758
  Many software, such as Python or MATLAB are already compiled and available on
759
  Beluga through the module command and its subcommands. Its full
760
  documentation can be found here.
761
-
762
-
763
-
764
-
765
-
766
-
767
- module avail
768
- Displays all the available modules
769
-
770
- module load <module>
771
- Loads <module>
772
-
773
- module spider <module>
774
- Shows specific details about <module>
775
-
776
-
777
-
778
  In particular, if you with to use Python 3.6 you can simply do:
779
  module load python/3.6
780
 
@@ -927,213 +829,27 @@ request them for a very short duration (for testing code before queueing long
927
  jobs). You do not get the same guarantee as on the Mila cluster, however.
928
  "
929
  Node profile description,https://docs.mila.quebec/Information.html#node-profile-description,"Node profile description
930
-
931
-
932
-
933
-
934
-
935
-
936
-
937
-
938
-
939
-
940
-
941
-
942
-
943
-
944
-
945
-
946
- Name
947
- GPU
948
- CPUs
949
- Sockets
950
- Cores/Socket
951
- Threads/Core
952
- Memory (GB)
953
- TmpDisk (TB)
954
- Arch
955
- Slurm Features
956
-
957
- Model
958
- Mem
959
- #
960
- GPU Arch and Memory
961
-
962
-
963
-
964
- GPU Compute Nodes
965
-
966
- cn-a[001-011]
967
- RTX8000
968
- 48
969
- 8
970
- 40
971
- 2
972
- 20
973
- 1
974
- 384
975
- 3.6
976
- x86_64
977
- turing,48gb
978
-
979
- cn-b[001-005]
980
- V100
981
- 32
982
- 8
983
- 40
984
- 2
985
- 20
986
- 1
987
- 384
988
- 3.6
989
- x86_64
990
- volta,nvlink,32gb
991
-
992
- cn-c[001-040]
993
- RTX8000
994
- 48
995
- 8
996
- 64
997
- 2
998
- 32
999
- 1
1000
- 384
1001
- 3
1002
- x86_64
1003
- turing,48gb
1004
-
1005
- cn-g[001-026]
1006
- A100
1007
- 80
1008
- 4
1009
- 64
1010
- 2
1011
- 32
1012
- 1
1013
- 1024
1014
- 7
1015
- x86_64
1016
- ampere,nvlink,80gb
1017
-
1018
- DGX Systems
1019
-
1020
- cn-d[001-002]
1021
- A100
1022
- 40
1023
- 8
1024
- 128
1025
- 2
1026
- 64
1027
- 1
1028
- 1024
1029
- 14
1030
- x86_64
1031
- ampere,nvlink,40gb
1032
-
1033
- cn-d[003-004]
1034
- A100
1035
- 80
1036
- 8
1037
- 128
1038
- 2
1039
- 64
1040
- 1
1041
- 2048
1042
- 28
1043
- x86_64
1044
- ampere,nvlink,80gb
1045
-
1046
- cn-e[002-003]
1047
- V100
1048
- 32
1049
- 8
1050
- 40
1051
- 2
1052
- 20
1053
- 1
1054
- 512
1055
- 7
1056
- x86_64
1057
- volta,32gb
1058
-
1059
- CPU Compute Nodes
1060
-
1061
- cn-f[001-004]
1062
-
1063
-
1064
-
1065
-
1066
-
1067
-
1068
-
1069
-
1070
-
1071
-
1072
-
1073
-
1074
- 32
1075
- 1
1076
- 32
1077
- 1
1078
- 256
1079
- 10
1080
- x86_64
1081
- rome
1082
-
1083
- cn-h[001-004]
1084
-
1085
-
1086
-
1087
-
1088
-
1089
-
1090
-
1091
-
1092
-
1093
-
1094
-
1095
-
1096
- 64
1097
- 2
1098
- 32
1099
- 1
1100
- 768
1101
- 7
1102
- x86_64
1103
- milan
1104
-
1105
- Legacy GPU Compute Nodes
1106
-
1107
- kepler5
1108
- V100
1109
- 16
1110
- 2
1111
- 16
1112
- 2
1113
- 4
1114
- 2
1115
- 256
1116
- 3.6
1117
- x86_64
1118
- volta,16gb
1119
-
1120
- TITAN RTX
1121
-
1122
- rtx[1,3-5,7]
1123
- titanrtx
1124
- 24
1125
- 2
1126
- 20
1127
- 1
1128
- 10
1129
- 2
1130
- 128
1131
- 0.93
1132
- x86_64
1133
- turing,24gb
1134
-
1135
-
1136
-
1137
  "
1138
  Special nodes and outliers,https://docs.mila.quebec/Information.html#special-nodes-and-outliers,"Special nodes and outliers
1139
  "
@@ -1161,55 +877,12 @@ expected to be used.
1161
  The cn-g series of nodes include A100-80GB GPUs. One third have been
1162
  configured to offer regular (non-MIG mode) a100l GPUs. The other two-thirds
1163
  have been configured in MIG mode, and offer the following profiles:
1164
-
1165
-
1166
-
1167
-
1168
-
1169
-
1170
-
1171
-
1172
-
1173
- Name
1174
- GPU
1175
- Cluster-wide
1176
-
1177
- Model
1178
- Memory
1179
- Compute
1180
- #
1181
-
1182
-
1183
-
1184
- a100l.1g.10gb
1185
- a100l.1
1186
- A100
1187
- 10GB
1188
- (1/8th)
1189
- 1/7th
1190
- of full
1191
- 72
1192
-
1193
- a100l.2g.20gb
1194
- a100l.2
1195
- A100
1196
- 20GB
1197
- (2/8th)
1198
- 2/7th
1199
- of full
1200
- 108
1201
-
1202
- a100l.3g.40gb
1203
- a100l.3
1204
- A100
1205
- 40GB
1206
- (4/8th)
1207
- 3/7th
1208
- of full
1209
- 72
1210
-
1211
-
1212
-
1213
  And can be requested using a SLURM flag such as --gres=gpu:a100l.1
1214
  The partitioning may be revised as needs and SLURM capabilities evolve. Other
1215
  MIG profiles exist and could be introduced.
@@ -1222,7 +895,6 @@ limit every MIG job to exactly one MIG slice and no more. Thus,
1222
  --gres=gpu:a100l.3 will work (and request a size-3 slice of an
1223
  a100l GPU) but --gres=gpu:a100l.1:3 (with :3 requesting
1224
  three size-1 slices) will not.
1225
-
1226
  "
1227
  AMD,https://docs.mila.quebec/Information.html#amd,"AMD
1228
 
@@ -1329,7 +1001,8 @@ when you actually require only 8GB.
1329
 
1330
  GPU
1331
  Monitors the GPU usage using an nvidia-smi plugin for Netdata.
1332
- Under the plugin interface, select the GPU number which was allocated to
 
1333
  you. You can figure this out by running echo $SLURM_JOB_GPUS on the
1334
  allocated node or, if you have the job ID,
1335
  scontrol show -d job YOUR_JOB_ID | grep 'GRES' and checking IDX
@@ -1363,99 +1036,20 @@ inspect this to diagnose certain problems.
1363
 
1364
 
1365
 
1366
-
1367
  "
1368
  Example with Mila dashboard,https://docs.mila.quebec/Information.html#example-with-mila-dashboard,"Example with Mila dashboard
1369
 
1370
  "
1371
  Storage,https://docs.mila.quebec/Information.html#storage,"Storage
1372
-
1373
-
1374
-
1375
-
1376
-
1377
-
1378
-
1379
-
1380
-
1381
-
1382
- Path
1383
- Performance
1384
- Usage
1385
- Quota (Space/Files)
1386
- Backup
1387
- Auto-cleanup
1388
-
1389
-
1390
-
1391
- /network/datasets/
1392
- High
1393
-
1394
- Curated raw datasets (read only)
1395
-
1396
-
1397
-
1398
-
1399
-
1400
-
1401
- $HOME or /home/mila/<u>/<username>/
1402
- Low
1403
-
1404
- Personal user space
1405
- Specific libraries, code, binaries
1406
-
1407
-
1408
- 100GB/1000K
1409
- Daily
1410
- no
1411
-
1412
- $SCRATCH or /network/scratch/<u>/<username>/
1413
- High
1414
-
1415
- Temporary job results
1416
- Processed datasets
1417
- Optimized for small Files
1418
-
1419
-
1420
- no
1421
- no
1422
- 90 days
1423
-
1424
- $SLURM_TMPDIR
1425
- Highest
1426
-
1427
- High speed disk for temporary job
1428
- results
1429
-
1430
-
1431
- 4TB/-
1432
- no
1433
- at job end
1434
-
1435
- /network/projects/<groupname>/
1436
- Fair
1437
-
1438
- Shared space to facilitate
1439
- collaboration between researchers
1440
- Long-term project storage
1441
-
1442
-
1443
- 200GB/1000K
1444
- Daily
1445
- no
1446
-
1447
- $ARCHIVE or /network/archive/<u>/<username>/
1448
- Low
1449
-
1450
- Long-term personal storage
1451
-
1452
-
1453
- 500GB
1454
- no
1455
- no
1456
-
1457
-
1458
-
1459
 
1460
  Note
1461
  The $HOME file system is backed up once a day. For any file
@@ -1758,34 +1352,13 @@ an allocation on multiple nodes.
1758
  Job submission arguments,https://docs.mila.quebec/Userguide.html#job-submission-arguments,"Job submission arguments
1759
  In order to accurately select the resources for your job, several arguments are
1760
  available. The most important ones are:
1761
-
1762
-
1763
-
1764
-
1765
-
1766
-
1767
- Argument
1768
- Description
1769
-
1770
-
1771
-
1772
- -n, –ntasks=<number>
1773
- The number of task in your script, usually =1
1774
-
1775
- -c, –cpus-per-task=<ncpus>
1776
- The number of cores for each task
1777
-
1778
- -t, –time=<time>
1779
- Time requested for your job
1780
-
1781
- –mem=<size[units]>
1782
- Memory requested for all your tasks
1783
-
1784
- –gres=<list>
1785
- Select generic resources such as GPUs for your job: --gres=gpu:GPU_MODEL
1786
-
1787
-
1788
-
1789
 
1790
  Tip
1791
  Always consider requesting the adequate amount of resources to improve the
@@ -1816,65 +1389,23 @@ with a lower priority: unkillable > main > long. Once preempted, your job is
1816
  killed without notice and is automatically re-queued on the same partition until
1817
  resources are available. (To leverage a different preemption mechanism, see the
1818
  Handling preemption)
1819
-
1820
-
1821
-
1822
-
1823
-
1824
-
1825
-
1826
-
1827
- Flag
1828
- Max Resource Usage
1829
- Max Time
1830
- Note
1831
-
1832
-
1833
-
1834
- --partition=unkillable
1835
- 6 CPUs, mem=32G, 1 GPU
1836
- 2 days
1837
-
1838
-
1839
- --partition=unkillable-cpu
1840
- 2 CPUs, mem=16G
1841
- 2 days
1842
- CPU-only jobs
1843
-
1844
- --partition=short-unkillable
1845
- 24 CPUs, mem=128G, 4 GPUs
1846
- 3 hours (!)
1847
- Large but short jobs
1848
-
1849
- --partition=main
1850
- 8 CPUs, mem=48G, 2 GPUs
1851
- 5 days
1852
-
1853
-
1854
- --partition=main-cpu
1855
- 8 CPUs, mem=64G
1856
- 5 days
1857
- CPU-only jobs
1858
-
1859
- --partition=long
1860
- no limit of resources
1861
- 7 days
1862
-
1863
-
1864
- --partition=long-cpu
1865
- no limit of resources
1866
- 7 days
1867
- CPU-only jobs
1868
-
1869
-
1870
-
1871
 
1872
  Warning
1873
  Historically, before the 2022 introduction of CPU-only nodes (e.g. the cn-f
1874
  series), CPU jobs ran side-by-side with the GPU jobs on GPU nodes. To prevent
1875
  them obstructing any GPU job, they were always lowest-priority and preemptible.
1876
  This was implemented by automatically assigning them to one of the now-obsolete
1877
- partitions cpu_jobs, cpu_jobs_low or cpu_jobs_low-grace.
 
1878
  Do not use these partition names anymore. Prefer the *-cpu partition
1879
  names defined above.
1880
  For backwards-compatibility purposes, the legacy partition names are translated
@@ -1901,28 +1432,11 @@ accessed Node profile description.
1901
  Example:
1902
  To request a machine with 2 GPUs using NVLink, you can use
1903
  sbatch -c 4 --gres=gpu:2 --constraint=nvlink
1904
-
1905
-
1906
-
1907
-
1908
-
1909
-
1910
- Feature
1911
- Particularities
1912
-
1913
-
1914
-
1915
- 12GB/16GB/24GB/32GB/48GB
1916
- Request a specific amount of GPU memory
1917
-
1918
- volta/turing/ampere
1919
- Request a specific GPU architecture
1920
-
1921
- nvlink
1922
- Machine with GPUs using the NVLink interconnect technology
1923
-
1924
-
1925
-
1926
  "
1927
  Information on partitions/nodes,https://docs.mila.quebec/Userguide.html#information-on-partitions-nodes,"Information on partitions/nodes
1928
  sinfo (ref.) provides most of the
@@ -1947,12 +1461,12 @@ node[10-15] 6 batch idle 2 246 16000 0 (null) (null)
1947
  And to get statistics on a job running or terminated, use sacct with some of
1948
  the fields you want to display
1949
  sacct --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,nnodes,ncpus,nodelist,workdir -u $USER
1950
- User JobID JobName Partition State Timelimit Start End Elapsed NNodes NCPUS NodeList WorkDir
 
1951
  --------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- -------- ---------- --------------- --------------------
1952
  my_usern+ 2398 run_extra+ batch RUNNING 130-05:00+ 2019-03-27T18:33:43 Unknown 1-01:07:54 1 16 node9 /home/mila/my_usern+
1953
  my_usern+ 2399 run_extra+ batch RUNNING 130-05:00+ 2019-03-26T08:51:38 Unknown 2-10:49:59 1 16 node9 /home/mila/my_usern+
1954
- Or to get the list of all your previous jobs, use the --start=YYYY-MM-DD flag. You can check sacct(1) for further information about additional t"
1955
- Information on partitions/nodes,https://docs.mila.quebec/Userguide.html#information-on-partitions-nodes,"ime formats.
1956
  sacct -u $USER --start=2019-01-01
1957
  scontrol (ref.) can be used to
1958
  provide specific information on a job (currently running or recently terminated)
@@ -1966,7 +1480,8 @@ RunTime=2-10:41:57 TimeLimit=130-05:00:00 TimeMin=N/A
1966
  SubmitTime=2019-03-26T08:47:17 EligibleTime=2019-03-26T08:49:18
1967
  AccrueTime=2019-03-26T08:49:18
1968
  StartTime=2019-03-26T08:51:38 EndTime=2019-08-03T13:51:38 Deadline=N/A
1969
- PreemptTime=None SuspendTime=None SecsPreSuspend=0
 
1970
  LastSchedEval=2019-03-26T08:49:18
1971
  Partition=slurm_partition AllocNode:Sid=login-node-1:14586
1972
  ReqNodeList=(null) ExcNodeList=(null)
@@ -2000,8 +1515,7 @@ CfgTRES=cpu=16,mem=32000M,billing=3
2000
  AllocTRES=cpu=16,mem=32000M
2001
  CapWatts=n/a
2002
  CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
2003
- ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
2004
- "
2005
  Useful Commands,https://docs.mila.quebec/Userguide.html#useful-commands,"Useful Commands
2006
 
2007
  sallocGet an interactive job and give you a shell. (ssh like) CPU only
@@ -2180,18 +1694,19 @@ module avail
2180
  cuda/11.0 -> cudatoolkit/11.0 pytorch -> python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.5.1 tensorflow -> python/3.7/tensorflow/2.2
2181
  cuda/9.0 -> cudatoolkit/9.0 pytorch/1.4.0 -> python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.4.0 tensorflow-cpu/1.15 -> python/3.7/tensorflow/1.15
2182
 
2183
- -------------------------------------------------------------------------------------------------- /cvmfs/config.mila.quebec/modules/Core ---------------------------------------------------------------------------------------------------
 
2184
  Mila (S,L) anaconda/3 (D) go/1.13.5 miniconda/2 mujoco/1.50 python/2.7 python/3.6 python/3.8 singularity/3.0.3 singularity/3.2.1 singularity/3.5.3 (D)
2185
  anaconda/2 go/1.12.4 go/1.14 (D) miniconda/3 (D) mujoco/2.0 (D) python/3.5 python/3.7 (D) singularity/2.6.1 singularity/3.1.1 singularity/3.4.2
2186
 
2187
- ------------------------------------------------------------------------------------------------ /cvmfs/config.mila.quebec/modules/Compiler ---------------------------------------------------------------------------------------"
2188
- The module command,https://docs.mila.quebec/Userguide.html#the-module-command,"----------
2189
  python/3.7/mujoco-py/2.0
2190
 
2191
  -------------------------------------------------------------------------------------------------- /cvmfs/config.mila.quebec/modules/Cuda ---------------------------------------------------------------------------------------------------
2192
  cuda/10.0/cudnn/7.3 cuda/10.0/nccl/2.4 cuda/10.1/nccl/2.4 cuda/11.0/nccl/2.7 cuda/9.0/nccl/2.4 cudatoolkit/9.0 cudatoolkit/10.1 cudnn/7.6/cuda/10.0/tensorrt/7.0
2193
  cuda/10.0/cudnn/7.5 cuda/10.1/cudnn/7.5 cuda/10.2/cudnn/7.6 cuda/9.0/cudnn/7.3 cuda/9.2/cudnn/7.6 cudatoolkit/9.2 cudatoolkit/10.2 cudnn/7.6/cuda/10.1/tensorrt/7.0
2194
- cuda/10.0/cudnn/7.6 (D) cuda/10.1/cudnn/7.6 (D) cuda/10.2/nccl/2.7 cuda/9.0/cudnn/7.5 (D) cuda/9.2/nccl/2.4 cudatoolkit/10.0 cudatoolkit/11.0 (D) cudnn/7.6/cuda/9.0/tensorrt/7.0
 
2195
 
2196
  ------------------------------------------------------------------------------------------------ /cvmfs/config.mila.quebec/modules/Pytorch --------------------------------------------------------------------------------------------------
2197
  python/3.7/cuda/10.1/cudnn/7.6/pytorch/1.4.1 python/3.7/cuda/10.1/cudnn/7.6/pytorch/1.5.1 (D) python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.5.0
@@ -2209,32 +1724,12 @@ module load python3.7
2209
  "
2210
  Available Software,https://docs.mila.quebec/Userguide.html#available-software,"Available Software
2211
  Modules are divided in 5 main sections:
2212
-
2213
-
2214
-
2215
-
2216
-
2217
-
2218
- Section
2219
- Description
2220
-
2221
-
2222
-
2223
- Core
2224
- Base interpreter and software (Python, go, etc…)
2225
-
2226
- Compiler
2227
- Interpreter-dependent software (see the note below)
2228
-
2229
- Cuda
2230
- Toolkits, cudnn and related libraries
2231
-
2232
- Pytorch/Tensorflow
2233
- Pytorch/TF built with a specific Cuda/Cudnn
2234
- version for Mila’s GPUs (see the related paragraph)
2235
-
2236
-
2237
-
2238
 
2239
  Note
2240
  Modules which are nested (../../..) usually depend on other software/module
@@ -2495,7 +1990,8 @@ From: tensorflow/tensorflow:latest-gpu-py3
2495
  apt-get update
2496
  apt-get install -y cmake libcupti-dev libyaml-dev wget unzip
2497
  apt-get clean
2498
- echo ""Installing things with pip""
 
2499
  pip install tqdm
2500
  echo ""Creating mount points""
2501
  mkdir /dataset
@@ -2524,7 +2020,6 @@ Warning
2524
  You always need to use sudo when you build a container from a
2525
  recipe. As there is no access to sudo on the cluster, a personal computer or
2526
  the use singularity hub is needed to build a container
2527
-
2528
  "
2529
  Build recipe on singularity hub,https://docs.mila.quebec/Userguide.html#build-recipe-on-singularity-hub,"Build recipe on singularity hub
2530
  Singularity hub allows users to build containers from recipes directly on
@@ -2600,7 +2095,8 @@ From: pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime
2600
  mkdir /Gym && cd /Gym
2601
  git clone https://github.com/openai/gym.git || true && \
2602
  mkdir /Gym/.mujoco && cd /Gym/.mujoco
2603
- wget https://www.roboti.us/download/mjpro150_linux.zip && \
 
2604
  unzip mjpro150_linux.zip && \
2605
  wget https://www.roboti.us/download/mujoco200_linux.zip && \
2606
  unzip mujoco200_linux.zip && \
@@ -2610,8 +2106,7 @@ From: pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime
2610
  export MUJOCO_PY_MJKEY_PATH=/Gym/.mujoco/mjkey.txt
2611
  export MUJOCO_PY_MUJOCO_PATH=/Gym/.mujoco/mujoco150/
2612
  export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mjpro150/bin
2613
- export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym"
2614
- "Example: Recipe with OpenAI gym, MuJoCo and Miniworld",https://docs.mila.quebec/Userguide.html#example-recipe-with-openai-gym-mujoco-and-miniworld,"/.mujoco/mujoco200/bin
2615
  export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/bin
2616
  cp /mjkey.txt /Gym/.mujoco/mjkey.txt
2617
  # Install Python dependencies
@@ -2632,7 +2127,8 @@ From: pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime
2632
  export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mjpro150/bin
2633
  export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mujoco200/bin
2634
  export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/bin
2635
- export PATH=/Gym/gym/.tox/py3/bin:$PATH
 
2636
 
2637
  %runscript
2638
  exec /bin/sh ""$@""
@@ -2674,8 +2170,7 @@ From: tensorflow/tensorflow:latest-gpu-py3
2674
 
2675
  # Download Gym and MuJoCo
2676
  mkdir /Gym && cd /Gym
2677
- git clone"
2678
- "Example: Recipe with OpenAI gym, MuJoCo and Miniworld",https://docs.mila.quebec/Userguide.html#example-recipe-with-openai-gym-mujoco-and-miniworld," https://github.com/openai/gym.git || true && \
2679
  mkdir /Gym/.mujoco && cd /Gym/.mujoco
2680
  wget https://www.roboti.us/download/mjpro150_linux.zip && \
2681
  unzip mjpro150_linux.zip && \
@@ -2685,7 +2180,8 @@ From: tensorflow/tensorflow:latest-gpu-py3
2685
 
2686
  # Export global environment variables
2687
  export MUJOCO_PY_MJKEY_PATH=/Gym/.mujoco/mjkey.txt
2688
- export MUJOCO_PY_MUJOCO_PATH=/Gym/.mujoco/mujoco150/
 
2689
  export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mjpro150/bin
2690
  export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mujoco200/bin
2691
  export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/bin
@@ -2722,8 +2218,7 @@ From: tensorflow/tensorflow:latest-gpu-py3
2722
 
2723
  Keep in mind that those environment variables are sourced at runtime and not at
2724
  build time. This is why, you should also define them in the %post section
2725
- since they are required to install MuJoCo.
2726
- "
2727
  Using containers on clusters,https://docs.mila.quebec/Userguide.html#using-containers-on-clusters,"Using containers on clusters
2728
  "
2729
  How to use containers on clusters,https://docs.mila.quebec/Userguide.html#how-to-use-containers-on-clusters,"How to use containers on clusters
@@ -3168,29 +2663,10 @@ It does not require any ssh tunnel or port redirection, the hub acts as a proxy
3168
  server that will redirect you to a session as soon as it is available.
3169
  It is currently available for Mila clusters and some Digital Research Alliance
3170
  of Canada (Alliance) clusters.
3171
-
3172
-
3173
-
3174
-
3175
-
3176
-
3177
-
3178
- Cluster
3179
- Address
3180
- Login type
3181
-
3182
-
3183
-
3184
- Mila Local
3185
- https://jupyterhub.server.mila.quebec
3186
- Google Oauth
3187
-
3188
- Alliance
3189
- https://docs.alliancecan.ca/wiki/JupyterHub
3190
- DRAC login
3191
-
3192
-
3193
-
3194
 
3195
  Warning
3196
  Do not forget to close the JupyterLab session! Closing the window leaves
@@ -3351,7 +2827,8 @@ Requesting 2 tasks per GPU
3351
 
3352
 
3353
  --exclusive is important to specify subsequent step/srun to bind to different cpus.
3354
- This will produce 8 output files, 2 for each step:
 
3355
 
3356
  JOBID-step-0-task-0.out
3357
  JOBID-step-0-task-1.out
@@ -3372,8 +2849,7 @@ cat JOBID-step-* | grep Tesla
3372
  0: | 0 Tesla P100-PCIE... On | 00000000:82:00.0 Off | 0 |
3373
  1: | 0 Tesla P100-PCIE... On | 00000000:82:00.0 Off | 0 |
3374
  0: | 0 Tesla P100-PCIE... On | 00000000:03:00.0 Off | 0 |
3375
- 1: | 0 Tesla P100-PCIE... On | 00000000:03:00.0 Off | 0 |
3376
- "
3377
  Multiple Nodes,https://docs.mila.quebec/Userguide.html#multiple-nodes,"Multiple Nodes
3378
  "
3379
  Data Parallel,https://docs.mila.quebec/Userguide.html#data-parallel,"Data Parallel
 
95
  If this all seems complicated, you should know that all these things
96
  do not need to always be used. It is perfectly acceptable to sumbit
97
  jobs with a single step, a single task and a single process.
98
+ The available resource"
99
+ The workload manager,https://docs.mila.quebec/Theory_cluster.html#the-workload-manager,"s on the cluster are not infinite and it is the
100
  workload manager’s job to allocate them. Whenever a job request comes
101
  in and there are not enough resources available to start it
102
  immediately, it will go in the queue.
 
111
  queue.
112
  The workload manager will divide the cluster into partitions according
113
  to the configuration set by the admins. A partition is a set of
114
+ machines typically reserved for a particular purpose. An example might
 
115
  be CPU-only machines for preprocessing setup as a separate partition.
116
  It is possible for multiple partitions to share resources.
117
  There will always be at least one partition that is the default
 
125
  compatible with all of it (for example x86 and POWER cpus).
126
  To ensure a fair share of the computing resources for all, the workload
127
  manager establishes limits on the amount of resources that a single
128
+ user can us"
129
+ The workload manager,https://docs.mila.quebec/Theory_cluster.html#the-workload-manager,"e at once. These can be hard limits which prevent running
130
  jobs when you go over or soft limits which will let you run jobs, but
131
  only until some other job needs the resources.
132
  Admin policy will determine what those exact limits are for a
 
536
  jobs. For instance, even though we are allocated 400 GPU-years across all
537
  clusters, we can use more or less than 400 GPUs simultaneously depending on the
538
  history of usage from our group and other groups using the cluster at a given
539
+ period of time. Please see the Alliance’s doc"
540
+ Current allocation description,https://docs.mila.quebec/Extra_compute.html#current-allocation-description,"umentation for
541
  more information on how allocations and resource scheduling are configured for
542
  these installations.
543
  The table below provides information on the allocation for
 
545
  April 2023. Note that there are no special allocations for GPUs on
546
  Graham and therefore jobs with GPUs should be submitted with the
547
  account def-bengioy.
548
+ | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
549
+ |---------|------|----------------|----------|------|----------------------|----------------|
550
+ | Cluster | CPUs | CPUs | GPUs | GPUs | GPUs | GPUs |
551
+ | Cluster | # | account | Model | # | SLURM type specifier | account |
552
+ | Beluga | 238 | rrg-bengioy-ad | V100-16G | 77 | v100 | rrg-bengioy-ad |
553
+ | Cedar | 34 | rrg-bengioy-ad | V100-32G | 138 | v100l | rrg-bengioy-ad |
554
+ | Graham | 34 | rrg-bengioy-ad | various | – | – | def-bengioy |
555
+ | Narval | 34 | rrg-bengioy-ad | A100-40G | 185 | a100 | rrg-bengioy-ad |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
556
  "
557
  Account Creation,https://docs.mila.quebec/Extra_compute.html#account-creation,"Account Creation
558
  To access the Alliance clusters you have to first create an account at
 
639
 
640
  "
641
  Beluga Storage,https://docs.mila.quebec/Extra_compute.html#beluga-storage,"Beluga Storage
642
+ | Storage | Path | Usage |
643
+ |----------------|----------------------|---------------------------------------------------------------|
644
+ | $HOME | /home/<user>/ | Code Specific libraries |
645
+ | $HOME/projects | /project/rpp-bengioy | Compressed raw datasets |
646
+ | $SCRATCH | /scratch/<user> | Processed datasets Experimental results Logs of experiments |
647
+ | $SLURM_TMPDIR | nan | Temporary job results |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
648
  They are roughly listed in order of increasing performance and optimized for
649
  different uses:
650
 
 
672
  Many software, such as Python or MATLAB are already compiled and available on
673
  Beluga through the module command and its subcommands. Its full
674
  documentation can be found here.
675
+ | 0 | 1 |
676
+ |------------------------|---------------------------------------|
677
+ | module avail | Displays all the available modules |
678
+ | module load <module> | Loads <module> |
679
+ | module spider <module> | Shows specific details about <module> |
 
 
 
 
 
 
 
 
 
 
 
 
680
  In particular, if you with to use Python 3.6 you can simply do:
681
  module load python/3.6
682
 
 
829
  jobs). You do not get the same guarantee as on the Mila cluster, however.
830
  "
831
  Node profile description,https://docs.mila.quebec/Information.html#node-profile-description,"Node profile description
832
+ | ('Name', 'Name') | ('GPU', 'Model') | ('GPU', 'Mem') | ('GPU', '#') | ('CPUs', 'CPUs') | ('Sockets', 'Sockets') | ('Cores/Socket', 'Cores/Socket') | ('Threads/Core', 'Threads/Core') | ('Memory (GB)', 'Memory (GB)') | ('TmpDisk (TB)', 'TmpDisk (TB)') | ('Arch', 'Arch') | ('Slurm Features', 'GPU Arch and Memory') |
833
+ |--------------------------|--------------------------|--------------------------|--------------------------|--------------------------|--------------------------|------------------------------------|------------------------------------|----------------------------------|------------------------------------|--------------------------|---------------------------------------------|
834
+ | GPU Compute Nodes | GPU Compute Nodes | GPU Compute Nodes | GPU Compute Nodes | GPU Compute Nodes | GPU Compute Nodes | GPU Compute Nodes | GPU Compute Nodes | GPU Compute Nodes | GPU Compute Nodes | GPU Compute Nodes | GPU Compute Nodes |
835
+ | cn-a[001-011] | RTX8000 | 48 | 8 | 40 | 2 | 20 | 1 | 384 | 3.6 | x86_64 | turing,48gb |
836
+ | cn-b[001-005] | V100 | 32 | 8 | 40 | 2 | 20 "
837
+ Node profile description,https://docs.mila.quebec/Information.html#node-profile-description," | 1 | 384 | 3.6 | x86_64 | volta,nvlink,32gb |
838
+ | cn-c[001-040] | RTX8000 | 48 | 8 | 64 | 2 | 32 | 1 | 384 | 3 | x86_64 | turing,48gb |
839
+ | cn-g[001-026] | A100 | 80 | 4 | 64 | 2 | 32 | 1 | 1024 | 7 | x86_64 | ampere,nvlink,80gb |
840
+ | DGX Systems | DGX Systems | DGX Systems | DGX Systems | DGX Systems | DGX Systems | DGX Systems | DGX Systems | DGX Systems | DGX Systems | DGX Systems | DGX Systems |
841
+ | cn-d[001-002] | A100 | 40 | 8 | 128 | 2 | 64 | 1 | 1024 | 14 | x86_64 | ampere,nvlink,40gb "
842
+ Node profile description,https://docs.mila.quebec/Information.html#node-profile-description," |
843
+ | cn-d[003-004] | A100 | 80 | 8 | 128 | 2 | 64 | 1 | 2048 | 28 | x86_64 | ampere,nvlink,80gb |
844
+ | cn-e[002-003] | V100 | 32 | 8 | 40 | 2 | 20 | 1 | 512 | 7 | x86_64 | volta,32gb |
845
+ | CPU Compute Nodes | CPU Compute Nodes | CPU Compute Nodes | CPU Compute Nodes | CPU Compute Nodes | CPU Compute Nodes | CPU Compute Nodes | CPU Compute Nodes | CPU Compute Nodes | CPU Compute Nodes | CPU Compute Nodes | CPU Compute Nodes |
846
+ | cn-f[001-004] | nan | nan | nan | 32 | 1 | 32 | 1 | 256 | 10 | x86_64 | rome |
847
+ | cn-h[001-004] | nan | nan | nan | 64 | 2 | 32 "
848
+ Node profile description,https://docs.mila.quebec/Information.html#node-profile-description," | 1 | 768 | 7 | x86_64 | milan |
849
+ | Legacy GPU Compute Nodes | Legacy GPU Compute Nodes | Legacy GPU Compute Nodes | Legacy GPU Compute Nodes | Legacy GPU Compute Nodes | Legacy GPU Compute Nodes | Legacy GPU Compute Nodes | Legacy GPU Compute Nodes | Legacy GPU Compute Nodes | Legacy GPU Compute Nodes | Legacy GPU Compute Nodes | Legacy GPU Compute Nodes |
850
+ | kepler5 | V100 | 16 | 2 | 16 | 2 | 4 | 2 | 256 | 3.6 | x86_64 | volta,16gb |
851
+ | TITAN RTX | TITAN RTX | TITAN RTX | TITAN RTX | TITAN RTX | TITAN RTX | TITAN RTX | TITAN RTX | TITAN RTX | TITAN RTX | TITAN RTX | TITAN RTX |
852
+ | rtx[1,3-5,7] | titanrtx | 24 | 2 | 20 | 1 | 10 | 2 | 128 | 0.93 | x86_64 | turing,24gb |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
853
  "
854
  Special nodes and outliers,https://docs.mila.quebec/Information.html#special-nodes-and-outliers,"Special nodes and outliers
855
  "
 
877
  The cn-g series of nodes include A100-80GB GPUs. One third have been
878
  configured to offer regular (non-MIG mode) a100l GPUs. The other two-thirds
879
  have been configured in MIG mode, and offer the following profiles:
880
+ | ('Name', 'Name') | ('GPU', 'Model') | ('GPU', 'Memory') | ('GPU', 'Compute') | ('Cluster-wide', '#') |
881
+ |------------------------|--------------------|---------"
882
+ MIG,https://docs.mila.quebec/Information.html#mig,"------------|----------------------|-------------------------|
883
+ | a100l.1g.10gb a100l.1 | A100 | 10GB (1/8th) | 1/7th of full | 72 |
884
+ | a100l.2g.20gb a100l.2 | A100 | 20GB (2/8th) | 2/7th of full | 108 |
885
+ | a100l.3g.40gb a100l.3 | A100 | 40GB (4/8th) | 3/7th of full | 72 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
886
  And can be requested using a SLURM flag such as --gres=gpu:a100l.1
887
  The partitioning may be revised as needs and SLURM capabilities evolve. Other
888
  MIG profiles exist and could be introduced.
 
895
  --gres=gpu:a100l.3 will work (and request a size-3 slice of an
896
  a100l GPU) but --gres=gpu:a100l.1:3 (with :3 requesting
897
  three size-1 slices) will not.
 
898
  "
899
  AMD,https://docs.mila.quebec/Information.html#amd,"AMD
900
 
 
1001
 
1002
  GPU
1003
  Monitors the GPU usage using an nvidia-smi plugin for Netdata.
1004
+ Under the plugin interface, select the GPU"
1005
+ Example watching the CPU/RAM/GPU usage,https://docs.mila.quebec/Information.html#example-watching-the-cpu-ram-gpu-usage," number which was allocated to
1006
  you. You can figure this out by running echo $SLURM_JOB_GPUS on the
1007
  allocated node or, if you have the job ID,
1008
  scontrol show -d job YOUR_JOB_ID | grep 'GRES' and checking IDX
 
1036
 
1037
 
1038
 
 
1039
  "
1040
  Example with Mila dashboard,https://docs.mila.quebec/Information.html#example-with-mila-dashboard,"Example with Mila dashboard
1041
 
1042
  "
1043
  Storage,https://docs.mila.quebec/Information.html#storage,"Storage
1044
+ | Path | Performance | Usage | Quota (Space/Files) | Backup | Auto-cleanup |
1045
+ |------------------------------------------------|---------------|-----------------------------------------------------------------------------------------|-----------------------|----------|----------------|
1046
+ | /network/datasets/ | High | Curated raw datasets (read only) | nan | nan | nan |
1047
+ | $HOME or /home/mila/<u>/<username>/ | Low | Personal user space Specific libraries, code, binaries | 100GB/1000K | Daily | no |
1048
+ | $SCRATCH or /network/scratch/<u>/<username>/ | High | Temporary job results Processed datasets Optimized for small Files | no | no | 90 days "
1049
+ Storage,https://docs.mila.quebec/Information.html#storage," |
1050
+ | $SLURM_TMPDIR | Highest | High speed disk for temporary job results | 4TB/- | no | at job end |
1051
+ | /network/projects/<groupname>/ | Fair | Shared space to facilitate collaboration between researchers Long-term project storage | 200GB/1000K | Daily | no |
1052
+ | $ARCHIVE or /network/archive/<u>/<username>/ | Low | Long-term personal storage | 500GB | no | no |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1053
 
1054
  Note
1055
  The $HOME file system is backed up once a day. For any file
 
1352
  Job submission arguments,https://docs.mila.quebec/Userguide.html#job-submission-arguments,"Job submission arguments
1353
  In order to accurately select the resources for your job, several arguments are
1354
  available. The most important ones are:
1355
+ | Argument | Description |
1356
+ |----------------------------|---------------------------------------------------------------------------|
1357
+ | -n, –ntasks=<number> | The number of task in your script, usually =1 |
1358
+ | -c, –cpus-per-task=<ncpus> | The number of cores for each task |
1359
+ | -t, –time=<time> | Time requested for your job |
1360
+ | –mem=<size[units]> | Memory requested for all your tasks |
1361
+ | –gres=<list> | Select generic resources such as GPUs for your job: --gres=gpu:GPU_MODEL |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1362
 
1363
  Tip
1364
  Always consider requesting the adequate amount of resources to improve the
 
1389
  killed without notice and is automatically re-queued on the same partition until
1390
  resources are available. (To leverage a different preemption mechanism, see the
1391
  Handling preemption)
1392
+ | Flag | Max Resource Usage | Max Time | Note |
1393
+ |------------------------------|---------------------------|-------------|----------------------|
1394
+ | --partition=unkillable | 6 CPUs, mem=32G, 1 GPU | 2 days | nan |
1395
+ | --partition=unkillable-cpu | 2 CPUs, mem=16G | 2 days | CPU-only jobs |
1396
+ | --partition=short-unkillable | 24 CPUs, mem=128G, 4 GPUs | 3 hours (!) | Large but short jobs |
1397
+ | --partition=main | 8 CPUs, mem=48G, 2 GPUs | 5 days | nan |
1398
+ | --partition=main-cpu | 8 CPUs, mem=64G | 5 days | CPU-only jobs |
1399
+ | --partition=long | no limit of resources | 7 days | nan |
1400
+ | --partition=long-cpu | no limit of resources | 7 days | CPU-only jobs |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1401
 
1402
  Warning
1403
  Historically, before the 2022 introduction of CPU-only nodes (e.g. the cn-f
1404
  series), CPU jobs ran side-by-side with the GPU jobs on GPU nodes. To prevent
1405
  them obstructing any GPU job, they were always lowest-priority and preemptible.
1406
  This was implemented by automatically assigning them to one of the now-obsolete
1407
+ part"
1408
+ Partitioning,https://docs.mila.quebec/Userguide.html#partitioning,"itions cpu_jobs, cpu_jobs_low or cpu_jobs_low-grace.
1409
  Do not use these partition names anymore. Prefer the *-cpu partition
1410
  names defined above.
1411
  For backwards-compatibility purposes, the legacy partition names are translated
 
1432
  Example:
1433
  To request a machine with 2 GPUs using NVLink, you can use
1434
  sbatch -c 4 --gres=gpu:2 --constraint=nvlink
1435
+ | Feature | Particularities |
1436
+ |--------------------------|------------------------------------------------------------|
1437
+ | 12GB/16GB/24GB/32GB/48GB | Request a specific amount of GPU memory |
1438
+ | volta/turing/ampere | Request a specific GPU architecture |
1439
+ | nvlink | Machine with GPUs using the NVLink interconnect technology |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1440
  "
1441
  Information on partitions/nodes,https://docs.mila.quebec/Userguide.html#information-on-partitions-nodes,"Information on partitions/nodes
1442
  sinfo (ref.) provides most of the
 
1461
  And to get statistics on a job running or terminated, use sacct with some of
1462
  the fields you want to display
1463
  sacct --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,nnodes,ncpus,nodelist,workdir -u $USER
1464
+ User JobID JobName Partition State Timelimit Start End Elapsed NNodes NCPUS N"
1465
+ Information on partitions/nodes,https://docs.mila.quebec/Userguide.html#information-on-partitions-nodes,"odeList WorkDir
1466
  --------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- -------- ---------- --------------- --------------------
1467
  my_usern+ 2398 run_extra+ batch RUNNING 130-05:00+ 2019-03-27T18:33:43 Unknown 1-01:07:54 1 16 node9 /home/mila/my_usern+
1468
  my_usern+ 2399 run_extra+ batch RUNNING 130-05:00+ 2019-03-26T08:51:38 Unknown 2-10:49:59 1 16 node9 /home/mila/my_usern+
1469
+ Or to get the list of all your previous jobs, use the --start=YYYY-MM-DD flag. You can check sacct(1) for further information about additional time formats.
 
1470
  sacct -u $USER --start=2019-01-01
1471
  scontrol (ref.) can be used to
1472
  provide specific information on a job (currently running or recently terminated)
 
1480
  SubmitTime=2019-03-26T08:47:17 EligibleTime=2019-03-26T08:49:18
1481
  AccrueTime=2019-03-26T08:49:18
1482
  StartTime=2019-03-26T08:51:38 EndTime=2019-08-03T13:51:38 Deadline=N/A
1483
+ PreemptTime=None SuspendTim"
1484
+ Information on partitions/nodes,https://docs.mila.quebec/Userguide.html#information-on-partitions-nodes,"e=None SecsPreSuspend=0
1485
  LastSchedEval=2019-03-26T08:49:18
1486
  Partition=slurm_partition AllocNode:Sid=login-node-1:14586
1487
  ReqNodeList=(null) ExcNodeList=(null)
 
1515
  AllocTRES=cpu=16,mem=32000M
1516
  CapWatts=n/a
1517
  CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
1518
+ ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/"
 
1519
  Useful Commands,https://docs.mila.quebec/Userguide.html#useful-commands,"Useful Commands
1520
 
1521
  sallocGet an interactive job and give you a shell. (ssh like) CPU only
 
1694
  cuda/11.0 -> cudatoolkit/11.0 pytorch -> python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.5.1 tensorflow -> python/3.7/tensorflow/2.2
1695
  cuda/9.0 -> cudatoolkit/9.0 pytorch/1.4.0 -> python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.4.0 tensorflow-cpu/1.15 -> python/3.7/tensorflow/1.15
1696
 
1697
+ -------------------------------------------------------------------------------------------------- /cvmfs/config.mila.quebec/modules/Core ---------------------------------"
1698
+ The module command,https://docs.mila.quebec/Userguide.html#the-module-command,"------------------------------------------------------------------
1699
  Mila (S,L) anaconda/3 (D) go/1.13.5 miniconda/2 mujoco/1.50 python/2.7 python/3.6 python/3.8 singularity/3.0.3 singularity/3.2.1 singularity/3.5.3 (D)
1700
  anaconda/2 go/1.12.4 go/1.14 (D) miniconda/3 (D) mujoco/2.0 (D) python/3.5 python/3.7 (D) singularity/2.6.1 singularity/3.1.1 singularity/3.4.2
1701
 
1702
+ ------------------------------------------------------------------------------------------------ /cvmfs/config.mila.quebec/modules/Compiler -------------------------------------------------------------------------------------------------
 
1703
  python/3.7/mujoco-py/2.0
1704
 
1705
  -------------------------------------------------------------------------------------------------- /cvmfs/config.mila.quebec/modules/Cuda ---------------------------------------------------------------------------------------------------
1706
  cuda/10.0/cudnn/7.3 cuda/10.0/nccl/2.4 cuda/10.1/nccl/2.4 cuda/11.0/nccl/2.7 cuda/9.0/nccl/2.4 cudatoolkit/9.0 cudatoolkit/10.1 cudnn/7.6/cuda/10.0/tensorrt/7.0
1707
  cuda/10.0/cudnn/7.5 cuda/10.1/cudnn/7.5 cuda/10.2/cudnn/7.6 cuda/9.0/cudnn/7.3 cuda/9.2/cudnn/7.6 cudatoolkit/9.2 cudatoolkit/10.2 cudnn/7.6/cuda/10.1/tensorrt/7.0
1708
+ cuda/10"
1709
+ The module command,https://docs.mila.quebec/Userguide.html#the-module-command,".0/cudnn/7.6 (D) cuda/10.1/cudnn/7.6 (D) cuda/10.2/nccl/2.7 cuda/9.0/cudnn/7.5 (D) cuda/9.2/nccl/2.4 cudatoolkit/10.0 cudatoolkit/11.0 (D) cudnn/7.6/cuda/9.0/tensorrt/7.0
1710
 
1711
  ------------------------------------------------------------------------------------------------ /cvmfs/config.mila.quebec/modules/Pytorch --------------------------------------------------------------------------------------------------
1712
  python/3.7/cuda/10.1/cudnn/7.6/pytorch/1.4.1 python/3.7/cuda/10.1/cudnn/7.6/pytorch/1.5.1 (D) python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.5.0
 
1724
  "
1725
  Available Software,https://docs.mila.quebec/Userguide.html#available-software,"Available Software
1726
  Modules are divided in 5 main sections:
1727
+ | Section | Description |
1728
+ |--------------------|-----------------------------------------------------------------------------------------------------|
1729
+ | Core | Base interpreter and software (Python, go, etc…) |
1730
+ | Compiler | Interpreter-dependent software ( see the note below ) |
1731
+ | Cuda | Toolkits, cudnn and related libraries |
1732
+ | Pytorch/Tensorflow | Pytorch/TF built with a specific Cuda/Cudnn version for Mila’s GPUs ( see the related paragraph ) |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1733
 
1734
  Note
1735
  Modules which are nested (../../..) usually depend on other software/module
 
1990
  apt-get update
1991
  apt-get install -y cmake libcupti-dev libyaml-dev wget unzip
1992
  apt-get clean
1993
+ echo ""Instal"
1994
+ Second way: Use recipes,https://docs.mila.quebec/Userguide.html#second-way-use-recipes,"ling things with pip""
1995
  pip install tqdm
1996
  echo ""Creating mount points""
1997
  mkdir /dataset
 
2020
  You always need to use sudo when you build a container from a
2021
  recipe. As there is no access to sudo on the cluster, a personal computer or
2022
  the use singularity hub is needed to build a container
 
2023
  "
2024
  Build recipe on singularity hub,https://docs.mila.quebec/Userguide.html#build-recipe-on-singularity-hub,"Build recipe on singularity hub
2025
  Singularity hub allows users to build containers from recipes directly on
 
2095
  mkdir /Gym && cd /Gym
2096
  git clone https://github.com/openai/gym.git || true && \
2097
  mkdir /Gym/.mujoco && cd /Gym/.mujoco
2098
+ wget https://www.roboti.us/do"
2099
+ "Example: Recipe with OpenAI gym, MuJoCo and Miniworld",https://docs.mila.quebec/Userguide.html#example-recipe-with-openai-gym-mujoco-and-miniworld,"wnload/mjpro150_linux.zip && \
2100
  unzip mjpro150_linux.zip && \
2101
  wget https://www.roboti.us/download/mujoco200_linux.zip && \
2102
  unzip mujoco200_linux.zip && \
 
2106
  export MUJOCO_PY_MJKEY_PATH=/Gym/.mujoco/mjkey.txt
2107
  export MUJOCO_PY_MUJOCO_PATH=/Gym/.mujoco/mujoco150/
2108
  export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mjpro150/bin
2109
+ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mujoco200/bin
 
2110
  export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/bin
2111
  cp /mjkey.txt /Gym/.mujoco/mjkey.txt
2112
  # Install Python dependencies
 
2127
  export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mjpro150/bin
2128
  export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mujoco200/bin
2129
  export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/bin
2130
+ export PATH=/Gym/gym/.tox/py3/bin:$PATH"
2131
+ "Example: Recipe with OpenAI gym, MuJoCo and Miniworld",https://docs.mila.quebec/Userguide.html#example-recipe-with-openai-gym-mujoco-and-miniworld,"
2132
 
2133
  %runscript
2134
  exec /bin/sh ""$@""
 
2170
 
2171
  # Download Gym and MuJoCo
2172
  mkdir /Gym && cd /Gym
2173
+ git clone https://github.com/openai/gym.git || true && \
 
2174
  mkdir /Gym/.mujoco && cd /Gym/.mujoco
2175
  wget https://www.roboti.us/download/mjpro150_linux.zip && \
2176
  unzip mjpro150_linux.zip && \
 
2180
 
2181
  # Export global environment variables
2182
  export MUJOCO_PY_MJKEY_PATH=/Gym/.mujoco/mjkey.txt
2183
+ export MUJOCO_PY_MUJOCO_PATH=/Gym/.mujoco/mujo"
2184
+ "Example: Recipe with OpenAI gym, MuJoCo and Miniworld",https://docs.mila.quebec/Userguide.html#example-recipe-with-openai-gym-mujoco-and-miniworld,"co150/
2185
  export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mjpro150/bin
2186
  export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mujoco200/bin
2187
  export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/bin
 
2218
 
2219
  Keep in mind that those environment variables are sourced at runtime and not at
2220
  build time. This is why, you should also define them in the %post section
2221
+ since they are required to install MuJoCo"
 
2222
  Using containers on clusters,https://docs.mila.quebec/Userguide.html#using-containers-on-clusters,"Using containers on clusters
2223
  "
2224
  How to use containers on clusters,https://docs.mila.quebec/Userguide.html#how-to-use-containers-on-clusters,"How to use containers on clusters
 
2663
  server that will redirect you to a session as soon as it is available.
2664
  It is currently available for Mila clusters and some Digital Research Alliance
2665
  of Canada (Alliance) clusters.
2666
+ | Cluster | Address | Login type |
2667
+ |------------|---------------------------------------------|--------------|
2668
+ | Mila Local | https://jupyterhub.server.mila.quebec | Google Oauth |
2669
+ | Alliance | https://docs.alliancecan.ca/wiki/JupyterHub | DRAC login |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2670
 
2671
  Warning
2672
  Do not forget to close the JupyterLab session! Closing the window leaves
 
2827
 
2828
 
2829
  --exclusive is important to specify subsequent step/srun to bind to different cpus.
2830
+ This will produce 8 output files"
2831
+ Sharing a node with multiple GPU & multiple processes/GPU,https://docs.mila.quebec/Userguide.html#sharing-a-node-with-multiple-gpu-multiple-processes-gpu,", 2 for each step:
2832
 
2833
  JOBID-step-0-task-0.out
2834
  JOBID-step-0-task-1.out
 
2849
  0: | 0 Tesla P100-PCIE... On | 00000000:82:00.0 Off | 0 |
2850
  1: | 0 Tesla P100-PCIE... On | 00000000:82:00.0 Off | 0 |
2851
  0: | 0 Tesla P100-PCIE... On | 00000000:03:00.0 Off | 0 |
2852
+ 1: | 0 Tesla P100-PCIE... On | 00000000:03:00.0 Off | 0 |"
 
2853
  Multiple Nodes,https://docs.mila.quebec/Userguide.html#multiple-nodes,"Multiple Nodes
2854
  "
2855
  Data Parallel,https://docs.mila.quebec/Userguide.html#data-parallel,"Data Parallel
buster/docparser.py CHANGED
@@ -2,6 +2,7 @@ import glob
2
  import math
3
  import os
4
 
 
5
  import pandas as pd
6
  import tiktoken
7
  from bs4 import BeautifulSoup
@@ -14,7 +15,20 @@ EMBEDDING_ENCODING = "cl100k_base" # this the encoding for text-embedding-ada-0
14
  BASE_URL = "https://docs.mila.quebec/"
15
 
16
 
17
- def get_all_documents(root_dir: str, max_section_length: int = 3000) -> pd.DataFrame:
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  """Parse all HTML files in `root_dir`, and extract all sections.
19
 
20
  Sections are broken into subsections if they are longer than `max_section_length`.
@@ -34,11 +48,10 @@ def get_all_documents(root_dir: str, max_section_length: int = 3000) -> pd.DataF
34
 
35
  # If sections has subsections, keep only the part before the first subsection
36
  if len(section_href) > 1:
37
- section_siblings = section_soup.section.previous_siblings
38
- section = [sibling.text for sibling in section_siblings]
39
- section = "".join(section[::-1])[1:]
40
  else:
41
- section = section_soup.text[1:]
42
 
43
  url = section_found["href"]
44
  name = section_found.parent.text[:-1]
 
2
  import math
3
  import os
4
 
5
+ import bs4
6
  import pandas as pd
7
  import tiktoken
8
  from bs4 import BeautifulSoup
 
15
  BASE_URL = "https://docs.mila.quebec/"
16
 
17
 
18
+ def parse_section(nodes: list[bs4.element.NavigableString]) -> str:
19
+ section = []
20
+ for node in nodes:
21
+ if node.name == "table":
22
+ node_text = pd.read_html(node.prettify())[0].to_markdown(index=False, tablefmt="github")
23
+ else:
24
+ node_text = node.text
25
+ section.append(node_text)
26
+ section = "".join(section)[1:]
27
+
28
+ return section
29
+
30
+
31
+ def get_all_documents(root_dir: str, max_section_length: int = 2000) -> pd.DataFrame:
32
  """Parse all HTML files in `root_dir`, and extract all sections.
33
 
34
  Sections are broken into subsections if they are longer than `max_section_length`.
 
48
 
49
  # If sections has subsections, keep only the part before the first subsection
50
  if len(section_href) > 1:
51
+ section_siblings = list(section_soup.section.previous_siblings)[::-1]
52
+ section = parse_section(section_siblings)
 
53
  else:
54
+ section = parse_section(section_soup.children)
55
 
56
  url = section_found["href"]
57
  name = section_found.parent.text[:-1]