diff --git "a/buster/data/documents.csv" "b/buster/data/documents.csv" deleted file mode 100644--- "a/buster/data/documents.csv" +++ /dev/null @@ -1,3222 +0,0 @@ -name,url,text -AI tooling and methodology handbook,https://docs.mila.quebec/Handbook.html#ai-tooling-and-methodology-handbook,"AI tooling and methodology handbook -This section seeks to provide researchers with insightful articles pertaining to -aspects of methodology in their work. -" -What is a computer cluster?,https://docs.mila.quebec/Theory_cluster.html#what-is-a-computer-cluster,"What is a computer cluster? -A computer cluster is a set -of loosely or tightly connected computers that work together so that, in many -respects, they can be viewed as a single system. -" -Parts of a computing cluster,https://docs.mila.quebec/Theory_cluster.html#parts-of-a-computing-cluster,"Parts of a computing cluster -To provide high performance computation capabilities, clusters can -combine hundreds to thousands of computers, called nodes, which are all -inter-connected with a high-performance communication network. Most nodes are -designed for high-performance computations, but clusters can also use -specialized nodes to offer parallel file systems, databases, login nodes and -even the cluster scheduling functionality as pictured in the image below. - -We will overview the different types of nodes which you can encounter on a -typical cluster. -" -The login nodes,https://docs.mila.quebec/Theory_cluster.html#the-login-nodes,"The login nodes -To execute computing processes on a cluster, you must first connect to a -cluster and this is accomplished through a login node. These so-called -login nodes are the entry point to most clusters. -Another entry point to some clusters such as the Mila cluster is the JupyterHub -web interface, but we’ll read about that later. For now let’s return to the -subject of this section; Login nodes. To connect to these, you would typically -use a remote shell connection. The most usual tool to do so is SSH. You’ll hear -and read a lot about this tool. Imagine it as a very long (and somewhat -magical) extension cord which connects the computer you are using now, such as -your laptop, to a remote computer’s terminal shell. You might already know what -a terminal shell is if you ever used the command line. -" -The compute nodes,https://docs.mila.quebec/Theory_cluster.html#the-compute-nodes,"The compute nodes -In the field of artificial intelligence, you will usually be on the hunt for -GPUs. In most clusters, the compute nodes are the ones with GPU capacity. -While there is a general paradigm to tend towards a homogeneous configuration -for nodes, this is not always possible in the field of artificial intelligence -as the hardware evolve rapidly as is being complemented by new hardware and so -on. Hence, you will often read about computational node classes. Some of which -might have different GPU models or even no GPU at all. For the Mila cluster you -will find this information in the Node profile description section. For -now, you should note that is important to keep in mind that you should be aware -of which nodes your code is running on. More on that later. -" -The storage nodes,https://docs.mila.quebec/Theory_cluster.html#the-storage-nodes,"The storage nodes -Some computers on a cluster function to only store and serve files. While the -name of these computers might matter to some, as a user, you’ll only be -concerned about the path to the data. More on that in the Processing data section. -" -Different nodes for different uses,https://docs.mila.quebec/Theory_cluster.html#different-nodes-for-different-uses,"Different nodes for different uses -It is important to note here the difference in intended uses between the -compute nodes and the login nodes. While the compute nodes are meant for heavy -computation, the login nodes are not. -The login nodes however are used by everyone who uses the cluster and care must -be taken not to overburden these nodes. Consequently, only very short and light -processes should be run on these otherwise the cluster may become inaccessible. -In other words, please refrain from executing long or compute intensive -processes on login nodes because it affects all other users. In some cases, you -will also find that doing so might get you into trouble. -" -UNIX,https://docs.mila.quebec/Theory_cluster.html#unix,"UNIX -All clusters typically run on GNU/Linux distributions. Hence a minimum -knowledge of GNU/Linux and BASH is usually required to use them. See the -following tutorial -for a rough guide on getting started with Linux. -" -The workload manager,https://docs.mila.quebec/Theory_cluster.html#the-workload-manager,"The workload manager -On a cluster, users don’t have direct access to the compute nodes but -instead connect to a login node and add jobs to the workload manager -queue. Whenever there are resources available to execute these jobs -they will be allocated to a compute node and run, which can be -immediately or after a wait of up to several days. -A job is comprised of a number of steps that will run one after the -other. This is done so that you can schedule a sequence of processes -that can use the results of the previous steps without having to -manually interact with the scheduler. -Each step can have any number of tasks which are groups of processes -that can be scheduled independently on the cluster but can run in -parallel if there are resources available. The distinction between -steps and tasks is that multiple tasks, if they are part of the same -step, cannot depend on results of other tasks because there are no -guarantees on the order in which they will be executed. -Finally each process group is the basic unit that is scheduled in the -cluster. It comprises of a set of processes (or threads) that can run -on a number of resources (CPU, GPU, RAM, …) and are scheduled -together as a unit on one or more machines. -Each of these concepts lends itself to a particular use. For multi-gpu -training in AI workloads you would use one task per GPU for data -paralellism or one process group if you are doing model -parallelism. Hyperparameter optimisation can be done using a -combination of tasks and steps but is probably better left to a -framework outside of the scope of the workload manager. -If this all seems complicated, you should know that all these things -do not need to always be used. It is perfectly acceptable to sumbit -jobs with a single step, a single task and a single process. -The available resource" -The workload manager,https://docs.mila.quebec/Theory_cluster.html#the-workload-manager,"s on the cluster are not infinite and it is the -workload manager’s job to allocate them. Whenever a job request comes -in and there are not enough resources available to start it -immediately, it will go in the queue. -Once a job is in the queue, it will stay there until another job -finishes and then the workload manager will try to use the newly freed -resources with jobs from the queue. The exact order in which the jobs -will start is not fixed, because it depends on the local policies -which can take into account the user priority, the time since the job -was requested, the amount of resources requested and possibly other -things. There should be a tool that comes with the manager where you -can see the status of your queued jobs and why they remain in the -queue. -The workload manager will divide the cluster into partitions according -to the configuration set by the admins. A partition is a set of -machines typically reserved for a particular purpose. An example might -be CPU-only machines for preprocessing setup as a separate partition. -It is possible for multiple partitions to share resources. -There will always be at least one partition that is the default -partition in which jobs without a specific request will go. Other -partitions can be requested, but might be restricted to a group of -users, depending on policy. -Partitions are useful for a policy standpoint to ensure efficient use -of the cluster resources and avoid using up too much of one resource -type blocking use of another. They are also useful for heterogenous -clusters where different hardware is mixed in and not all software is -compatible with all of it (for example x86 and POWER cpus). -To ensure a fair share of the computing resources for all, the workload -manager establishes limits on the amount of resources that a single -user can us" -The workload manager,https://docs.mila.quebec/Theory_cluster.html#the-workload-manager,"e at once. These can be hard limits which prevent running -jobs when you go over or soft limits which will let you run jobs, but -only until some other job needs the resources. -Admin policy will determine what those exact limits are for a -particular cluster or user and whether they are hard or soft limits. -The way soft limits are enforced is using preemption, which means that -when another job with higher priority needs the resources that your -job is using, your job will receive a signal that it needs to save its -state and exit. It will be given a certain amount of time to do this -(the grace period, which may be 0s) and then forcefully terminated if -it is still running. -Depending on the workload manager in use and the cluster configuration -a job that is preempted like this may be automatically rescheduled to -have a chance to finish or it may be up to the job to reschedule -itself. -The other limit you can encounter with a job that goes over its -declared limits. When you schedule a job, you declare how much -resources it will need (RAM, CPUs, GPUs, …). Some of those may have -default values and not be explicitely defined. For certain types of -devices, like GPUs, access to units over your job limit is made -unavailable. For others, like RAM, usage is monitored and your job -will be terminated if it goes too much over. This makes it important -to ensure you estimate resource usage accurately. -Mila as well as Digital Research Alliance of Canada use the workload -manager Slurm to schedule and -allocate resources on their infrastructure. -Slurm client commands are available on the login nodes for you to submit -jobs to the main controller and add your job to the queue. Jobs are of 2 types: -batch jobs and interactive jobs. -For practical examples of Slurm commands on the Mila cluster, see Running your code." -Processing data,https://docs.mila.quebec/Theory_cluster.html#processing-data,"Processing data -For processing large amounts of data common for deep learning, either -for dataset preprocessing or training, several techniques exist. Each -has typical uses and limitations. -" -Data parallelism,https://docs.mila.quebec/Theory_cluster.html#data-parallelism,"Data parallelism -The first technique is called data parallelism (aka task -parallelism in formal computer science). You simply run lots of -processes each handling a portion of the data you want to -process. This is by far the easiest technique to use and should be -favored whenever possible. A common example of this is -hyperparameter optimisation. -For really small computations the time to setup multiple processes -might be longer than the processing time and lead to waste. This can -be addressed by bunching up some of the processes together by doing -sequential processing of sub-partitions of the data. -For the cluster systems it is also inadvisable to launch thousands of -jobs and even if each job would run for a reasonable amount of time -(several minutes at minimum), it would be best to make larger groups -until the amount of jobs is in the low hundreds at most. -Finally another thing to keep in mind is that the transfer bandwidth -is limited between the filesystems (see Filesystem concerns) -and the compute nodes and if you run too many jobs using too much data -at once they may end up not being any faster because they will spend -their time waiting for data to arrive. -" -Model parallelism,https://docs.mila.quebec/Theory_cluster.html#model-parallelism,"Model parallelism -The second technique is called model parallelism (which doesn’t -have a single equivalent in formal computer science). It is used -mostly when a single instance of a model will not fit in a computing -resource (such as the GPU memory being too small for all the -parameters). -In this case, the model is split into its constituent parts, each -processed independently and their intermediate results communicated -with each other to arrive at a final result. -This is generally harder but necessary to work with larger, more -powerful models like GPT. -" -Communication concerns,https://docs.mila.quebec/Theory_cluster.html#communication-concerns,"Communication concerns -The main difference of these two approaches is the need for -communication between the multiple processes. Some common training -methods, like stochastic gradient descent sit somewhere between the -two, because they require some communication, but not a lot. Most -people classify it as data parallelism since it sits closer to that -end. -In general for data parallelism tasks or tasks that communicate -infrequently it doesn’t make a lot of difference where the processes -sit because the communication bandwidth and latency will not have a -lot of impact on the time it takes to complete the job. The -individual tasks can generally be scheduled independently. -On the contrary for model parallelism you need to pay more attention -to where your tasks are. In this case it is usually required to use -the facilities of the workload manager to group the tasks so that they -are on the same machine or machines that are closely linked to ensure -optimal communication. What is the best allocation depends on the -specific cluster architecture available and the technologies it -support (such as InfiniBand, -RDMA, -NVLink or others) -" -Filesystem concerns,https://docs.mila.quebec/Theory_cluster.html#filesystem-concerns,"Filesystem concerns -When working on a cluster, you will generally encounter several -different filesystems. Usually there will be names such as ‘home’, -‘scratch’, ‘datasets’, ‘projects’, ‘tmp’. -The reason for having different filesystems available instead of a -single giant one is to provide for different use cases. For example, -the ‘datasets’ filesystem would be optimized for fast reads but have -slow write performance. This is because datasets are usually written -once and then read very often for training. -Different filesystems have different performance levels. For instance, backed -up filesystems (such as $PROJECT in Digital Research Alliance of Canada -clusters) provide more space and can handle large files but cannot sustain -highly parallel accesses typically required for high speed model training. -The set of filesystems provided by the cluster you are using should be -detailed in the documentation for that cluster and the names can -differ from those above. You should pay attention to their recommended -use case in the documentation and use the appropriate filesystem for -the appropriate job. There are cases where a job ran hundreds of times -slower because it tried to use a filesystem that wasn’t a good fit for -the job. -One last thing to pay attention to is the data retention policy for -the filesystems. This has two subpoints: how long is the data kept -for, and are there backups. -Some filesystems will have a limit on how long they keep their -files. Typically the limit is some number of days (like 90 days) but -can also be ‘as long as the job runs’ for some. -As for backups, some filesystems will not have a limit for data, but -will also not have backups. For those it is important to maintain a -copy of any crucial data somewhere else. The data will not be -purposefully deleted, but the filesystem may fail and lose all or part -of its data. If you have any data that is crucial for a paper or your -thesis keep an additional copy of it somewhere else. -" -Software on the cluster,https://docs.mila.quebec/Theory_cluster.html#software-on-the-cluster,"Software on the cluster -This section aims to raise awareness to problems one can encounter when trying -to run a software on different computers and how this is dealt with on typical -computation clusters. -The Mila cluster and the Digital Research Alliance of Canada clusters both -provide various useful software and computing environments, which can be -activated through the module system. Alternatively, you may build containers -with your desired software and run them on compute nodes. -Regarding Python development, we recommend using virtual environments to install -Python packages in isolation. -" -Cluster software modules,https://docs.mila.quebec/Theory_cluster.html#cluster-software-modules,"Cluster software modules -Modules are small files which modify your environment variables to point to -specific versions of various software and libraries. For instance, a module -might provide the python command to point to Python 3.7, another might -activate CUDA version 11.0, another might provide the torch package, and so -on. -For more information, see The module command. -" -Containers,https://docs.mila.quebec/Theory_cluster.html#containers,"Containers -Containers are a special form of isolation of software and its dependencies. A -container is essentially a lightweight virtual machine: it encapsulates a -virtual file system for a full OS installation, as well as a separate network -and execution environment. -For example, you can create an Ubuntu container in which you install various -packages using apt, modify settings as you would as a root user, and so on, -but without interfering with your main installation. Once built, a container can -be run on any compatible system. -For more information, see Using containers on clusters. -" -Python Virtual environments,https://docs.mila.quebec/Theory_cluster.html#python-virtual-environments,"Python Virtual environments -A virtual environment in Python is a local, isolated environment in which you -can install or uninstall Python packages without interfering with the global -environment (or other virtual environments). In order to use a virtual -environment, you first have to activate it. -For more information, see Virtual environments. -" -"Who, what, where is IDT",https://docs.mila.quebec/IDT.html#who-what-where-is-idt,"Who, what, where is IDT -This section seeks to help Mila researchers understand the mission and role of -the IDT team. -" -IDT’s mission,https://docs.mila.quebec/IDT.html#idt-s-mission,"IDT’s mission - -" -The IDT team,https://docs.mila.quebec/IDT.html#the-idt-team,"The IDT team -See https://mila.quebec/en/mila/team/?cat_id=143 -" -Purpose of this documentation,https://docs.mila.quebec/Purpose.html#purpose-of-this-documentation,"Purpose of this documentation -This documentation aims to cover the information required to run scientific -and data-intensive computing tasks at Mila and the available resources for its -members. -It also aims to be an outlet for sharing know-how, tips and tricks and examples -from the IDT team to the Mila researcher community. -" -Intended audience,https://docs.mila.quebec/Purpose.html#intended-audience,"Intended audience -This documentation is mainly intended for Mila researchers having access to the -Mila cluster. This access is determined by your researcher status. See -Roles and authorizations for more information. The core of the -information with this purpose can be found in the following section: -Computing infrastructure and policies. -However, we also aim to provide more general information which can be useful -outside the scope of using the Mila cluster. For instance, more general theory -on computational considerations and such. In this perspective, we hope the -documentation can be of use for all of Mila members. -" -Contributing,https://docs.mila.quebec/Purpose.html#contributing,"Contributing -See the following file for contribution guidelines : -# Contributing to the Mila Docs - -Thank you for your interest into making a better documentation for all at Mila. - -Here are some guidelines to help bring your contributions to life. - -## What should be included in the Mila Docs - -* Mila cluster usage -* Digital Research Alliance of Canada cluster usage -* Job management tips / tricks -* Research good practices -* Software development good practices -* Useful tools - -**_NOTE_**: Examples should aim to not consume much more than 1 GPU/hour and 2 CPU/hour - -## Issues / Pull Requests - -### Issues - -Issues can be used to report any error in the documentation, missing or unclear -sections, broken tools or other suggestions to improve the overall -documentation. - -### Pull Requests - -PRs are welcome and we value the contents of contributions over the appearance -or functionality of the pull request. If you don't know how to write the proper -markup in reStructuredText, simply provide the content you would like to add in -the PR text form which supports markdown or with instructions to format the -content. In the PR, reference the related issues like this: - -``` -Resolves: #123 -See also: #456, #789 -``` - -If you would like to contribute directly in the code of the documentation, keep -the lines width to 80 characters or less. You can attempt to build the docs -yourself to see if the formating is right: - -```console -python3 -m pip install -r docs/requirements.txt -sphinx-build -b html docs/ docs/_build/ -``` - -This will produce the html version of the documentation which you can navigate -by opening the local file `docs/_build/index.html`. - -If you have any trouble building the docs, don't hesitate to open an issue to -request help. - -Regarding the restructured text format" -Contributing,https://docs.mila.quebec/Purpose.html#contributing,", you can simply provide the content -you would like to add in markdown or plain text format if more convenient -for you and someone down the line should take responsibility to convert -the format. - -## Sphinx / reStructuredText (reST) - -The markup language used for the Mila Docs is -[reStructuredText](http://docutils.sourceforge.net/rst.html) and we follow the -[Python’s Style Guide for documenting](https://docs.python.org/devguide/documenting.html#style-guide). - -Here are some of reST syntax directives which are useful to know : -(more can be found in -[Sphinx's reST Primer](https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html)): - - -### Inline markup - -* one asterisk: `*text*` for *emphasis* (italics), -* two asterisks: `**text**` for **strong emphasis** (boldface), and -* backquotes: ` ``text`` ` for `code samples`, and -* external links: `` `Link text `_ ``. - -### Lists - -```reST -* this is -* a list - - * with a nested list - * and some subitems - -* and here the parent list continues -``` - -### Sections - -```reST -################# -This is a heading -################# -``` - -There are no heading levels assigned to certain characters as the structure is -determined from the succession of headings. However, the Python documentation -suggests the following convention: - - * `#` with overline, for parts - * `*` with overline, for chapters - * `=`, for sections - * `-`, for subsections - * `^`, for subsubsections - * `""`, for paragraphs - -### Note box - -```reST -.. note:: This is a long - long long note -``` - -### Collapsible boxes - -This is a local extension, not part of Sphinx itself. It works like this: - -```reST -.. container:: toggle - - .. container:: header - - **Show/Hide Code** - - .. code-block:: - ... -``` - - -" -Visual Studio Code,https://docs.mila.quebec/VSCode.html#visual-studio-code,"Visual Studio Code -One editor of choice for many researchers is VSCode. One feature of VSCode is -remote editing through SSH. This allows you to edit files on the cluster as if -they were local. You can also debug your programs using VSCode’s debugger, open -terminal sessions, etc. -" -Connecting to the cluster,https://docs.mila.quebec/VSCode.html#connecting-to-the-cluster,"Connecting to the cluster -VSCode cannot be used to edit code on the login nodes, because it is a heavy -enough process (a node process, plus the language server, linter, and -possibly other plugins depending on your configured environment) that there is a -risk of overloading the login nodes if too many researchers did it at the same -time. -Therefore, to use VSCode on the cluster, you first need to allocate a compute -node, then connect to that node. -The milatools package provides a command to make the operation easier. More -info can be found here. -" -Activating an environment,https://docs.mila.quebec/VSCode.html#activating-an-environment,"Activating an environment -Reference -To activate a conda or pip environment, you can open the command palette with -Ctrl+Shift+P and type “Python: Select interpreter”. This will prompt you for the -path to the Python executable for your environment. - -Tip -If you already have the environment activated in a terminal session, you can -run the command which python to get the path for this environment. This -path can be pasted into the interpreter selection prompt in VSCode to use -that same environment. - -" -Troubleshooting,https://docs.mila.quebec/VSCode.html#troubleshooting,"Troubleshooting -" -“Cannot reconnect”,https://docs.mila.quebec/VSCode.html#cannot-reconnect,"“Cannot reconnect” -When connecting to multiple compute nodes (and/or from multiple computers), some -instances may crash with that message because of conflicts in the lock files -VSCode installs in ~/.vscode-server (which is shared on all compute nodes). -To fix this issue, you can change this setting in your settings.json file: -{ ""remote.SSH.lockfilesInTmp"": true } - - -This will store the necessary lockfiles in /tmp on the compute nodes (which -are local to the node). -" -Debugger timeouts,https://docs.mila.quebec/VSCode.html#debugger-timeouts,"Debugger timeouts -Sometimes, slowness on the compute node or the networked filesystem might cause -the VSCode debugger to timeout when starting a remote debug process. As a quick -fix, you can add this to your ~/.bashrc or ~/.profile or equivalent -resource file for your preferred shell, to increase the timeout delay to 500 -seconds: -export DEBUGPY_PROCESS_SPAWN_TIMEOUT=500 - - -" -Computational resources outside of Mila,https://docs.mila.quebec/Extra_compute.html#computational-resources-outside-of-mila,"Computational resources outside of Mila -This section seeks to provide insights and information on computational -resources outside the Mila cluster itself. -" -Digital Research Alliance of Canada Clusters,https://docs.mila.quebec/Extra_compute.html#digital-research-alliance-of-canada-clusters,"Digital Research Alliance of Canada Clusters -The clusters named Beluga, Cedar, Graham, Narval and Niagara are -clusters provided by the Digital Research Alliance of Canada organisation (the Alliance). For Mila researchers, these -clusters are to be used for larger experiments having many jobs, multi-node -computation and/or multi-GPU jobs as well as long running jobs. If you use -these resources for your research, please remember to acknowledge their use in -your papers. - -Note -Compute Canada ceased its operational responsibilities for supporting Canada’s -national advanced research computing (ARC) platform on March 31, 2022. The services -will be supported by the new Digital Research Alliance of Canada. -https://ace-net.ca/compute-canada-operations-move-to-the-digital-research-alliance-of-canada-(the-alliance).html - -" -Current allocation description,https://docs.mila.quebec/Extra_compute.html#current-allocation-description,"Current allocation description -Clusters of the Alliance are shared with researchers across the country. -Allocations are given by the Alliance to selected research groups to ensure to -a minimal amount of computational resources throughout the year. -Depending on your affiliation, you will have access to different allocations. If -you are a student at University of Montreal, you can have access to the -rrg-bengioy-ad allocation described below. For students from other -universities, you should ask your advisor to know which allocations you could -have access to. -From the Alliance’s documentation: An allocation is an amount of resources -that a research group can target for use for a period of time, usually a year. -To be clear, it is not a maximal amount of resources that can be used -simultaneously, it is a weighting factor of the workload manager to balance -jobs. For instance, even though we are allocated 400 GPU-years across all -clusters, we can use more or less than 400 GPUs simultaneously depending on the -history of usage from our group and other groups using the cluster at a given -period of time. Please see the Alliance’s doc" -Current allocation description,https://docs.mila.quebec/Extra_compute.html#current-allocation-description,"umentation for -more information on how allocations and resource scheduling are configured for -these installations. -The table below provides information on the allocation for -rrg-bengioy-ad for the period which spans from April 2022 to -April 2023. Note that there are no special allocations for GPUs on -Graham and therefore jobs with GPUs should be submitted with the -account def-bengioy. -| 0 | 1 | 2 | 3 | 4 | 5 | 6 | -|---------|------|----------------|----------|------|----------------------|----------------| -| Cluster | CPUs | CPUs | GPUs | GPUs | GPUs | GPUs | -| Cluster | # | account | Model | # | SLURM type specifier | account | -| Beluga | 238 | rrg-bengioy-ad | V100-16G | 77 | v100 | rrg-bengioy-ad | -| Cedar | 34 | rrg-bengioy-ad | V100-32G | 138 | v100l | rrg-bengioy-ad | -| Graham | 34 | rrg-bengioy-ad | various | – | – | def-bengioy | -| Narval | 34 | rrg-bengioy-ad | A100-40G | 185 | a100 | rrg-bengioy-ad | -" -Account Creation,https://docs.mila.quebec/Extra_compute.html#account-creation,"Account Creation -To access the Alliance clusters you have to first create an account at -https://ccdb.computecanada.ca. Use a password with at least 8 characters, mixed -case letters, digits and special characters. Later you will be asked to create -another password with those rules, and it’s really convenient that the two -password are the same. -Then, you have to apply for a role at -https://ccdb.computecanada.ca/me/add_role, which basically means telling the -Alliance that you are part of the lab so they know which cluster you can have -access to, and track your usage. -You will be asked for the CCRI (See screenshot below). Please reach out to your -sponsor to get the CCRI. - -You will need to wait for your sponsor to accept before being able to login -to the Alliance clusters. -" -Clusters,https://docs.mila.quebec/Extra_compute.html#clusters,"Clusters - -Beluga:(Mila doc) -(Digital Research Alliance of Canada doc) -For most students, Beluga is the best choice for both CPU and GPU jobs because -of larger allocations on this cluster. - -Narval:(Mila doc) -(Digital Research Alliance of Canada doc) -Narval is the newest cluster, and contains the most powerful GPUs (A100). If your -job can benefit from the A100’s features, such as TF32 floating-point math, Narval -is the best choice. - -Cedar:(Mila doc) -(Digital Research Alliance of Canada doc) -Cedar is a good alternative to Beluga if you absolutely need to have an internet connection -on the compute nodes. - -Graham:(Mila doc) -(Digital Research Alliance of Canada doc) -We do not have a GPU allocation on Graham anymore but it remains an alternative for CPU jobs. - -Niagara:(Mila doc) -(Digital Research Alliance of Canada doc) -Niagara is not recommended for most students. It is a CPU-only cluster with unusual -configurations. Access is not automatic; It is opt-in and must be requested via -CCDB manually. Compute resources in Niagara are not assigned to jobs on a per-CPU, -but on a per-node basis. - - -" -Beluga,https://docs.mila.quebec/Extra_compute.html#beluga,"Beluga -Beluga is a cluster located at ÉTS in Montreal. It -uses SLURM to schedule jobs. Its full documentation can be found here, and its current status -here. -You can access Beluga via ssh: -ssh @beluga.computecanada.ca -Where is the username you created previously (see Account Creation). -" -Launching Jobs,https://docs.mila.quebec/Extra_compute.html#launching-jobs,"Launching Jobs -Users must specify the resource allocation Group Name using the flag ---account=rrg-bengioy-ad. To launch a CPU-only job: -sbatch --time=1:0:0 --account=rrg-bengioy-ad job.sh - -Note -The account name will differ based on your affiliation. - -To launch a GPU job: -sbatch --time=1:0:0 --account=rrg-bengioy-ad --gres=gpu:1 job.sh -And to get an interactive session, use the salloc command: -salloc --time=1:0:0 --account=rrg-bengioy-ad --gres=gpu:1 -The full documentation for jobs launching on Beluga can be found here. -" -Beluga nodes description,https://docs.mila.quebec/Extra_compute.html#beluga-nodes-description,"Beluga nodes description -Each GPU node consists of: - -40 CPU cores -186 GB RAM -4 GPU NVIDIA V100 (16GB) - - -Tip -You should ask for max 10 CPU cores and 32 GB of RAM per GPU you are -requesting (as explained here), -otherwise, your job will count for more than 1 allocation, and will take -more time to get scheduled. - -" -Beluga Storage,https://docs.mila.quebec/Extra_compute.html#beluga-storage,"Beluga Storage -| Storage | Path | Usage | -|----------------|----------------------|---------------------------------------------------------------| -| $HOME | /home// | Code Specific libraries | -| $HOME/projects | /project/rpp-bengioy | Compressed raw datasets | -| $SCRATCH | /scratch/ | Processed datasets Experimental results Logs of experiments | -| $SLURM_TMPDIR | nan | Temporary job results | -They are roughly listed in order of increasing performance and optimized for -different uses: - -The $HOME folder on NFS is appropriate for codes and libraries which are -small and read once. Do not write experiemental results here! -The $HOME/projects folder should only contain compressed raw datasets -(processed datasets should go in $SCRATCH). We have a limit on the -size and number of file in $HOME/projects, so do not put anything else -there. If you add a new dataset there (make sure it is readable by every -member of the group using chgrp -R rpp-bengioy ). -The $SCRATCH space can be used for short term storage. It has good -performance and large quotas, but is purged regularly (every file that has -not been used in the last 3 months gets deleted, but you receive an email -before this happens). -$SLURM_TMPDIR points to the local disk of the node on which a job is -running. It should be used to copy the data on the node at the beginning of -the job and write intermediate checkpoints. This folder is cleared after each -job. - -When an experiment is finished, results should be transferred back to Mila -servers. -More details on storage can be found here. -" -Modules,https://docs.mila.quebec/Extra_compute.html#modules,"Modules -Many software, such as Python or MATLAB are already compiled and available on -Beluga through the module command and its subcommands. Its full -documentation can be found here. -| 0 | 1 | -|------------------------|---------------------------------------| -| module avail | Displays all the available modules | -| module load | Loads | -| module spider | Shows specific details about | -In particular, if you with to use Python 3.6 you can simply do: -module load python/3.6 - -Tip -If you wish to use Python on the cluster, we strongly encourage you to -read Alliance Python Documentation, and in particular the Pytorch and/or Tensorflow pages. - -The cluster has many Python packages (or wheels), such already compiled for -the cluster. See here for the -details. In particular, you can browse the packages by doing: -avail_wheels -Such wheels can be installed using pip. Moreover, the most efficient way to use -modules on the cluster is to build your environnement inside your job. -See the script example below. -" -Script Example,https://docs.mila.quebec/Extra_compute.html#script-example,"Script Example -Here is a sbatch script that follows good practices on Beluga: - 1#!/bin/bash - 2#SBATCH --account=rrg-bengioy-ad # Yoshua pays for your job - 3#SBATCH --cpus-per-task=6 # Ask for 6 CPUs - 4#SBATCH --gres=gpu:1 # Ask for 1 GPU - 5#SBATCH --mem=32G # Ask for 32 GB of RAM - 6#SBATCH --time=3:00:00 # The job will run for 3 hours - 7#SBATCH -o /scratch//slurm-%j.out # Write the log in $SCRATCH - 8 - 9# 1. Create your environement locally -10module load python/3.6 -11virtualenv --no-download $SLURM_TMPDIR/env -12source $SLURM_TMPDIR/env/bin/activate -13pip install --no-index torch torchvision -14 -15# 2. Copy your dataset on the compute node -16# IMPORTANT: Your dataset must be compressed in one single file (zip, hdf5, ...)!!! -17cp $SCRATCH/ $SLURM_TMPDIR -18 -19# 3. Eventually unzip your dataset -20unzip $SLURM_TMPDIR/ -d $SLURM_TMPDIR -21 -22# 4. Launch your job, tell it to save the model in $SLURM_TMPDIR -23# and look for the dataset into $SLURM_TMPDIR -24python main.py --path $SLURM_TMPDIR --data_path $SLURM_TMPDIR -25 -26# 5. Copy whatever you want to save on $SCRATCH -27cp $SLURM_TMPDIR/ $SCRATCH - - -" -Using CometML and Wandb,https://docs.mila.quebec/Extra_compute.html#using-cometml-and-wandb,"Using CometML and Wandb -The compute nodes for Beluga don’t have access to the internet, -but there is a special module that can be loaded in order to allow -training scripts to access some specific servers, which includes -the necessary servers for using CometML and Wandb (“Weights and Biases”). -module load httpproxy -More documentation about this can be found here. -" -Graham,https://docs.mila.quebec/Extra_compute.html#graham,"Graham -Graham is a cluster located at University of Waterloo. It uses SLURM to schedule -jobs. Its full documentation can be found here, and its current status here. -You can access Graham via ssh: -ssh @graham.computecanada.ca -Where is the username you created previously (see Account Creation). -Since its structure is similar to Beluga, please look at the Beluga -documentation, as well as relevant parts of the Digital Research Alliance of -Canada Documentation. - -Note -For GPU jobs the ressource allocation Group Name is the same as Beluga, so you should use the flag --account=rrg-bengioy-ad for GPU jobs. - -" -Cedar,https://docs.mila.quebec/Extra_compute.html#cedar,"Cedar -Cedar is a cluster located at Simon Fraser University. It uses SLURM to schedule -jobs. Its full documentation can be found here, and its current status here. -You can access Cedar via ssh: -ssh @cedar.computecanada.ca -Where is the username you created previously (see Account Creation). -Since its structure is similar to Beluga, please look at the Beluga -documentation, as well as relevant parts of the Digital Research Alliance of -Canada Documentation. - -Note -However, we don’t have any CPU priority on Cedar, in this case you can -use --account=def-bengioy for CPU. Thus, it might take some time before -they start. - -" -Niagara,https://docs.mila.quebec/Extra_compute.html#niagara,"Niagara -Niagara is a cluster located at University of Toronto. It uses SLURM to schedule -jobs. Its full documentation can be found here, and its current status here. -You can access Niagara via ssh: -ssh @niagara.computecanada.ca -Where is the username you created previously (see Account Creation). -Since its structure is similar to Beluga, please look at the Beluga -documentation, as well as relevant parts of the Digital Research Alliance of -Canada Documentation. -" -FAQ,https://docs.mila.quebec/Extra_compute.html#faq,"FAQ -" -What to do with ImportError: /lib64/libm.so.6: version GLIBC_2.23 not found?,https://docs.mila.quebec/Extra_compute.html#what-to-do-with-importerror-lib64-libm-so-6-version-glibc-2-23-not-found,"What to do with ImportError: /lib64/libm.so.6: version GLIBC_2.23 not found? -The structure of the file system is different than a classical Linux, so your -code has trouble finding libraries. See how to install binary packages. -" -Disk quota exceeded error on /project file systems,https://docs.mila.quebec/Extra_compute.html#disk-quota-exceeded-error-on-project-file-systems,"Disk quota exceeded error on /project file systems -You have files in /project with the wrong permissions. See how to change -permissions. -" -Computing infrastructure and policies,https://docs.mila.quebec/Information.html#computing-infrastructure-and-policies,"Computing infrastructure and policies -This section seeks to provide factual information and policies on the Mila cluster computing environments. -" -Roles and authorizations,https://docs.mila.quebec/Information.html#roles-and-authorizations,"Roles and authorizations -There are mainly two types of researchers statuses at Mila : - -Core researchers -Affiliated researchers - -This is determined by Mila policy. Core researchers have access to the Mila -computing cluster. See your supervisor’s Mila status to know what is your own -status. -" -Overview of available computing resources at Mila,https://docs.mila.quebec/Information.html#overview-of-available-computing-resources-at-mila,"Overview of available computing resources at Mila -The Mila cluster is to be used for regular development and relatively small -number of jobs (< 5). It is a heterogeneous cluster. It uses -SLURM to schedule jobs. -" -Mila cluster versus Digital Research Alliance of Canada clusters,https://docs.mila.quebec/Information.html#mila-cluster-versus-digital-research-alliance-of-canada-clusters,"Mila cluster versus Digital Research Alliance of Canada clusters -There are a lot of commonalities between the Mila cluster and the clusters from -Digital Research Alliance of Canada (the Alliance). At the time being, the -Alliance clusters where we have a large allocation of resources are beluga, -cedar, graham and narval. We also have comparable computational resources -in the Mila cluster, with more to come. -The main distinguishing factor is that we have more control over our own -cluster than we have over the ones at the Alliance. Notably, also, the compute -nodes in the Mila cluster all have unrestricted access to the Internet, which -is not the case in general for the Alliance clusters (although cedar does -allow it). -At the current time of this writing (June 2021), Mila students are advised to -use a healthy diet of a mix of Mila and Alliance clusters. This is especially -true in times when your favorite cluster is oversubscribed, because you can -easily switch over to a different one if you are used to it. -" -Guarantees about one GPU as absolute minimum,https://docs.mila.quebec/Information.html#guarantees-about-one-gpu-as-absolute-minimum,"Guarantees about one GPU as absolute minimum -There are certain guarantees that the Mila cluster tries to honor when it comes -to giving at minimum one GPU per student, all the time, to be used in -interactive mode. This is strictly better than “one GPU per student on average” -because it’s a floor meaning that, at any time, you should be able to ask for -your GPU, right now, and get it (although it might take a minute for the -request to be processed by SLURM). -Interactive sessions are possible on the Alliance clusters, and there are -generally special rules that allow you to get resources more easily if you -request them for a very short duration (for testing code before queueing long -jobs). You do not get the same guarantee as on the Mila cluster, however. -" -Node profile description,https://docs.mila.quebec/Information.html#node-profile-description,"Node profile description -| ('Name', 'Name') | ('GPU', 'Model') | ('GPU', 'Mem') | ('GPU', '#') | ('CPUs', 'CPUs') | ('Sockets', 'Sockets') | ('Cores/Socket', 'Cores/Socket') | ('Threads/Core', 'Threads/Core') | ('Memory (GB)', 'Memory (GB)') | ('TmpDisk (TB)', 'TmpDisk (TB)') | ('Arch', 'Arch') | ('Slurm Features', 'GPU Arch and Memory') | -|--------------------------|--------------------------|--------------------------|--------------------------|--------------------------|--------------------------|------------------------------------|------------------------------------|----------------------------------|------------------------------------|--------------------------|---------------------------------------------| -| GPU Compute Nodes | GPU Compute Nodes | GPU Compute Nodes | GPU Compute Nodes | GPU Compute Nodes | GPU Compute Nodes | GPU Compute Nodes | GPU Compute Nodes | GPU Compute Nodes | GPU Compute Nodes | GPU Compute Nodes | GPU Compute Nodes | -| cn-a[001-011] | RTX8000 | 48 | 8 | 40 | 2 | 20 | 1 | 384 | 3.6 | x86_64 | turing,48gb | -| cn-b[001-005] | V100 | 32 | 8 | 40 | 2 | 20 " -Node profile description,https://docs.mila.quebec/Information.html#node-profile-description," | 1 | 384 | 3.6 | x86_64 | volta,nvlink,32gb | -| cn-c[001-040] | RTX8000 | 48 | 8 | 64 | 2 | 32 | 1 | 384 | 3 | x86_64 | turing,48gb | -| cn-g[001-026] | A100 | 80 | 4 | 64 | 2 | 32 | 1 | 1024 | 7 | x86_64 | ampere,nvlink,80gb | -| DGX Systems | DGX Systems | DGX Systems | DGX Systems | DGX Systems | DGX Systems | DGX Systems | DGX Systems | DGX Systems | DGX Systems | DGX Systems | DGX Systems | -| cn-d[001-002] | A100 | 40 | 8 | 128 | 2 | 64 | 1 | 1024 | 14 | x86_64 | ampere,nvlink,40gb " -Node profile description,https://docs.mila.quebec/Information.html#node-profile-description," | -| cn-d[003-004] | A100 | 80 | 8 | 128 | 2 | 64 | 1 | 2048 | 28 | x86_64 | ampere,nvlink,80gb | -| cn-e[002-003] | V100 | 32 | 8 | 40 | 2 | 20 | 1 | 512 | 7 | x86_64 | volta,32gb | -| CPU Compute Nodes | CPU Compute Nodes | CPU Compute Nodes | CPU Compute Nodes | CPU Compute Nodes | CPU Compute Nodes | CPU Compute Nodes | CPU Compute Nodes | CPU Compute Nodes | CPU Compute Nodes | CPU Compute Nodes | CPU Compute Nodes | -| cn-f[001-004] | nan | nan | nan | 32 | 1 | 32 | 1 | 256 | 10 | x86_64 | rome | -| cn-h[001-004] | nan | nan | nan | 64 | 2 | 32 " -Node profile description,https://docs.mila.quebec/Information.html#node-profile-description," | 1 | 768 | 7 | x86_64 | milan | -| Legacy GPU Compute Nodes | Legacy GPU Compute Nodes | Legacy GPU Compute Nodes | Legacy GPU Compute Nodes | Legacy GPU Compute Nodes | Legacy GPU Compute Nodes | Legacy GPU Compute Nodes | Legacy GPU Compute Nodes | Legacy GPU Compute Nodes | Legacy GPU Compute Nodes | Legacy GPU Compute Nodes | Legacy GPU Compute Nodes | -| kepler5 | V100 | 16 | 2 | 16 | 2 | 4 | 2 | 256 | 3.6 | x86_64 | volta,16gb | -| TITAN RTX | TITAN RTX | TITAN RTX | TITAN RTX | TITAN RTX | TITAN RTX | TITAN RTX | TITAN RTX | TITAN RTX | TITAN RTX | TITAN RTX | TITAN RTX | -| rtx[1,3-5,7] | titanrtx | 24 | 2 | 20 | 1 | 10 | 2 | 128 | 0.93 | x86_64 | turing,24gb | -" -Special nodes and outliers,https://docs.mila.quebec/Information.html#special-nodes-and-outliers,"Special nodes and outliers -" -DGX A100,https://docs.mila.quebec/Information.html#dgx-a100,"DGX A100 -DGX A100 nodes are NVIDIA appliances with 8 NVIDIA A100 Tensor Core GPUs. Each -GPU has 40 GB of memory, for a total of 320 GB per appliance. The GPUs are -interconnected via 6 NVSwitches which allows 4.8 TB/s bi-directional bandwidth. -In order to run jobs on a DGX A100, add the flags below to your Slurm -commands: ---gres=gpu:a100: --reservation=DGXA100 - - -" -MIG,https://docs.mila.quebec/Information.html#mig,"MIG -MIG (Multi-Instance GPU) -is an NVIDIA technology allowing certain GPUs to be -partitioned into multiple instances, each of which has a roughly proportional -amount of compute resources, device memory and bandwidth to that memory. -NVIDIA supports MIG on its A100 GPUs and allows slicing the A100 into up to 7 -instances. Although this can theoretically be done dynamically, the SLURM job -scheduler does not support doing so in practice as it does not model -reconfigurable resources very well. Therefore, the A100s must currently be -statically partitioned into the required number of instances of every size -expected to be used. -The cn-g series of nodes include A100-80GB GPUs. One third have been -configured to offer regular (non-MIG mode) a100l GPUs. The other two-thirds -have been configured in MIG mode, and offer the following profiles: -| ('Name', 'Name') | ('GPU', 'Model') | ('GPU', 'Memory') | ('GPU', 'Compute') | ('Cluster-wide', '#') | -|------------------------|--------------------|---------" -MIG,https://docs.mila.quebec/Information.html#mig,"------------|----------------------|-------------------------| -| a100l.1g.10gb a100l.1 | A100 | 10GB (1/8th) | 1/7th of full | 72 | -| a100l.2g.20gb a100l.2 | A100 | 20GB (2/8th) | 2/7th of full | 108 | -| a100l.3g.40gb a100l.3 | A100 | 40GB (4/8th) | 3/7th of full | 72 | -And can be requested using a SLURM flag such as --gres=gpu:a100l.1 -The partitioning may be revised as needs and SLURM capabilities evolve. Other -MIG profiles exist and could be introduced. - -Warning -MIG has a number of important limitations, -most notably that a GPU in MIG mode does not support graphics APIs -(OpenGL/Vulkan), nor P2P over NVLink and PCIe. We have therefore chosen to -limit every MIG job to exactly one MIG slice and no more. Thus, ---gres=gpu:a100l.3 will work (and request a size-3 slice of an -a100l GPU) but --gres=gpu:a100l.1:3 (with :3 requesting -three size-1 slices) will not. -" -AMD,https://docs.mila.quebec/Information.html#amd,"AMD - -Warning -As of August 20 2019 the GPUs had to return back to AMD. Mila will get -more samples. You can join the amd slack channels to get the latest -information - -Mila has a few node equipped with MI50 GPUs. -srun --gres=gpu -c 8 --reservation=AMD --pty bash - - first time setup of AMD stack -conda create -n rocm python=3.6 -conda activate rocm - -pip install tensorflow-rocm -pip install /wheels/pytorch/torch-1.1.0a0+d8b9d32-cp36-cp36m-linux_x86_64.whl -" -Data sharing policies,https://docs.mila.quebec/Information.html#data-sharing-policies,"Data sharing policies - -Note -/network/scratch aims to support -Access Control Lists (ACLs) -to allow collaborative work on rapidly changing data, e.g. work in process -datasets, model checkpoints, etc… - -/network/projects aims to offer a collaborative -space for long-term projects. Data that should be kept for a longer period then -90 days can be stored in that location but first a request to Mila’s helpdesk has to be made to create the project -directory. -" -Monitoring,https://docs.mila.quebec/Information.html#monitoring,"Monitoring -Every compute node on the Mila cluster has a Netdata -monitoring daemon allowing you to get a sense of the state of the node. -This information is exposed in two ways: - -For every node, there is a web interface from Netdata itself at .server.mila.quebec:19999. -This is accessible only when using the Mila wifi or through SSH tunnelling. - -SSH tunnelling: on your local machine, run - -ssh -L 19999:.server.mila.quebec:19999 -p 2222 -login.server.mila.quebec -or ssh -L 19999:.server.mila.quebec:19999 mila if you have -already setup your SSH Login, - - -then open http://localhost:19999 in your browser. - - -The Mila dashboard at dashboard.server.mila.quebec -exposes aggregated statistics with the use of grafana. -These are collected internally to an instance of prometheus. - -In both cases, those graphs are not editable by individual users, -but they provide valuable insight into the state of the whole cluster -or the individual nodes. -One of the important uses is to collect data about the health -of the Mila cluster and to sound the alarm if outages occur -(e.g. if the nodes crash or if GPUs mysteriously become unavailable for SLURM). -" -Example with Netdata on cn-c001,https://docs.mila.quebec/Information.html#example-with-netdata-on-cn-c001,"Example with Netdata on cn-c001 -For example, if we have a job running on cn-c001, we can type -cn-c001.server.mila.quebec:19999 in a browser address bar and the following -page will appear. - -" -Example watching the CPU/RAM/GPU usage,https://docs.mila.quebec/Information.html#example-watching-the-cpu-ram-gpu-usage,"Example watching the CPU/RAM/GPU usage -Given that compute nodes are generally shared -with other users who are also running jobs at the same time and -consuming resources, this is not generally a good way to profile your code -in fine details. -However, it can still be a very useful source of information -for getting an idea of whether the machine that you requested is being -used in its full capacity. -Given how expensive the GPUs are, it generally makes sense to try to -make sure that this resources is always kept busy. - - -CPU -iowait (pink line): High values means your model is waiting on IO a lot (disk or network). - - - - - - - - -CPU RAM -You can see how much CPU RAM is being used by your script in practice, -considering the amount that you requested (e.g. `sbatch --mem=8G ...`). -GPU usage is generally more important to monitor than CPU RAM. -You should not cut it so close to the limit that your experiments randomly fail -because they run out of RAM. However, you should not request blindly 32GB of RAM -when you actually require only 8GB. - - - - - - - - -GPU -Monitors the GPU usage using an nvidia-smi plugin for Netdata. -Under the plugin interface, select the GPU" -Example watching the CPU/RAM/GPU usage,https://docs.mila.quebec/Information.html#example-watching-the-cpu-ram-gpu-usage," number which was allocated to -you. You can figure this out by running echo $SLURM_JOB_GPUS on the -allocated node or, if you have the job ID, -scontrol show -d job YOUR_JOB_ID | grep 'GRES' and checking IDX -You should make sure you use the GPUs to their fullest capacity. -Select the biggest batch size if possible to increase GPU memory usage and -the GPU computational load. -Spawn multiple experiments if you can fit many on a single GPU. -Running 10 independent MNIST experiments on a single GPU will probably take -less than 10x the time to run a single one. This assumes that you have more -experiments to run, because nothing is gained by gratuitously running experiments. -You can request a less powerful GPU and leave the more powerful GPUs -to other researchers who have experiments that can make best use of them. -Sometimes you really just need a k80 and not a v100. - - - - - - - - -Other users or jobs -If the node seems unresponsive or slow, -it may be useful to check what other tasks are -running at the same time on that node. -This should not be an issue in general, -but in practice it is useful to be able to -inspect this to diagnose certain problems. - - - - - -" -Example with Mila dashboard,https://docs.mila.quebec/Information.html#example-with-mila-dashboard,"Example with Mila dashboard - -" -Storage,https://docs.mila.quebec/Information.html#storage,"Storage -| Path | Performance | Usage | Quota (Space/Files) | Backup | Auto-cleanup | -|------------------------------------------------|---------------|-----------------------------------------------------------------------------------------|-----------------------|----------|----------------| -| /network/datasets/ | High | Curated raw datasets (read only) | nan | nan | nan | -| $HOME or /home/mila/// | Low | Personal user space Specific libraries, code, binaries | 100GB/1000K | Daily | no | -| $SCRATCH or /network/scratch/// | High | Temporary job results Processed datasets Optimized for small Files | no | no | 90 days " -Storage,https://docs.mila.quebec/Information.html#storage," | -| $SLURM_TMPDIR | Highest | High speed disk for temporary job results | 4TB/- | no | at job end | -| /network/projects// | Fair | Shared space to facilitate collaboration between researchers Long-term project storage | 200GB/1000K | Daily | no | -| $ARCHIVE or /network/archive/// | Low | Long-term personal storage | 500GB | no | no | - -Note -The $HOME file system is backed up once a day. For any file -restoration request, file a request to Mila’s IT support with the path to the file or directory to -restore, with the required date. - - -Warning -Currently there is no backup system for any other file systems of -the Mila cluster. Storage local to personal computers, Google Drive and other -related solutions should be used to backup important data - -" -$HOME,https://docs.mila.quebec/Information.html#home,"$HOME -$HOME is appropriate for codes and libraries which are small and read once, -as well as the experimental results that would be needed at a later time (e.g. -the weights of a network referenced in a paper). -Quotas are enabled on $HOME for both disk capacity (blocks) and number of -files (inodes). The limits for blocks and inodes are respectively 100GiB and 1 -million per user. The command to check the quota usage from a login node is: -beegfs-ctl --cfgFile=/etc/beegfs/home.d/beegfs-client.conf --getquota --uid $USER -" -$SCRATCH,https://docs.mila.quebec/Information.html#scratch,"$SCRATCH -$SCRATCH can be used to store processed datasets, work in progress datasets -or temporary job results. Its block size is optimized for small files which -minimizes the performance hit of working on extracted datasets. - -Note -Auto-cleanup: this file system is cleared on a weekly basis, -files not used for more than 90 days will be deleted. - -" -$SLURM_TMPDIR,https://docs.mila.quebec/Information.html#slurm-tmpdir,"$SLURM_TMPDIR -$SLURM_TMPDIR points to the local disk of the node on which a job is -running. It should be used to copy the data on the node at the beginning of the -job and write intermediate checkpoints. This folder is cleared after each job. -" -projects,https://docs.mila.quebec/Information.html#projects,"projects -projects can be used for collaborative projects. It aims to ease the -sharing of data between users working on a long-term project. -Quotas are enabled on projects for both disk capacity (blocks) and number -of files (inodes). The limits for blocks and inodes are respectively 200GiB and -1 million per user and per group. - -Note -It is possible to request higher quota limits if the project requires -it. File a request to Mila’s IT support. - -" -$ARCHIVE,https://docs.mila.quebec/Information.html#archive,"$ARCHIVE -$ARCHIVE purpose is to store data other than datasets that has to be kept -long-term (e.g. generated samples, logs, data relevant for paper submission). -$ARCHIVE is only available on the login nodes. Because this file system -is tuned for large files, it is recommended to archive your directories. For -example, to archive the results of an experiment in -$SCRATCH/my_experiment_results/, run the commands below from a login node: -cd $SCRATCH -tar cJf $ARCHIVE/my_experiment_results.tar.xz --xattrs my_experiment_results -Disk capacity quotas are enabled on $ARCHIVE. The soft limit per user is -500GB, the hard limit is 550GB. The grace time is 7 days. This means that one -can use more than 500GB for 7 days before the file system enforces quota. -However, it is not possible to use more than 550GB. -The command to check the quota usage from a login node is df: -df -h $ARCHIVE - -Note -There is NO backup of this file system. - -" -datasets,https://docs.mila.quebec/Information.html#datasets,"datasets -datasets contains curated datasets to the benefit of the Mila community. -To request the addition of a dataset or a preprocessed dataset you think could -benefit the research of others, you can fill this form. Datasets can also be browsed from the -web : Mila Datasets -Datasets in datasets/restricted are restricted and require an explicit -request to gain access. Please submit a support ticket mentioning the dataset’s -access group (ex.: scannet_users), your cluster’s username and the -approbation of the group owner. You can find the dataset’s access group by -listing the content of /network/datasets/restricted with the ls command. -Those datasets are mirrored to the Alliance clusters in -~/projects/rrg-bengioy-ad/data/curated/ if they follow Digital Research -Alliance of Canada’s good practices on data. -To list the local datasets on an Alliance cluster, you can execute the -following command: -ssh [CLUSTER_LOGIN] -C ""projects/rrg-bengioy-ad/data/curated/list_datasets_cc.sh"" -" -Data Transmission,https://docs.mila.quebec/Information.html#data-transmission,"Data Transmission -Multiple methods can be used to transfer data to/from the cluster: - -rsync --bwlimit=10mb; this is the favored method since the bandwidth can -be limited to prevent impacting the usage of the cluster: rsync -Digital Research Alliance of Canada: Globus - -" -Getting started,https://docs.mila.quebec/Getting_started.html#getting-started,"Getting started -See User’s guide. -" -User’s guide,https://docs.mila.quebec/Userguide.html#user-s-guide,"User’s guide -…or IDT’s list of opinionated howtos -This section seeks to provide users of the Mila infrastructure with practical -knowledge, tips and tricks and example commands. -" -Quick Start,https://docs.mila.quebec/Userguide.html#quick-start,"Quick Start -Users first need login access to the cluster. It is -recommended to install milatools which will help in the set up of the -ssh configuration needed to securely and easily connect to the -cluster. -" -mila code,https://docs.mila.quebec/Userguide.html#mila-code,"mila code -milatools also makes it easy to run and debug code on the Mila cluster. Using -the mila code command will allow you to use VSCode on the server. Simply run: -mila code path/on/cluster - - -The details of the command can be found on the github page of the package. Note that you need to -first setup your ssh configuration using mila init before the mila code -command can be used. The initialisation of the ssh configuration is explained -here and on the github page of the package. -" -Logging in to the cluster,https://docs.mila.quebec/Userguide.html#logging-in-to-the-cluster,"Logging in to the cluster -To access the Mila Cluster clusters, you will need a Mila account. Please contact -Mila systems administrators if you don’t have it already. Our IT support service -is available here: https://it-support.mila.quebec/ -You will also need to complete and return an IT Onboarding Training to get -access to the cluster. Please refer to the Mila Intranet for more -informations: -https://sites.google.com/mila.quebec/mila-intranet/it-infrastructure/it-onboarding-training -IMPORTANT : Your access to the Cluster is granted based on your status at -Mila (for students, your status is the same as your main supervisor’ status), -and on the duration of your stay, set during the creation of your account. The -following have access to the cluster : Current Students of Core Professors - -Core Professors - Staff -" -SSH Login,https://docs.mila.quebec/Userguide.html#ssh-login,"SSH Login -You can access the Mila cluster via ssh: -# Generic login, will send you to one of the 4 login nodes to spread the load -ssh @login.server.mila.quebec -p 2222 - -# To connect to a specific login node, X in [1, 2, 3, 4] -ssh @login-X.login.server.mila.quebec -p 2222 -Four login nodes are available and accessible behind a load balancer. At each -connection, you will be redirected to the least loaded login-node. -The ECDSA, RSA and ED25519 fingerprints for Mila’s login nodes are: -SHA256:baEGIa311fhnxBWsIZJ/zYhq2WfCttwyHRKzAb8zlp8 (ECDSA) -SHA256:Xr0/JqV/+5DNguPfiN5hb8rSG+nBAcfVCJoSyrR0W0o (RSA) -SHA256:gfXZzaPiaYHcrPqzHvBi6v+BWRS/lXOS/zAjOKeoBJg (ED25519) - - - -Important -Login nodes are merely entry points to the cluster. They give you access -to the compute nodes and to the filesystem, but they are not meant to run -anything heavy. Do not run compute-heavy programs on these nodes, -because in doing so you could bring them down, impeding cluster access for -everyone. -This means no training or experiments, no compiling programs, no Python -scripts, but also no zip of a large folder or anything that demands a -sustained amount of computation. -Rule of thumb: never run a program that takes more than a few seconds on -a login node. - -Note -In a similar vein, you should not run VSCode remote SSH instances directly -on login nodes, because even though they are typically not very -computationally expensive, when many people do it, they add up! See -Visual Studio Code for specific instructions. - - -" -mila init,https://docs.mila.quebec/Userguide.html#mila-init,"mila init -To make it easier to set up a productive environment, Mila publishes the -milatools package, which defines a mila init command which will -automatically perform some of the below steps for you. You can install it with -pip and use it, provided your Python version is at least 3.8: -$ pip install milatools -$ mila init - - -" -SSH Config,https://docs.mila.quebec/Userguide.html#ssh-config,"SSH Config -The login nodes support the following authentication mechanisms: -publickey,keyboard-interactive. If you would like to set an entry in your -.ssh/config file, please use the following recommendation: -Host mila - User YOUR-USERNAME - Hostname login.server.mila.quebec - PreferredAuthentications publickey,keyboard-interactive - Port 2222 - ServerAliveInterval 120 - ServerAliveCountMax 5 - - -Then you can simply write ssh mila to connect to a login node. You will also -be able to use mila with scp, rsync and other such programs. - -Tip -You can run commands on the login node with ssh directly, for example -ssh mila squeue -u '$USER' (remember to put single quotes around any -$VARIABLE you want to evaluate on the remote side, otherwise it will be -evaluated locally before ssh is even executed). - -" -Passwordless login,https://docs.mila.quebec/Userguide.html#passwordless-login,"Passwordless login -To save you some repetitive typing it is highly recommended to set up public -key authentication, which means you won’t have to enter your password every time -you connect to the cluster. -# ON YOUR LOCAL MACHINE -# You might already have done this in the past, but if you haven't: -ssh-keygen # Press ENTER 3x - -# Copy your public key over to the cluster -# You will need to enter your password -ssh-copy-id mila - - -" -Connecting to compute nodes,https://docs.mila.quebec/Userguide.html#connecting-to-compute-nodes,"Connecting to compute nodes -If (and only if) you have a job running on compute node “cnode”, you are -allowed to SSH to it directly, if for some reason you need a second terminal. -That session will be automatically ended when your job is relinquished. -First, however, you need to have -password-less ssh either with a key present in your home or with an -ssh-agent. To generate a key pair on the login node: -# ON A LOGIN NODE -ssh-keygen # Press ENTER 3x -cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys -chmod 600 ~/.ssh/authorized_keys -chmod 700 ~/.ssh - - -Then from the login node you can write ssh . From your local -machine, you can use ssh -J mila USERNAME@ (-J represents a “jump” -through the login node, necessary because the compute nodes are behind a -firewall). -If you wish, you may also add the following wildcard rule in your .ssh/config: -Host *.server.mila.quebec !*login.server.mila.quebec - HostName %h - User YOUR-USERNAME - ProxyJump mila - - -This will let you connect to a compute node with ssh .server.mila.quebec. -" -Running your code,https://docs.mila.quebec/Userguide.html#running-your-code,"Running your code -" -SLURM commands guide,https://docs.mila.quebec/Userguide.html#slurm-commands-guide,"SLURM commands guide -" -Basic Usage,https://docs.mila.quebec/Userguide.html#basic-usage,"Basic Usage -The SLURM documentation -provides extensive information on the available commands to query the cluster -status or submit jobs. -Below are some basic examples of how to use SLURM. -" -Submitting jobs,https://docs.mila.quebec/Userguide.html#submitting-jobs,"Submitting jobs -" -Batch job,https://docs.mila.quebec/Userguide.html#batch-job,"Batch job -In order to submit a batch job, you have to create a script containing the main -command(s) you would like to execute on the allocated resources/nodes. - 1#!/bin/bash - 2#SBATCH --job-name=test - 3#SBATCH --output=job_output.txt - 4#SBATCH --error=job_error.txt - 5#SBATCH --ntasks=1 - 6#SBATCH --time=10:00 - 7#SBATCH --mem=100Gb - 8 - 9module load python/3.5 -10python my_script.py - - -Your job script is then submitted to SLURM with sbatch (ref.) -sbatch job_script -sbatch: Submitted batch job 4323674 -The working directory of the job will be the one where your executed sbatch. - -Tip -Slurm directives can be specified on the command line alongside sbatch or -inside the job script with a line starting with #SBATCH. - -" -Interactive job,https://docs.mila.quebec/Userguide.html#interactive-job,"Interactive job -Workload managers usually run batch jobs to avoid having to watch its -progression and let the scheduler run it as soon as resources are available. If -you want to get access to a shell while leveraging cluster resources, you can -submit an interactive jobs where the main executable is a shell with the -srun/salloc (srun/salloc) commands -salloc -Will start an interactive job on the first node available with the default -resources set in SLURM (1 task/1 CPU). srun accepts the same arguments as -sbatch with the exception that the environment is not passed. - -Tip -To pass your current environment to an interactive job, add ---preserve-env to srun. - -salloc can also be used and is mostly a wrapper around srun if provided -without more info but it gives more flexibility if for example you want to get -an allocation on multiple nodes. -" -Job submission arguments,https://docs.mila.quebec/Userguide.html#job-submission-arguments,"Job submission arguments -In order to accurately select the resources for your job, several arguments are -available. The most important ones are: -| Argument | Description | -|----------------------------|---------------------------------------------------------------------------| -| -n, –ntasks= | The number of task in your script, usually =1 | -| -c, –cpus-per-task= | The number of cores for each task | -| -t, –time=