diff --git "a/buster/data/documents.csv" "b/buster/data/documents.csv"
deleted file mode 100644--- "a/buster/data/documents.csv"
+++ /dev/null
@@ -1,3222 +0,0 @@
-name,url,text
-AI tooling and methodology handbook,https://docs.mila.quebec/Handbook.html#ai-tooling-and-methodology-handbook,"AI tooling and methodology handbook
-This section seeks to provide researchers with insightful articles pertaining to
-aspects of methodology in their work.
-"
-What is a computer cluster?,https://docs.mila.quebec/Theory_cluster.html#what-is-a-computer-cluster,"What is a computer cluster?
-A computer cluster is a set
-of loosely or tightly connected computers that work together so that, in many
-respects, they can be viewed as a single system.
-"
-Parts of a computing cluster,https://docs.mila.quebec/Theory_cluster.html#parts-of-a-computing-cluster,"Parts of a computing cluster
-To provide high performance computation capabilities, clusters can
-combine hundreds to thousands of computers, called nodes, which are all
-inter-connected with a high-performance communication network. Most nodes are
-designed for high-performance computations, but clusters can also use
-specialized nodes to offer parallel file systems, databases, login nodes and
-even the cluster scheduling functionality as pictured in the image below.
-
-We will overview the different types of nodes which you can encounter on a
-typical cluster.
-"
-The login nodes,https://docs.mila.quebec/Theory_cluster.html#the-login-nodes,"The login nodes
-To execute computing processes on a cluster, you must first connect to a
-cluster and this is accomplished through a login node. These so-called
-login nodes are the entry point to most clusters.
-Another entry point to some clusters such as the Mila cluster is the JupyterHub
-web interface, but we’ll read about that later. For now let’s return to the
-subject of this section; Login nodes. To connect to these, you would typically
-use a remote shell connection. The most usual tool to do so is SSH. You’ll hear
-and read a lot about this tool. Imagine it as a very long (and somewhat
-magical) extension cord which connects the computer you are using now, such as
-your laptop, to a remote computer’s terminal shell. You might already know what
-a terminal shell is if you ever used the command line.
-"
-The compute nodes,https://docs.mila.quebec/Theory_cluster.html#the-compute-nodes,"The compute nodes
-In the field of artificial intelligence, you will usually be on the hunt for
-GPUs. In most clusters, the compute nodes are the ones with GPU capacity.
-While there is a general paradigm to tend towards a homogeneous configuration
-for nodes, this is not always possible in the field of artificial intelligence
-as the hardware evolve rapidly as is being complemented by new hardware and so
-on. Hence, you will often read about computational node classes. Some of which
-might have different GPU models or even no GPU at all. For the Mila cluster you
-will find this information in the Node profile description section. For
-now, you should note that is important to keep in mind that you should be aware
-of which nodes your code is running on.  More on that later.
-"
-The storage nodes,https://docs.mila.quebec/Theory_cluster.html#the-storage-nodes,"The storage nodes
-Some computers on a cluster function to only store and serve files.  While the
-name of these computers might matter to some, as a user, you’ll only be
-concerned about the path to the data. More on that in the Processing data section.
-"
-Different nodes for different uses,https://docs.mila.quebec/Theory_cluster.html#different-nodes-for-different-uses,"Different nodes for different uses
-It is important to note here the difference in intended uses between the
-compute nodes and the login nodes. While the compute nodes are meant for heavy
-computation, the login nodes are not.
-The login nodes however are used by everyone who uses the cluster and care must
-be taken not to overburden these nodes. Consequently, only very short and light
-processes should be run on these otherwise the cluster may become inaccessible.
-In other words, please refrain from executing long or compute intensive
-processes on login nodes because it affects all other users. In some cases, you
-will also find that doing so might get you into trouble.
-"
-UNIX,https://docs.mila.quebec/Theory_cluster.html#unix,"UNIX
-All clusters typically run on GNU/Linux distributions. Hence a minimum
-knowledge of GNU/Linux and BASH is usually required to use them. See the
-following tutorial
-for a rough guide on getting started with Linux.
-"
-The workload manager,https://docs.mila.quebec/Theory_cluster.html#the-workload-manager,"The workload manager
-On a cluster, users don’t have direct access to the compute nodes but
-instead connect to a login node and add jobs to the workload manager
-queue. Whenever there are resources available to execute these jobs
-they will be allocated to a compute node and run, which can be
-immediately or after a wait of up to several days.
-A job is comprised of a number of steps that will run one after the
-other. This is done so that you can schedule a sequence of processes
-that can use the results of the previous steps without having to
-manually interact with the scheduler.
-Each step can have any number of tasks which are groups of processes
-that can be scheduled independently on the cluster but can run in
-parallel if there are resources available. The distinction between
-steps and tasks is that multiple tasks, if they are part of the same
-step, cannot depend on results of other tasks because there are no
-guarantees on the order in which they will be executed.
-Finally each process group is the basic unit that is scheduled in the
-cluster. It comprises of a set of processes (or threads) that can run
-on a number of resources (CPU, GPU, RAM, …) and are scheduled
-together as a unit on one or more machines.
-Each of these concepts lends itself to a particular use. For multi-gpu
-training in AI workloads you would use one task per GPU for data
-paralellism or one process group if you are doing model
-parallelism. Hyperparameter optimisation can be done using a
-combination of tasks and steps but is probably better left to a
-framework outside of the scope of the workload manager.
-If this all seems complicated, you should know that all these things
-do not need to always be used. It is perfectly acceptable to sumbit
-jobs with a single step, a single task and a single process.
-The available resource"
-The workload manager,https://docs.mila.quebec/Theory_cluster.html#the-workload-manager,"s on the cluster are not infinite and it is the
-workload manager’s job to allocate them. Whenever a job request comes
-in and there are not enough resources available to start it
-immediately, it will go in the queue.
-Once a job is in the queue, it will stay there until another job
-finishes and then the workload manager will try to use the newly freed
-resources with jobs from the queue. The exact order in which the jobs
-will start is not fixed, because it depends on the local policies
-which can take into account the user priority, the time since the job
-was requested, the amount of resources requested and possibly other
-things. There should be a tool that comes with the manager where you
-can see the status of your queued jobs and why they remain in the
-queue.
-The workload manager will divide the cluster into partitions according
-to the configuration set by the admins. A partition is a set of
-machines typically reserved for a particular purpose. An example might
-be CPU-only machines for preprocessing setup as a separate partition.
-It is possible for multiple partitions to share resources.
-There will always be at least one partition that is the default
-partition in which jobs without a specific request will go. Other
-partitions can be requested, but might be restricted to a group of
-users, depending on policy.
-Partitions are useful for a policy standpoint to ensure efficient use
-of the cluster resources and avoid using up too much of one resource
-type blocking use of another. They are also useful for heterogenous
-clusters where different hardware is mixed in and not all software is
-compatible with all of it (for example x86 and POWER cpus).
-To ensure a fair share of the computing resources for all, the workload
-manager establishes limits on the amount of resources that a single
-user can us"
-The workload manager,https://docs.mila.quebec/Theory_cluster.html#the-workload-manager,"e at once. These can be hard limits which prevent running
-jobs when you go over or soft limits which will let you run jobs, but
-only until some other job needs the resources.
-Admin policy will determine what those exact limits are for a
-particular cluster or user and whether they are hard or soft limits.
-The way soft limits are enforced is using preemption, which means that
-when another job with higher priority needs the resources that your
-job is using, your job will receive a signal that it needs to save its
-state and exit. It will be given a certain amount of time to do this
-(the grace period, which may be 0s) and then forcefully terminated if
-it is still running.
-Depending on the workload manager in use and the cluster configuration
-a job that is preempted like this may be automatically rescheduled to
-have a chance to finish or it may be up to the job to reschedule
-itself.
-The other limit you can encounter with a job that goes over its
-declared limits. When you schedule a job, you declare how much
-resources it will need (RAM, CPUs, GPUs, …). Some of those may have
-default values and not be explicitely defined. For certain types of
-devices, like GPUs, access to units over your job limit is made
-unavailable. For others, like RAM, usage is monitored and your job
-will be terminated if it goes too much over. This makes it important
-to ensure you estimate resource usage accurately.
-Mila as well as Digital Research Alliance of Canada use the workload
-manager Slurm to schedule and
-allocate resources on their infrastructure.
-Slurm client commands are available on the login nodes for you to submit
-jobs to the main controller and add your job to the queue. Jobs are of 2 types:
-batch jobs and interactive jobs.
-For practical examples of Slurm commands on the Mila cluster, see Running your code."
-Processing data,https://docs.mila.quebec/Theory_cluster.html#processing-data,"Processing data
-For processing large amounts of data common for deep learning, either
-for dataset preprocessing or training, several techniques exist. Each
-has typical uses and limitations.
-"
-Data parallelism,https://docs.mila.quebec/Theory_cluster.html#data-parallelism,"Data parallelism
-The first technique is called data parallelism (aka task
-parallelism in formal computer science). You simply run lots of
-processes each handling a portion of the data you want to
-process. This is by far the easiest technique to use and should be
-favored whenever possible. A common example of this is
-hyperparameter optimisation.
-For really small computations the time to setup multiple processes
-might be longer than the processing time and lead to waste. This can
-be addressed by bunching up some of the processes together by doing
-sequential processing of sub-partitions of the data.
-For the cluster systems it is also inadvisable to launch thousands of
-jobs and even if each job would run for a reasonable amount of time
-(several minutes at minimum), it would be best to make larger groups
-until the amount of jobs is in the low hundreds at most.
-Finally another thing to keep in mind is that the transfer bandwidth
-is limited between the filesystems (see Filesystem concerns)
-and the compute nodes and if you run too many jobs using too much data
-at once they may end up not being any faster because they will spend
-their time waiting for data to arrive.
-"
-Model parallelism,https://docs.mila.quebec/Theory_cluster.html#model-parallelism,"Model parallelism
-The second technique is called model parallelism (which doesn’t
-have a single equivalent in formal computer science). It is used
-mostly when a single instance of a model will not fit in a computing
-resource (such as the GPU memory being too small for all the
-parameters).
-In this case, the model is split into its constituent parts, each
-processed independently and their intermediate results communicated
-with each other to arrive at a final result.
-This is generally harder but necessary to work with larger, more
-powerful models like GPT.
-"
-Communication concerns,https://docs.mila.quebec/Theory_cluster.html#communication-concerns,"Communication concerns
-The main difference of these two approaches is the need for
-communication between the multiple processes. Some common training
-methods, like stochastic gradient descent sit somewhere between the
-two, because they require some communication, but not a lot. Most
-people classify it as data parallelism since it sits closer to that
-end.
-In general for data parallelism tasks or tasks that communicate
-infrequently it doesn’t make a lot of difference where the processes
-sit because the communication bandwidth and latency will not have a
-lot of impact on the time it takes to complete the job.  The
-individual tasks can generally be scheduled independently.
-On the contrary for model parallelism you need to pay more attention
-to where your tasks are.  In this case it is usually required to use
-the facilities of the workload manager to group the tasks so that they
-are on the same machine or machines that are closely linked to ensure
-optimal communication.  What is the best allocation depends on the
-specific cluster architecture available and the technologies it
-support (such as InfiniBand,
-RDMA,
-NVLink or others)
-"
-Filesystem concerns,https://docs.mila.quebec/Theory_cluster.html#filesystem-concerns,"Filesystem concerns
-When working on a cluster, you will generally encounter several
-different filesystems.  Usually there will be names such as ‘home’,
-‘scratch’, ‘datasets’, ‘projects’, ‘tmp’.
-The reason for having different filesystems available instead of a
-single giant one is to provide for different use cases. For example,
-the ‘datasets’ filesystem would be optimized for fast reads but have
-slow write performance. This is because datasets are usually written
-once and then read very often for training.
-Different filesystems have different performance levels. For instance, backed
-up filesystems (such as $PROJECT in Digital Research Alliance of Canada
-clusters) provide more space and can handle large files but cannot sustain
-highly parallel accesses typically required for high speed model training.
-The set of filesystems provided by the cluster you are using should be
-detailed in the documentation for that cluster and the names can
-differ from those above. You should pay attention to their recommended
-use case in the documentation and use the appropriate filesystem for
-the appropriate job. There are cases where a job ran hundreds of times
-slower because it tried to use a filesystem that wasn’t a good fit for
-the job.
-One last thing to pay attention to is the data retention policy for
-the filesystems. This has two subpoints: how long is the data kept
-for, and are there backups.
-Some filesystems will have a limit on how long they keep their
-files. Typically the limit is some number of days (like 90 days) but
-can also be ‘as long as the job runs’ for some.
-As for backups, some filesystems will not have a limit for data, but
-will also not have backups. For those it is important to maintain a
-copy of any crucial data somewhere else. The data will not be
-purposefully deleted, but the filesystem may fail and lose all or part
-of its data. If you have any data that is crucial for a paper or your
-thesis keep an additional copy of it somewhere else.
-"
-Software on the cluster,https://docs.mila.quebec/Theory_cluster.html#software-on-the-cluster,"Software on the cluster
-This section aims to raise awareness to problems one can encounter when trying
-to run a software on different computers and how this is dealt with on typical
-computation clusters.
-The Mila cluster and the Digital Research Alliance of Canada clusters both
-provide various useful software and computing environments, which can be
-activated through the module system. Alternatively, you may build containers
-with your desired software and run them on compute nodes.
-Regarding Python development, we recommend using virtual environments to install
-Python packages in isolation.
-"
-Cluster software modules,https://docs.mila.quebec/Theory_cluster.html#cluster-software-modules,"Cluster software modules
-Modules are small files which modify your environment variables to point to
-specific versions of various software and libraries. For instance, a module
-might provide the python command to point to Python 3.7, another might
-activate CUDA version 11.0, another might provide the torch package, and so
-on.
-For more information, see The module command.
-"
-Containers,https://docs.mila.quebec/Theory_cluster.html#containers,"Containers
-Containers are a special form of isolation of software and its dependencies. A
-container is essentially a lightweight virtual machine: it encapsulates a
-virtual file system for a full OS installation, as well as a separate network
-and execution environment.
-For example, you can create an Ubuntu container in which you install various
-packages using apt, modify settings as you would as a root user, and so on,
-but without interfering with your main installation. Once built, a container can
-be run on any compatible system.
-For more information, see Using containers on clusters.
-"
-Python Virtual environments,https://docs.mila.quebec/Theory_cluster.html#python-virtual-environments,"Python Virtual environments
-A virtual environment in Python is a local, isolated environment in which you
-can install or uninstall Python packages without interfering with the global
-environment (or other virtual environments). In order to use a virtual
-environment, you first have to activate it.
-For more information, see Virtual environments.
-"
-"Who, what, where is IDT",https://docs.mila.quebec/IDT.html#who-what-where-is-idt,"Who, what, where is IDT
-This section seeks to help Mila researchers understand the mission and role of
-the IDT team.
-"
-IDT’s mission,https://docs.mila.quebec/IDT.html#idt-s-mission,"IDT’s mission
-
-"
-The IDT team,https://docs.mila.quebec/IDT.html#the-idt-team,"The IDT team
-See https://mila.quebec/en/mila/team/?cat_id=143
-"
-Purpose of this documentation,https://docs.mila.quebec/Purpose.html#purpose-of-this-documentation,"Purpose of this documentation
-This documentation aims to cover the information required to run scientific
-and data-intensive computing tasks at Mila and the available resources for its
-members.
-It also aims to be an outlet for sharing know-how, tips and tricks and examples
-from the IDT team to the Mila researcher community.
-"
-Intended audience,https://docs.mila.quebec/Purpose.html#intended-audience,"Intended audience
-This documentation is mainly intended for Mila researchers having access to the
-Mila cluster. This access is determined by your researcher status. See
-Roles and authorizations for more information. The core of the
-information with this purpose can be found in the following section:
-Computing infrastructure and policies.
-However, we also aim to provide more general information which can be useful
-outside the scope of using the Mila cluster. For instance, more general theory
-on computational considerations and such. In this perspective, we hope the
-documentation can be of use for all of Mila members.
-"
-Contributing,https://docs.mila.quebec/Purpose.html#contributing,"Contributing
-See the following file for contribution guidelines :
-# Contributing to the Mila Docs
-
-Thank you for your interest into making a better documentation for all at Mila.
-
-Here are some guidelines to help bring your contributions to life.
-
-## What should be included in the Mila Docs
-
-* Mila cluster usage
-* Digital Research Alliance of Canada cluster usage
-* Job management tips / tricks
-* Research good practices
-* Software development good practices
-* Useful tools
-
-**_NOTE_**: Examples should aim to not consume much more than 1 GPU/hour and 2 CPU/hour
-
-## Issues / Pull Requests
-
-### Issues
-
-Issues can be used to report any error in the documentation, missing or unclear
-sections, broken tools or other suggestions to improve the overall
-documentation.
-
-### Pull Requests
-
-PRs are welcome and we value the contents of contributions over the appearance
-or functionality of the pull request. If you don't know how to write the proper
-markup in reStructuredText, simply provide the content you would like to add in
-the PR text form which supports markdown or with instructions to format the
-content. In the PR, reference the related issues like this:
-
-```
-Resolves: #123
-See also: #456, #789
-```
-
-If you would like to contribute directly in the code of the documentation, keep
-the lines width to 80 characters or less. You can attempt to build the docs
-yourself to see if the formating is right:
-
-```console
-python3 -m pip install -r docs/requirements.txt
-sphinx-build -b html docs/ docs/_build/
-```
-
-This will produce the html version of the documentation which you can navigate
-by opening the local file `docs/_build/index.html`.
-
-If you have any trouble building the docs, don't hesitate to open an issue to
-request help.
-
-Regarding the restructured text format"
-Contributing,https://docs.mila.quebec/Purpose.html#contributing,", you can simply provide the content
-you would like to add in markdown or plain text format if more convenient
-for you and someone down the line should take responsibility to convert
-the format.
-
-## Sphinx / reStructuredText (reST)
-
-The markup language used for the Mila Docs is
-[reStructuredText](http://docutils.sourceforge.net/rst.html) and we follow the
-[Python’s Style Guide for documenting](https://docs.python.org/devguide/documenting.html#style-guide).
-
-Here are some of reST syntax directives which are useful to know :
-(more can be found in
-[Sphinx's reST Primer](https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html)):
-
-
-### Inline markup
-
-* one asterisk: `*text*` for *emphasis* (italics),
-* two asterisks: `**text**` for **strong emphasis** (boldface), and
-* backquotes: ` ``text`` ` for `code samples`, and
-* external links: `` `Link text <http://target>`_ ``.
-
-### Lists
-
-```reST
-* this is
-* a list
-
-  * with a nested list
-  * and some subitems
-
-* and here the parent list continues
-```
-
-### Sections
-
-```reST
-#################
-This is a heading
-#################
-```
-
-There are no heading levels assigned to certain characters as the structure is
-determined from the succession of headings. However, the Python documentation
-suggests the following convention:
-
-    * `#` with overline, for parts
-    * `*` with overline, for chapters
-    * `=`, for sections
-    * `-`, for subsections
-    * `^`, for subsubsections
-    * `""`, for paragraphs
-
-### Note box
-
-```reST
-.. note:: This is a long
-   long long note
-```
-
-### Collapsible boxes
-
-This is a local extension, not part of Sphinx itself.  It works like this:
-
-```reST
-.. container:: toggle
-
-    .. container:: header
-
-        **Show/Hide Code**
-
-    .. code-block:: <type>
-       ...
-```
-
-
-"
-Visual Studio Code,https://docs.mila.quebec/VSCode.html#visual-studio-code,"Visual Studio Code
-One editor of choice for many researchers is VSCode. One feature of VSCode is
-remote editing through SSH. This allows you to edit files on the cluster as if
-they were local. You can also debug your programs using VSCode’s debugger, open
-terminal sessions, etc.
-"
-Connecting to the cluster,https://docs.mila.quebec/VSCode.html#connecting-to-the-cluster,"Connecting to the cluster
-VSCode cannot be used to edit code on the login nodes, because it is a heavy
-enough process (a node process, plus the language server, linter, and
-possibly other plugins depending on your configured environment) that there is a
-risk of overloading the login nodes if too many researchers did it at the same
-time.
-Therefore, to use VSCode on the cluster, you first need to allocate a compute
-node, then connect to that node.
-The milatools package provides a command to make the operation easier. More
-info can be found here.
-"
-Activating an environment,https://docs.mila.quebec/VSCode.html#activating-an-environment,"Activating an environment
-Reference
-To activate a conda or pip environment, you can open the command palette with
-Ctrl+Shift+P and type “Python: Select interpreter”. This will prompt you for the
-path to the Python executable for your environment.
-
-Tip
-If you already have the environment activated in a terminal session, you can
-run the command which python to get the path for this environment. This
-path can be pasted into the interpreter selection prompt in VSCode to use
-that same environment.
-
-"
-Troubleshooting,https://docs.mila.quebec/VSCode.html#troubleshooting,"Troubleshooting
-"
-“Cannot reconnect”,https://docs.mila.quebec/VSCode.html#cannot-reconnect,"“Cannot reconnect”
-When connecting to multiple compute nodes (and/or from multiple computers), some
-instances may crash with that message because of conflicts in the lock files
-VSCode installs in ~/.vscode-server (which is shared on all compute nodes).
-To fix this issue, you can change this setting in your settings.json file:
-{ ""remote.SSH.lockfilesInTmp"": true }
-
-
-This will store the necessary lockfiles in /tmp on the compute nodes (which
-are local to the node).
-"
-Debugger timeouts,https://docs.mila.quebec/VSCode.html#debugger-timeouts,"Debugger timeouts
-Sometimes, slowness on the compute node or the networked filesystem might cause
-the VSCode debugger to timeout when starting a remote debug process. As a quick
-fix, you can add this to your ~/.bashrc or ~/.profile or equivalent
-resource file for your preferred shell, to increase the timeout delay to 500
-seconds:
-export DEBUGPY_PROCESS_SPAWN_TIMEOUT=500
-
-
-"
-Computational resources outside of Mila,https://docs.mila.quebec/Extra_compute.html#computational-resources-outside-of-mila,"Computational resources outside of Mila
-This section seeks to provide insights and information on computational
-resources outside the Mila cluster itself.
-"
-Digital Research Alliance of Canada Clusters,https://docs.mila.quebec/Extra_compute.html#digital-research-alliance-of-canada-clusters,"Digital Research Alliance of Canada Clusters
-The clusters named Beluga, Cedar, Graham, Narval and Niagara are
-clusters provided by the Digital Research Alliance of Canada organisation (the Alliance). For Mila researchers, these
-clusters are to be used for larger experiments having many jobs, multi-node
-computation and/or multi-GPU jobs as well as long running jobs. If you use
-these resources for your research, please remember to acknowledge their use in
-your papers.
-
-Note
-Compute Canada ceased its operational responsibilities for supporting Canada’s
-national advanced research computing (ARC) platform on March 31, 2022. The services
-will be supported by the new Digital Research Alliance of Canada.
-https://ace-net.ca/compute-canada-operations-move-to-the-digital-research-alliance-of-canada-(the-alliance).html
-
-"
-Current allocation description,https://docs.mila.quebec/Extra_compute.html#current-allocation-description,"Current allocation description
-Clusters of the Alliance are shared with researchers across the country.
-Allocations are given by the Alliance to selected research groups to ensure to
-a minimal amount of computational resources throughout the year.
-Depending on your affiliation, you will have access to different allocations. If
-you are a student at University of Montreal, you can have access to the
-rrg-bengioy-ad allocation described below. For students from other
-universities, you should ask your advisor to know which allocations you could
-have access to.
-From the Alliance’s documentation: An allocation is an amount of resources
-that a research group can target for use for a period of time, usually a year.
-To be clear, it is not a maximal amount of resources that can be used
-simultaneously, it is a weighting factor of the workload manager to balance
-jobs. For instance, even though we are allocated 400 GPU-years across all
-clusters, we can use more or less than 400 GPUs simultaneously depending on the
-history of usage from our group and other groups using the cluster at a given
-period of time. Please see the Alliance’s doc"
-Current allocation description,https://docs.mila.quebec/Extra_compute.html#current-allocation-description,"umentation for
-more information on how allocations and resource scheduling are configured for
-these installations.
-The table below provides information on the allocation for
-rrg-bengioy-ad for the period which spans from April 2022 to
-April 2023. Note that there are no special allocations for GPUs on
-Graham and therefore jobs with GPUs should be submitted with the
-account def-bengioy.
-| 0       | 1    | 2              | 3        | 4    | 5                    | 6              |
-|---------|------|----------------|----------|------|----------------------|----------------|
-| Cluster | CPUs | CPUs           | GPUs     | GPUs | GPUs                 | GPUs           |
-| Cluster | #    | account        | Model    | #    | SLURM type specifier | account        |
-| Beluga  | 238  | rrg-bengioy-ad | V100-16G | 77   | v100                 | rrg-bengioy-ad |
-| Cedar   | 34   | rrg-bengioy-ad | V100-32G | 138  | v100l                | rrg-bengioy-ad |
-| Graham  | 34   | rrg-bengioy-ad | various  | –    | –                    | def-bengioy    |
-| Narval  | 34   | rrg-bengioy-ad | A100-40G | 185  | a100                 | rrg-bengioy-ad |
-"
-Account Creation,https://docs.mila.quebec/Extra_compute.html#account-creation,"Account Creation
-To access the Alliance clusters you have to first create an account at
-https://ccdb.computecanada.ca. Use a password with at least 8 characters, mixed
-case letters, digits and special characters. Later you will be asked to create
-another password with those rules, and it’s really convenient that the two
-password are the same.
-Then, you have to apply for a role at
-https://ccdb.computecanada.ca/me/add_role, which basically means telling the
-Alliance that you are part of the lab so they know which cluster you can have
-access to, and track your usage.
-You will be asked for the CCRI (See screenshot below). Please reach out to your
-sponsor to get the CCRI.
-
-You will need to wait for your sponsor to accept before being able to login
-to the Alliance clusters.
-"
-Clusters,https://docs.mila.quebec/Extra_compute.html#clusters,"Clusters
-
-Beluga:(Mila doc)
-(Digital Research Alliance of Canada doc)
-For most students, Beluga is the best choice for both CPU and GPU jobs because
-of larger allocations on this cluster.
-
-Narval:(Mila doc)
-(Digital Research Alliance of Canada doc)
-Narval is the newest cluster, and contains the most powerful GPUs (A100). If your
-job can benefit from the A100’s features, such as TF32 floating-point math, Narval
-is the best choice.
-
-Cedar:(Mila doc)
-(Digital Research Alliance of Canada doc)
-Cedar is a good alternative to Beluga if you absolutely need to have an internet connection
-on the compute nodes.
-
-Graham:(Mila doc)
-(Digital Research Alliance of Canada doc)
-We do not have a GPU allocation on Graham anymore but it remains an alternative for CPU jobs.
-
-Niagara:(Mila doc)
-(Digital Research Alliance of Canada doc)
-Niagara is not recommended for most students. It is a CPU-only cluster with unusual
-configurations. Access is not automatic; It is opt-in and must be requested via
-CCDB manually. Compute resources in Niagara are not assigned to jobs on a per-CPU,
-but on a per-node basis.
-
-
-"
-Beluga,https://docs.mila.quebec/Extra_compute.html#beluga,"Beluga
-Beluga is a cluster located at ÉTS in Montreal. It
-uses SLURM to schedule jobs. Its full documentation can be found here, and its current status
-here.
-You can access Beluga via ssh:
-ssh <user>@beluga.computecanada.ca
-Where <user> is the username you created previously (see Account Creation).
-"
-Launching Jobs,https://docs.mila.quebec/Extra_compute.html#launching-jobs,"Launching Jobs
-Users must specify the resource allocation Group Name using the flag
---account=rrg-bengioy-ad.  To launch a CPU-only job:
-sbatch --time=1:0:0 --account=rrg-bengioy-ad job.sh
-
-Note
-The account name will differ based on your affiliation.
-
-To launch a GPU job:
-sbatch --time=1:0:0 --account=rrg-bengioy-ad --gres=gpu:1 job.sh
-And to get an interactive session, use the salloc command:
-salloc --time=1:0:0 --account=rrg-bengioy-ad --gres=gpu:1
-The full documentation for jobs launching on Beluga can be found here.
-"
-Beluga nodes description,https://docs.mila.quebec/Extra_compute.html#beluga-nodes-description,"Beluga nodes description
-Each GPU node consists of:
-
-40 CPU cores
-186 GB RAM
-4 GPU NVIDIA V100 (16GB)
-
-
-Tip
-You should ask for max 10 CPU cores and 32 GB of RAM per GPU you are
-requesting (as explained here),
-otherwise, your job will count for more than 1 allocation, and will take
-more time to get scheduled.
-
-"
-Beluga Storage,https://docs.mila.quebec/Extra_compute.html#beluga-storage,"Beluga Storage
-| Storage        | Path                 | Usage                                                         |
-|----------------|----------------------|---------------------------------------------------------------|
-| $HOME          | /home/<user>/        | Code  Specific libraries                                      |
-| $HOME/projects | /project/rpp-bengioy | Compressed raw datasets                                       |
-| $SCRATCH       | /scratch/<user>      | Processed datasets  Experimental results  Logs of experiments |
-| $SLURM_TMPDIR  | nan                  | Temporary job results                                         |
-They are roughly listed in order of increasing performance and optimized for
-different uses:
-
-The $HOME folder on NFS is appropriate for codes and libraries which are
-small and read once. Do not write experiemental results here!
-The $HOME/projects folder should only contain compressed raw datasets
-(processed datasets should go in $SCRATCH). We have a limit on the
-size and number of file in $HOME/projects, so do not put anything else
-there.  If you add a new dataset there (make sure it is readable by every
-member of the group using chgrp -R rpp-bengioy <dataset>).
-The $SCRATCH space can be used for short term storage. It has good
-performance and large quotas, but is purged regularly (every file that has
-not been used in the last 3 months gets deleted, but you receive an email
-before this happens).
-$SLURM_TMPDIR points to the local disk of the node on which a job is
-running. It should be used to copy the data on the node at the beginning of
-the job and write intermediate checkpoints. This folder is cleared after each
-job.
-
-When an experiment is finished, results should be transferred back to Mila
-servers.
-More details on storage can be found here.
-"
-Modules,https://docs.mila.quebec/Extra_compute.html#modules,"Modules
-Many software, such as Python or MATLAB are already compiled and available on
-Beluga through the module command and its subcommands. Its full
-documentation can be found here.
-| 0                      | 1                                     |
-|------------------------|---------------------------------------|
-| module avail           | Displays all the available modules    |
-| module load <module>   | Loads <module>                        |
-| module spider <module> | Shows specific details about <module> |
-In particular, if you with to use Python 3.6 you can simply do:
-module load python/3.6
-
-Tip
-If you wish to use Python on the cluster, we strongly encourage you to
-read Alliance Python Documentation, and in particular the Pytorch and/or Tensorflow pages.
-
-The cluster has many Python packages (or wheels), such already compiled for
-the cluster. See here for the
-details. In particular, you can browse the packages by doing:
-avail_wheels <wheel>
-Such wheels can be installed using pip. Moreover, the most efficient way to use
-modules on the cluster is to build your environnement inside your job.
-See the script example below.
-"
-Script Example,https://docs.mila.quebec/Extra_compute.html#script-example,"Script Example
-Here is a sbatch script that follows good practices on Beluga:
- 1#!/bin/bash
- 2#SBATCH --account=rrg-bengioy-ad         # Yoshua pays for your job
- 3#SBATCH --cpus-per-task=6                # Ask for 6 CPUs
- 4#SBATCH --gres=gpu:1                     # Ask for 1 GPU
- 5#SBATCH --mem=32G                        # Ask for 32 GB of RAM
- 6#SBATCH --time=3:00:00                   # The job will run for 3 hours
- 7#SBATCH -o /scratch/<user>/slurm-%j.out  # Write the log in $SCRATCH
- 8
- 9# 1. Create your environement locally
-10module load python/3.6
-11virtualenv --no-download $SLURM_TMPDIR/env
-12source $SLURM_TMPDIR/env/bin/activate
-13pip install --no-index torch torchvision
-14
-15# 2. Copy your dataset on the compute node
-16# IMPORTANT: Your dataset must be compressed in one single file (zip, hdf5, ...)!!!
-17cp $SCRATCH/<dataset.zip> $SLURM_TMPDIR
-18
-19# 3. Eventually unzip your dataset
-20unzip $SLURM_TMPDIR/<dataset.zip> -d $SLURM_TMPDIR
-21
-22# 4. Launch your job, tell it to save the model in $SLURM_TMPDIR
-23#    and look for the dataset into $SLURM_TMPDIR
-24python main.py --path $SLURM_TMPDIR --data_path $SLURM_TMPDIR
-25
-26# 5. Copy whatever you want to save on $SCRATCH
-27cp $SLURM_TMPDIR/<to_save> $SCRATCH
-
-
-"
-Using CometML and Wandb,https://docs.mila.quebec/Extra_compute.html#using-cometml-and-wandb,"Using CometML and Wandb
-The compute nodes for Beluga don’t have access to the internet,
-but there is a special module that can be loaded in order to allow
-training scripts to access some specific servers, which includes
-the necessary servers for using CometML and Wandb (“Weights and Biases”).
-module load httpproxy
-More documentation about this can be found here.
-"
-Graham,https://docs.mila.quebec/Extra_compute.html#graham,"Graham
-Graham is a cluster located at University of Waterloo. It uses SLURM to schedule
-jobs. Its full documentation can be found here, and its current status here.
-You can access Graham via ssh:
-ssh <user>@graham.computecanada.ca
-Where <user> is the username you created previously (see Account Creation).
-Since its structure is similar to Beluga, please look at the Beluga
-documentation, as well as relevant parts of the Digital Research Alliance of
-Canada Documentation.
-
-Note
-For GPU jobs the ressource allocation Group Name is the same as Beluga, so you should use the flag --account=rrg-bengioy-ad for GPU jobs.
-
-"
-Cedar,https://docs.mila.quebec/Extra_compute.html#cedar,"Cedar
-Cedar is a cluster located at Simon Fraser University. It uses SLURM to schedule
-jobs. Its full documentation can be found here, and its current status here.
-You can access Cedar via ssh:
-ssh <user>@cedar.computecanada.ca
-Where <user> is the username you created previously (see Account Creation).
-Since its structure is similar to Beluga, please look at the Beluga
-documentation, as well as relevant parts of the Digital Research Alliance of
-Canada Documentation.
-
-Note
-However, we don’t have any CPU priority on Cedar, in this case you can
-use --account=def-bengioy for CPU. Thus, it might take some time before
-they start.
-
-"
-Niagara,https://docs.mila.quebec/Extra_compute.html#niagara,"Niagara
-Niagara is a cluster located at University of Toronto. It uses SLURM to schedule
-jobs. Its full documentation can be found here, and its current status here.
-You can access Niagara via ssh:
-ssh <user>@niagara.computecanada.ca
-Where <user> is the username you created previously (see Account Creation).
-Since its structure is similar to Beluga, please look at the Beluga
-documentation, as well as relevant parts of the Digital Research Alliance of
-Canada Documentation.
-"
-FAQ,https://docs.mila.quebec/Extra_compute.html#faq,"FAQ
-"
-What to do with  ImportError: /lib64/libm.so.6: version GLIBC_2.23 not found?,https://docs.mila.quebec/Extra_compute.html#what-to-do-with-importerror-lib64-libm-so-6-version-glibc-2-23-not-found,"What to do with  ImportError: /lib64/libm.so.6: version GLIBC_2.23 not found?
-The structure of the file system is different than a classical Linux, so your
-code has trouble finding libraries. See how to install binary packages.
-"
-Disk quota exceeded error on /project file systems,https://docs.mila.quebec/Extra_compute.html#disk-quota-exceeded-error-on-project-file-systems,"Disk quota exceeded error on /project file systems
-You have files in /project with the wrong permissions. See how to change
-permissions.
-"
-Computing infrastructure and policies,https://docs.mila.quebec/Information.html#computing-infrastructure-and-policies,"Computing infrastructure and policies
-This section seeks to provide factual information and policies on the Mila cluster computing environments.
-"
-Roles and authorizations,https://docs.mila.quebec/Information.html#roles-and-authorizations,"Roles and authorizations
-There are mainly two types of researchers statuses at Mila :
-
-Core researchers
-Affiliated researchers
-
-This is determined by Mila policy. Core researchers have access to the Mila
-computing cluster. See your supervisor’s Mila status to know what is your own
-status.
-"
-Overview of available computing resources at Mila,https://docs.mila.quebec/Information.html#overview-of-available-computing-resources-at-mila,"Overview of available computing resources at Mila
-The Mila cluster is to be used for regular development and relatively small
-number of jobs (< 5). It is a heterogeneous cluster. It uses
-SLURM to schedule jobs.
-"
-Mila cluster versus Digital Research Alliance of Canada clusters,https://docs.mila.quebec/Information.html#mila-cluster-versus-digital-research-alliance-of-canada-clusters,"Mila cluster versus Digital Research Alliance of Canada clusters
-There are a lot of commonalities between the Mila cluster and the clusters from
-Digital Research Alliance of Canada (the Alliance). At the time being, the
-Alliance clusters where we have a large allocation of resources are beluga,
-cedar, graham and narval. We also have comparable computational resources
-in the Mila cluster, with more to come.
-The main distinguishing factor is that we have more control over our own
-cluster than we have over the ones at the Alliance. Notably, also, the compute
-nodes in the Mila cluster all have unrestricted access to the Internet, which
-is not the case in general for the Alliance clusters (although cedar does
-allow it).
-At the current time of this writing (June 2021), Mila students are advised to
-use a healthy diet of a mix of Mila and Alliance clusters. This is especially
-true in times when your favorite cluster is oversubscribed, because you can
-easily switch over to a different one if you are used to it.
-"
-Guarantees about one GPU as absolute minimum,https://docs.mila.quebec/Information.html#guarantees-about-one-gpu-as-absolute-minimum,"Guarantees about one GPU as absolute minimum
-There are certain guarantees that the Mila cluster tries to honor when it comes
-to giving at minimum one GPU per student, all the time, to be used in
-interactive mode. This is strictly better than “one GPU per student on average”
-because it’s a floor meaning that, at any time, you should be able to ask for
-your GPU, right now, and get it (although it might take a minute for the
-request to be processed by SLURM).
-Interactive sessions are possible on the Alliance clusters, and there are
-generally special rules that allow you to get resources more easily if you
-request them for a very short duration (for testing code before queueing long
-jobs). You do not get the same guarantee as on the Mila cluster, however.
-"
-Node profile description,https://docs.mila.quebec/Information.html#node-profile-description,"Node profile description
-| ('Name', 'Name')         | ('GPU', 'Model')         | ('GPU', 'Mem')           | ('GPU', '#')             | ('CPUs', 'CPUs')         | ('Sockets', 'Sockets')   | ('Cores/Socket', 'Cores/Socket')   | ('Threads/Core', 'Threads/Core')   | ('Memory (GB)', 'Memory (GB)')   | ('TmpDisk (TB)', 'TmpDisk (TB)')   | ('Arch', 'Arch')         | ('Slurm Features', 'GPU Arch and Memory')   |
-|--------------------------|--------------------------|--------------------------|--------------------------|--------------------------|--------------------------|------------------------------------|------------------------------------|----------------------------------|------------------------------------|--------------------------|---------------------------------------------|
-| GPU Compute Nodes        | GPU Compute Nodes        | GPU Compute Nodes        | GPU Compute Nodes        | GPU Compute Nodes        | GPU Compute Nodes        | GPU Compute Nodes                  | GPU Compute Nodes                  | GPU Compute Nodes                | GPU Compute Nodes                  | GPU Compute Nodes        | GPU Compute Nodes                           |
-| cn-a[001-011]            | RTX8000                  | 48                       | 8                        | 40                       | 2                        | 20                                 | 1                                  | 384                              | 3.6                                | x86_64                   | turing,48gb                                 |
-| cn-b[001-005]            | V100                     | 32                       | 8                        | 40                       | 2                        | 20      "
-Node profile description,https://docs.mila.quebec/Information.html#node-profile-description,"                           | 1                                  | 384                              | 3.6                                | x86_64                   | volta,nvlink,32gb                           |
-| cn-c[001-040]            | RTX8000                  | 48                       | 8                        | 64                       | 2                        | 32                                 | 1                                  | 384                              | 3                                  | x86_64                   | turing,48gb                                 |
-| cn-g[001-026]            | A100                     | 80                       | 4                        | 64                       | 2                        | 32                                 | 1                                  | 1024                             | 7                                  | x86_64                   | ampere,nvlink,80gb                          |
-| DGX Systems              | DGX Systems              | DGX Systems              | DGX Systems              | DGX Systems              | DGX Systems              | DGX Systems                        | DGX Systems                        | DGX Systems                      | DGX Systems                        | DGX Systems              | DGX Systems                                 |
-| cn-d[001-002]            | A100                     | 40                       | 8                        | 128                      | 2                        | 64                                 | 1                                  | 1024                             | 14                                 | x86_64                   | ampere,nvlink,40gb               "
-Node profile description,https://docs.mila.quebec/Information.html#node-profile-description,"           |
-| cn-d[003-004]            | A100                     | 80                       | 8                        | 128                      | 2                        | 64                                 | 1                                  | 2048                             | 28                                 | x86_64                   | ampere,nvlink,80gb                          |
-| cn-e[002-003]            | V100                     | 32                       | 8                        | 40                       | 2                        | 20                                 | 1                                  | 512                              | 7                                  | x86_64                   | volta,32gb                                  |
-| CPU Compute Nodes        | CPU Compute Nodes        | CPU Compute Nodes        | CPU Compute Nodes        | CPU Compute Nodes        | CPU Compute Nodes        | CPU Compute Nodes                  | CPU Compute Nodes                  | CPU Compute Nodes                | CPU Compute Nodes                  | CPU Compute Nodes        | CPU Compute Nodes                           |
-| cn-f[001-004]            | nan                      | nan                      | nan                      | 32                       | 1                        | 32                                 | 1                                  | 256                              | 10                                 | x86_64                   | rome                                        |
-| cn-h[001-004]            | nan                      | nan                      | nan                      | 64                       | 2                        | 32                   "
-Node profile description,https://docs.mila.quebec/Information.html#node-profile-description,"              | 1                                  | 768                              | 7                                  | x86_64                   | milan                                       |
-| Legacy GPU Compute Nodes | Legacy GPU Compute Nodes | Legacy GPU Compute Nodes | Legacy GPU Compute Nodes | Legacy GPU Compute Nodes | Legacy GPU Compute Nodes | Legacy GPU Compute Nodes           | Legacy GPU Compute Nodes           | Legacy GPU Compute Nodes         | Legacy GPU Compute Nodes           | Legacy GPU Compute Nodes | Legacy GPU Compute Nodes                    |
-| kepler5                  | V100                     | 16                       | 2                        | 16                       | 2                        | 4                                  | 2                                  | 256                              | 3.6                                | x86_64                   | volta,16gb                                  |
-| TITAN RTX                | TITAN RTX                | TITAN RTX                | TITAN RTX                | TITAN RTX                | TITAN RTX                | TITAN RTX                          | TITAN RTX                          | TITAN RTX                        | TITAN RTX                          | TITAN RTX                | TITAN RTX                                   |
-| rtx[1,3-5,7]             | titanrtx                 | 24                       | 2                        | 20                       | 1                        | 10                                 | 2                                  | 128                              | 0.93                               | x86_64                   | turing,24gb                                 |
-"
-Special nodes and outliers,https://docs.mila.quebec/Information.html#special-nodes-and-outliers,"Special nodes and outliers
-"
-DGX A100,https://docs.mila.quebec/Information.html#dgx-a100,"DGX A100
-DGX A100 nodes are NVIDIA appliances with 8 NVIDIA A100 Tensor Core GPUs. Each
-GPU has 40 GB of memory, for a total of 320 GB per appliance. The GPUs are
-interconnected via 6 NVSwitches which allows 4.8 TB/s bi-directional bandwidth.
-In order to run jobs on a DGX A100, add the flags below to your Slurm
-commands:
---gres=gpu:a100:<number> --reservation=DGXA100
-
-
-"
-MIG,https://docs.mila.quebec/Information.html#mig,"MIG
-MIG (Multi-Instance GPU)
-is an NVIDIA technology allowing certain GPUs to be
-partitioned into multiple instances, each of which has a roughly proportional
-amount of compute resources, device memory and bandwidth to that memory.
-NVIDIA supports MIG on its A100 GPUs and allows slicing the A100 into up to 7
-instances. Although this can theoretically be done dynamically, the SLURM job
-scheduler does not support doing so in practice as it does not model
-reconfigurable resources very well. Therefore, the A100s must currently be
-statically partitioned into the required number of instances of every size
-expected to be used.
-The cn-g series of nodes include A100-80GB GPUs. One third have been
-configured to offer regular (non-MIG mode) a100l GPUs. The other two-thirds
-have been configured in MIG mode, and offer the following profiles:
-| ('Name', 'Name')       | ('GPU', 'Model')   | ('GPU', 'Memory')   | ('GPU', 'Compute')   |   ('Cluster-wide', '#') |
-|------------------------|--------------------|---------"
-MIG,https://docs.mila.quebec/Information.html#mig,"------------|----------------------|-------------------------|
-| a100l.1g.10gb  a100l.1 | A100               | 10GB  (1/8th)       | 1/7th  of full       |                      72 |
-| a100l.2g.20gb  a100l.2 | A100               | 20GB  (2/8th)       | 2/7th  of full       |                     108 |
-| a100l.3g.40gb  a100l.3 | A100               | 40GB  (4/8th)       | 3/7th  of full       |                      72 |
-And can be requested using a SLURM flag such as --gres=gpu:a100l.1
-The partitioning may be revised as needs and SLURM capabilities evolve. Other
-MIG profiles exist and could be introduced.
-
-Warning
-MIG has a number of important limitations,
-most notably that a GPU in MIG mode does not support graphics APIs
-(OpenGL/Vulkan), nor P2P over NVLink and PCIe. We have therefore chosen to
-limit every MIG job to exactly one MIG slice and no more. Thus,
---gres=gpu:a100l.3 will work (and request a size-3 slice of an
-a100l GPU) but --gres=gpu:a100l.1:3 (with :3 requesting
-three size-1 slices) will not.
-"
-AMD,https://docs.mila.quebec/Information.html#amd,"AMD
-
-Warning
-As of August 20 2019 the GPUs had to return back to AMD.  Mila will get
-more samples. You can join the amd slack channels to get the latest
-information
-
-Mila has a few node equipped with MI50 GPUs.
-srun --gres=gpu -c 8 --reservation=AMD --pty bash
-
- first time setup of AMD stack
-conda create -n rocm python=3.6
-conda activate rocm
-
-pip install tensorflow-rocm
-pip install /wheels/pytorch/torch-1.1.0a0+d8b9d32-cp36-cp36m-linux_x86_64.whl
-"
-Data sharing policies,https://docs.mila.quebec/Information.html#data-sharing-policies,"Data sharing policies
-
-Note
-/network/scratch aims to support
-Access Control Lists (ACLs)
-to allow collaborative work on rapidly changing data, e.g. work in process
-datasets, model checkpoints, etc…
-
-/network/projects aims to offer a collaborative
-space for long-term projects. Data that should be kept for a longer period then
-90 days can be stored in that location but first a request to Mila’s helpdesk has to be made to create the project
-directory.
-"
-Monitoring,https://docs.mila.quebec/Information.html#monitoring,"Monitoring
-Every compute node on the Mila cluster has a Netdata
-monitoring daemon allowing you to get a sense of the state of the node.
-This information is exposed in two ways:
-
-For every node, there is a web interface from Netdata itself at <node>.server.mila.quebec:19999.
-This is accessible only when using the Mila wifi or through SSH tunnelling.
-
-SSH tunnelling: on your local machine, run
-
-ssh -L 19999:<node>.server.mila.quebec:19999 -p 2222
-login.server.mila.quebec
-or ssh -L 19999:<node>.server.mila.quebec:19999 mila if you have
-already setup your SSH Login,
-
-
-then open http://localhost:19999 in your browser.
-
-
-The Mila dashboard at dashboard.server.mila.quebec
-exposes aggregated statistics with the use of grafana.
-These are collected internally to an instance of prometheus.
-
-In both cases, those graphs are not editable by individual users,
-but they provide valuable insight into the state of the whole cluster
-or the individual nodes.
-One of the important uses is to collect data about the health
-of the Mila cluster and to sound the alarm if outages occur
-(e.g. if the nodes crash or if GPUs mysteriously become unavailable for SLURM).
-"
-Example with Netdata on cn-c001,https://docs.mila.quebec/Information.html#example-with-netdata-on-cn-c001,"Example with Netdata on cn-c001
-For example, if we have a job running on cn-c001, we can type
-cn-c001.server.mila.quebec:19999 in a browser address bar and the following
-page will appear.
-
-"
-Example watching the CPU/RAM/GPU usage,https://docs.mila.quebec/Information.html#example-watching-the-cpu-ram-gpu-usage,"Example watching the CPU/RAM/GPU usage
-Given that compute nodes are generally shared
-with other users who are also running jobs at the same time and
-consuming resources, this is not generally a good way to profile your code
-in fine details.
-However, it can still be a very useful source of information
-for getting an idea of whether the machine that you requested is being
-used in its full capacity.
-Given how expensive the GPUs are, it generally makes sense to try to
-make sure that this resources is always kept busy.
-
-
-CPU
-iowait (pink line): High values means your model is waiting on IO a lot (disk or network).
-
-
-
-
-
-
-
-
-CPU RAM
-You can see how much CPU RAM is being used by your script in practice,
-considering the amount that you requested (e.g. `sbatch --mem=8G ...`).
-GPU usage is generally more important to monitor than CPU RAM.
-You should not cut it so close to the limit that your experiments randomly fail
-because they run out of RAM. However, you should not request blindly 32GB of RAM
-when you actually require only 8GB.
-
-
-
-
-
-
-
-
-GPU
-Monitors the GPU usage using an nvidia-smi plugin for Netdata.
-Under the plugin interface, select the GPU"
-Example watching the CPU/RAM/GPU usage,https://docs.mila.quebec/Information.html#example-watching-the-cpu-ram-gpu-usage," number which was allocated to
-you. You can figure this out by running echo $SLURM_JOB_GPUS on the
-allocated node or, if you have the job ID,
-scontrol show -d job YOUR_JOB_ID | grep 'GRES' and checking IDX
-You should make sure you use the GPUs to their fullest capacity.
-Select the biggest batch size if possible to increase GPU memory usage and
-the GPU computational load.
-Spawn multiple experiments if you can fit many on a single GPU.
-Running 10 independent MNIST experiments on a single GPU will probably take
-less than 10x the time to run a single one. This assumes that you have more
-experiments to run, because nothing is gained by gratuitously running experiments.
-You can request a less powerful GPU and leave the more powerful GPUs
-to other researchers who have experiments that can make best use of them.
-Sometimes you really just need a k80 and not a v100.
-
-
-
-
-
-
-
-
-Other users or jobs
-If the node seems unresponsive or slow,
-it may be useful to check what other tasks are
-running at the same time on that node.
-This should not be an issue in general,
-but in practice it is useful to be able to
-inspect this to diagnose certain problems.
-
-
-
-
-
-"
-Example with Mila dashboard,https://docs.mila.quebec/Information.html#example-with-mila-dashboard,"Example with Mila dashboard
-
-"
-Storage,https://docs.mila.quebec/Information.html#storage,"Storage
-| Path                                           | Performance   | Usage                                                                                   | Quota (Space/Files)   | Backup   | Auto-cleanup   |
-|------------------------------------------------|---------------|-----------------------------------------------------------------------------------------|-----------------------|----------|----------------|
-| /network/datasets/                             | High          | Curated raw datasets (read only)                                                        | nan                   | nan      | nan            |
-| $HOME  or  /home/mila/<u>/<username>/          | Low           | Personal user space  Specific libraries, code, binaries                                 | 100GB/1000K           | Daily    | no             |
-| $SCRATCH  or  /network/scratch/<u>/<username>/ | High          | Temporary job results  Processed datasets  Optimized for small Files                    | no                    | no       | 90 days  "
-Storage,https://docs.mila.quebec/Information.html#storage,"      |
-| $SLURM_TMPDIR                                  | Highest       | High speed disk for temporary job results                                               | 4TB/-                 | no       | at job end     |
-| /network/projects/<groupname>/                 | Fair          | Shared space to facilitate collaboration between researchers  Long-term project storage | 200GB/1000K           | Daily    | no             |
-| $ARCHIVE  or  /network/archive/<u>/<username>/ | Low           | Long-term personal storage                                                              | 500GB                 | no       | no             |
-
-Note
-The $HOME file system is backed up once a day. For any file
-restoration request, file a request to Mila’s IT support with the path to the file or directory to
-restore, with the required date.
-
-
-Warning
-Currently there is no backup system for any other file systems of
-the Mila cluster. Storage local to personal computers, Google Drive and other
-related solutions should be used to backup important data
-
-"
-$HOME,https://docs.mila.quebec/Information.html#home,"$HOME
-$HOME is appropriate for codes and libraries which are small and read once,
-as well as the experimental results that would be needed at a later time (e.g.
-the weights of a network referenced in a paper).
-Quotas are enabled on $HOME for both disk capacity (blocks) and number of
-files (inodes). The limits for blocks and inodes are respectively 100GiB and 1
-million per user. The command to check the quota usage from a login node is:
-beegfs-ctl --cfgFile=/etc/beegfs/home.d/beegfs-client.conf --getquota --uid $USER
-"
-$SCRATCH,https://docs.mila.quebec/Information.html#scratch,"$SCRATCH
-$SCRATCH can be used to store processed datasets, work in progress datasets
-or temporary job results. Its block size is optimized for small files which
-minimizes the performance hit of working on extracted datasets.
-
-Note
-Auto-cleanup: this file system is cleared on a weekly basis,
-files not used for more than 90 days will be deleted.
-
-"
-$SLURM_TMPDIR,https://docs.mila.quebec/Information.html#slurm-tmpdir,"$SLURM_TMPDIR
-$SLURM_TMPDIR points to the local disk of the node on which a job is
-running. It should be used to copy the data on the node at the beginning of the
-job and write intermediate checkpoints. This folder is cleared after each job.
-"
-projects,https://docs.mila.quebec/Information.html#projects,"projects
-projects can be used for collaborative projects. It aims to ease the
-sharing of data between users working on a long-term project.
-Quotas are enabled on projects for both disk capacity (blocks) and number
-of files (inodes). The limits for blocks and inodes are respectively 200GiB and
-1 million per user and per group.
-
-Note
-It is possible to request higher quota limits if the project requires
-it. File a request to Mila’s IT support.
-
-"
-$ARCHIVE,https://docs.mila.quebec/Information.html#archive,"$ARCHIVE
-$ARCHIVE purpose is to store data other than datasets that has to be kept
-long-term (e.g.  generated samples, logs, data relevant for paper submission).
-$ARCHIVE is only available on the login nodes. Because this file system
-is tuned for large files, it is recommended to archive your directories. For
-example, to archive the results of an experiment in
-$SCRATCH/my_experiment_results/, run the commands below from a login node:
-cd $SCRATCH
-tar cJf $ARCHIVE/my_experiment_results.tar.xz --xattrs my_experiment_results
-Disk capacity quotas are enabled on $ARCHIVE. The soft limit per user is
-500GB, the hard limit is 550GB. The grace time is 7 days. This means that one
-can use more than 500GB for 7 days before the file system enforces quota.
-However, it is not possible to use more than 550GB.
-The command to check the quota usage from a login node is df:
-df -h $ARCHIVE
-
-Note
-There is NO backup of this file system.
-
-"
-datasets,https://docs.mila.quebec/Information.html#datasets,"datasets
-datasets contains curated datasets to the benefit of the Mila community.
-To request the addition of a dataset or a preprocessed dataset you think could
-benefit the research of others, you can fill this form. Datasets can also be browsed from the
-web : Mila Datasets
-Datasets in datasets/restricted are restricted and require an explicit
-request to gain access. Please submit a support ticket mentioning the dataset’s
-access group (ex.: scannet_users), your cluster’s username and the
-approbation of the group owner. You can find the dataset’s access group by
-listing the content of /network/datasets/restricted with the ls command.
-Those datasets are mirrored to the Alliance clusters in
-~/projects/rrg-bengioy-ad/data/curated/ if they follow Digital Research
-Alliance of Canada’s good practices on data.
-To list the local datasets on an Alliance cluster, you can execute the
-following command:
-ssh [CLUSTER_LOGIN] -C ""projects/rrg-bengioy-ad/data/curated/list_datasets_cc.sh""
-"
-Data Transmission,https://docs.mila.quebec/Information.html#data-transmission,"Data Transmission
-Multiple methods can be used to transfer data to/from the cluster:
-
-rsync --bwlimit=10mb; this is the favored method since the bandwidth can
-be limited to prevent impacting the usage of the cluster: rsync
-Digital Research Alliance of Canada: Globus
-
-"
-Getting started,https://docs.mila.quebec/Getting_started.html#getting-started,"Getting started
-See User’s guide.
-"
-User’s guide,https://docs.mila.quebec/Userguide.html#user-s-guide,"User’s guide
-…or IDT’s list of opinionated howtos
-This section seeks to provide users of the Mila infrastructure with practical
-knowledge, tips and tricks and example commands.
-"
-Quick Start,https://docs.mila.quebec/Userguide.html#quick-start,"Quick Start
-Users first need login access to the cluster. It is
-recommended to install milatools which will help in the set up of the
-ssh configuration needed to securely and easily connect to the
-cluster.
-"
-mila code,https://docs.mila.quebec/Userguide.html#mila-code,"mila code
-milatools also makes it easy to run and debug code on the Mila cluster. Using
-the mila code command will allow you to use VSCode on the server. Simply run:
-mila code path/on/cluster
-
-
-The details of the command can be found on the github page of the package. Note that you need to
-first setup your ssh configuration using mila init before the mila code
-command can be used. The initialisation of the ssh configuration is explained
-here and on the github page of the package.
-"
-Logging in to the cluster,https://docs.mila.quebec/Userguide.html#logging-in-to-the-cluster,"Logging in to the cluster
-To access the Mila Cluster clusters, you will need a Mila account. Please contact
-Mila systems administrators if you don’t have it already. Our IT support service
-is available here: https://it-support.mila.quebec/
-You will also need to complete and return an IT Onboarding Training to get
-access to the cluster.  Please refer to the Mila Intranet for more
-informations:
-https://sites.google.com/mila.quebec/mila-intranet/it-infrastructure/it-onboarding-training
-IMPORTANT : Your access to the Cluster is granted based on your status at
-Mila (for students, your status is the same as your main supervisor’ status),
-and on the duration of your stay, set during the creation of your account. The
-following have access to the cluster : Current Students of Core Professors -
-Core Professors - Staff
-"
-SSH Login,https://docs.mila.quebec/Userguide.html#ssh-login,"SSH Login
-You can access the Mila cluster via ssh:
-# Generic login, will send you to one of the 4 login nodes to spread the load
-ssh <user>@login.server.mila.quebec -p 2222
-
-# To connect to a specific login node, X in [1, 2, 3, 4]
-ssh <user>@login-X.login.server.mila.quebec -p 2222
-Four login nodes are available and accessible behind a load balancer. At each
-connection, you will be redirected to the least loaded login-node.
-The ECDSA, RSA and ED25519 fingerprints for Mila’s login nodes are:
-SHA256:baEGIa311fhnxBWsIZJ/zYhq2WfCttwyHRKzAb8zlp8 (ECDSA)
-SHA256:Xr0/JqV/+5DNguPfiN5hb8rSG+nBAcfVCJoSyrR0W0o (RSA)
-SHA256:gfXZzaPiaYHcrPqzHvBi6v+BWRS/lXOS/zAjOKeoBJg (ED25519)
-
-
-
-Important
-Login nodes are merely entry points to the cluster. They give you access
-to the compute nodes and to the filesystem, but they are not meant to run
-anything heavy. Do not run compute-heavy programs on these nodes,
-because in doing so you could bring them down, impeding cluster access for
-everyone.
-This means no training or experiments, no compiling programs, no Python
-scripts, but also no zip of a large folder or anything that demands a
-sustained amount of computation.
-Rule of thumb: never run a program that takes more than a few seconds on
-a login node.
-
-Note
-In a similar vein, you should not run VSCode remote SSH instances directly
-on login nodes, because even though they are typically not very
-computationally expensive, when many people do it, they add up! See
-Visual Studio Code for specific instructions.
-
-
-"
-mila init,https://docs.mila.quebec/Userguide.html#mila-init,"mila init
-To make it easier to set up a productive environment, Mila publishes the
-milatools package, which defines a mila init command which will
-automatically perform some of the below steps for you. You can install it with
-pip and use it, provided your Python version is at least 3.8:
-$ pip install milatools
-$ mila init
-
-
-"
-SSH Config,https://docs.mila.quebec/Userguide.html#ssh-config,"SSH Config
-The login nodes support the following authentication mechanisms:
-publickey,keyboard-interactive.  If you would like to set an entry in your
-.ssh/config file, please use the following recommendation:
-Host mila
-    User YOUR-USERNAME
-    Hostname login.server.mila.quebec
-    PreferredAuthentications publickey,keyboard-interactive
-    Port 2222
-    ServerAliveInterval 120
-    ServerAliveCountMax 5
-
-
-Then you can simply write ssh mila to connect to a login node. You will also
-be able to use mila with scp, rsync and other such programs.
-
-Tip
-You can run commands on the login node with ssh directly, for example
-ssh mila squeue -u '$USER' (remember to put single quotes around any
-$VARIABLE you want to evaluate on the remote side, otherwise it will be
-evaluated locally before ssh is even executed).
-
-"
-Passwordless login,https://docs.mila.quebec/Userguide.html#passwordless-login,"Passwordless login
-To save you some repetitive typing it is highly recommended to set up public
-key authentication, which means you won’t have to enter your password every time
-you connect to the cluster.
-# ON YOUR LOCAL MACHINE
-# You might already have done this in the past, but if you haven't:
-ssh-keygen  # Press ENTER 3x
-
-# Copy your public key over to the cluster
-# You will need to enter your password
-ssh-copy-id mila
-
-
-"
-Connecting to compute nodes,https://docs.mila.quebec/Userguide.html#connecting-to-compute-nodes,"Connecting to compute nodes
-If (and only if) you have a job running on compute node “cnode”, you are
-allowed to SSH to it directly, if for some reason you need a second terminal.
-That session will be automatically ended when your job is relinquished.
-First, however, you need to have
-password-less ssh either with a key present in your home or with an
-ssh-agent. To generate a key pair on the login node:
-# ON A LOGIN NODE
-ssh-keygen  # Press ENTER 3x
-cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
-chmod 600 ~/.ssh/authorized_keys
-chmod 700 ~/.ssh
-
-
-Then from the login node you can write ssh <node>. From your local
-machine, you can use ssh -J mila USERNAME@<node> (-J represents a “jump”
-through the login node, necessary because the compute nodes are behind a
-firewall).
-If you wish, you may also add the following wildcard rule in your .ssh/config:
-Host *.server.mila.quebec !*login.server.mila.quebec
-    HostName %h
-    User YOUR-USERNAME
-    ProxyJump mila
-
-
-This will let you connect to a compute node with ssh <node>.server.mila.quebec.
-"
-Running your code,https://docs.mila.quebec/Userguide.html#running-your-code,"Running your code
-"
-SLURM commands guide,https://docs.mila.quebec/Userguide.html#slurm-commands-guide,"SLURM commands guide
-"
-Basic Usage,https://docs.mila.quebec/Userguide.html#basic-usage,"Basic Usage
-The SLURM documentation
-provides extensive information on the available commands to query the cluster
-status or submit jobs.
-Below are some basic examples of how to use SLURM.
-"
-Submitting jobs,https://docs.mila.quebec/Userguide.html#submitting-jobs,"Submitting jobs
-"
-Batch job,https://docs.mila.quebec/Userguide.html#batch-job,"Batch job
-In order to submit a batch job, you have to create a script containing the main
-command(s) you would like to execute on the allocated resources/nodes.
- 1#!/bin/bash
- 2#SBATCH --job-name=test
- 3#SBATCH --output=job_output.txt
- 4#SBATCH --error=job_error.txt
- 5#SBATCH --ntasks=1
- 6#SBATCH --time=10:00
- 7#SBATCH --mem=100Gb
- 8
- 9module load python/3.5
-10python my_script.py
-
-
-Your job script is then submitted to SLURM with sbatch (ref.)
-sbatch job_script
-sbatch: Submitted batch job 4323674
-The working directory of the job will be the one where your executed sbatch.
-
-Tip
-Slurm directives can be specified on the command line alongside sbatch or
-inside the job script with a line starting with #SBATCH.
-
-"
-Interactive job,https://docs.mila.quebec/Userguide.html#interactive-job,"Interactive job
-Workload managers usually run batch jobs to avoid having to watch its
-progression and let the scheduler run it as soon as resources are available. If
-you want to get access to a shell while leveraging cluster resources, you can
-submit an interactive jobs where the main executable is a shell with the
-srun/salloc (srun/salloc) commands
-salloc
-Will start an interactive job on the first node available with the default
-resources set in SLURM (1 task/1 CPU).  srun accepts the same arguments as
-sbatch with the exception that the environment is not passed.
-
-Tip
-To pass your current environment to an interactive job, add
---preserve-env to srun.
-
-salloc can also be used and is mostly a wrapper around srun if provided
-without more info but it gives more flexibility if for example you want to get
-an allocation on multiple nodes.
-"
-Job submission arguments,https://docs.mila.quebec/Userguide.html#job-submission-arguments,"Job submission arguments
-In order to accurately select the resources for your job, several arguments are
-available. The most important ones are:
-| Argument                   | Description                                                               |
-|----------------------------|---------------------------------------------------------------------------|
-| -n, –ntasks=<number>       | The number of task in your script, usually =1                             |
-| -c, –cpus-per-task=<ncpus> | The number of cores for each task                                         |
-| -t, –time=<time>           | Time requested for your job                                               |
-| –mem=<size[units]>         | Memory requested for all your tasks                                       |
-| –gres=<list>               | Select generic resources such as GPUs for your job:  --gres=gpu:GPU_MODEL |
-
-Tip
-Always consider requesting the adequate amount of resources to improve the
-scheduling of your job (small jobs always run first).
-
-"
-Checking job status,https://docs.mila.quebec/Userguide.html#checking-job-status,"Checking job status
-To display jobs currently in queue, use squeue and to get only your jobs type
-squeue -u $USER
-JOBID   USER          NAME    ST  START_TIME         TIME NODES CPUS TRES_PER_NMIN_MEM NODELIST (REASON) COMMENT
-133     my_username   myjob   R   2019-03-28T18:33   0:50     1    2        N/A  7000M node1 (None) (null)
-
-Note
-The maximum number of jobs able to be submitted to the system per user is 1000 (MaxSubmitJobs=1000)
-at any given time from the given association. If this limit is reached, new submission requests
-will be denied until existing jobs in this association complete.
-
-"
-Removing a job,https://docs.mila.quebec/Userguide.html#removing-a-job,"Removing a job
-To cancel your job simply use scancel
-scancel 4323674
-"
-Partitioning,https://docs.mila.quebec/Userguide.html#partitioning,"Partitioning
-Since we don’t have many GPUs on the cluster, resources must be shared as fairly
-as possible.  The --partition=/-p flag of SLURM allows you to set the
-priority you need for a job.  Each job assigned with a priority can preempt jobs
-with a lower priority: unkillable > main > long. Once preempted, your job is
-killed without notice and is automatically re-queued on the same partition until
-resources are available. (To leverage a different preemption mechanism, see the
-Handling preemption)
-| Flag                         | Max Resource Usage        | Max Time    | Note                 |
-|------------------------------|---------------------------|-------------|----------------------|
-| --partition=unkillable       | 6 CPUs, mem=32G, 1 GPU    | 2 days      | nan                  |
-| --partition=unkillable-cpu   | 2 CPUs, mem=16G           | 2 days      | CPU-only jobs        |
-| --partition=short-unkillable | 24 CPUs, mem=128G, 4 GPUs | 3 hours (!) | Large but short jobs |
-| --partition=main             | 8 CPUs, mem=48G, 2 GPUs   | 5 days      | nan                  |
-| --partition=main-cpu         | 8 CPUs, mem=64G           | 5 days      | CPU-only jobs        |
-| --partition=long             | no limit of resources     | 7 days      | nan                  |
-| --partition=long-cpu         | no limit of resources     | 7 days      | CPU-only jobs        |
-
-Warning
-Historically, before the 2022 introduction of CPU-only nodes (e.g. the cn-f
-series), CPU jobs ran side-by-side with the GPU jobs on GPU nodes. To prevent
-them obstructing any GPU job, they were always lowest-priority and preemptible.
-This was implemented by automatically assigning them to one of the now-obsolete
-part"
-Partitioning,https://docs.mila.quebec/Userguide.html#partitioning,"itions cpu_jobs, cpu_jobs_low or cpu_jobs_low-grace.
-Do not use these partition names anymore. Prefer the *-cpu partition
-names defined above.
-For backwards-compatibility purposes, the legacy partition names are translated
-to their effective equivalent long-cpu, but they will eventually be removed
-entirely.
-
-
-Note
-As a convenience, should you request the unkillable, main or long
-partition for a CPU-only job, the partition will be translated to its -cpu
-equivalent automatically.
-
-For instance, to request an unkillable job with 1 GPU, 4 CPUs, 10G of RAM and
-12h of computation do:
-sbatch --gres=gpu:1 -c 4 --mem=10G -t 12:00:00 --partition=unkillable <job.sh>
-You can also make it an interactive job using salloc:
-salloc --gres=gpu:1 -c 4 --mem=10G -t 12:00:00 --partition=unkillable
-The Mila cluster has many different types of nodes/GPUs. To request a specific
-type of node/GPU, you can add specific feature requirements to your job
-submission command.
-To access those special nodes you need to request them explicitly by adding the
-flag --constraint=<name>.  The full list of nodes in the Mila Cluster can be
-accessed Node profile description.
-Example:
-To request a machine with 2 GPUs using NVLink, you can use
-sbatch -c 4 --gres=gpu:2 --constraint=nvlink
-| Feature                  | Particularities                                            |
-|--------------------------|------------------------------------------------------------|
-| 12GB/16GB/24GB/32GB/48GB | Request a specific amount of  GPU  memory                  |
-| volta/turing/ampere      | Request a specific  GPU  architecture                      |
-| nvlink                   | Machine with GPUs using the NVLink interconnect technology |
-"
-Information on partitions/nodes,https://docs.mila.quebec/Userguide.html#information-on-partitions-nodes,"Information on partitions/nodes
-sinfo (ref.) provides most of the
-information about available nodes and partitions/queues to submit jobs to.
-Partitions are a group of nodes usually sharing similar features. On a
-partition, some job limits can be applied which will override those asked for a
-job (i.e. max time, max CPUs, etc…)
-To display available partitions, simply use
-sinfo
-PARTITION AVAIL TIMELIMIT NODES STATE  NODELIST
-batch     up     infinite     2 alloc  node[1,3,5-9]
-batch     up     infinite     6 idle   node[10-15]
-cpu       up     infinite     6 idle   cpu_node[1-15]
-gpu       up     infinite     6 idle   gpu_node[1-15]
-To display available nodes and their status, you can use
-sinfo -N -l
-NODELIST    NODES PARTITION STATE  CPUS MEMORY TMP_DISK WEIGHT FEATURES REASON
-node[1,3,5-9]   2 batch     allocated 2    246    16000     0  (null)   (null)
-node[2,4]       2 batch     drain     2    246    16000     0  (null)   (null)
-node[10-15]     6 batch     idle      2    246    16000     0  (null)   (null)
-...
-And to get statistics on a job running or terminated, use sacct with some of
-the fields you want to display
-sacct --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,nnodes,ncpus,nodelist,workdir -u $USER
-     User        JobID    JobName  Partition      State  Timelimit               Start                 End    Elapsed   NNodes      NCPUS        N"
-Information on partitions/nodes,https://docs.mila.quebec/Userguide.html#information-on-partitions-nodes,"odeList              WorkDir
---------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- -------- ---------- --------------- --------------------
-my_usern+ 2398         run_extra+      batch    RUNNING 130-05:00+ 2019-03-27T18:33:43             Unknown 1-01:07:54        1         16 node9           /home/mila/my_usern+
-my_usern+ 2399         run_extra+      batch    RUNNING 130-05:00+ 2019-03-26T08:51:38             Unknown 2-10:49:59        1         16 node9           /home/mila/my_usern+
-Or to get the list of all your previous jobs, use the --start=YYYY-MM-DD flag. You can check sacct(1) for further information about additional time formats.
-sacct -u $USER --start=2019-01-01
-scontrol (ref.) can be used to
-provide specific information on a job (currently running or recently terminated)
-scontrol show job 43123
-JobId=43123 JobName=python_script.py
-UserId=my_username(1500000111) GroupId=student(1500000000) MCS_label=N/A
-Priority=645895 Nice=0 Account=my_username QOS=normal
-JobState=RUNNING Reason=None Dependency=(null)
-Requeue=1 Restarts=3 BatchFlag=1 Reboot=0 ExitCode=0:0
-RunTime=2-10:41:57 TimeLimit=130-05:00:00 TimeMin=N/A
-SubmitTime=2019-03-26T08:47:17 EligibleTime=2019-03-26T08:49:18
-AccrueTime=2019-03-26T08:49:18
-StartTime=2019-03-26T08:51:38 EndTime=2019-08-03T13:51:38 Deadline=N/A
-PreemptTime=None SuspendTim"
-Information on partitions/nodes,https://docs.mila.quebec/Userguide.html#information-on-partitions-nodes,"e=None SecsPreSuspend=0
-LastSchedEval=2019-03-26T08:49:18
-Partition=slurm_partition AllocNode:Sid=login-node-1:14586
-ReqNodeList=(null) ExcNodeList=(null)
-NodeList=node2
-BatchHost=node2
-NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=16 ReqB:S:C:T=0:0:*:*
-TRES=cpu=16,mem=32000M,node=1,billing=3
-Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
-MinCPUsNode=16 MinMemoryNode=32000M MinTmpDiskNode=0
-Features=(null) DelayBoot=00:00:00
-OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
-WorkDir=/home/mila/my_username
-StdErr=/home/mila/my_username/slurm-43123.out
-StdIn=/dev/null
-StdOut=/home/mila/my_username/slurm-43123.out
-Power=
-Or more info on a node and its resources
-scontrol show node node9
-NodeName=node9 Arch=x86_64 CoresPerSocket=4
-CPUAlloc=16 CPUTot=16 CPULoad=1.38
-AvailableFeatures=(null)
-ActiveFeatures=(null)
-Gres=(null)
-NodeAddr=10.252.232.4 NodeHostName=mila20684000000 Port=0 Version=18.08
-OS=Linux 4.15.0-1036 #38-Ubuntu SMP Fri Dec 7 02:47:47 UTC 2018
-RealMemory=32000 AllocMem=32000 FreeMem=23262 Sockets=2 Boards=1
-State=ALLOCATED+CLOUD ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
-Partitions=slurm_partition
-BootTime=2019-03-26T08:50:01 SlurmdStartTime=2019-03-26T08:51:15
-CfgTRES=cpu=16,mem=32000M,billing=3
-AllocTRES=cpu=16,mem=32000M
-CapWatts=n/a
-CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
-ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/"
-Useful Commands,https://docs.mila.quebec/Userguide.html#useful-commands,"Useful Commands
-
-sallocGet an interactive job and give you a shell. (ssh like) CPU only
-
-salloc --gres=gpu:1 -c 2 --mem=12000Get an interactive job with one GPU, 2 CPUs and 12000 MB RAM
-
-sbatchstart a batch job (same options as salloc)
-
-sattach --pty <jobid>.0Re-attach a dropped interactive job
-
-sinfostatus of all nodes
-
-sinfo -Ogres:27,nodelist,features -tidle,mix,allocList GPU type and FEATURES that you can request
-
-savail(Custom) List available gpu
-
-scancel <jobid>Cancel a job
-
-squeuesummary status of all active jobs
-
-squeue -u $USERsummary status of all YOUR active jobs
-
-squeue -j <jobid>summary status of a specific job
-
-squeue -Ojobid,name,username,partition,state,timeused,nodelist,gres,tresstatus of all jobs including requested resources (see the SLURM squeue doc for all output options)
-
-scontrol show job <jobid>Detailed status of a running job
-
-sacct -j <job_id> -o NodeListGet the node where a finished job ran
-
-sacct -u $USER -S <start_time> -E <stop_time>Find info about old jobs
-
-sacct -oJobID,JobName,User,Partition,Node,StateList of current and recent jobs
-
-
-"
-Special GPU requirements,https://docs.mila.quebec/Userguide.html#special-gpu-requirements,"Special GPU requirements
-Specific GPU architecture and memory can be easily requested through the
---gres flag by using either
-
---gres=gpu:architecture:number
---gres=gpu:memory:number
---gres=gpu:model:number
-
-Example:
-To request 1 GPU with at least 16GB of memory use
-sbatch -c 4 --gres=gpu:16gb:1
-The full list of GPU and their features can be accessed here.
-"
-Example script,https://docs.mila.quebec/Userguide.html#example-script,"Example script
-Here is a sbatch script that follows good practices on the Mila cluster:
- 1#!/bin/bash
- 2
- 3#SBATCH --partition=unkillable                           # Ask for unkillable job
- 4#SBATCH --cpus-per-task=2                                # Ask for 2 CPUs
- 5#SBATCH --gres=gpu:1                                     # Ask for 1 GPU
- 6#SBATCH --mem=10G                                        # Ask for 10 GB of RAM
- 7#SBATCH --time=3:00:00                                   # The job will run for 3 hours
- 8#SBATCH -o /network/scratch/<u>/<username>/slurm-%j.out  # Write the log on scratch
- 9
-10# 1. Load the required modules
-11module --quiet load anaconda/3
-12
-13# 2. Load your environment
-14conda activate ""<env_name>""
-15
-16# 3. Copy your dataset on the compute node
-17cp /network/datasets/<dataset> $SLURM_TMPDIR
-18
-19# 4. Launch your job, tell it to save the model in $SLURM_TMPDIR
-20#    and look for the dataset into $SLURM_TMPDIR
-21python main.py --path $SLURM_TMPDIR --data_path $SLURM_TMPDIR
-22
-23# 5. Copy whatever you want to save on $SCRATCH
-24cp $SLURM_TMPDIR/<to_save> /network/scratch/<u>/<username>/
-
-
-"
-Portability concerns and solutions,https://docs.mila.quebec/Userguide.html#portability-concerns-and-solutions,"Portability concerns and solutions
-When working on a software project, it is important to be aware of all the
-software and libraries the project relies on and to list them explicitly and
-under a version control system in such a way that they can easily be
-installed and made available on different systems. The upsides are significant:
-
-Easily install and run on the cluster
-Ease of collaboration
-Better reproducibility
-
-To achieve this, try to always keep in mind the following aspects:
-
-Versions: For each dependency, make sure you have some record of the
-specific version you are using during development. That way, in the future, you
-will be able to reproduce the original environment which you know to be
-compatible. Indeed, the more time passes, the more likely it is that newer
-versions of some dependency have breaking changes. The pip freeze command can create
-such a record for Python dependencies.
-Isolation: Ideally, each of your software projects should be isolated from
-the others. What this means is that updating the environment for project A
-should not update the environment for project B. That way, you can freely
-install and upgrade software and libraries for the former without worrying about
-breaking the latter (which you might not notice until weeks later, the next time
-you work on project B!) Isolation can be achieved using Python Virtual environments and Containers.
-
-"
-Managing your environments,https://docs.mila.quebec/Userguide.html#managing-your-environments,"Managing your environments
-"
-Virtual environments,https://docs.mila.quebec/Userguide.html#virtual-environments,"Virtual environments
-A virtual environment in Python is a local, isolated environment in which you
-can install or uninstall Python packages without interfering with the global
-environment (or other virtual environments). It usually lives in a directory
-(location varies depending on whether you use venv, conda or poetry). In order
-to use a virtual environment, you have to activate it. Activating an
-environment essentially sets environment variables in your shell so that:
-
-python points to the right Python version for that environment (different
-virtual environments can use different versions of Python!)
-python looks for packages in the virtual environment
-pip install installs packages into the virtual environment
-Any shell commands installed via pip install are made available
-
-To run experiments within a virtual environment, you can simply activate it
-in the script given to sbatch.
-"
-Pip/Virtualenv,https://docs.mila.quebec/Userguide.html#pip-virtualenv,"Pip/Virtualenv
-Pip is the preferred package manager for Python and each cluster provides
-several Python versions through the associated module which comes with pip. In
-order to install new packages, you will first have to create a personal space
-for them to be stored.  The preferred solution (as it is the preferred solution
-on Digital Research Alliance of Canada clusters) is to use virtual
-environments.
-First, load the Python module you want to use:
-module load python/3.8
-Then, create a virtual environment in your home directory:
-python -m venv $HOME/<env>
-Where <env> is the name of your environment. Finally, activate the environment:
-source $HOME/<env>/bin/activate
-You can now install any Python package you wish using the pip command, e.g.
-pytorch:
-pip install torch torchvision
-Or Tensorflow:
-pip install tensorflow-gpu
-"
-Conda,https://docs.mila.quebec/Userguide.html#conda,"Conda
-Another solution for Python is to use miniconda or anaconda which are also available through the module
-command: (the use of Conda is not recommended for Digital Research Alliance of
-Canada clusters due to the availability of custom-built packages for pip)
-module load miniconda/3
-=== Module miniconda/3 loaded ===]
-o enable conda environment functions, first use:
-To create an environment (see here
-for details) using a specific Python version, you may write:
-conda create -n <env> python=3.9
-Where <env> is the name of your environment. You can now activate it by doing:
-conda activate <env>
-You are now ready to install any Python package you want in this environment.
-For instance, to install PyTorch, you can find the Conda command of any version
-you want on pytorch’s website, e.g:
-conda install pytorch torchvision cudatoolkit=10.0 -c pytorch
-If you make a lot of environments and install/uninstall a lot of packages, it
-can be good to periodically clean up Conda’s cache:
-conda clean --all
-"
-Using Modules,https://docs.mila.quebec/Userguide.html#using-modules,"Using Modules
-A lot of software, such as Python and Conda, is already compiled and available on
-the cluster through the module command and its sub-commands. In particular,
-if you wish to use Python 3.7 you can simply do:
-module load python/3.7
-"
-The module command,https://docs.mila.quebec/Userguide.html#the-module-command,"The module command
-For a list of available modules, simply use:
-module avail
--------------------------------------------------------------------------------------------------------------- Global Aliases ---------------------------------------------------------------------------------------------------------------
-  cuda/10.0 -> cudatoolkit/10.0    cuda/9.2      -> cudatoolkit/9.2                                 pytorch/1.4.1       -> python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.4.1    tensorflow/1.15 -> python/3.7/tensorflow/1.15
-  cuda/10.1 -> cudatoolkit/10.1    mujoco-py     -> python/3.7/mujoco-py/2.0                        pytorch/1.5.0       -> python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.5.0    tensorflow/2.2  -> python/3.7/tensorflow/2.2
-  cuda/10.2 -> cudatoolkit/10.2    mujoco-py/2.0 -> python/3.7/mujoco-py/2.0                        pytorch/1.5.1       -> python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.5.1
-  cuda/11.0 -> cudatoolkit/11.0    pytorch       -> python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.5.1    tensorflow          -> python/3.7/tensorflow/2.2
-  cuda/9.0  -> cudatoolkit/9.0     pytorch/1.4.0 -> python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.4.0    tensorflow-cpu/1.15 -> python/3.7/tensorflow/1.15
-
--------------------------------------------------------------------------------------------------- /cvmfs/config.mila.quebec/modules/Core ---------------------------------"
-The module command,https://docs.mila.quebec/Userguide.html#the-module-command,"------------------------------------------------------------------
-  Mila       (S,L)    anaconda/3 (D)    go/1.13.5        miniconda/2        mujoco/1.50        python/2.7    python/3.6        python/3.8           singularity/3.0.3    singularity/3.2.1    singularity/3.5.3 (D)
-  anaconda/2          go/1.12.4         go/1.14   (D)    miniconda/3 (D)    mujoco/2.0  (D)    python/3.5    python/3.7 (D)    singularity/2.6.1    singularity/3.1.1    singularity/3.4.2
-
------------------------------------------------------------------------------------------------- /cvmfs/config.mila.quebec/modules/Compiler -------------------------------------------------------------------------------------------------
-  python/3.7/mujoco-py/2.0
-
--------------------------------------------------------------------------------------------------- /cvmfs/config.mila.quebec/modules/Cuda ---------------------------------------------------------------------------------------------------
-  cuda/10.0/cudnn/7.3        cuda/10.0/nccl/2.4         cuda/10.1/nccl/2.4     cuda/11.0/nccl/2.7        cuda/9.0/nccl/2.4     cudatoolkit/9.0     cudatoolkit/10.1        cudnn/7.6/cuda/10.0/tensorrt/7.0
-  cuda/10.0/cudnn/7.5        cuda/10.1/cudnn/7.5        cuda/10.2/cudnn/7.6    cuda/9.0/cudnn/7.3        cuda/9.2/cudnn/7.6    cudatoolkit/9.2     cudatoolkit/10.2        cudnn/7.6/cuda/10.1/tensorrt/7.0
-  cuda/10"
-The module command,https://docs.mila.quebec/Userguide.html#the-module-command,".0/cudnn/7.6 (D)    cuda/10.1/cudnn/7.6 (D)    cuda/10.2/nccl/2.7     cuda/9.0/cudnn/7.5 (D)    cuda/9.2/nccl/2.4     cudatoolkit/10.0    cudatoolkit/11.0 (D)    cudnn/7.6/cuda/9.0/tensorrt/7.0
-
------------------------------------------------------------------------------------------------- /cvmfs/config.mila.quebec/modules/Pytorch --------------------------------------------------------------------------------------------------
-  python/3.7/cuda/10.1/cudnn/7.6/pytorch/1.4.1    python/3.7/cuda/10.1/cudnn/7.6/pytorch/1.5.1 (D)    python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.5.0
-  python/3.7/cuda/10.1/cudnn/7.6/pytorch/1.5.0    python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.4.1        python/3.7/cuda/10.2/cudnn/7.6/pytorch/1.5.1 (D)
-
------------------------------------------------------------------------------------------------ /cvmfs/config.mila.quebec/modules/Tensorflow ------------------------------------------------------------------------------------------------
-  python/3.7/tensorflow/1.15    python/3.7/tensorflow/2.0    python/3.7/tensorflow/2.2 (D)
-Modules can be loaded using the load command:
-module load <module>
-To search for a module or a software, use the command spider:
-module spider search_term
-E.g.: by default, python2 will refer to the os-shipped installation of python2.7 and python3 to python3.6.
-If you want to use python3.7 you can type:
-module load python3.7
-"
-Available Software,https://docs.mila.quebec/Userguide.html#available-software,"Available Software
-Modules are divided in 5 main sections:
-| Section            | Description                                                                                         |
-|--------------------|-----------------------------------------------------------------------------------------------------|
-| Core               | Base interpreter and software (Python, go, etc…)                                                    |
-| Compiler           | Interpreter-dependent software (  see the note below  )                                             |
-| Cuda               | Toolkits, cudnn and related libraries                                                               |
-| Pytorch/Tensorflow | Pytorch/TF built with a specific Cuda/Cudnn version for Mila’s GPUs (  see the related paragraph  ) |
-
-Note
-Modules which are nested (../../..) usually depend on other software/module
-loaded alongside the main module.  No need to load the dependent software,
-the complex naming scheme allows an automatic detection of the dependent
-module(s):
-i.e.: Loading cudnn/7.6/cuda/9.0/tensorrt/7.0 will load cudnn/7.6 and
-cuda/9.0 alongside
-python/3.X is a particular dependency which can be served through
-python/3.X or anaconda/3 and is not automatically loaded to let the
-user pick his favorite flavor.
-
-"
-Default package location,https://docs.mila.quebec/Userguide.html#default-package-location,"Default package location
-Python by default uses the user site package first and packages provided by
-module last to not interfere with your installation.  If you want to skip
-packages installed in your site-packages folder (in your /home directory), you
-have to start Python with the -s flag.
-To check which package is loaded at import, you can print package.__file__
-to get the full path of the package.
-Example:
-module load pytorch/1.5.0
-python -c 'import torch;print(torch.__file__)'
-home/mila/my_home/.local/lib/python3.7/site-packages/torch/__init__.py   <== package from your own site-package
-Now with the -s flag:
-module load pytorch/1.5.0
-python -s -c 'import torch;print(torch.__file__)'
-cvmfs/ai.mila.quebec/apps/x86_64/debian/pytorch/python3.7-cuda10.1-cudnn7.6-v1.5.0/lib/python3.7/site-packages/torch/__init__.py'
-"
-On using containers,https://docs.mila.quebec/Userguide.html#on-using-containers,"On using containers
-Another option for creating portable code is Using containers on clusters.
-Containers are a popular approach at deploying applications by packaging a lot
-of the required dependencies together. The most popular tool for this is
-Docker, but Docker cannot be used on the Mila
-cluster (nor the other clusters from Digital Research Alliance of Canada).
-One popular mechanism for containerisation on a computational cluster is called
-Singularity.
-This is the recommended approach for running containers on the
-Mila cluster. See section Singularity for more details.
-"
-Singularity,https://docs.mila.quebec/Userguide.html#id7,"Singularity
-"
-Overview,https://docs.mila.quebec/Userguide.html#overview,"Overview
-"
-What is Singularity?,https://docs.mila.quebec/Userguide.html#what-is-singularity,"What is Singularity?
-Running Docker on SLURM is a security problem (e.g. running as root, being able
-to mount any directory).  The alternative is to use Singularity, which is a
-popular solution in the world of HPC.
-There is a good level of compatibility between Docker and Singularity,
-and we can find many exaggerated claims about able to convert containers
-from Docker to Singularity without any friction.
-Oftentimes, Docker images from DockerHub are 100% compatible with Singularity,
-and they can indeed be used without friction, but things get messy when
-we try to convert our own Docker build files to Singularity recipes.
-"
-Links to official documentation,https://docs.mila.quebec/Userguide.html#links-to-official-documentation,"Links to official documentation
-
-official Singularity user guide (this is the one you
-will use most often)
-official Singularity admin guide
-
-"
-Overview of the steps used in practice,https://docs.mila.quebec/Userguide.html#overview-of-the-steps-used-in-practice,"Overview of the steps used in practice
-Most often, the process to create and use a Singularity container is:
-
-on your Linux computer (at home or work)
-
-select a Docker image from DockerHub (e.g. pytorch/pytorch)
-make a recipe file for Singularity that starts with that DockerHub image
-build the recipe file, thus creating the image file (e.g. my-pytorch-image.sif)
-test your singularity container before send it over to the cluster
-rsync -av my-pytorch-image.sif <login-node>:Documents/my-singularity-images
-
-
-on the login node for that cluster
-
-queue your jobs with sbatch ...
-(note that your jobs will copy over the my-pytorch-image.sif to $SLURM_TMPDIR
-and will then launch Singularity with that image)
-do something else while you wait for them to finish
-queue more jobs with the same my-pytorch-image.sif,
-reusing it many times over
-
-
-
-In the following sections you will find specific examples or tips to accomplish
-in practice the steps highlighted above.
-"
-"Nope, not on MacOS",https://docs.mila.quebec/Userguide.html#nope-not-on-macos,"Nope, not on MacOS
-Singularity does not work on MacOS, as of the time of this writing in 2021.
-Docker does not actually run on MacOS, but there Docker silently installs a
-virtual machine running Linux, which makes it a pleasant experience,
-and the user does not need to care about the details of how Docker does it.
-Given its origins in HPC, Singularity does not provide that kind of seamless
-experience on MacOS, even though it’s technically possible to run it
-inside a Linux virtual machine on MacOS.
-"
-Where to build images,https://docs.mila.quebec/Userguide.html#where-to-build-images,"Where to build images
-Building Singularity images is a rather heavy task, which can take 20 minutes
-if you have a lot of steps in your recipe. This makes it a bad task to run on
-the login nodes of our clusters, especially if it needs to be run regularly.
-On the Mila cluster, we are lucky to have unrestricted internet access on the compute
-nodes, which means that anyone can request an interactive CPU node (no need for GPU)
-and build their images there without problem.
-
-Warning
-Do not build Singularity images from scratch every time your run a
-job in a large batch.  This will be a colossal waste of GPU time as well as
-internet bandwidth.  If you setup your workflow properly (e.g. using bind
-paths for your code and data), you can spend months reusing the same
-Singularity image my-pytorch-image.sif.
-
-"
-Building the containers,https://docs.mila.quebec/Userguide.html#building-the-containers,"Building the containers
-Building a container is like creating a new environment except that containers
-are much more powerful since they are self-contained systems.  With
-singularity, there are two ways to build containers.
-The first one is by yourself, it’s like when you got a new Linux laptop and you
-don’t really know what you need, if you see that something is missing, you
-install it. Here you can get a vanilla container with Ubuntu called a sandbox,
-you log in and you install each packages by yourself.  This procedure can take
-time but will allow you to understand how things work and what you need. This is
-recommended if you need to figure out how things will be compiled or if you want
-to install packages on the fly. We’ll refer to this procedure as singularity
-sandboxes.
-The second way is more like you know what you want, so you write a list of
-everything you need, you send it to singularity and it will install everything
-for you. Those lists are called singularity recipes.
-"
-First way: Build and use a sandbox,https://docs.mila.quebec/Userguide.html#first-way-build-and-use-a-sandbox,"First way: Build and use a sandbox
-You might ask yourself: On which machine should I build a container?
-First of all, you need to choose where you’ll build your container. This
-operation requires memory and high cpu usage.
-
-Warning
-Do NOT build containers on any login nodes !
-
-
-(Recommended for beginner) If you need to use apt-get, you should build
-the container on your laptop with sudo privileges. You’ll only need to
-install singularity on your laptop. Windows/Mac users can look there and
-Ubuntu/Debian users can use directly:
-
-sudo apt-get install singularity-container
-
-
-If you can’t install singularity on your laptop and you don’t need
-apt-get, you can reserve a cpu node on the Mila cluster to build your
-container.
-
-In this case, in order to avoid too much I/O over the network, you should define
-the singularity cache locally:
-
-export SINGULARITY_CACHEDIR=$SLURM_TMPDIR
-
-
-If you can’t install singularity on your laptop and you want to use
-apt-get, you can use singularity-hub to build your containers and read
-Recipe_section.
-
-"
-Download containers from the web,https://docs.mila.quebec/Userguide.html#download-containers-from-the-web,"Download containers from the web
-Hopefully, you may not need to create containers from scratch as many have been
-already built for the most common deep learning software. You can find most of
-them on dockerhub.
-Go on dockerhub and select the container you want to pull.
-For example, if you want to get the latest PyTorch version with GPU support
-(Replace runtime by devel if you need the full Cuda toolkit):
-singularity pull docker://pytorch/pytorch:1.0.1-cuda10.0-cudnn7-runtime
-Or the latest TensorFlow:
-singularity pull docker://tensorflow/tensorflow:latest-gpu-py3
-Currently the pulled image pytorch.simg or tensorflow.simg is read-only
-meaning that you won’t be able to install anything on it.  Starting now, PyTorch
-will be taken as example. If you use TensorFlow, simply replace every
-pytorch occurrences by tensorflow.
-"
-How to add or install stuff in a container,https://docs.mila.quebec/Userguide.html#how-to-add-or-install-stuff-in-a-container,"How to add or install stuff in a container
-The first step is to transform your read only container
-pytorch-1.0.1-cuda10.0-cudnn7-runtime.simg in a writable version that will
-allow you to add packages.
-
-Warning
-Depending on the version of singularity you are using, singularity
-will build a container with the extension .simg or .sif. If you’re using
-.sif files, replace every occurences of .simg by .sif.
-
-
-Tip
-If you want to use apt-get you have to put sudo ahead of the
-following commands
-
-This command will create a writable image in the folder pytorch.
-singularity build --sandbox pytorch pytorch-1.0.1-cuda10.0-cudnn7-runtime.simg
-Then you’ll need the following command to log inside the container.
-singularity shell --writable -H $HOME:/home pytorch
-Once you get into the container, you can use pip and install anything you need
-(Or with apt-get if you built the container with sudo).
-
-Warning
-Singularity mounts your home folder, so if you install things into
-the $HOME of your container, they will be installed in your real
-$HOME!
-
-You should install your stuff in /usr/local instead.
-"
-Creating useful directories,https://docs.mila.quebec/Userguide.html#creating-useful-directories,"Creating useful directories
-One of the benefits of containers is that you’ll be able to use them across
-different clusters. However for each cluster the datasets and experiments
-folder location can be different. In order to be invariant to those locations,
-we will create some useful mount points inside the container:
-mkdir /dataset
-mkdir /tmp_log
-mkdir /final_log
-From now, you won’t need to worry anymore when you write your code to specify
-where to pick up your dataset. Your dataset will always be in /dataset
-independently of the cluster you are using.
-"
-Testing,https://docs.mila.quebec/Userguide.html#testing,"Testing
-If you have some code that you want to test before finalizing your container,
-you have two choices.  You can either log into your container and run Python
-code inside it with:
-singularity shell --nv pytorch
-Or you can execute your command directly with
-singularity exec --nv pytorch Python YOUR_CODE.py
-
-Tip
-—nv allows the container to use gpus. You don’t need this if you
-don’t plan to use a gpu.
-
-
-Warning
-Don’t forget to clear the cache of the packages you installed in
-the containers.
-
-"
-Creating a new image from the sandbox,https://docs.mila.quebec/Userguide.html#creating-a-new-image-from-the-sandbox,"Creating a new image from the sandbox
-Once everything you need is installed inside the container, you need to convert
-it back to a read-only singularity image with:
-singularity build pytorch_final.simg pytorch
-"
-Second way: Use recipes,https://docs.mila.quebec/Userguide.html#second-way-use-recipes,"Second way: Use recipes
-A singularity recipe is a file including specifics about installation software,
-environment variables, files to add, and container metadata.  It is a starting
-point for designing any custom container. Instead of pulling a container and
-installing your packages manually, you can specify in this file the packages
-you want and then build your container from this file.
-Here is a toy example of a singularity recipe installing some stuff:
-################# Header: Define the base system you want to use ################
-# Reference of the kind of base you want to use (e.g., docker, debootstrap, shub).
-Bootstrap: docker
-# Select the docker image you want to use (Here we choose tensorflow)
-From: tensorflow/tensorflow:latest-gpu-py3
-
-################# Section: Defining the system #################################
-# Commands in the %post section are executed within the container.
-%post
-        echo ""Installing Tools with apt-get""
-        apt-get update
-        apt-get install -y cmake libcupti-dev libyaml-dev wget unzip
-        apt-get clean
-        echo ""Instal"
-Second way: Use recipes,https://docs.mila.quebec/Userguide.html#second-way-use-recipes,"ling things with pip""
-        pip install tqdm
-        echo ""Creating mount points""
-        mkdir /dataset
-        mkdir /tmp_log
-        mkdir /final_log
-
-
-# Environment variables that should be sourced at runtime.
-%environment
-        # use bash as default shell
-        SHELL=/bin/bash
-        export SHELL
-
-
-A recipe file contains two parts: the header and sections. In the
-header you specify which base system you want to use, it can be any docker
-or singularity container. In sections, you can list the things you want to
-install in the subsection post or list the environment’s variable you need
-to source at each runtime in the subsection environment. For a more detailed
-description, please look at the singularity documentation.
-In order to build a singularity container from a singularity recipe file, you
-should use:
-sudo singularity build <NAME_CONTAINER> <YOUR_RECIPE_FILES>
-
-Warning
-You always need to use sudo when you build a container from a
-recipe. As there is no access to sudo on the cluster, a personal computer or
-the use singularity hub is needed to build a container
-"
-Build recipe on singularity hub,https://docs.mila.quebec/Userguide.html#build-recipe-on-singularity-hub,"Build recipe on singularity hub
-Singularity hub allows users to build containers from recipes directly on
-singularity-hub’s cloud meaning that you don’t need to build containers by
-yourself.  You need to register on singularity-hub and link your
-singularity-hub account to your GitHub account, then:
-
-
-Create a new github repository.
-Add a collection on singularity-hub and select the github repository your created.
-Clone the github repository on your computer.
-$ git clone <url>
-
-
-
-Write the singularity recipe and save it as a file named Singularity.
-Git add Singularity, commit and push on the master branch
-$ git add Singularity
-$ git commit
-$ git push origin master
-
-
-
-
-
-At this point, robots from singularity-hub will build the container for you, you
-will be able to download your container from the website or directly with:
-singularity pull shub://<github_username>/<repository_name>
-"
-"Example: Recipe with OpenAI gym, MuJoCo and Miniworld",https://docs.mila.quebec/Userguide.html#example-recipe-with-openai-gym-mujoco-and-miniworld,"Example: Recipe with OpenAI gym, MuJoCo and Miniworld
-Here is an example on how you can use a singularity recipe to install complex
-environment such as OpenAI gym, MuJoCo and Miniworld on a PyTorch based
-container. In order to use MuJoCo, you’ll need to copy the key stored on the
-Mila cluster in /ai/apps/mujoco/license/mjkey.txt to your current directory.
-#This is a dockerfile that sets up a full Gym install with test dependencies
-Bootstrap: docker
-
-# Here we ll build our container upon the pytorch container
-From: pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime
-
-# Now we'll copy the mjkey file located in the current directory inside the container's root
-# directory
-%files
-        mjkey.txt
-
-# Then we put everything we need to install
-%post
-        export PATH=$PATH:/opt/conda/bin
-        apt -y update && \
-        apt install -y keyboard-configuration && \
-        apt install -y \
-        python3-dev \
-        python-pyglet \
-        python3-opengl \
-        libhdf5-dev \
-        libjpeg-dev \
-        libboost-all-dev \
-        libsdl2-dev \
-        libosmesa6-dev \
-        patchelf \
-        ffmpeg \
-        xvfb \
-        libhdf5-dev \
-        openjdk-8-jdk \
-        wget \
-        git \
-        unzip && \
-        apt clean && \
-        rm -rf /var/lib/apt/lists/*
-        pip install h5py
-
-        # Download Gym and MuJoCo
-        mkdir /Gym && cd /Gym
-        git clone https://github.com/openai/gym.git || true && \
-        mkdir /Gym/.mujoco && cd /Gym/.mujoco
-        wget https://www.roboti.us/do"
-"Example: Recipe with OpenAI gym, MuJoCo and Miniworld",https://docs.mila.quebec/Userguide.html#example-recipe-with-openai-gym-mujoco-and-miniworld,"wnload/mjpro150_linux.zip  && \
-        unzip mjpro150_linux.zip && \
-        wget https://www.roboti.us/download/mujoco200_linux.zip && \
-        unzip mujoco200_linux.zip && \
-        mv mujoco200_linux mujoco200
-
-        # Export global environment variables
-        export MUJOCO_PY_MJKEY_PATH=/Gym/.mujoco/mjkey.txt
-        export MUJOCO_PY_MUJOCO_PATH=/Gym/.mujoco/mujoco150/
-        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mjpro150/bin
-        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mujoco200/bin
-        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/bin
-        cp /mjkey.txt /Gym/.mujoco/mjkey.txt
-        # Install Python dependencies
-        wget https://raw.githubusercontent.com/openai/mujoco-py/master/requirements.txt
-        pip install -r requirements.txt
-        # Install Gym and MuJoCo
-        cd /Gym/gym
-        pip install -e '.[all]'
-        # Change permission to use mujoco_py as non sudoer user
-        chmod -R 777 /opt/conda/lib/python3.6/site-packages/mujoco_py/
-        pip install --upgrade minerl
-
-# Export global environment variables
-%environment
-        export SHELL=/bin/sh
-        export MUJOCO_PY_MJKEY_PATH=/Gym/.mujoco/mjkey.txt
-        export MUJOCO_PY_MUJOCO_PATH=/Gym/.mujoco/mujoco150/
-        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mjpro150/bin
-        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mujoco200/bin
-        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/bin
-        export PATH=/Gym/gym/.tox/py3/bin:$PATH"
-"Example: Recipe with OpenAI gym, MuJoCo and Miniworld",https://docs.mila.quebec/Userguide.html#example-recipe-with-openai-gym-mujoco-and-miniworld,"
-
-%runscript
-        exec /bin/sh ""$@""
-
-
-Here is the same recipe but written for TensorFlow:
-#This is a dockerfile that sets up a full Gym install with test dependencies
-Bootstrap: docker
-
-# Here we ll build our container upon the tensorflow container
-From: tensorflow/tensorflow:latest-gpu-py3
-
-# Now we'll copy the mjkey file located in the current directory inside the container's root
-# directory
-%files
-        mjkey.txt
-
-# Then we put everything we need to install
-%post
-        apt -y update && \
-        apt install -y keyboard-configuration && \
-        apt install -y \
-        python3-setuptools \
-        python3-dev \
-        python-pyglet \
-        python3-opengl \
-        libjpeg-dev \
-        libboost-all-dev \
-        libsdl2-dev \
-        libosmesa6-dev \
-        patchelf \
-        ffmpeg \
-        xvfb \
-        wget \
-        git \
-        unzip && \
-        apt clean && \
-        rm -rf /var/lib/apt/lists/*
-
-        # Download Gym and MuJoCo
-        mkdir /Gym && cd /Gym
-        git clone https://github.com/openai/gym.git || true && \
-        mkdir /Gym/.mujoco && cd /Gym/.mujoco
-        wget https://www.roboti.us/download/mjpro150_linux.zip  && \
-        unzip mjpro150_linux.zip && \
-        wget https://www.roboti.us/download/mujoco200_linux.zip && \
-        unzip mujoco200_linux.zip && \
-        mv mujoco200_linux mujoco200
-
-        # Export global environment variables
-        export MUJOCO_PY_MJKEY_PATH=/Gym/.mujoco/mjkey.txt
-        export MUJOCO_PY_MUJOCO_PATH=/Gym/.mujoco/mujo"
-"Example: Recipe with OpenAI gym, MuJoCo and Miniworld",https://docs.mila.quebec/Userguide.html#example-recipe-with-openai-gym-mujoco-and-miniworld,"co150/
-        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mjpro150/bin
-        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mujoco200/bin
-        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/bin
-        cp /mjkey.txt /Gym/.mujoco/mjkey.txt
-
-        # Install Python dependencies
-        wget https://raw.githubusercontent.com/openai/mujoco-py/master/requirements.txt
-        pip install -r requirements.txt
-        # Install Gym and MuJoCo
-        cd /Gym/gym
-        pip install -e '.[all]'
-        # Change permission to use mujoco_py as non sudoer user
-        chmod -R 777 /usr/local/lib/python3.5/dist-packages/mujoco_py/
-
-        # Then install miniworld
-        cd /usr/local/
-        git clone https://github.com/maximecb/gym-miniworld.git
-        cd gym-miniworld
-        pip install -e .
-
-# Export global environment variables
-%environment
-        export SHELL=/bin/bash
-        export MUJOCO_PY_MJKEY_PATH=/Gym/.mujoco/mjkey.txt
-        export MUJOCO_PY_MUJOCO_PATH=/Gym/.mujoco/mujoco150/
-        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mjpro150/bin
-        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/Gym/.mujoco/mujoco200/bin
-        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/bin
-        export PATH=/Gym/gym/.tox/py3/bin:$PATH
-
-%runscript
-        exec /bin/bash ""$@""
-
-
-Keep in mind that those environment variables are sourced at runtime and not at
-build time. This is why, you should also define them in the %post section
-since they are required to install MuJoCo"
-Using containers on clusters,https://docs.mila.quebec/Userguide.html#using-containers-on-clusters,"Using containers on clusters
-"
-How to use containers on clusters,https://docs.mila.quebec/Userguide.html#how-to-use-containers-on-clusters,"How to use containers on clusters
-On every cluster with Slurm, datasets and intermediate results should go in
-$SLURM_TMPDIR while the final experiment results should go in $SCRATCH.
-In order to use the container you built, you need to copy it on the cluster you
-want to use.
-
-Warning
-You should always store your container in $SCRATCH !
-
-Then reserve a node with srun/sbatch, copy the container and your dataset on the
-node given by SLURM (i.e in $SLURM_TMPDIR) and execute the code
-<YOUR_CODE> within the container <YOUR_CONTAINER> with:
-singularity exec --nv -H $HOME:/home -B $SLURM_TMPDIR:/dataset/ -B $SLURM_TMPDIR:/tmp_log/ -B $SCRATCH:/final_log/ $SLURM_TMPDIR/<YOUR_CONTAINER> python <YOUR_CODE>
-Remember that /dataset, /tmp_log and /final_log were created in the
-previous section. Now each time, we’ll use singularity, we are explicitly
-telling it to mount $SLURM_TMPDIR on the cluster’s node in the folder
-/dataset inside the container with the option -B such that each dataset
-downloaded by PyTorch in /dataset will be available in $SLURM_TMPDIR.
-This will allow us to have code and scripts that are invariant to the cluster
-environment. The option -H specify what will be the container’s home. For
-example, if you have your code in $HOME/Project12345/Version35/ you can
-specify -H $HOME/Project12345/Version35:/home, thus the container will only
-have access to the code inside Version35.
-If you want to run multiple commands inside the container you can use:
-singularity exec --nv -H $HOME:/home -B $SLURM_TMPDIR:/dataset/ \
-   -B $SLURM_TMPDIR:/tmp_log/ -B $SCRATCH:/final_log/ \
-   $SLURM_TMPDIR/<YOUR_CONTAINER> bash -c 'pwd && ls && python <YOUR_CODE>'
-"
-Example: Interactive case (srun/salloc),https://docs.mila.quebec/Userguide.html#example-interactive-case-srun-salloc,"Example: Interactive case (srun/salloc)
-Once you get an interactive session with SLURM, copy <YOUR_CONTAINER> and
-<YOUR_DATASET> to $SLURM_TMPDIR
-0. Get an interactive session
-srun --gres=gpu:1
-1. Copy your container on the compute node
-rsync -avz $SCRATCH/<YOUR_CONTAINER> $SLURM_TMPDIR
-2. Copy your dataset on the compute node
-rsync -avz $SCRATCH/<YOUR_DATASET> $SLURM_TMPDIR
-Then use singularity shell to get a shell inside the container
-3. Get a shell in your environment
-singularity shell --nv \
-        -H $HOME:/home \
-        -B $SLURM_TMPDIR:/dataset/ \
-        -B $SLURM_TMPDIR:/tmp_log/ \
-        -B $SCRATCH:/final_log/ \
-        $SLURM_TMPDIR/<YOUR_CONTAINER>
-4. Execute your code
-python <YOUR_CODE>
-or use singularity exec to execute <YOUR_CODE>.
-3. Execute your code
-singularity exec --nv \
-        -H $HOME:/home \
-        -B $SLURM_TMPDIR:/dataset/ \
-        -B $SLURM_TMPDIR:/tmp_log/ \
-        -B $SCRATCH:/final_log/ \
-        $SLURM_TMPDIR/<YOUR_CONTAINER> \
-        python <YOUR_CODE>
-You can create also the following alias to make your life easier.
-alias my_env='singularity exec --nv \
-        -H $HOME:/home \
-        -B $SLURM_TMPDIR:/dataset/ \
-        -B $SLURM_TMPDIR:/tmp_log/ \
-        -B $SCRATCH:/final_log/ \
-        $SLURM_TMPDIR/<YOUR_CONTAINER>'
-This will allow you to run any code with:
-my_env python <YOUR_CODE>
-"
-Example: sbatch case,https://docs.mila.quebec/Userguide.html#example-sbatch-case,"Example: sbatch case
-You can also create a sbatch script:
-:linenos:
-
-#!/bin/bash
-#SBATCH --cpus-per-task=6         # Ask for 6 CPUs
-#SBATCH --gres=gpu:1              # Ask for 1 GPU
-#SBATCH --mem=10G                 # Ask for 10 GB of RAM
-#SBATCH --time=0:10:00            # The job will run for 10 minutes
-
-# 1. Copy your container on the compute node
-rsync -avz $SCRATCH/<YOUR_CONTAINER> $SLURM_TMPDIR
-# 2. Copy your dataset on the compute node
-rsync -avz $SCRATCH/<YOUR_DATASET> $SLURM_TMPDIR
-# 3. Executing your code with singularity
-singularity exec --nv \
-        -H $HOME:/home \
-        -B $SLURM_TMPDIR:/dataset/ \
-        -B $SLURM_TMPDIR:/tmp_log/ \
-        -B $SCRATCH:/final_log/ \
-        $SLURM_TMPDIR/<YOUR_CONTAINER> \
-        python ""<YOUR_CODE>""
-# 4. Copy whatever you want to save on $SCRATCH
-rsync -avz $SLURM_TMPDIR/<to_save> $SCRATCH
-
-
-"
-Issue with PyBullet and OpenGL libraries,https://docs.mila.quebec/Userguide.html#issue-with-pybullet-and-opengl-libraries,"Issue with PyBullet and OpenGL libraries
-If you are running certain gym environments that require pyglet, you may
-encounter a problem when running your singularity instance with the Nvidia
-drivers using the --nv flag. This happens because the --nv flag also
-provides the OpenGL libraries:
-libGL.so.1 => /.singularity.d/libs/libGL.so.1
-libGLX.so.0 => /.singularity.d/libs/libGLX.so.0
-
-
-If you don’t experience those problems with pyglet, you probably don’t need
-to address this. Otherwise, you can resolve those problems by apt-get install
--y libosmesa6-dev mesa-utils mesa-utils-extra libgl1-mesa-glx, and then making
-sure that your LD_LIBRARY_PATH points to those libraries before the ones in
-/.singularity.d/libs.
-%environment
-        # ...
-        export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/mesa:$LD_LIBRARY_PATH
-
-
-"
-Mila cluster,https://docs.mila.quebec/Userguide.html#mila-cluster,"Mila cluster
-On the Mila cluster $SCRATCH is not yet defined, you should add the
-experiment results you want to keep in /network/scratch/<u>/<username>/. In
-order to use the sbatch script above and to match other cluster environment’s
-names, you can define $SCRATCH as an alias for
-/network/scratch/<u>/<username> with:
-echo ""export SCRATCH=/network/scratch/${USER:0:1}/$USER"" >> ~/.bashrc
-Then, you can follow the general procedure explained above.
-"
-Digital Research Alliance of Canada,https://docs.mila.quebec/Userguide.html#digital-research-alliance-of-canada,"Digital Research Alliance of Canada
-Using singularity on Digital Research Alliance of Canada is similar except that
-you need to add Yoshua’s account name and load singularity. Here is an example
-of a sbatch script using singularity on compute Canada cluster:
-
-Warning
-You should use singularity/2.6 or singularity/3.4. There is a bug
-in singularity/3.2 which makes gpu unusable.
-
- 1#!/bin/bash
- 2#SBATCH --account=rpp-bengioy     # Yoshua pays for your job
- 3#SBATCH --cpus-per-task=6         # Ask for 6 CPUs
- 4#SBATCH --gres=gpu:1              # Ask for 1 GPU
- 5#SBATCH --mem=32G                 # Ask for 32 GB of RAM
- 6#SBATCH --time=0:10:00            # The job will run for 10 minutes
- 7#SBATCH --output=""/scratch/<user>/slurm-%j.out"" # Modify the output of sbatch
- 8
- 9# 1. You have to load singularity
-10module load singularity
-11# 2. Then you copy the container to the local disk
-12rsync -avz $SCRATCH/<YOUR_CONTAINER> $SLURM_TMPDIR
-13# 3. Copy your dataset on the compute node
-14rsync -avz $SCRATCH/<YOUR_DATASET> $SLURM_TMPDIR
-15# 4. Executing your code with singularity
-16singularity exec --nv \
-17        -H $HOME:/home \
-18        -B $SLURM_TMPDIR:/dataset/ \
-19        -B $SLURM_TMPDIR:/tmp_log/ \
-20        -B $SCRATCH:/final_log/ \
-21        $SLURM_TMPDIR/<YOUR_CONTAINER> \
-22        python ""<YOUR_CODE>""
-23# 5. Copy whatever you want to save on $SCRATCH
-24rsync -avz $SLURM_TMPDIR/<to_save> $SCRATCH
-
-
-"
-Sharing Data with ACLs,https://docs.mila.quebec/Userguide.html#sharing-data-with-acls,"Sharing Data with ACLs
-Regular permissions bits are extremely blunt tools: They control access through
-only three sets of bits owning user, owning group and all others. Therefore,
-access is either too narrow (0700 allows access only by oneself) or too wide
-(770 gives all permissions to everyone in the same group, and 777 to
-literally everyone).
-ACLs (Access Control Lists) are an expansion of the permissions bits that allow
-more fine-grained, granular control of accesses to a file. They can be used to
-permit specific users access to files and folders even if conservative default
-permissions would have denied them such access.
-As an illustrative example, to use ACLs to allow $USER (oneself) to
-share with $USER2 (another person) a “playground” folder hierarchy in
-Mila’s scratch filesystem at a location
-
-/network/scratch/${USER:0:1}/$USER/X/Y/Z/...
-
-in a safe and secure fashion that allows both users to read, write, execute,
-search and delete each others’ files:
-
-
-1. Grant oneself permissions to access any future files/folders created
-by the other (or oneself)
-(-d renders this permission a “default” / inheritable one)
-
-setfacl -Rdm user:${USER}:rwx  /network/scratch/${USER:0:1}/$USER/X/Y/Z/
-
-
-
-
-Note
-The importance of doing this seemingly-redundant step first is that files
-and folders are always owned by only one person, almost always their
-creator (the UID will be the creator’s, the GID typically as well). If that
-user is not yourself, you will not have access to those files unless the
-other person specifically gives them to you – or these files inherited a
-default ACL allowing you full access.
-This is the inherited, default ACL serving that purpose.
-
-
-2. Grant the other permission to access any future files/folders created
-by the other (or oneself)
-(-d renders this permission a “default” / inheritable one)
-
-setfacl"
-Sharing Data with ACLs,https://docs.mila.quebec/Userguide.html#sharing-data-with-acls," -Rdm user:${USER2}:rwx /network/scratch/${USER:0:1}/$USER/X/Y/Z/
-
-
-
-
-3. Grant the other permission to access any existing files/folders created
-by oneself.
-Such files and folders were created before the new default ACLs were added
-above and thus did not inherit them from their parent folder at the moment of
-their creation.
-
-setfacl -Rm  user:${USER2}:rwx /network/scratch/${USER:0:1}/$USER/X/Y/Z/
-
-
-
-Note
-The purpose of granting permissions first for future files and then for
-existing files is to prevent a race condition whereby after the first
-setfacl command the other person could create files to which the
-second setfacl command does not apply.
-
-
-
-4. Grant another permission to search through one’s hierarchy down to the
-shared location in question.
-
-
-Non-recursive (!!!!)
-May also grant :rx in unlikely event others listing your folders on the
-path is not troublesome or desirable.
-
-setfacl -m   user:${USER2}:x   /network/scratch/${USER:0:1}/$USER/X/Y/
-setfacl -m   user:${USER2}:x   /network/scratch/${USER:0:1}/$USER/X/
-setfacl -m   user:${USER2}:x   /network/scratch/${USER:0:1}/$USER/
-
-
-
-Note
-In order to access a file, all folders from the root (/) down to the
-parent folder in question must be searchable (+x) by the concerned user.
-This is already the case for all users for folders such as /,
-/network and /network/scratch, but users must explicitly grant access
-to some or all users either through base permissions or by adding ACLs, for
-at least /network/scratch/${USER:0:1}/$USER, $HOME and subfolders.
-To bluntly allow all users to search through a folder (think twice!),
-the following command can be used:
-chmod a+x /network/scratch/${USER:0:1}/$USER/
-
-
-
-
-Note
-For more information on setfacl and path resolution/access checking,
-consider the following documentation viewing commands:
-
-man setfacl
-man path_resolution
-
-"
-Viewing and Verifying ACLs,https://docs.mila.quebec/Userguide.html#viewing-and-verifying-acls,"Viewing and Verifying ACLs
-getfacl /path/to/folder/or/file
-           1:  # file: somedir/
-           2:  # owner: lisa
-           3:  # group: staff
-           4:  # flags: -s-
-           5:  user::rwx
-           6:  user:joe:rwx               #effective:r-x
-           7:  group::rwx                 #effective:r-x
-           8:  group:cool:r-x
-           9:  mask::r-x
-          10:  other::r-x
-          11:  default:user::rwx
-          12:  default:user:joe:rwx       #effective:r-x
-          13:  default:group::r-x
-          14:  default:mask::r-x
-          15:  default:other::---
-
-
-
-Note
-
-man getfacl
-
-
-"
-Contributing datasets,https://docs.mila.quebec/Userguide.html#contributing-datasets,"Contributing datasets
-If a dataset could help the research of others at Mila, this form can be filled to request its addition
-to /network/datasets.
-"
-Publicly share a Mila dataset,https://docs.mila.quebec/Userguide.html#publicly-share-a-mila-dataset,"Publicly share a Mila dataset
-Mila offers two ways to publicly share a Mila dataset:
-
-Academic Torrent
-Google Drive
-
-Note that these options are not mutually exclusive and both can be used.
-"
-Academic Torrent,https://docs.mila.quebec/Userguide.html#id10,"Academic Torrent
-Mila hosts/seeds some datasets created by the Mila community through Academic
-Torrent. The first step is to create an
-account and a torrent file.
-Then drop the dataset in /network/scratch/.transit_datasets and send the
-Academic Torrent URL to Mila’s helpdesk. If
-the dataset does not reside on the Mila cluster, only the Academic Torrent URL
-would be needed to proceed with the initial download. Then you can delete /
-stop sharing your copy.
-
-Note
-
-Avoid mentioning dataset in the name of the dataset
-Avoid capital letters, special charaters (including spaces) in files and
-directories names. Spaces can be replaced by hyphens (-).
-Multiple archives can be provided to spread the data (e.g. dataset splits,
-raw data, extra data, …)
-
-
-"
-Generate a .torrent file to be uploaded to Academic Torrent,https://docs.mila.quebec/Userguide.html#generate-a-torrent-file-to-be-uploaded-to-academic-torrent,"Generate a .torrent file to be uploaded to Academic Torrent
-The command line / Python utility torrentool can be used to create a
-DATASET_NAME.torrent file:
-# Install torrentool
-python3 -m pip install torrentool click
-# Change Directory to the location of the dataset to be hosted by Mila
-cd /network/scratch/.transit_datasets
-torrent create --tracker https://academictorrents.com/announce.php DATASET_NAME
-
-
-The resulting DATASET_NAME.torrent can then be used to register a new dataset
-on Academic Torrent.
-
-Warning
-
-The creation of a DATASET_NAME.torrent file requires the computation of
-checksums for the dataset content which can quickly become CPU-heavy. This
-process should not be executed on a login node
-
-
-"
-Download a dataset from Academic Torrent,https://docs.mila.quebec/Userguide.html#download-a-dataset-from-academic-torrent,"Download a dataset from Academic Torrent
-Academic Torrent provides a Python API to easily download a dataset
-from it’s registered list:
-# Install the Python API with:
-# python3 -m pip install academictorrents
-import academictorrents as at
-mnist_path = at.get(""323a0048d87ca79b68f12a6350a57776b6a3b7fb"", datastore=""~/scratch/.academictorrents-datastore"") # Download the mnist dataset
-
-
-
-Note
-Current needs have been evaluated to be for a download speed of about 10
-MB/s. This speed can be higher if more users also seeds the dataset.
-
-"
-Google Drive,https://docs.mila.quebec/Userguide.html#id12,"Google Drive
-Only a member of the staff team can upload to Mila’s Google Drive
-which requires to first drop the dataset in
-/network/scratch/.transit_datasets. Then, contact Mila’s helpdesk and provide the following informations:
-
-directory containing the archived dataset (zip is favored) in
-/network/scratch/.transit_datasets
-the name of the dataset
-a licence in .txt format. One of the the creative common licenses can be used. It is
-recommended to at least have the Attribution option. The No Derivatives
-option is discouraged unless the dataset should not be modified by others.
-MD5 checksum of the archive
-the arXiv and GitHub URLs (those can be sent later if the article is still in
-the submission process)
-instructions to know if the dataset needs to be unziped, untared or
-else before uploading to Google Drive
-
-
-Note
-
-Avoid mentioning dataset in the name of the dataset
-Avoid capital letters, special charaters (including spaces) in files and
-directories names. Spaces can be replaced by hyphens (-).
-Multiple archives can be provided to spread the data (e.g. dataset splits,
-raw data, extra data, …)
-
-
-"
-Download a dataset from Mila’s Google Drive with  gdown,https://docs.mila.quebec/Userguide.html#download-a-dataset-from-mila-s-google-drive-with-gdown,"Download a dataset from Mila���s Google Drive with  gdown
-The utility gdown is a simple utility to
-download data from Google Drive from the command line shell or in a Python
-script and requires no setup.
-
-Warning
-A limitation however is that it uses a shared client id which can cause a
-quota block when too many users uses it in the same day. It is described in
-a GitHub issue.
-
-"
-Download a dataset from Mila’s Google Drive with rclone,https://docs.mila.quebec/Userguide.html#download-a-dataset-from-mila-s-google-drive-with-rclone,"Download a dataset from Mila’s Google Drive with rclone
-Rclone is a command line program to manage files on
-cloud storage. In the context of a Google Drive remote, it allows to specify a
-client id to avoid sharing with other users which avoid quota limits. Rclone
-describes the creation of a client id in its documentaton. Once this is done, a
-remote for Mila’s Google Drive can be configured from the command line:
-rclone config create mila-gdrive drive client_id XXXXXXXXXXXX-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.apps.googleusercontent.com \
-    client_secret XXXXXXXXXXXXX-XXXXXXXXXX \
-    scope 'drive.readonly' \
-    root_folder_id 1peJ6VF9wQ-LeETgcdGxu1e4fo28JbtUt \
-    config_is_local false \
-    config_refresh_token false
-
-
-The remote can then be used to download a dataset:
-rclone copy --progress mila-gdrive:DATASET_NAME/ ~/scratch/datasets/DATASET_NAME/
-
-
-Rclone is available from the conda channel conda-forge.
-"
-Digital Object Identifier (DOI),https://docs.mila.quebec/Userguide.html#digital-object-identifier-doi,"Digital Object Identifier (DOI)
-It is recommended to get a DOI to reference the dataset. A DOI is a permanent
-id/URL which prevents losing references of online scientific data.
-https://figshare.com can be used to create a DOI:
-
-Go in My Data
-Create an item by clicking Create new item
-Check Metadata record only at the top
-Fill the metadata fields
-
-Then reference the dataset using https://doi.org like this:
-https://doi.org/10.6084/m9.figshare.2066037
-"
-Data Transmission using Globus Connect Personal,https://docs.mila.quebec/Userguide.html#data-transmission-using-globus-connect-personal,"Data Transmission using Globus Connect Personal
-Mila doesn’t own a Globus license but if the source or destination provides a
-Globus account, like Digital Research Alliance of Canada for example, it’s
-possible to setup Globus Connect Personal to create a personal endpoint on the
-Mila cluster by following the Globus guide to Install, Configure, and
-Uninstall Globus Connect Personal for Linux.
-This endpoint can then be used to transfer data to and from the Mila cluster.
-"
-JupyterHub,https://docs.mila.quebec/Userguide.html#jupyterhub,"JupyterHub
-JupyterHub is a platform connected to SLURM to start a JupyterLab
-session as a batch job then connects it when the allocation has been granted.
-It does not require any ssh tunnel or port redirection, the hub acts as a proxy
-server that will redirect you to a session as soon as it is available.
-It is currently available for Mila clusters and some Digital Research Alliance
-of Canada (Alliance) clusters.
-| Cluster    | Address                                     | Login type   |
-|------------|---------------------------------------------|--------------|
-| Mila Local | https://jupyterhub.server.mila.quebec       | Google Oauth |
-| Alliance   | https://docs.alliancecan.ca/wiki/JupyterHub | DRAC login   |
-
-Warning
-Do not forget to close the JupyterLab session! Closing the window leaves
-running the session and the SLURM job it is linked to.
-To close it, use the hub menu and then Control Panel > Stop my server
-
-
-Note
-For Mila Clusters:
-mila.quebec account credentials should be used to login and start a
-JupyterLab session.
-
-"
-Access Mila Storage in JupyterLab,https://docs.mila.quebec/Userguide.html#access-mila-storage-in-jupyterlab,"Access Mila Storage in JupyterLab
-Unfortunately, JupyterLab does not allow the navigation to parent directories of
-$HOME. This makes some file systems like /network/datasets or
-$SLURM_TMPDIR unavailable through their absolute path in the interface. It
-is however possible to create symbolic links to those resources. To do so, you
-can use the ln -s command:
-ln -s /network/datasets $HOME
-
-
-Note that $SLURM_TMPDIR is a directory that is dynamically created for each
-job so you would need to recreate the symbolic link every time you start a
-JupyterHub session:
-ln -sf $SLURM_TMPDIR $HOME
-
-
-"
-Advanced SLURM usage and Multiple GPU jobs,https://docs.mila.quebec/Userguide.html#advanced-slurm-usage-and-multiple-gpu-jobs,"Advanced SLURM usage and Multiple GPU jobs
-"
-Handling preemption,https://docs.mila.quebec/Userguide.html#handling-preemption,"Handling preemption
-On the Mila cluster, jobs can preempt one-another depending on their priority
-(unkillable>high>low) (See the Slurm documentation)
-The default preemption mechanism is to kill and re-queue the job automatically
-without any notice. To allow a different preemption mechanism, every partition
-have been duplicated (i.e. have the same characteristics as their counterparts)
-allowing a 120sec grace period before killing your job but don’t requeue
-it automatically: those partitions are referred by the suffix: -grace
-(main-grace, long-grace, main-cpu-grace, long-cpu-grace).
-When using a partition with a grace period, a series of signals consisting of
-first SIGCONT and SIGTERM then SIGKILL will be sent to the SLURM
-job.  It’s good practice to catch those signals using the Linux trap command
-to properly terminate a job and save what’s necessary to restart the job.  On
-each cluster, you’ll be allowed a grace period before SLURM actually kills
-your job (SIGKILL).
-The easiest way to handle preemption is by trapping the SIGTERM signal
- 1#SBATCH --ntasks=1
- 2#SBATCH ....
- 3
- 4exit_script() {
- 5    echo ""Preemption signal, saving myself""
- 6    trap - SIGTERM # clear the trap
- 7    # Optional: sends SIGTERM to child/sub processes
- 8    kill -- -$$
- 9}
-10
-11trap exit_script SIGTERM
-12
-13# The main script part
-14python3 my_script
-
-
-
-Note
-
-Requeuing:
-The Slurm scheduler on the cluster does not allow a grace period before
-preempting a job while requeuing it automatically, therefore your job will
-be cancelled at the end of the grace period.
-To automatically requeue it, you can just add the sbatch command inside
-your exit_script function.
-
-
-"
-Packing jobs,https://docs.mila.quebec/Userguide.html#packing-jobs,"Packing jobs
-"
-Sharing a GPU between processes,https://docs.mila.quebec/Userguide.html#sharing-a-gpu-between-processes,"Sharing a GPU between processes
-srun, when used in a batch job is responsible for starting tasks on the
-allocated resources (see srun) SLURM batch script
-1#SBATCH --ntasks-per-node=2
-2#SBATCH --output=myjob_output_wrapper.out
-3#SBATCH --ntasks=2
-4#SBATCH --gres=gpu:1
-5#SBATCH --cpus-per-task=4
-6#SBATCH --mem=18G
-7srun -l --output=myjob_output_%t.out python script args
-
-
-This will run Python 2 times, each process with 4 CPUs with the same arguments
---output=myjob_output_%t.out will create 2 output files appending the task
-id (%t) to the filename and 1 global log file for things happening outside
-the srun command.
-Knowing that, if you want to have 2 different arguments to the Python program,
-you can use a multi-prog configuration file: srun -l --multi-prog silly.conf
-0  python script firstarg
-1  python script secondarg
-
-
-Or by specifying a range of tasks
-0-1  python script %t
-
-
-%t being the taskid that your Python script will parse.  Note the -l on the
-srun command: this will prepend each line with the taskid (0:, 1:)
-"
-Sharing a node with multiple GPU 1process/GPU,https://docs.mila.quebec/Userguide.html#sharing-a-node-with-multiple-gpu-1process-gpu,"Sharing a node with multiple GPU 1process/GPU
-On Digital Research Alliance of Canada, several nodes, especially nodes with
-largeGPU (P100) are reserved for jobs requesting the whole node, therefore
-packing multiple processes in a single job can leverage faster GPU.
-If you want different tasks to access different GPUs in a single allocation you
-need to create an allocation requesting a whole node and using srun with a
-subset of those resources (1 GPU).
-Keep in mind that every resource not specified on the srun command while
-inherit the global allocation specification so you need to split each resource
-in a subset (except –cpu-per-task which is a per-task requirement)
-Each srun represents a job step (%s).
-Example for a GPU node with 24 cores and 4 GPUs and 128G of RAM
-Requesting 1 task per GPU
- 1#!/bin/bash
- 2#SBATCH --nodes=1-1
- 3#SBATCH --ntasks-per-node=4
- 4#SBATCH --output=myjob_output_wrapper.out
- 5#SBATCH --gres=gpu:4
- 6#SBATCH --cpus-per-task=6
- 7srun --gres=gpu:1 -n1 --mem=30G -l --output=%j-step-%s.out --exclusive --multi-prog python script args1 &
- 8srun --gres=gpu:1 -n1 --mem=30G -l --output=%j-step-%s.out --exclusive --multi-prog python script args2 &
- 9srun --gres=gpu:1 -n1 --mem=30G -l --output=%j-step-%s.out --exclusive --multi-prog python script args3 &
-10srun --gres=gpu:1 -n1 --mem=30G -l --output=%j-step-%s.out --exclusive --multi-prog python script args4 &
-11wait
-
-
-This will create 4 output files:
-
-JOBID-step-0.out
-JOBID-step-1.out
-JOBID-step-2.out
-JOBID-step-3.out
-
-"
-Sharing a node with multiple GPU & multiple processes/GPU,https://docs.mila.quebec/Userguide.html#sharing-a-node-with-multiple-gpu-multiple-processes-gpu,"Sharing a node with multiple GPU & multiple processes/GPU
-Combining both previous sections, we can create a script requesting a whole node
-with four GPUs, allocating 1 GPU per srun and sharing each GPU with multiple
-processes
-Example still with a 24 cores/4 GPUs/128G RAM
-Requesting 2 tasks per GPU
- 1#!/bin/bash
- 2#SBATCH --nodes=1-1
- 3#SBATCH --ntasks-per-node=8
- 4#SBATCH --output=myjob_output_wrapper.out
- 5#SBATCH --gres=gpu:4
- 6#SBATCH --cpus-per-task=3
- 7srun --gres=gpu:1 -n2 --mem=30G -l --output=%j-step-%s-task-%t.out --exclusive --multi-prog silly.conf &
- 8srun --gres=gpu:1 -n2 --mem=30G -l --output=%j-step-%s-task-%t.out --exclusive --multi-prog silly.conf &
- 9srun --gres=gpu:1 -n2 --mem=30G -l --output=%j-step-%s-task-%t.out --exclusive --multi-prog silly.conf &
-10srun --gres=gpu:1 -n2 --mem=30G -l --output=%j-step-%s-task-%t.out --exclusive --multi-prog silly.conf &
-11wait
-
-
---exclusive is important to specify subsequent step/srun to bind to different cpus.
-This will produce 8 output files"
-Sharing a node with multiple GPU & multiple processes/GPU,https://docs.mila.quebec/Userguide.html#sharing-a-node-with-multiple-gpu-multiple-processes-gpu,", 2 for each step:
-
-JOBID-step-0-task-0.out
-JOBID-step-0-task-1.out
-JOBID-step-1-task-0.out
-JOBID-step-1-task-1.out
-JOBID-step-2-task-0.out
-JOBID-step-2-task-1.out
-JOBID-step-3-task-0.out
-JOBID-step-3-task-1.out
-
-Running nvidia-smi in silly.conf, while parsing the output, we can see 4
-GPUs allocated and 2 tasks per GPU
-cat JOBID-step-* | grep Tesla
-0: |   0  Tesla P100-PCIE...  On   | 00000000:04:00.0 Off |                    0 |
-1: |   0  Tesla P100-PCIE...  On   | 00000000:04:00.0 Off |                    0 |
-0: |   0  Tesla P100-PCIE...  On   | 00000000:83:00.0 Off |                    0 |
-1: |   0  Tesla P100-PCIE...  On   | 00000000:83:00.0 Off |                    0 |
-0: |   0  Tesla P100-PCIE...  On   | 00000000:82:00.0 Off |                    0 |
-1: |   0  Tesla P100-PCIE...  On   | 00000000:82:00.0 Off |                    0 |
-0: |   0  Tesla P100-PCIE...  On   | 00000000:03:00.0 Off |                    0 |
-1: |   0  Tesla P100-PCIE...  On   | 00000000:03:00.0 Off |                    0 |"
-Multiple Nodes,https://docs.mila.quebec/Userguide.html#multiple-nodes,"Multiple Nodes
-"
-Data Parallel,https://docs.mila.quebec/Userguide.html#data-parallel,"Data Parallel
-
-Request 3 nodes with at least 4 GPUs each.
- 1#!/bin/bash
- 2
- 3# Number of Nodes
- 4#SBATCH --nodes=3
- 5
- 6# Number of tasks. 3 (1 per node)
- 7#SBATCH --ntasks=3
- 8
- 9# Number of GPU per node
-10#SBATCH --gres=gpu:4
-11#SBATCH --gpus-per-node=4
-12
-13# 16 CPUs per node
-14#SBATCH --cpus-per-gpu=4
-15
-16# 16Go per nodes (4Go per GPU)
-17#SBATCH --mem=16G
-18
-19# we need all nodes to be ready at the same time
-20#SBATCH --wait-all-nodes=1
-21
-22# Total resources:
-23#   CPU: 16 * 3 = 48
-24#   RAM: 16 * 3 = 48 Go
-25#   GPU:  4 * 3 = 12
-26
-27# Setup our rendez-vous point
-28RDV_ADDR=$(hostname)
-29WORLD_SIZE=$SLURM_JOB_NUM_NODES
-30# -----
-31
-32srun -l torchrun \
-33   --nproc_per_node=$SLURM_GPUS_PER_NODE\
-34   --nnodes=$WORLD_SIZE\
-35   --rdzv_id=$SLURM_JOB_ID\
-36   --rdzv_backend=c10d\
-37   --rdzv_endpoint=$RDV_ADDR\
-38   training_script.py
-
-
-You can find below a pytorch script outline on what a multi-node trainer could look like.
-import os
-import torch.distributed as dist
-
-class Trainer:
-   def __init__(self):
-      self.local_rank = None
-      self.chk_path = ...
-      self.model = ...
-
-   @property
-   def device_id(self):
-      return self.local_rank
-
-   def load_checkpoint(self, path):
-      self.chk_path = path
-      # ...
-
-   def should_checkpoint(self):
-      # Note: only one worker saves its weights
-      return self.global_rank == 0 and self.local_rank == 0
-
-   def save_checkpoint(self):
-      if self.chk_path is None:
-            return
-
-      # Save your states here
-      # Note: you should save the weights of self.model not ddp_model
-      # ...
-
-   def initialize(self):
-      self.global_rank = int(os.environ.get(""RANK"", -1))
-      self.local_rank = int(os.environ.get(""LOCAL_RANK"", -1))
-
-      assert self.global_rank >= 0, 'Global rank should be set (Only Rank 0 can save checkpoints)'
-      assert self.local_rank >= 0, 'Local rank should be set'
-
-      dist.init_process_group(backend=""gloo|nccl"")
-
-   def sy"
-Data Parallel,https://docs.mila.quebec/Userguide.html#data-parallel,"nc_weights(self, resuming=False):
-      if resuming:
-            # in the case of resuming all workers need to load the same checkpoint
-            self.load_checkpoint()
-
-            # Wait for everybody to finish loading the checkpoint
-            dist.barrier()
-            return
-
-      # Make sure all workers have the same initial weights
-      # This makes the leader save his weights
-      if self.should_checkpoint():
-            self.save_checkpoint()
-
-      # All workers wait for the leader to finish
-      dist.barrier()
-
-      # All followers load the leader's weights
-      if not self.should_checkpoint():
-            self.load_checkpoint()
-
-      # Leader waits for the follower to load the weights
-      dist.barrier()
-
-   def dataloader(self, dataset, batch_size):
-      train_sampler = ElasticDistributedSampler(dataset)
-      train_loader = DataLoader(
-            dataset,
-            batch_size=batch_size,
-            num_workers=4,
-            pin_memory=True,
-            sampler=train_sampler,
-      )
-      return train_loader
-
-   def train_step(self):
-      # Your batch processing step here
-      # ...
-      pass
-
-   def train(self, dataset, batch_size):
-      self.sync_weights()
-
-      ddp_model = torch.nn.parallel.DistributedDataParallel(
-            self.model,
-            device_ids=[self.device_id],
-            output_device=self.device_id
-      )
-
-      loader = self.dataloader(dataset, batch_size)
-
-      for epoch in range(100):
-            for batch in iter(loader):
-               self.train_step(batch)
-
-               if self.should_checkpoint():
-                  self.save_checkpoint()
-
-def main():
-   trainer = Trainer()
-   trainer.load_checkpoint(path)
-   tainer.initialize()
-
-   trainer.train(dataset, batch_size)
-
-
-
-Note
-To bypass Python GIL (Global interpreter lock) pytorch spawn one process for each GPU.
-In the example above this means at least 12 processes are spawn, at least 4 on each node.
-"
-Frequently asked questions (FAQs),https://docs.mila.quebec/Userguide.html#frequently-asked-questions-faqs,"Frequently asked questions (FAQs)
-"
-Connection/SSH issues,https://docs.mila.quebec/Userguide.html#connection-ssh-issues,"Connection/SSH issues
-"
-I’m getting connection refused while trying to connect to a login node,https://docs.mila.quebec/Userguide.html#i-m-getting-connection-refused-while-trying-to-connect-to-a-login-node,"I’m getting connection refused while trying to connect to a login node
-Login nodes are protected against brute force attacks and might ban your IP if
-it detects too many connections/failures. You will be automatically unbanned
-after 1 hour. For any further problem, please submit a support ticket.
-"
-Shell issues,https://docs.mila.quebec/Userguide.html#shell-issues,"Shell issues
-"
-How do I change my shell ?,https://docs.mila.quebec/Userguide.html#how-do-i-change-my-shell,"How do I change my shell ?
-By default you will be assigned /bin/bash as a shell. If you would like to
-change for another one, please submit a support ticket.
-"
-SLURM issues,https://docs.mila.quebec/Userguide.html#slurm-issues,"SLURM issues
-"
-How can I get an interactive shell on the cluster ?,https://docs.mila.quebec/Userguide.html#how-can-i-get-an-interactive-shell-on-the-cluster,"How can I get an interactive shell on the cluster ?
-Use salloc [--slurm_options] without any executable at the end of the
-command, this will launch your default shell on an interactive session. Remember
-that an interactive session is bound to the login node where you start it so you
-could risk losing your job if the login node becomes unreachable.
-"
-How can I reset my cluster password ?,https://docs.mila.quebec/Userguide.html#how-can-i-reset-my-cluster-password,"How can I reset my cluster password ?
-To reset your password, please submit a support ticket.
-Warning: your cluster password is the same as your Google Workspace account. So,
-after reset, you must use the new password for all your Google services.
-"
-srun: error: –mem and –mem-per-cpu are mutually exclusive,https://docs.mila.quebec/Userguide.html#srun-error-mem-and-mem-per-cpu-are-mutually-exclusive,"srun: error: –mem and –mem-per-cpu are mutually exclusive
-You can safely ignore this, salloc has a default memory flag in case you
-don’t provide one.
-"
-How can I see where and if my jobs are running ?,https://docs.mila.quebec/Userguide.html#how-can-i-see-where-and-if-my-jobs-are-running,"How can I see where and if my jobs are running ?
-Use squeue -u YOUR_USERNAME to see all your job status and locations.
-To get more info on a running job, try scontrol show job #JOBID
-"
-Unable to allocate resources: Invalid account or account/partition combination specified,https://docs.mila.quebec/Userguide.html#unable-to-allocate-resources-invalid-account-or-account-partition-combination-specified,"Unable to allocate resources: Invalid account or account/partition combination specified
-Chances are your account is not setup properly. You should submit a support ticket.
-"
-How do I cancel a job?,https://docs.mila.quebec/Userguide.html#how-do-i-cancel-a-job,"How do I cancel a job?
-
-To cancel a specific job, use scancel #JOBID
-To cancel all your jobs (running and pending), use scancel -u YOUR_USERNAME
-To cancel all your pending jobs only, use scancel -t PD
-
-"
-How can I access a node on which one of my jobs is running ?,https://docs.mila.quebec/Userguide.html#how-can-i-access-a-node-on-which-one-of-my-jobs-is-running,"How can I access a node on which one of my jobs is running ?
-You can ssh into a node on which you have a job running, your ssh connection
-will be adopted by your job, i.e.  if your job finishes your ssh connection will
-be automatically terminated. In order to connect to a node, you need to have
-password-less ssh either with a key present in your home or with an
-ssh-agent. You can generate a key on the login node like this:
-ssh-keygen (3xENTER)
-cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
-chmod 600 ~/.ssh/authorized_keys
-chmod 700 ~/.ssh
-"
-I’m getting Permission denied (publickey) while trying to connect to a node,https://docs.mila.quebec/Userguide.html#i-m-getting-permission-denied-publickey-while-trying-to-connect-to-a-node,"I’m getting Permission denied (publickey) while trying to connect to a node
-See previous question
-"
-Where do I put my data during a job ?,https://docs.mila.quebec/Userguide.html#where-do-i-put-my-data-during-a-job,"Where do I put my data during a job ?
-Your /home as well as the datasets are on shared file-systems, it is
-recommended to copy them to the $SLURM_TMPDIR to better process them and
-leverage higher-speed local drives. If you run a low priority job subject to
-preemption, it’s better to save any output you want to keep on the shared file
-systems, because the $SLURM_TMPDIR is deleted at the end of each job.
-"
-slurmstepd: error: Detected 1 oom-kill event(s) in step #####.batch cgroup,https://docs.mila.quebec/Userguide.html#slurmstepd-error-detected-1-oom-kill-event-s-in-step-batch-cgroup,"slurmstepd: error: Detected 1 oom-kill event(s) in step #####.batch cgroup
-You exceeded the amount of memory allocated to your job, either you did not
-request enough memory or you have a memory leak in your process. Try increasing
-the amount of memory requested with --mem= or --mem-per-cpu=.
-"
-fork: retry: Resource temporarily unavailable,https://docs.mila.quebec/Userguide.html#fork-retry-resource-temporarily-unavailable,"fork: retry: Resource temporarily unavailable
-You exceeded the limit of 2000 tasks/PIDs in your job, it probably means there
-is an issue with a sub-process spawning too many processes in your script. For
-any help with your software, please submit a support ticket.
-"
-PyTorch issues,https://docs.mila.quebec/Userguide.html#pytorch-issues,"PyTorch issues
-"
-"I randomly get INTERNAL ASSERT FAILED at ""../aten/src/ATen/MapAllocator.cpp"":263",https://docs.mila.quebec/Userguide.html#i-randomly-get-internal-assert-failed-at-aten-src-aten-mapallocator-cpp-263,"I randomly get INTERNAL ASSERT FAILED at ""../aten/src/ATen/MapAllocator.cpp"":263
-You are using PyTorch 1.10.x and hitting #67864,
-for which the solution is PR #72232
-merged in PyTorch 1.11.x. For an immediate fix, consider the following compilable Gist:
-hack.cpp.
-Compile the patch to hack.so and then export LD_PRELOAD=/absolute/path/to/hack.so
-before executing the Python process that import torch a broken PyTorch 1.10.
-For Hydra users who are using the submitit launcher plug-in, the env_set key cannot
-be used to set LD_PRELOAD in the environment as it does so too late at runtime. The
-dynamic loader reads LD_PRELOAD only once and very early during the startup of any
-process, before the variable can be set from inside the process. The hack must therefore
-be injected using the setup key in Hydra YAML config file:
-hydra:
-  launcher:
-    setup:
-      - export LD_PRELOAD=/absolute/path/to/hack.so
-
-
-"
-Mila technical documentation,https://docs.mila.quebec/index.html#mila-technical-documentation,"Mila technical documentation
-Welcome to Mila’s technical documentation. If this is your first time here, we
-recommend you start by checking out the short quick start guide.
-
-Introduction
-
-Purpose of this documentation
-Intended audience
-
-
-Contributing
-
-
-
-How-tos and Guides
-
-User’s guide
-Quick Start
-Logging in to the cluster
-Running your code
-Portability concerns and solutions
-Singularity
-Sharing Data with ACLs
-Contributing datasets
-Data Transmission using Globus Connect Personal
-JupyterHub
-Advanced SLURM usage and Multiple GPU jobs
-Multiple Nodes
-Frequently asked questions (FAQs)
-
-
-AI tooling and methodology handbook
-
-
-
-Systems and services
-
-Computing infrastructure and policies
-Roles and authorizations
-Overview of available computing resources at Mila
-Node profile description
-Data sharing policies
-Monitoring
-Storage
-Data Transmission
-
-
-Computational resources outside of Mila
-Digital Research Alliance of Canada Clusters
-
-
-
-
-
-General theory
-
-What is a computer cluster?
-Parts of a computing cluster
-The login nodes
-The compute nodes
-The storage nodes
-Different nodes for different uses
-
-
-UNIX
-The workload manager
-Processing data
-Data parallelism
-Model parallelism
-Communication concerns
-Filesystem concerns
-
-
-Software on the cluster
-Cluster software modules
-Containers
-Python Virtual environments
-
-
-
-
-
-Extras
-
-Mila Datasets
-Audio and video resources at Mila
-Visual Studio Code
-Connecting to the cluster
-Activating an environment
-Troubleshooting
-
-
-Who, what, where is IDT
-IDT’s mission
-The IDT team
-
-
-
-
-
-Support
-To reach the Mila infrastructure support, please submit
-a support ticket.
-
-Contribution
-If you find any errors in the documentation, missing or unclear
-sections, or would simply like to contribute, please open an
-issue or make a pull request on the github page.
-
-
-"
-Audio and video resources at Mila,https://docs.mila.quebec/Audio_video.html#audio-and-video-resources-at-mila,"Audio and video resources at Mila
-See the intranet section on
-audio and video
-for complete information on audio and video systems made available at Mila.
-"