glenn-jocher commited on
Commit
d5d275b
1 Parent(s): 7a6870b

Amazon AWS EC2 startup and re-startup scripts (#2185)

Browse files

* Amazon AWS EC2 startup and re-startup scripts

* Create resume.py

* cleanup

utils/aws/__init__.py ADDED
File without changes
utils/aws/mime.sh ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AWS EC2 instance startup 'MIME' script https://aws.amazon.com/premiumsupport/knowledge-center/execute-user-data-ec2/
2
+ # This script will run on every instance restart, not only on first start
3
+ # --- DO NOT COPY ABOVE COMMENTS WHEN PASTING INTO USERDATA ---
4
+
5
+ Content-Type: multipart/mixed; boundary="//"
6
+ MIME-Version: 1.0
7
+
8
+ --//
9
+ Content-Type: text/cloud-config; charset="us-ascii"
10
+ MIME-Version: 1.0
11
+ Content-Transfer-Encoding: 7bit
12
+ Content-Disposition: attachment; filename="cloud-config.txt"
13
+
14
+ #cloud-config
15
+ cloud_final_modules:
16
+ - [scripts-user, always]
17
+
18
+ --//
19
+ Content-Type: text/x-shellscript; charset="us-ascii"
20
+ MIME-Version: 1.0
21
+ Content-Transfer-Encoding: 7bit
22
+ Content-Disposition: attachment; filename="userdata.txt"
23
+
24
+ #!/bin/bash
25
+ # --- paste contents of userdata.sh here ---
26
+ --//
utils/aws/resume.py ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Resume all interrupted trainings in yolov5/ dir including DPP trainings
2
+ # Usage: $ python utils/aws/resume.py
3
+
4
+ import os
5
+ from pathlib import Path
6
+
7
+ import torch
8
+ import yaml
9
+
10
+ port = 0 # --master_port
11
+ path = Path('').resolve()
12
+ for last in path.rglob('*/**/last.pt'):
13
+ ckpt = torch.load(last)
14
+ if ckpt['optimizer'] is None:
15
+ continue
16
+
17
+ # Load opt.yaml
18
+ with open(last.parent.parent / 'opt.yaml') as f:
19
+ opt = yaml.load(f, Loader=yaml.SafeLoader)
20
+
21
+ # Get device count
22
+ d = opt['device'].split(',') # devices
23
+ nd = len(d) # number of devices
24
+ ddp = nd > 1 or (nd == 0 and torch.cuda.device_count() > 1) # distributed data parallel
25
+
26
+ if ddp: # multi-GPU
27
+ port += 1
28
+ cmd = f'python -m torch.distributed.launch --nproc_per_node {nd} --master_port {port} train.py --resume {last}'
29
+ else: # single-GPU
30
+ cmd = f'python train.py --resume {last}'
31
+
32
+ cmd += ' > /dev/null 2>&1 &' # redirect output to dev/null and run in daemon thread
33
+ print(cmd)
34
+ os.system(cmd)
utils/aws/userdata.sh ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # AWS EC2 instance startup script https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html
3
+ # This script will run only once on first instance start (for a re-start script see mime.sh)
4
+ # /home/ubuntu (ubuntu) or /home/ec2-user (amazon-linux) is working dir
5
+ # Use >300 GB SSD
6
+
7
+ cd home/ubuntu
8
+ if [ ! -d yolov5 ]; then
9
+ echo "Running first-time script." # install dependencies, download COCO, pull Docker
10
+ git clone https://github.com/ultralytics/yolov5 && sudo chmod -R 777 yolov5
11
+ cd yolov5
12
+ bash data/scripts/get_coco.sh && echo "Data done." &
13
+ sudo docker pull ultralytics/yolov5:latest && echo "Docker done." &
14
+ # python -m pip install --upgrade pip && pip install -r requirements.txt && python detect.py && echo "Requirements done." &
15
+ else
16
+ echo "Running re-start script." # resume interrupted runs
17
+ i=0
18
+ list=$(docker ps -qa) # container list i.e. $'one\ntwo\nthree\nfour'
19
+ while IFS= read -r id; do
20
+ ((i++))
21
+ echo "restarting container $i: $id"
22
+ docker start $id
23
+ # docker exec -it $id python train.py --resume # single-GPU
24
+ docker exec -d $id python utils/aws/resume.py
25
+ done <<<"$list"
26
+ fi