zhangchao20 commited on
Commit
42ddd4f
·
2 Parent(s): e02252b 1098b6c

update podsys_v1.8

Browse files
Files changed (1) hide show
  1. README.md +146 -0
README.md CHANGED
@@ -1,3 +1,149 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+
5
+ # Overview
6
+
7
+ PODsys focuses on AI cluster deployment scenarios, providing a complete toolchain including infrastructure environment installation, environment deployment, user management, system monitoring and resource scheduling, aiming to create an open-source, efficient, compatible and easy-to-use intelligent cluster system environment deployment solution.
8
+
9
+ To achieve these capabilities, PODsys integrates dozens of drivers, softwares, and other installation packages required for AI cluster deployment, and provides a range of scripting tools to simplify deployment. Using these tools, users can complete the deployment of the entire cluster with several simple commands.
10
+
11
+ - Environment deployment and management: PODsys provides quick tools for environment deployment and management, including quick installation, configuration, and updating of cluster environments. It also includes the operating system, NVIDIA drivers, InfiniBand drivers and other necessary software base packages, to provide users with a complete GPU cluster environment. Users can manage cluster nodes, add or remove nodes, and monitor node status and performance with simple commands.
12
+
13
+ - User management and permission control: PODsys has a comprehensive user management and permission control mechanism. Administrators can create and manage user accounts and assign different permissions and resource quotas. This allows each user or team to flexibly allocate resources in the cluster and ensures the security of the cluster.
14
+
15
+ - System monitoring and performance optimization: PODsys provides comprehensive system monitoring and performance optimization capabilities to help users monitor the status and performance indicators of the cluster in real time. Through a visual interface, users can view cluster resource usage, job execution, and performance bottlenecks to adjust cluster configurations and optimize job performance in a timely manner.
16
+
17
+ - Resource scheduling and job management: PODsys provides efficient resource scheduling and job management functions, which can automatically schedule and manage jobs according to users' needs to ensure the resource utilization of the cluster and the execution efficiency of jobs.
18
+
19
+ # User Guide
20
+
21
+ Choose one machine from the cluster as the management node, and the remaining machines as compute nodes. All PODsys deployment operations are performed on the management node.
22
+
23
+ ## 1. Installation Steps on the Management Node
24
+
25
+ ###1.1 Installing OS through BMC
26
+
27
+ Install Ubuntu Server on the management node. Set:
28
+
29
+ ```shell
30
+ version : 22.04.2
31
+ hostname : mu01
32
+ username : nexus
33
+ ```
34
+
35
+ ### 1.2 Running install_manager.sh
36
+
37
+ Download the podsys-v1.8.tgz, run:
38
+
39
+ ```shell
40
+ $ echo "dd594ca770a09af5654c999925cf9af2 podsys-v1.8.tgz" | md5sum --check
41
+ $ sudo tar -xzvf podsys-v1.8.tgz -C /home/nexus/
42
+ $ cd podsys
43
+ $ sudo ./install_manager.sh
44
+ $ sudo ./verify_installation.sh
45
+ ```
46
+
47
+ ### 1.3 Modifying iplist.txt
48
+
49
+ ```shell
50
+ $ sudo vim /workspace/iplist.txt
51
+ ```
52
+ Examples of "iplist.txt":
53
+ | Serial Number | hostname | IP address | gateway | DNS | IPoIB |
54
+ | ------------- | -------- | ------------ | ----------- | --------------- | -------------- |
55
+ | 24XR20001 | node01 | 192.168.0.11 | 192.168.0.1 | 114.114.114.114 | 192.168.100.11 |
56
+ | 24XR20002 | node02 | 192.168.0.12 | 192.168.0.1 | 114.114.114.114 | 192.168.100.12 |
57
+
58
+ ## 2. Installation Steps on the Compute Node
59
+
60
+ **The installation steps on the compute node are to be executed on the management node.**
61
+
62
+ ### 2.1 Modifying config.yaml
63
+
64
+ ```shell
65
+ $ vim workspace/config.yaml
66
+ ```
67
+
68
+ Default contents are as follows:
69
+
70
+ ```shell
71
+ manager_ip:192.168.0.11
72
+ manager_nic:enp61s0f0
73
+ compute_passwd:123
74
+ compute_storage:sda
75
+ ```
76
+ - manager_ip: The IP address of the management node.
77
+ - manager_nic: The NIC identifier of the management node.
78
+ - compute_passwd: The user password for the compute node.
79
+ - compute_storage: The installation location of the compute node system.
80
+ "Warning: This will overwrite the original data on the hard drive"
81
+
82
+ ### 2.2 Running install_compute.sh
83
+
84
+ ```shell
85
+ $ sudo ./install_compute.sh
86
+ ```
87
+
88
+ Then you will enter a Docker terminal interface with a command prompt of "root@podsys:/$"
89
+ Start the compute nodes without an installed operating system toinitiate automatic installation.
90
+ If the compute nodes already have an operating system, they need to be placed in PXE (Preboot Execution Environment) mode.
91
+ After the installation is complete, the compute nodes will shut down automatically.
92
+ Once all compute nodes are powered off, type "exit" to finish the installation of the compute node.
93
+
94
+ ## 3. Configuration of Cluster Parallel Environments
95
+
96
+ All following operations are performed on the management node. Before setting the relevant services, run the following commands:
97
+
98
+ ```shell
99
+ $ cd podsys
100
+ $ sudo ./config_server.sh -pre
101
+ ```
102
+
103
+ ### 3.1 Configuration of NFSoRDMA (NFS over Remote Direct Memory Access)
104
+
105
+ - Configuration on the Management Node:
106
+
107
+ ```shell
108
+ $ sudo ./config_server.sh -nfs [share directory]
109
+ ```
110
+
111
+ - Configuration on the Compute Node:
112
+
113
+ ```shell
114
+ $ sudo ./config_client.sh -IPoIB
115
+ $ sudo ./config_client.sh -nfs [serverIP] [share directory] [localdirectory]
116
+ ```
117
+ **NIS and OpenLDAP are both used for user management, and you need to choose one of them for configuration.**
118
+
119
+ ### 3.2 Configuration of NIS (Network Information Service)
120
+
121
+ - Configuration on the Management Node:
122
+
123
+ ```shell
124
+ $ sudo ./config_server.sh -nis [serverIP]
125
+ $ sudo /usr/lib/yp/ypinit –m
126
+ $ sudo make -C /var/yp
127
+ ```
128
+
129
+ - Configuration on the Compute Node:
130
+
131
+ ```shell
132
+ $ cd /home/nexus/podsys_manager
133
+ $ ./config_client.sh -nis [nis server ip]
134
+ $ sudo yptest
135
+ ```
136
+
137
+ ### 3.3 Configuration of OpenLDAP
138
+
139
+ - Configuration on the Management Node:
140
+
141
+ ```shell
142
+ $ sudo ./config_server.sh -ldap [serverIP] [ldap—password]
143
+ ```
144
+
145
+ - Configuration on the Compute Node:
146
+
147
+ ```shell
148
+ $ sudo ./config_client.sh -ldap [serverIP] [ldap—password]
149
+ ```