lizejun commited on
Commit
0efb4d3
1 Parent(s): ad85ccf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +148 -0
README.md CHANGED
@@ -1,3 +1,151 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+
5
+ # Overview
6
+
7
+ PODsys focuses on AI cluster deployment scenarios, providing a complete toolchain including infrastructure environment installation, environment deployment, user management, system monitoring and resource scheduling, aiming to create an open-source, efficient, compatible and easy-to-use intelligent cluster system environment deployment solution.
8
+
9
+ To achieve these capabilities, PODsys integrates dozens of drivers, softwares, and other installation packages required for AI cluster deployment, and provides a range of scripting tools to simplify deployment. Using these tools, users can complete the deployment of the entire cluster with several simple commands.
10
+
11
+ - Environment deployment and management: PODsys provides quick tools for environment deployment and management, including quick installation, configuration, and updating of cluster environments. It also includes the operating system, NVIDIA drivers, InfiniBand drivers and other necessary software base packages, to provide users with a complete GPU cluster environment. Users can manage cluster nodes, add or remove nodes, and monitor node status and performance with simple commands.
12
+
13
+ - User management and permission control: PODsys has a comprehensive user management and permission control mechanism. Administrators can create and manage user accounts and assign different permissions and resource quotas. This allows each user or team to flexibly allocate resources in the cluster and ensures the security of the cluster.
14
+
15
+ - System monitoring and performance optimization: PODsys provides comprehensive system monitoring and performance optimization capabilities to help users monitor the status and performance indicators of the cluster in real time. Through a visual interface, users can view cluster resource usage, job execution, and performance bottlenecks to adjust cluster configurations and optimize job performance in a timely manner.
16
+
17
+ - Resource scheduling and job management: PODsys provides efficient resource scheduling and job management functions, which can automatically schedule and manage jobs according to users' needs to ensure the resource utilization of the cluster and the execution efficiency of jobs.
18
+
19
+ # User Guide
20
+
21
+ Choose one machine from the cluster as the management node, and the remaining machines as compute nodes. All PODsys deployment operations are performed on the management node.
22
+
23
+ ## Installation Steps on the Management Node
24
+
25
+ ### Installing OS through BMC
26
+
27
+ Install Ubuntu Server on the management node. Set:
28
+
29
+ ```shell
30
+ version : 22.04.2
31
+ hostname : mu01
32
+ username : nexus
33
+ ```
34
+
35
+ ### Running install_manager.sh
36
+
37
+ Download the podsys-v1.8.tgz, run:
38
+
39
+ ```shell
40
+ $ echo "dd594ca770a09af5654c999925cf9af2 podsys-v1.8.tgz" | md5sum --check
41
+ $ sudo tar -xzvf podsys-v1.8.tgz -C /home/nexus/
42
+ $ cd podsys
43
+ $ sudo ./install_manager.sh
44
+ $ sudo ./verify_installation.sh
45
+ ```
46
+
47
+ ### Generating iplist.txt
48
+
49
+ ```shell
50
+ $ sudo vim /workspace/iplist.txt
51
+ ```
52
+
53
+ Examples of "iplist.txt":
54
+ | Serial Number | hostname | IP address | gateway | DNS | IPoIB |
55
+ | ------------- | -------- | ------------ | ----------- | --------------- | -------------- |
56
+ | 24XR20001 | node01 | 192.168.0.11 | 192.168.0.1 | 114.114.114.114 | 192.168.100.11 |
57
+ | 24XR20002 | node02 | 192.168.0.12 | 192.168.0.1 | 114.114.114.114 | 192.168.100.12 |
58
+
59
+
60
+ ## Installation Steps on the Compute Node
61
+
62
+ **The installation steps on the compute node are to be executed on the management node.**
63
+
64
+ ### Modifying config.yaml
65
+
66
+ ```shell
67
+ $ vim workspace/config.yaml
68
+ ```
69
+
70
+ Default contents are as follows:
71
+
72
+ ```shell
73
+ manager_ip:192.168.0.11
74
+ manager_nic:enp61s0f0
75
+ compute_passwd:123
76
+ compute_storage:sda
77
+ ```
78
+ - manager_ip: The IP address of the management node.
79
+ - manager_nic: The NIC identifier of the management node.
80
+ - compute_passwd: The user password for the compute node.
81
+ - compute_storage: The installation location of the compute node system.
82
+ "Warning: This will overwrite the original data on the hard drive"
83
+
84
+ ### Running install_compute.sh
85
+
86
+ ```shell
87
+ $ sudo ./install_compute.sh
88
+ ```
89
+
90
+ Then you will enter a Docker terminal interface with a command prompt of "root@podsys:/$"
91
+ Start the compute nodes without an installed operating system toinitiate automatic installation.
92
+ If the compute nodes already have an operating system, they need to be placed in PXE (Preboot Execution Environment) mode.
93
+ After the installation is complete, the compute nodes will shut down automatically.
94
+ Once all compute nodes are powered off, type "exit" to finish the installation of the compute node.
95
+
96
+ ## Configuration of Cluster Parallel Environments
97
+
98
+ All following operations are performed on the management node. Before setting the relevant services, run the following commands:
99
+
100
+ ```shell
101
+ $ cd podsys
102
+ $ sudo ./config_server.sh -pre
103
+ ```
104
+
105
+ ### Configuration of NFSoRDMA (NFS over Remote Direct Memory Access)
106
+
107
+ - Configuration on the Management Node:
108
+
109
+ ```shell
110
+ $ sudo ./config_server.sh -nfs [share directory]
111
+ ```
112
+
113
+ - Configuration on the Compute Node:
114
+
115
+ ```shell
116
+ $ sudo ./config_client.sh -IPoIB
117
+ $ sudo ./config_client.sh -nfs [serverIP] [share directory] [localdirectory]
118
+ ```
119
+ **NIS and OpenLDAP are both used for user management, and you need to choose one of them for configuration.**
120
+
121
+ ### Configuration of NIS (Network Information Service)
122
+
123
+ - Configuration on the Management Node:
124
+
125
+ ```shell
126
+ $ sudo ./config_server.sh -nis [serverIP]
127
+ $ sudo /usr/lib/yp/ypinit –m
128
+ $ sudo make -C /var/yp
129
+ ```
130
+
131
+ - Configuration on the Compute Node:
132
+
133
+ ```shell
134
+ $ cd /home/nexus/podsys_manager
135
+ $ ./config_client.sh -nis [nis server ip]
136
+ $ sudo yptest
137
+ ```
138
+
139
+ ### Configuration of OpenLDAP
140
+
141
+ - Configuration on the Management Node:
142
+
143
+ ```shell
144
+ $ sudo ./config_server.sh -ldap [serverIP] [ldap—password]
145
+ ```
146
+
147
+ - Configuration on the Compute Node:
148
+
149
+ ```shell
150
+ $ sudo ./config_client.sh -ldap [serverIP] [ldap—password]
151
+ ```