--- license: apache-2.0 --- # Overview PODsys focuses on AI cluster deployment scenarios, providing a complete toolchain including infrastructure environment installation, environment deployment, user management, system monitoring and resource scheduling, aiming to create an open-source, efficient, compatible and easy-to-use intelligent cluster system environment deployment solution. To achieve these capabilities, PODsys integrates dozens of drivers, softwares, and other installation packages required for AI cluster deployment, and provides a range of scripting tools to simplify deployment. Using these tools, users can complete the deployment of the entire cluster with several simple commands. - Environment deployment and management: PODsys provides quick tools for environment deployment and management, including quick installation, configuration, and updating of cluster environments. It also includes the operating system, NVIDIA drivers, InfiniBand drivers and other necessary software base packages, to provide users with a complete GPU cluster environment. Users can manage cluster nodes, add or remove nodes, and monitor node status and performance with simple commands. - User management and permission control: PODsys has a comprehensive user management and permission control mechanism. Administrators can create and manage user accounts and assign different permissions and resource quotas. This allows each user or team to flexibly allocate resources in the cluster and ensures the security of the cluster. - System monitoring and performance optimization: PODsys provides comprehensive system monitoring and performance optimization capabilities to help users monitor the status and performance indicators of the cluster in real time. Through a visual interface, users can view cluster resource usage, job execution, and performance bottlenecks to adjust cluster configurations and optimize job performance in a timely manner. - Resource scheduling and job management: PODsys provides efficient resource scheduling and job management functions, which can automatically schedule and manage jobs according to users' needs to ensure the resource utilization of the cluster and the execution efficiency of jobs. # User Guide Choose one machine from the cluster as the management node, and the remaining machines as compute nodes. All PODsys deployment operations are performed on the management node. ## Installation Steps on the Management Node ### Installing OS through BMC Install Ubuntu Server on the management node. Set: ```shell version : 22.04.2 hostname : mu01 username : nexus ``` ### Running install_manager.sh Download the podsys-v1.8.tgz, run: ```shell $ echo "dd594ca770a09af5654c999925cf9af2 podsys-v1.8.tgz" | md5sum --check $ sudo tar -xzvf podsys-v1.8.tgz -C /home/nexus/ $ cd podsys $ sudo ./install_manager.sh $ sudo ./verify_installation.sh ``` ### Generating iplist.txt ```shell $ sudo vim /workspace/iplist.txt ``` Examples of "iplist.txt": | Serial Number | hostname | IP address | gateway | DNS | IPoIB | | ------------- | -------- | ------------ | ----------- | --------------- | -------------- | | 24XR20001 | node01 | 192.168.0.11 | 192.168.0.1 | 114.114.114.114 | 192.168.100.11 | | 24XR20002 | node02 | 192.168.0.12 | 192.168.0.1 | 114.114.114.114 | 192.168.100.12 | ## Installation Steps on the Compute Node **The installation steps on the compute node are to be executed on the management node.** ### Modifying config.yaml ```shell $ vim workspace/config.yaml ``` Default contents are as follows: ```shell manager_ip:192.168.0.11 manager_nic:enp61s0f0 compute_passwd:123 compute_storage:sda ``` - manager_ip: The IP address of the management node. - manager_nic: The NIC identifier of the management node. - compute_passwd: The user password for the compute node. - compute_storage: The installation location of the compute node system. "Warning: This will overwrite the original data on the hard drive" ### Running install_compute.sh ```shell $ sudo ./install_compute.sh ``` Then you will enter a Docker terminal interface with a command prompt of "root@podsys:/$" Start the compute nodes without an installed operating system toinitiate automatic installation. If the compute nodes already have an operating system, they need to be placed in PXE (Preboot Execution Environment) mode. After the installation is complete, the compute nodes will shut down automatically. Once all compute nodes are powered off, type "exit" to finish the installation of the compute node. ## Configuration of Cluster Parallel Environments All following operations are performed on the management node. Before setting the relevant services, run the following commands: ```shell $ cd podsys $ sudo ./config_server.sh -pre ``` ### Configuration of NFSoRDMA (NFS over Remote Direct Memory Access) - Configuration on the Management Node: ```shell $ sudo ./config_server.sh -nfs [share directory] ``` - Configuration on the Compute Node: ```shell $ sudo ./config_client.sh -IPoIB $ sudo ./config_client.sh -nfs [serverIP] [share directory] [localdirectory] ``` **NIS and OpenLDAP are both used for user management, and you need to choose one of them for configuration.** ### Configuration of NIS (Network Information Service) - Configuration on the Management Node: ```shell $ sudo ./config_server.sh -nis [serverIP] $ sudo /usr/lib/yp/ypinit –m $ sudo make -C /var/yp ``` - Configuration on the Compute Node: ```shell $ cd /home/nexus/podsys_manager $ ./config_client.sh -nis [nis server ip] $ sudo yptest ``` ### Configuration of OpenLDAP - Configuration on the Management Node: ```shell $ sudo ./config_server.sh -ldap [serverIP] [ldap—password] ``` - Configuration on the Compute Node: ```shell $ sudo ./config_client.sh -ldap [serverIP] [ldap—password] ```