因为有一些软件必须要用Slurm,所以不得不在我的主机上配置slurm。
Slurm的安装依赖于root权限
munge配置
wget https://github.com/dun/munge/releases/download/munge-0.5.14/munge-0.5.14.tar.xz
rpmbuild -tb --without verify munge-0.5.14.tar.xz
cd /root/rpmbuild/RPMS/x86_64
rpm -ivh munge-0.5.14-1.el7.x86_64.rpm \
munge-devel-0.5.14-1.el7.x86_64.rpm munge-libs-0.5.14-1.el7.x86_64.rpm
创建密钥
sudo -u munge /usr/sbin/mungekey -v
# mungekey: Info: Created "/etc/munge/munge.key" with 1024-bit key
生成的 munge.key 文件需要分发到所有的计算节点。
启动守护进程(daemon)
systemctl enable munge
systemctl start munge
# 检查状态
systemctl status munge
方法1: RPM安装
下载页面, https://www.schedmd.com/downloads.php
因为是CentOS7, 因此我下载的是19.05版本。 而20.11可能不再支持Python2。
wget https://download.schedmd.com/slurm/slurm-19.05.8.tar.bz2
yum install pam-devel perl-Switch -y
rpmbuild -ta slurm-19.05.8.tar.bz2
cd /root/rpmbuild/RPMS/x86_64
rpm --install slurm-*.rpm
创建用户 slurm
adduser slurm
创建配置文件(非常关键)
mkdir -p /etc/slurm
touch /etc/slurm/slurm.conf
etc中slurm.conf文件里面的配置信息来自于https://slurm.schedmd.com/configurator.html 生成,需要配置如下选项
-
SlurmctldHost: 信息来自于
hostname -f
-
NodeName: 信息来自于
hostname -f
, 只不过是子节点的服务器信息,如果只有单个主机,那么同上 -
ComputeNodeAddress: 计算节点的IP地址,仅有单个节点时,信息为空
-
PartitionName: 任务分配名,改成batch
-
CPUs: 设置为空
-
CoresPerSocket: 实际的物理CPU数,例如96
-
ThreadsPerCore: 如果超线程,设置为2
-
RealMemory: 服务器内存大小,单位为Mb
-
SlurmUser: slurm要求有一个专门的用户,
-
StateSaveLocation: 一定要改成 /var/spool/slurmd, 否则会出现权限问题
最后还需要增加一行 CgroupMountpoint=/sys/fs/cgroup
启动 slurmctld, slurmd 的守护进程(deamon)
# 控制节点
systemctl enable slurmctld
systemctl start slurmctld
systemctl status slurmctld
# 计算节点
systemctl enable slurmd
systemctl start slurmd
systemctl status slurmd
方法2: 通过OpenHPC仓库
测试安装
安装结果后,我们创建一个 test.sbatch, 信息如下,用于测试
#!/bin/bash
#SBATCH -J test # Job name
#SBATCH -o job.%j.out # Name of stdout output file (%j expands to %jobId)
#SBATCH -N 1 # Total number of nodes requested
#SBATCH -n 2 # Total number of mpi tasks #requested
#SBATCH -t 01:30:00 # Run time (hh:mm:ss) - 1.5 hours
# Launch MPI-based executable
echo "Test output from Slurm Testjob"
NODEFILE=`generate_pbs_nodefile`
cat $NODEFILE
sleep 20
递交任务
sbatch ./test.sbatch
# Submitted batch job 2
查看状态
squeue
如果能输出一个job.X.out 文件,说明我们的SLURM已经配置成功。
可能报错和解决方案
使用 rpm --install
的时候可能会遇到如下的报错。这表示你需要安装perl的Switch模块
error: Failed dependencies:
perl(Switch) is needed by slurm-openlava-19.05.8-1.el7.x86_64
perl(Switch) is needed by slurm-torque-19.05.8-1.el7.x86_64
启动 slurmd的deamon失败
# systemctl start slurmd
Job for slurmd.service failed because the control process exited with error code.
See "systemctl status slurmd.service" and "journalctl -xe" for details.
按照提示运行 systemctl status slurmd.service
发现error信息如下
error: Node configuration differs from hardware: Procs=1:192(hw) Boards=1:1(hw) SocketsPerBoard=1:4(hw) ...e=1:2(hw)
error: cgroup namespace 'freezer' not mounted. aborting
第一个error原因是在https://slurm.schedmd.com/configurator.html 填写 "Compute Machines" 的硬件信息出现错误
第二个error原因是配置文件的默认配置表现不佳,需要做如下替换
echo CgroupMountpoint=/sys/fs/cgroup >> /etc/slurm/cgroup.conf
参考: https://stackoverflow.com/questions/62641323/error-cgroup-namespace-freezer-not-mounted-aborting
参考资料
配置slurm: https://slurm.schedmd.com/configurator.html
单节点slurm: http://docs.nanomatch.de/technical/SimStackRequirements/SingleNodeSlurm.html
munge配置:https://github.com/dun/munge/wiki/Installation-Guide
Slurm安装与使用: http://wiki.casjc.com/?p=378