DGX Cluster
The Nvidia DGX cluster is a Slurm-based high-performance computing cluster for running for GPU intensive jobs, mainly machine learning related training tasks. The cluster consists of 5 Nvidia DGX H100 compute nodes providing 1120 CPUs, 10 TB of RAM, and 40 H100 Nvidia GPUs each with 80GB of VRAM.
Our role is management of the cluster. We are currently onboarding new users, setting up additional storage resources, and creating accounting systems for a user’s resource usage.