Last modified: October 06 2023.
Contact: hpchelp [at] iitd.ac.in

HPC DOWNTIME UPDATES


Dear All,

Access to HPC will be tentatively released approximately at 5pm today.

The following are the major changes:

1) File system has been changed.

HOME: 1.2PB (total); Default home quota is now 100GB, expandable to 2TB as before.

SCRATCH: 4.5PB (total); Default scratch quota is now 25TB. This can be increased on a case-to-case basis on approval by the HPC Steering Committee.

All active HPC users' HOME data has been transferred.

All scratch data of users having less than 5TB of scratch usage has been transferred to the new file system. Rest all users would need to transfer data from the previous scratch file system (/scratchsfa) to the new file system. The "serial" queue is reserved for jobs for transfer of data - the job will utilize "low" queue budget, with limit of 1 job of 4 cpu per user and maximum walltime 168:00:00. In case the data is not fully transferred in one week, the process needs to be repeated.

Users can transfer the whole old scratch data to new scratch (limit of 25TB) by following instructions on the HPC website (https://supercomputing.iitd.ac.in).

Users can transfer data out of old scratch from klogin1.hpc.iitd.ac.in or klogin2.hpc.iitd.ac.in via rsync/scp/sftp etc. Old scratch is available on /scratchsfa (environment variable $SCRATCHSFA). We recommend using rsync wherever possible. User who have more than 25TB of SCRATCH data need to transfer only the required data to the current file system.

All production nodes in the standard/low/high and other queues except the serial queue only have the new file system. The facility of serial queue would be available till 31st October 2023. Thereafter, old scratch file system will not be available. All users should ensure that all HOME and SCRATCH data has been successfully transferred before 31st October 2023. 2) Jobs : all queued jobs have been put on user hold. Users can release jobs as required using qrls after confirming the job working directory and required files. 3) nodes: All Icelake, Skylake are available for use. Most Haswell nodes are also available. School of AI nodes: 3 nodes are available. There is an additional flag "interlink" available to choose the "nvlink" based nodes or the "pci" based node. e.g. select=1:ncpus=2:ngpus=2:interlink=nvlink:centos=amdepyc n.b. node names have been changed to scai0[1-3].

List of users having data more than 5TB or above in SCRATCH

Please click on this link to see the list:Users_List


To migrate data from old storage to new storage, Please copy the PBS script using below command and submit the job with your departmental code


cp /home/apps/skeleton/migration_pbs.sh $HOME

qsub -P [Your departmental code] migration_pbs.sh

PADUM: Hybrid High Performance Computing Facility at IITD

How to Use and setup the environment.

You will need an ssh client to connect to the cluster through your IITD kerberos credentials. CPU login is available through ssh hpc.iitd.ac.in (Use IITD credentials). To copy data use scp to hpc.iitd.ac.in. GPU or Mic (Xeon Phi) nodes can be directly accessed through gpu.hpc.iitd.ac.in and mic.hpc.iitd.ac.in respectively. Please avoid using gpu and mic for large data transfer.

Once logged in to the system, you have access to your home (backed up) and scratch (not backed up) directories. Please generate an ssh key pair in your .ssh directory to start using PBS. Please report issues to hpchelp[@]iitd.ac.in

 

Hardware Specifications

    Icelake Nodes (December 2022 onward):
  • Total number of nodes: 32
    CPU nodes: 16 (256GB RAM)
    GPU nodes: 16 (512GB RAM)
    (with two Nvidia A100 card on each GPU node)

  • Basic Configuration:
    GPU: NVIDIA A100 (40GB)
    CPU: 2x Intel Xeon 2680 (32 cores 2.6 GHz) "Icelake"
  • All connected over 200G HDR Infiniband.

  • Skylake Nodes (July, 2019 onward):
  • Total number of nodes: 184
    CPU nodes: 144
    GPU nodes: 40
    (with one Nvidia V100 cards on each GPU node: 17)
    (with two Nvidia V100 card on each GPU node: 23)

  • Basic Configuration:
    GPU: NVIDIA V100 (32GB 5120 CUDA cores)
    CPU: 2x Intel Xeon G-6148 (20 cores 2.4 GHz) "Skylake"
    RAM: 96GB
  • 8 CPU, 40 GPU nodes have 192 GB RAM each



  • Haswell Nodes (November, 2015 onward):
  • Total number of compute nodes: 420
    CPU nodes: 259
    GPU accelerated nodes: 161
  • Basic configuration:
    GPU: 2x NVIDIA K40 (12GB, 2880 CUDA cores)
    [Please contact hpchelp[@]iitd.ac.in]
    CPU: 2x E5-2680 v3 2.5GHz/12-Core "Haswell"
    RAM: 62 GB

  • 12 CPU and 8 GPU nodes have 500 GB RAM each


  • The cluster can be accessed through 4 general login nodes, 2 GPU login nodes.


  • Storage:
    HOME space 1.2 PB
    SCRATCH space 4.5 PB


  • Primary network: Fully non-blocking FDR 56Gbps infiniband. Latency ~700ns.