GridKa Technical Overview
High-Throughput Compute Farm
The GridKa compute farm consists of approximately 860 compute nodes (2022). The setup is optimized for independent high-throughput jobs common in high-energy-physics and astroparticle physics computing, which don't require low-latency interconnects between the compute nodes. GPUs are available for R&D and production compute jobs. The HTCondor batch systems manage ~60000 logical CPU cores and 56 GPUs (2022).
In order to serve as a data hub for the Worldwide LHC Computing Grid, GridKa operates a large software-defined online storage installation. Based on IBM Spectrum ScaleTM with an internal Infiniband network, the GridKa online storage is highly scalable in capacity and performance. Access for users is provided through the dCache and xrootd middlewares. In 2021 ~52PB with a total throughput of 120GB/s are available to the users.
The GridKa Offline Storage system provides the capacity for efficient long term storage of raw data of the experiments. It will provide almost 100 PB of capacity for the four LHC experiments and Belle II. We started using IBM Spectrum ProtectTM in an Oracle SL8500 library with T10KD drives. Since enterprise drives in Oracle libraries are not any longer supported, we started the migration to a Spectra Logic TFinity® library with TS1160 drives and at the same time changing the software layer to High Performance Storage System (HPSS).
High bandwidth Wide Area Network connections are essential to receive data directly from CERN and transfer data to and from other WLCG centers all across the globe. Two 100Gbit/s connections to CERN and two 100Gbit/s connections to the internet allow GridKa to cope with expected data rates during LHC Run 3.
The internal network backbone connects the online storage system, management servers, and compute node.