DOE has decided to use AMD’s Epyc and Radeon Instinct for its El Capitan supercomputer at the Lawrence Livermore National Lab!
To review, DOE, under the CORAL-2 Procurement, was to acquire three exascale supercomputers.
Aurora (1 exaflops) was awarded to Intel for Argonne National Lab, detailed here.
Frontier (1.5 exaflops) was award to HPE/Cray, using AMD’s custom Milan CPU plus Radeon Instinct, for Oak Ridge National Lab, detailed here. I had also written about it, here.
Now DOE had selected AMD yet again for the third and final El Capitan (2 exaflops), using AMD’s 4th generation Epyc, code name Genoa, plus Radeon Instinct, detailed here.
The current top two supercomputers in the world, Summit (Oak Ridge) and Sierra (Livermore), both use IBM’s Power9 CPU, Nvidia’s Volta GPU and Mellanox’s Infiniband interconnect (now owned by Nvidia). It was actually somewhat expected that El Capitan would have gone to Nvidia. Yet for the next gen exascale supercomputers, IBM and Nvidia are completely shut out!
As this Tomshardware artcicle noted:
“As such, it’s telling that the DOE selected AMD’s next-gen platforms, as it highlights that its next-gen products are more suitable for the project than either Intel or Nvidia’s future offerings. It’s also noteworthy that the system has a particular focus on AI and machine learning workloads.”
Nvidia had famously bought Mellanox ($6.9 B) in 2019 for its interconnect technology. Now all three exascale machines are using companies that can provide BOTH CPUs and GPUs, such as AMD and Intel. It’s quite obvious that for such kind of performance leap (ten times from current ones), the CPU and GPU must work together in great synergy. AMD will be offering its Infinity Fabric 2 for Frontier, and IF 3 for El Capitan. In short, AMD’s Infinity Fabric and Cray’s Slingshot networking had won over Nvidia/Mellanox! In addition, Intel is still working on its Xe GPU, and if it falters as is rumored, Aurora may yet still go to AMD.
More importantly, these projects also provide $100 M each for software development. While Nvidia uses its proprietary CUDA, AMD had embraced ROCm, which is open sourced. As these supercomputers use ROCm, any further refinement of ROCm bodes well for AMD. AMD also emphasized that CUDA can be easily ported over to ROCm, so there is no reason to be held hostage to Nvidia’s hardware. In short, AMD will gradually break through Nvidia’s tight hold on data center GPU and machine learning/AI markets.
In summary, it is a great win for AMD. It is an excellent affirmation for AMD’s future products (Milan and Genoa) and its road map such that DOE is willing to select AMD over Intel/IBM/Nvidia. All this will trickle down to the data centers over the next few years.