Examining Power Efficiency in Rack-Scale AI Systems

Ally Winning, European Editor, PSD


A talk with Ajith Jain, Global Vice President High Performance Computing Business, Vicor, on changing power demands in HPC systems.


Ajith Jain, Global Vice President High Performance Computing Business, Vicor


Can you explain how and why power demands are changing for today’s high performance computing (HPC) installations?

Semiconductor process geometries are advancing at a rapid rate, and we are now working with loads that are designed using 7nm, all the way down to 3nm process nodes. As process nodes shrink, the current density (Ion/W) increases, and the voltage threshold reduces accordingly. For example, 0.7V down to 0.5V has become commonplace these days, with steady state currents of ASIC core rails exceeding 1kA with an associated peak current of ~2kA for short duration.

Lately, HPC designs fuelled by AI/ChatGPT are being designed with more GPUs, high-speed interconnect devices, and high-bandwidth memory (HBM) banks, thereby significantly increasing the power consumption of the compute blade. We have seen an increase of power of around 2.5 times on a two-socket compute blade over the past decade, from 1kW to 2.5kW per blade. This increase leads to a great deal of input current, especially if the distributed voltage rails are <48V (popularly 12V). These system designs also call for unique and high efficiency thermal management systems to remove the dissipated heat and still be cost effective. Overall, HPC installations for the AI world have been a complicated problem to solve compared to the more straightforward hyperscale data centres.


What are the issues and challenges in supplying this power?


HPC designs have started using more GPUs, high speed interconnect devices, and large memory banks (HBMs) that all consume a lot of power. It becomes particularly hard to power these loads when they use large amounts of current in an aggregated single voltage domain/rail approaching >1kA Thermal Design Current (TDC) and ~2kA Thermal Design Power (TDP).

One solution the industry has tried is to divide the compute elements into chiplets, thereby reducing the main rail current across four or more compute elements. There too, owing to higher transistor density and finer geometries, the currents have been rising to levels where the conventional PWM based multiphase synch bucks have started to run into Power Delivery Network (PDN) issues.

Most applications have seen an increase in HBM size. It is commonplace to use multiple banks of HBMs, with current ranges topping 220A. Signal Integrity and PDN issues are critical in powering HBM rails, which pose challenges for PCB layout and the placement of power components.

High-speed interconnect and network connectivity is another high current rail in the realm of 1kA TDC, which is also sensitive to noise. Signal integrity requirement of this rail dictates strict layout rules for power components. In most cases, this means we cannot put power components on the periphery of the ASIC on the top side of the PCB - vertical power delivery techniques are better suited to this.

What architectures or technologies can be used to meet these challenges?


Managing PDNs is a challenge for HPC applications and thereby requires unique power delivery techniques that produce lower I²R losses, allowing delivery of maximum power from the power converter to the load. Most commonly used lateral power delivery methods pose several challenges, including higher PDN, signal integrity issues for noise sensitive loads (HBMs and networking/high speed interconnects), and dv/dt issues for the ASIC. These issues can be addressed using a combination of lateral, vertical power delivery and pure play vertical power delivery techniques.


What benefits do these technologies bring to the servers and their users?

These techniques offer a number of benefits. One of the major ones is a reduced complexity of the PDN, bringing lower costs and greater reliability through using fewer parts. It also produces higher signal integrity and better bandwidth for interconnects and networking loads, giving better accuracy of transmitted data. Additionally, they enable the highest current density possible, ensuring the right amount of power is available in the smallest possible space.


What solutions does Vicor offer to meet these needs?

Vicor offers a range of lateral power delivery solutions, with current multipliers that can be placed in close proximity to loads without the fear of signal integrity issues. It also offers a combination of lateral and vertical power delivery solutions that uses the best of both worlds, delivering almost a factor 8 reduction in PDN while also offering higher current density compared to pure play lateral only designs.

Vertical power delivery is the holy grail for all high current ASICs that include GPUs, NPUs and other AI/ML workloads which cannot afford to have power components sprinkled around the periphery of the ASICs. This delivery technique, combined with Vicor Corporation’s high-density current multipliers, delivers the highest current density in the industry, as well as offering the highest signal integrity, a crucial factor for these workloads.