It turns out that if you completely uproot the way in which data centers have been built for 10 year, there are going to be some growing pains. As much as the headlines are all about the rise of AI, the reality on the ground is a lot of headaches.
When speaking to systems integrators and other people who scale up large compute systems we are hearing a continuous stream of grumbling about the difficulties getting large GPU clusters up and running.
The chief complaint we hear is about liquid cooling. GPU systems run hot, with racks eating 1000 Watts of power, and plans for them to go to 2000W in the not so distant future. Traditional air cooling will not suffice here and this has lead to widespread adoption of liquid cooling systems. This has led to a surge in the stock price of companies like Vertiv which deploy those systems. But liquid cooling is new technology for the data center and there are not enough people familiar with installing them, resulting in liquid cooling becoming the number one failure item in data centers currently. There are all kinds of reasons for this but they all basically boil down to water and electronics do not mix well. The industry will sort this out eventually, but it is a good example of the growing pains we are seeing across data centers.
There are also all kinds of problems configuring GPUs. Again, this is not surprising, most data center professionals have an arsenal of experience to guide them in configuring CPUs, but for most of them GPUs are new. On top of that, Nvidia tends to sell complete designs which introduces a whole other range of complications. For instance, the firmware and BiOS systems for Nvidia are not entirely new, but they are just different and underdeveloped enough to lead to delays and more than the usual number of bugs. Then factor in Nvidia’s networking layer and it is easy to see how frustrating the whole process has become. There are just a lot of new things for these professionals to learn in a very short time frame.
In the grand scheme of things, all of this is just a speed bump. None of these are serious problems, and no one is going to cancel AI because of it. That being said, in the near term we expect this problem to become more pronounced and more high profile. There are going to be hyperscalers who delay or slow down their GPU roll-outs to work out these wrinkles. Or to be more precise, we are going to hear more about these delays, because they have already started.
Leave a Reply