NVIDIA at Hot Chips: And One Ring to Rule Them All

NVIDIA’s Blackwell platform made its major debut at Hot Chips this month, and it is just short of amazing. What is most fascinating about it is that it is somewhat similar to what AMD did with Threadripper. Building something this powerful is extremely risky because it falls outside of the performance and power envelopes of any data center part. 

Many folks argue that it isn’t power efficient because it runs too hot. But NVIDIA has brought this forward with warm water-cooling which Lenovo has been using for several years as a showcase for more efficient, effective cooling. The systems are typically sealed so that, other than the initial filling and minor losses due to leaks, it doesn’t use much water, at least far less than chilled systems use. While it is very power hungry, the amount of work that Blackwell does when compared to alternatives makes it power efficient, though there may be a need to up the power delivery systems in the related data center depending on part density. 

Let’s talk Blackwell this week and why parts like this are so unusual.

Why Parts Like Blackwell Are So Unusual

I posted on Facebook the other day about the Corvair car and how, when Ralph Nader came out with the book Unsafe at Any Speed, it effectively killed the car. Even after its safety problems had been addressed, car companies were unwilling to take chances on extremely innovative designs. 

When Ford introduced the Edsel about a decade earlier, it enjoyed the same outcome given its design was more similar to cars that came out a decade later. These instances slowed automotive innovation significantly. 

Blackwell is the most advanced and innovative product of its type to hit the market, but innovation like this comes with so much risk that most CEOs are unwilling to chance it. For instance, even though AMD was massively successful with the business version of Threadripper, it has never attempted to repeat that effort with its GPU lines, so it was NVIDIA, not AMD, that came out with Blackwell. 

Microsoft had the same experience with Windows 95. It was a massively successful launch, but it created so many problems with support, it never initiated another launch like that, so subsequent innovations like Windows Vista and Windows 8 failed spectacularly. 

This is one of NVIDIA CEO Jensen Huang’s primary advantages. He is willing to take these risks and properly fund his efforts. This willingness led to NVIDIA’s surprise success with AI, and now with Blackwell. No other firm appears willing to take similar risks, even though, just as clearly, Huang’s risks paid off spectacularly. 

Performance Results

The Blackwell system’s initial test reports of 30x performance increases are starting to come out, and its benefits are becoming more obvious. Interestingly, just as Lenovo found when it implemented warm water-cooling in its customer’s data centers, these data centers are is quieter and a far better environment to work in. Cooling requirements declined because the cooling shifts from the HVAC system used for air-cooling to the warm water-cooling system which doesn’t require fans. Granted, the power supplies still have fans as do the networking components, but only for now since I expect some of these may become warm water-cooled eventually to keep the racks consistent. But the noise level is night-and-day better, which has to be a benefit for the poor folks working in them. 

At the Hot Chips conference, Ali Heydari, the director of data center cooling and infrastructure at NVIDIA (wonder how he fits that title on a business card!) presented advanced designs that are capable of moving a data center from air to warm water-cooling efficiently and maximize both the cooling and energy savings from this approach. This also showcases one of NVIDIA’s largest advantages, Omniverse, which allows the creation of photorealistic and interoperable simulations of these data center designs so they can be rapidly and cheaply iterated until the design is fully optimized. This is a huge advantage for NVIDIA and any of its customers who use this technology.

Wrapping Up:

Blackwell is an amazing product. What’s troubling is that few other companies are willing to take risks like this that create business dominance. That unwillingness is one of the major problems facing U.S. companies in general. You see it in the automotive industry between Hyundai, Porsche and U.S. automakers. You saw it when Apple brought out the iPhone and kicked everyone’s butt, and you see it now with Blackwell. But these are the exceptions to our current behavior which is extremely risk averse. 

If the country wants to remain competitive, making bets like Blackwell should be encouraged so they are more the norm. Otherwise, innovations like this will come increasingly from countries like China. 

Oh, and the reference in the headline was from The Lord of the Rings because, in a way, Blackwell is now the world’s one ring, and it is nothing short of awesome.