Bending Metal: Combinatorics

A few weeks ago we took a look at the business behind delivering working servers to data centers, and the changing relationship among the OEMs, ODMs and their customers. Here we want to dig a little deeper into the subject to explore the many hurdles of the process of server design and the ways in which this add so much friction to the industry, with some important implications for the semis companies looking to supply it. Unsurprising spoiler – Nvidia is shaking this all up.

The heart of the problem is that everyone wants a slightly different server. The hyperscalers all have their own thoughts on what makes a good server. For example, Meta famously used to like the cheapest components on the reasonable assumption that they would all break eventually, so better to save some money and build software that was resilient to failures. Google wants to design every aspect of their system. AWS and Azure also each have their preferred approaches. The design of a server depends heavily on what software will run on it and so server designs tend towards as much diversity as there is in software, which is to say infinite.

All servers have some combination of digital processor (once a CPU, now a GPU or AI accelerator as well), some form of storage (SSDs and still a lot of hard drives), and networking (Ethernet vs Infiniband, optical, etc.). And within each of those categories there are dozens, if not hundreds, of different offerings. For instance, if a company needs a server to run a database, they will need a lot of storage, but can probably get by with a less expensive CPU. Another company may want a server to manage some aspect of their network, in which case they will need to bulk up on the networking elements as well as some specialized processors, but far less in the way of storage. This all sounds straightforward enough on paper, but there is considerable expense required to actually get these systems built.

Customers may want their own version of a server, but someone actually has to design that server. As much as we focus on those key components of a server, there are many, many more required. Look at a server board and the processor, storage and networking typically occupy less than 30% of the volume of the unit or surface area of the board. There are hundreds of smaller semis and passive components. Some of those perform high-level functions, like the BMC ship which provides some control of the physical components, but most are needed to just move the electrons around the board in a stable fashion. This is raw electrical engineering, which almost none of us think about, but it is critical for the server to operate. Someone has to design all of that using sophisticated CAD software. This process can take multiple weeks and costs several hundred thousand dollars. This may not sound like much compared to the scale of the data center market, but it does limit the number of servers that ultimately get designed. Customers may want infinite varieties of servers, but no one is willing to pay infinite dollars for all those designs.

So the question becomes who will pay for these designs. For many years, the answer was simple – Intel would pay for or somehow subsidize most of them. They sold most of the processors and  offering these designs helped to effectively lock customers into choosing their silicon. The OEMs – like Dell and HPE – also designed their offerings balancing the need to find some lowest common denominator for the customers with the expense of each design. Ultimately, the hyperscalers wanted something different and began paying for their own designs or got them as loss leaders from the ODMs. This process was a major barrier to entry for new silicon vendors, who found prospective customers expected them to pay for the server designs, but still wanted those custom formats. A big factor in the rise of Custom Silicon is the way in which it allowed the hyperscalers to assert control of everyone’s product roadmap, and that is especially true in the server design space.

At some level, these big customers recognized that these server design costs were not serving their own interest. This led to various initiatives to standardize server design and organizations like the Open Compute Project (OCP). And while OCP has had some notable successes, no one ever really expected them to completely untangle this knot. Designing servers remains a considerable expense, limiting entry to the market on multiple fronts. We know some semis start-ups that have their own server design teams, which is staggeringly expensive for pre-revenue companies.

That uncomfortable status quo is starting to change because of – you guessed it – AI. Nvidia has brought a new model to the market, or at least a new flavor of the old model. For almost its entire history, Nvidia has had to design its own hardware products, going back to the days of their first graphics cards. The company carried that practice into its AI products. For starters, Nvidia designed its own servers such as the DGX. As they have taken off, they have opened up this process widely, ‘encouraging’ the ODMs and OEMs to offer their own designs as well via systems like the MGX. In many ways, this harkens back to the days when Intel provided many server designs, but it is much more advanced. Nvidia would really like to sell customers their entire stack – CPU, GPU, networking, marked up memory – and while they will sell all of that individually, they have made it much easier to just buy the whole stack. The hyperscalers will mostly not go for this and continue to design their own systems, but for almost everyone else the Nvidia way is the easiest.

And remember, all of this happens before ‘revenue’ really kicks in. Once designed, someone needs to build a prototype. Then those prototypes need to be tested. The big customers will then want to run a series of trials – a single server, then a rack of servers, then a cluster of a few racks. Only once all those are run do volume orders start to come in. Given that leads times on semis from the foundries can be months long, silicon vendors need to incur the cost of buying chips, creating a major working capital requirement.

Selling against this is incredibly challenging for all the vendors. Intel and AMD can (mostly) hold their own here, but for most new entrants this process represents a large, often insrmountable barrier to entry.

7 responses to “Bending Metal: Combinatorics”

  1. […] take: In our ongoing exploration of the enterprise of promoting compute techniques, we discovered that there’s a […]

  2. […] take: In our ongoing exploration of the business of selling compute systems, we found that there is a giant blind spot […]

  3. […] take: In our ongoing exploration of the business of selling compute systems, we found that there is a giant blind spot […]

  4. […] take: In our ongoing exploration of the business of selling compute systems, we found that there is a giant blind spot […]

  5. […] take: In our ongoing exploration of the enterprise of promoting compute methods, we discovered that there’s a big […]

  6. […] of truly constructing compute methods. We have now explored completely different parts of that (and here), however there’s an underlying assumption constructed into all these conversations – the […]

  7. […] business of actually building compute systems. We have explored different parts of that (and here), but there is an underlying assumption built into all these conversations – the role of […]

Leave a Reply to The unsung heroes of chip gross sales: subject utility engineers – WebstoryCancel reply