Can You Duplicate the CSPs Server Architecture On-Premises?

[Originally published on LinkedIn 21 June 2024]

When you rent compute capacity from a cloud service provider (CSP), you can simply instantiate a virtual machine (VM) using pre-built configurations. Typically, you can choose from general purpose, storage-optimized, compute-optimized, memory-optimized, or accelerator-optimized families of virtual machines. For example, with Google, the C4 instance of virtual machines provides 3.75 GB memory per virtual CPU (vCPU) and you can specify anywhere from 2 to 192 vCPUs per VM. For memory-intensive applications such as SAP HANA, the M2 instance can provide 416 vCPUs and a whopping 11,776 GB memory (28 GB/vCPU).

Regardless of which VM instance you choose, you’re instantiating a virtual machine from a set of compute and memory resources provided to you by the CSP. These resources are not associated with a physical machine because the CSP has rearchitected computers to disaggregate the resources.

What does Pizza Have to Do with Server Architecture?

To increase the density of computers in data centers, vendors optimized power, cooling, storage and memory, and developed a standard form factor: a flat server that fits into an industry standard 19-inch-wide telco rack. These servers eventually shrank in height to just one rack unit (1U, 1.75 inches), resembling a pizza box.

However, stuffing a standard six-foot rack with 35 or more pizza box servers presents some significant challenges. First, you have a lot of cables. With redundant power supplies and multiple network interfaces, you may have 200 or more cables. Second, you have heat management. While a single pizza box may have adequate cooling, 35 stacked on top of each other may catastrophically increase the heat load. This can introduce reliability problems. And maintenance becomes more difficult, as you must remove the server from the rack to get to the internal components.

Modern CSP Server Architecture

Over time, the CSPs have optimized the architecture. First, they moved to caseless designs, where they racked the server motherboards without the cases. Then they created custom server motherboards that could plug into a backplane (a custom blade server), solving some of the cabling and heat management issues.

Eventually, the CSPs developed completely custom designs that ignore commercial server architecture in favor of a completely optimized solution. Most importantly, the CSPs no longer design a rack as a set of individual servers. Instead, the entire rack is an atomic unit containing a set of CPU and memory resources that can be arbitrarily grouped by software into multiple virtual machines.

Rent vs. Buy

Unfortunately, cloud-scale rack computing is only available to rent and is off limits to companies who want to purchase systems because they need to run at scale on premises. The commercial market is still stuck with sub-rack components with atomic building blocks that all contain compute, memory, storage, power, cooling, etc. The CSP custom rack-scale systems are proprietary (architecture and firmware/software) and anyone who wants to duplicate the CSP rack-scale computing architecture on-premises must act as their own system integrator, taking responsibility for assembling and integrating the hardware and software.

How big of a problem is it that you cannot buy rack-scale systems for your own on-premises data centers? Notwithstanding the many cost, performance, and efficiency standards that come with the cloud, organizations are continuing to invest in on-premises systems. Indeed, only approximately 5% of global IT spend is for cloud services, and on-premises investment continues without abatement.

The Need for Cloud-like Rack Scale Architecture for On-premises Datacenters

Why are organizations continuing to build their own data centers? Because there remain really good reasons to run on-premises in 2024, such as:

Risk management
Regulatory compliance
Latency
Economics

Some of the many industries that need and continue to invest in massive on-premises compute clusters include governments, the defense sector, financial services, oil and gas exploration, large SaaS application developers, managed service providers, and more.

Can Oxide Fill This Need?

At Cloud Field Day 20, Steve Tuck and Bryan Cantrill described how they founded Oxide Computer Company to solve this problem. Oxide is a systems company that has developed and commercialized its own rack-scale architecture.

It’s important to think of Oxide as delivering a rack-scale atomic computing system composed of integrated hardware and software that is just like the custom proprietary CSP systems. In the process of rearchitecting servers and rack-scale computing, the team has given deep thought to every component in the system. The result is a complete re-think of computing with disaggregated compute, memory, and storage where the rack is the atomic unit.

The Software

Arguably, the most important part of the system is the software. Realizing this, Oxide has developed custom software for the entire system. Oxide believes in being completely transparent and has published its software to github under the MPL-2.0 open-source license. [Part of being transparent includes publishing on its website a series of request for discussions (RFDs) documenting its thought process around various topics spanning everything from architecture decisions to its hiring process and how it views and values its partnerships.]

Servers vs. Resources

The most important part of the software is how the system presents itself to users. In a traditional rack of servers, each server is directly exposed. Users must install an operating system or VM hypervisor on each individual server, and each individual server is managed independently of the others.

The Oxide software takes the CSP approach. Instead of exposing individual and independent servers, Oxide has built a hypervisor running on the rack (not on individual nodes in the rack) that exposes and manages all the resources within the rack. Thus, administrators can take a rack system containing thousands of processor cores, terabytes of memory, and petabytes of storage and group and allocate the resources as needed without regard to a physical server.

Thus, the Oxide system presents itself to users as a set of resources, and users allocate and access these resources just as they do in the cloud — by instantiating a virtual machine.

The system is entirely API driven, and Oxide’s CLI and GUI make API calls to manage the system. The intent is that much of the interaction with the system will be automated and machine-driven rather than real-time human driven.

Some key logical constructs include:

Silos — an allocation of resources (including CPU, memory, storage) etc. into a collection that is cryptographically separated and has its own API endpoint. This is how Oxide supports multitenancy in a rack and ensures tenant separation and privacy.
Projects — different efforts across an organization, which many have different users with different privileges assigned. Projects can contain instances, disks, snapshots, images, VPCs (virtual private cloud networks), floating IPs, and access control.
Instance — a virtual machine

To create an instance, you specify the number of vCPUs, memory, boot disk image, and more, just like you would with any other hypervisor and virtual machine environment. This instance runs on the rack (within the constraints of a silo), without regard to any physical CPU or server.

Duplicating the Cloud Experience On-Premises

So how would you use Oxide to duplicate the cloud experience on-premises? Just as Google Cloud users can instantiate a virtual machine without knowing about the underlying hardware, so can Oxide users. All it takes is using the CLI or GUI, or invoking the correct API calls.

Just as Google Cloud has developed a self-service interface enabling a user to instantiate a pre-defined virtual machine (e.g., a C4 instance with 2 vCPUs and 3.75 GB memory), an Oxide administrator can use API calls to provide a similar self-service capability.

For example, administrators can build a self-service environment to duplicate the cloud experience for its developers. In such an environment, a DevOps engineer could invoke an API call (or use a CLI or GUI) to instantiate a preconfigured VM, just as they can with Google. A smart admin might even duplicate cloud VM configurations on-premises which can make the on-premises compute infrastructure an extension of the cloud infrastructure (or vice-versa), enabling users to easily move workloads back and forth between environments.

Thus, the role of the administrator has shifted from managing 30 – 40 individual servers that just happen to be physically stacked and racked together to managing a set of resources — processors, memory, storage, and networking, that can be arbitrarily grouped together as needed.

Custom Everything

Oxide’s obsession with rethinking every component and every interaction led it to create custom software to optimize every component and optimize the interaction between components. As an example, Oxide has developed its own boot firmware and software. It has even developed its own network switch software. This includes its own P4 compiler (P4 is a programming language to control packet forwarding control planes) so that it can optimize the capabilities and performance of the network switch. Custom networking software features include delay-driven multipath (DDM) which optimizes the inter-node packet routes based on current real-time loads and delays in the rack.

The Hardware

Just as with the software, Oxide has customized every hardware component. Oxide realized that in many of the data centers at organizations they were working with, square footage real estate in the data center was short supply, yet there was plenty of unused vertical space above the standard six-foot rack. Thus, the Oxide rack is tall – almost eight feet. It’s designed to maximize vertical real estate and still be able to roll through standard building and data center doors.

This extra vertical height was put to good use — both to increase compute density and to increase the efficiency of the cooling system. The processors are installed into the system on compute sleds. Rather than sticking with the 1U form factor which uses 20mm fans, the compute sleds are 100mm high, enabling Oxide to use 80mm fans. The larger fans can run at a slower speed while providing the same cooling capacity, reducing the power consumed for cooling as well as reducing the noise generated by the fans. Oxide claims a 12x improvement in fan power consumption.

How important is this? In the traditional pizza-box server rack architecture, approximately 25% of the power — which could be in the 12 – 15 kW range — is consumed by the cooling fans.

Oxide is working closely with its component partners, going so far as to co-develop custom circuit boards for its power supplies so that it can collect detailed telemetry and closely control power (including limiting overall power consumption). For example, an administrator could configure the rack to consume no more than 8 kW, and the power controller would manage the power consumption of the CPUs and other components.

Similarly, Oxide has replaced the traditional server baseboard management controller (BMC) used for out-of-band management with an embedded service processor and has eliminated the BIOS (which dates back to the days of CPM and the first PCs in 1981).

The Oxide rack supports 2048 AMD Milan CPU Cores, 32 TB RAM, 1024 TB raw storage, and 6.4 Tb/s layer 2/3 network switching. The rack includes two network switches that can switch 6 billion packets/second. The compute sleds connect to the system through a custom backplane that provides both power and networking. This means that the system is cable-less — there are no cables interconnecting the CPUs to the built in network switches. Nor are there power cables to each sled.

Speaking of power, the rack consumes up to 15 kW of three-phase AC, and Oxide’s fanatical devotion to reworking every bit of the architecture enables it to optimize compute density and compute efficiency (CPU cores per square foot and CPU cores per Watt).

Security

Near and dear to my heart, Oxide designed-in (rather than bolted on) security. Each sled includes a hardware root of trust to secure the boot process and provide a foundation for encryption and the software chain of trust. Thus, Oxide has a “secure-by-design” system, including the boot process:

During manufacturing, each hardware root of trust is provisioned with a unique private key and matching certificate, enabling public-key cryptography and validation that the system was produced by Oxide.
On power on, the hardware root of trust cryptographically validates its own firmware.
As the rack boots, each physical CPU is held in reset while the firmware is measured. This measurement enables system software to know what firmware was booted by asking the root of trust.
Verifiable secret sharing provides remote attestation, protects data at rest, and protects against boot-time attacks.
A secure secret storage service avoids keeping secrets in RAM and has strict ACLs limiting access and use of secrets.

The Analyst Take

It may be shocking to some, but there are thousands of organizations that need to deploy massive cloud-scale compute power on-premises. Until Oxide, if you were one of these organizations, you were stuck with decades-old computer and rack architecture. Oxide makes it possible for organizations to use the same architecture as the CSPs and to recognize the same benefits, including:

simplified system management
cloud computing software paradigm
cloud rack-scale hardware consumption model
power and cooling savings
increased compute density
increased flexibility
increased hardware efficiency and density
increased DevOps engineering productivity

Disaggregation may be an avant-garde exercise for haute quisine Michelin-starred restaurants.

For modern organizations that require on-premises cloud-scale computing resources, Oxides investments in re-thinking and re-architecting the server rack to create a cloud-scale rack that duplicates the cloud environment on-premises, disaggregating CPU, memory, storage, and networking is a game-changer.