The Heart Of Modern Systems Engineering

The trade-offs between integrated and component-based architectures, at Oxide Computers is a deliberate departure from the prevailing model of commodity hardware running a generic OS (usually Linux) and a standard hypervisor (like KVM).

Let's break down each component of the Oxide stack, analyze its role, and then discuss the risks and advantages of their deep integration.

Introduction: The Oxide Philosophy

Oxide's goal is to build a true "cloud computer" for on-premises data centers. Instead of selling servers, they sell a fully integrated rack that is a cohesive, software-defined system. The philosophy is that by co-designing the hardware, firmware, and system software, you can eliminate entire classes of problems related to security, performance, observability, and manageability that plague the commodity server world. This is a modern take on the vertically integrated model of companies like Sun Microsystems, which is no coincidence given the team's heritage.


1. Hardware: The Foundation

While developers don't explicitly ask for hardware, it's the bedrock upon which everything else is built. Oxide designs its own motherboards, chassis, network fabric, and, crucially, a custom Service Processor (SP) that acts as the hardware root of trust.

  • What it is: Bespoke, rack-scale hardware designed from first principles. This includes custom sleds with AMD CPUs, a 100GbE backplane network fabric, and a control plane run by their custom Rust-based Service Processor.

  • Role in Integration: The hardware is not a generic platform; it's the specific target for all the software above it. The drivers in illumos are written for this exact hardware. The firmware is designed to initialize this exact hardware. The hypervisor is aware of the specific I/O and CPU capabilities.


2. Firmware: Hubris and openSIL

In a traditional server, the firmware (BIOS/UEFI) is a complex, proprietary, and often insecure binary blob from vendors like AMI or Phoenix. Oxide rejected this model.

  • What it is: Oxide's firmware stack, named Hubris, is a small, security-focused operating system written in Rust for their Service Processor. It is responsible for the earliest stages of boot and system management. For the main CPUs (AMD EPYC), Oxide is a key partner and contributor to openSIL (Open-Source Silicon Initialization Library). openSIL is an AMD-led initiative to replace the proprietary AGESA blob, which has historically been the black box responsible for memory training and CPU initialization.

  • Role in Integration:

    • Security: By using Rust for Hubris and contributing to an open-source openSIL, Oxide eliminates vast, unauditable binary blobs from the boot process. The entire chain of trust, from the hardware root of trust to the OS, becomes verifiable.

    • Simplicity & Speed: Traditional UEFI is a full-blown OS with its own driver models, network stacks, and applications. It's slow and complex. Oxide's firmware does the bare minimum required to initialize the hardware and hand off control to the operating system's bootloader. This results in dramatically faster and more reliable boot times.

Advantages of Oxide's Firmware Approach

  • Security & Auditability: The entire firmware stack is open source and written in memory-safe Rust (Hubris). This allows for full auditability, removing the "firmware black box" which is a common vector for sophisticated attacks.

  • Robustness & Simplicity: By jettisoning the complexity of UEFI, the attack surface is drastically reduced. The boot process is simpler, faster, and less prone to esoteric bugs.

  • Deep Integration: The firmware is designed to do one thing: prepare the hardware for Helios (Oxide's OS). It passes critical hardware information directly to the OS in a clean, well-defined manner.

Risks of Oxide's Firmware Approach

  • Maturity: openSIL is a relatively new initiative. While it has AMD's backing, it doesn't have the decades of battlefield testing that proprietary AGESA and UEFI have (for better or worse). There's a risk of undiscovered hardware initialization bugs.

  • Vendor Dependence (AMD): The success of openSIL is heavily dependent on AMD's continued commitment. If AMD were to deprioritize it, Oxide's firmware strategy for future CPUs would be at risk.

  • Ecosystem: The ecosystem around UEFI is massive (secure boot certificates, specific drivers, etc.). By moving away from it, Oxide must build its own tooling and processes. This is not a risk for their integrated product but would be if these components were used a la carte.


3. Kernel: illumos Drivers and Helios

Instead of Linux, Oxide chose illumos as the foundation for their operating system, which they call Helios. illumos is an open-source fork of OpenSolaris.

  • What it is: illumos is a Unix operating system renowned for its robust, production-grade features developed at Sun Microsystems. Key features include:

    • ZFS: Widely considered the most advanced filesystem available, with built-in data integrity, snapshots, and software-defined storage capabilities.

    • DTrace: A dynamic tracing framework that provides unparalleled, real-time observability into the kernel and applications with minimal performance impact.

    • Zones: A mature, lightweight, OS-level virtualization (container) technology with strong security isolation.

    • SMF (Service Management Facility): A robust framework for managing system services and their dependencies.

  • Role in Integration: Helios is the connective tissue. The drivers are written specifically for the Oxide hardware. ZFS provides the storage substrate for the entire system, including the hypervisor. DTrace provides observability across the entire stack.

Advantages of Using illumos

  • Proven Technologies: ZFS, DTrace, and Zones are legendary for their stability and advanced capabilities. They are not add-ons; they are fundamental parts of the OS.

  • Integrated Design: illumos was designed from the ground up as a cohesive system. This aligns perfectly with Oxide's philosophy. For example, the networking stack (Crossbow) and storage (ZFS) are deeply integrated, enabling high-performance I/O.

  • Unmatched Observability: DTrace is the "killer app." It allows Oxide engineers and customers to debug complex performance problems across hardware, kernel, hypervisor, and applications in a way that is extremely difficult to replicate on Linux.

Risks of Using illumos

  • Niche Ecosystem: The illumos community and application ecosystem are vastly smaller than Linux's. While Oxide provides a Linux guest environment (via their hypervisor), the host OS itself has a limited pool of developers and third-party software.

  • Driver Support: This is a major issue for generic illumos deployments, but Oxide sidesteps it by controlling 100% of their hardware. They write their own drivers. However, this means you cannot add a third-party PCIe card and expect it to work.

  • Hiring and Knowledge Base: Finding engineers with deep illumos expertise is much harder than finding Linux experts. Customers need to trust Oxide's team to manage the platform, as their own staff is unlikely to have prior experience.


4. Hypervisor: Propolis

Oxide did not use KVM (the Linux standard) or Xen. They built their hypervisor, Propolis, on top of the illumos kernel, using the bhyve hypervisor technology originally from FreeBSD.

  • What it is: Propolis is Oxide's control plane and management layer for virtualization. It uses bhyve, a modern, Type-2 hypervisor that is well-suited for integration into a general-purpose OS like illumos. Propolis is responsible for VM lifecycle management, storage, networking, and exposing a cloud-like API.

  • Role in Integration: This is where the magic happens. Propolis is not a generic hypervisor; it's an illumos-native application that leverages all the underlying OS features.

    • Storage: A VM's virtual disk is simply a ZFS zvol. This means creating a VM, snapshotting it, or cloning it is an instantaneous, metadata-only ZFS operation.

    • Observability: Using DTrace, you can trace an I/O request from an application inside a Linux VM, through the Propolis hypervisor, down into the ZFS storage stack in Helios, and all the way to the hardware driver. This is the "god mode" of observability.

    • Networking: Propolis wires guest networking directly into the illumos Crossbow networking stack, allowing for fine-grained resource control and high performance.

Advantages of Propolis

  • Extreme Performance: By leveraging ZFS for storage operations, tasks that are slow and cumbersome in other systems (like cloning a 100GB VM) become instantaneous and space-efficient.

  • Holistic Management: The entire system—from hardware health to VM performance—is managed through a single, cohesive API. There is no separation between the storage layer, the compute layer, and the management layer; they are one and the same.

  • Superior Debuggability: When a VM is slow, DTrace can pinpoint the exact cause, whether it's in the guest OS, the hypervisor, the host OS kernel, or a hardware interaction. This drastically reduces the mean time to resolution for complex problems.

Risks of Propolis

  • Maturity and Feature Parity: bhyve and Propolis are less mature than KVM or VMware ESXi. While they have the core features, they may lack some of the esoteric or enterprise-edge features that have been built into competitors over decades (e.g., certain types of live migration, complex GPU pass-through configurations).

  • Lock-in: Propolis is a bespoke component. The APIs and management tools are unique to Oxide. Migrating workloads from Oxide to another platform (or vice-versa) is more involved than migrating between two KVM-based clouds. You are buying into the Oxide ecosystem, not just a hypervisor.

  • Guest Support: While bhyve has good support for modern guest OSes (Linux, Windows, BSDs), the range of tested and officially supported guest configurations may be smaller than that for KVM or ESXi initially.


The Synthesis: How Integration Creates a Superior System

The true advantage of Oxide is not in any single component but in their virtuous cycle of integration:

  1. Hardware is designed for the Firmware: The Service Processor and motherboard are built with the Hubris/openSIL firmware in mind.

  2. Firmware is designed for the OS: The firmware does the minimum work necessary and hands off a clean, well-understood state to the Helios kernel, enabling a fast and secure boot.

  3. OS is designed for the Hardware: Helios drivers are written for the exact hardware, leading to optimal performance and stability without the bloat of supporting thousands of devices.

  4. Hypervisor is designed for the OS: Propolis uses the core strengths of illumos (ZFS, DTrace, Zones) to deliver performance, observability, and manageability that would be impossible to achieve by simply layering KVM on top of a generic Linux distribution.

This tight coupling creates an experience that is more like a cloud service (e.g., EC2) or a mainframe than a collection of commodity servers. The primary risk is also its primary advantage: it's a single, opinionated system from a single vendor. You are betting on Oxide's engineering, vision, and long-term viability. In return, you get a system that promises to be more secure, more performant, and dramatically easier to operate.

Resources for More Information

Connect: Join Univrs.io

Last updated