Enfabrica: Accelerating AI GPU Communication
Massive datasets are the lifeblood of AI models, fueling training and enabling accurate predictions. The insatiable hunger for data has profound implications for the underlying network infrastructure, pushing the boundaries of traditional computer and networking architecture.
The Current State of AI Computer Networking Architecture
Present-day AI networking relies heavily on a hierarchical structure of interconnected components. This hierarchy typically comprises:
- GPUs responsible for processing vast amounts of data in parallel.
- PCI Switches connecting multiple GPUs within a server, facilitating communication and data exchange.
- RDMA NICs (Remote Direct Memory Access Network Interface Cards) enable direct memory access between GPUs across different servers, minimizing CPU involvement and improving data transfer speeds.
- Network Switches form the backbone of the leaf-spine network, connecting servers and facilitating communication across the data center.
This traditional approach, while functional, suffers from several key limitations that hinder the scalability and efficiency of AI workloads:
- Inter-GPU Communication Bottlenecks: As the number of GPUs in a cluster increases, the hierarchical nature of the network creates bottlenecks. Data often has to traverse multiple switches and NICs, adding latency and reducing overall throughput.
- Limited Bandwidth and Resilience: Existing architectures struggle to keep up with the exponential growth in bandwidth demands of AI workloads. Moreover, single points of failure, such as cable drops, can disrupt entire training jobs, leading to costly checkpoints and restarts.
- Lack of Composability: Traditional architectures lack the flexibility to support diverse AI applications that require different combinations of compute and memory resources. This rigidity hinders innovation and adaptability in AI development.
- Escalating Total Cost of Ownership (TCO): Scaling AI infrastructure with traditional components significantly increases TCO due to the cost of networking hardware, power consumption, and cooling requirements.
Enfabrica’s Solution: The Accelerated Compute Fabric
Enfabrica proposes a radical departure from the conventional approach with its Accelerated Compute Fabric (ACF) technology. ACF embraces a MegaNIC concept, consolidating the functionalities of PCI switching, RDMA, and first-tier network switching into a single, high-bandwidth, highly resilient device.
The ACF achieves its remarkable performance and efficiency through a unique architectural design. The solution integrates multiple high-speed Ethernet NICs, interconnected by internal crossbar switches. These switches create a high-bandwidth, non-blocking fabric that allows data to flow seamlessly between any connected port. A key innovation is the separation of packet header processing and payload transfer. The NICs within the ACF process packet headers and make forwarding decisions, while the payload data is directly transferred between endpoints via DMA (Direct Memory Access), bypassing the NICs and minimizing latency. This approach allows for extremely efficient data movement, crucial for the demands of AI workloads.
The ACF’s architecture provides:
- Converged PCI and Ethernet Crossbar: ACF integrates PCI switching and Ethernet networking capabilities, creating a direct, low-latency path for data transfer between GPUs and across the network. This consolidation eliminates intermediate hops, reducing latency and boosting performance.
- Massive Bandwidth and Path Diversity: ACF provides a substantial increase in bandwidth, supporting up to 3.2 terabits per second on the network side and 5 terabits per second on the host/accelerator side. This bandwidth capacity, coupled with multipath capabilities, ensures high throughput and mitigates the impact of component failures.
- Programmable Transport and Congestion Control: ACF incorporates a programmable transport layer that operates on a standard CPU, enabling customers to customize congestion control mechanisms and tailor network behavior to specific workloads. This flexibility enables efficient scaling and adaptation to evolving demands.
- Composability and Heterogeneity: ACF’s architecture supports diverse compute and memory resources, including GPUs, CPUs, storage, and CXL-attached memory. This enables the creation of tailored systems optimized for specific AI applications, fostering innovation and adaptability.
With Enfabrica’s ACF, each GPU is directly connected to all Ethernet interfaces in the chip rather than just a single NIC. This expands the throughput available to each GPU to the throughput of the fabric (3.2 Tbps). At AI Field Day 5, Rochan Sankar , Enfabrica’s CEO said “The role of a PCI networking card has no relevance in AI going forward.”
In addition to AI training workloads, the ACFS’s high-bandwidth memory access capabilities can also benefit inference workloads and Retrieval Augmented Generation (RAG) by providing a large, shared memory pool accessible by multiple GPUs with low latency. “We think this is huge for RAG because RAG is effectively going to be about 75% the retrieval part and what this can do is effectively reduce and make the fleet more efficient,” Mr. Sankar said.
Potential Disadvantages of Enfabrica’s Solution
While Enfabrica’s solution offers compelling advantages, some potential disadvantages merit consideration:
- Hardware Dependency: ACF requires modifications to existing server designs, making it incompatible with current off-the-shelf systems. This may hinder adoption, particularly for organizations with existing infrastructure investments.
- Single Point of Failure: While ACF mitigates numerous points of failure through its multipath architecture, the device itself represents a single point of failure. A failure at the ACF level could disrupt all connected GPUs. Though the failure rate is estimated to be low due to the device’s design, it’s still a factor to consider.
- Limited Compatibility: Enfabrica’s decision to focus on existing InfiniBand verbs and RoCE, rather than incorporating Ultra Ethernet immediately, reflects a pragmatic approach driven by customer requirements – there is an urgent need for solutions that address the scalability challenges faced by current AI deployments. By prioritizing compatibility with established technologies, Enfabrica aims to provide a readily deployable solution for immediate needs, while keeping an eye on future advancements like Ultra Ethernet.
Why This Matters
AI workloads, particularly large language models, demand enormous amounts of data to be moved, processed, and stored. This data deluge necessitates high-bandwidth, low-latency architectures to avoid bottlenecks that can cripple AI performance.
Enfabrica, a startup focused on revolutionizing network infrastructure for AI, recognizes this challenge and proposes a radical shift in approach. Instead of treating networking as a peripheral concern, Enfabrica places it at the heart of AI computing, arguing that the network’s role in AI is evolving beyond mere connectivity to become a critical performance and scalability determinant.
Enfabrica’s core value proposition lies in its ability to address the key challenges of AI networking:
- Reduced TCO: By collapsing multiple components into a single device and optimizing data flow, ACF significantly reduces the cost of AI infrastructure, freeing up resources for compute power.
- Enhanced Performance: The high bandwidth, low latency, and multipath capabilities of ACF unlock the full potential of GPUs, accelerating training and inference tasks.
- Improved Resilience: The robust architecture and failure recovery mechanisms of ACF minimize downtime and ensure consistent operation, vital for large-scale AI deployments.
- Future-Proofing AI Infrastructure: The programmable transport layer and support for diverse compute and memory resources empower organizations to adapt to evolving AI workloads and future technologies.
Enfabrica’s ACF represents a significant leap forward in AI networking, enabling the realization of increasingly complex and demanding AI applications. As AI continues to advance, solutions like Enfabrica’s will play a crucial role in unlocking AI’s full potential and shaping the future of computing.