Optical Fiber And The Generative AI Revolution | Corning

How is optical fiber connectivity advancing the generative AI revolution?

By Mustafa Keskin
Published: June 6, 2024

What comes to mind when you think about Artificial Intelligence (AI)? For me, it all began last November with a post from an old friend on LinkedIn, expressing how impressed they were with ChatGPT. After eventually signing up myself, what truly captivated me was its ability to provide human-like answers that were both contextually appropriate and technically sound.

Its limitations were also clear of course – almost like interacting with an intelligent but slightly dull human friend. It would respond with bullet-pointed answers and consistently remind me that it was, in fact, an AI model. It urged me to take its responses with a grain of skepticism. What I found most appealing was the way the answers appeared on the screen—each letter and word appearing slowly, as if typed by a human on the other end of the connection.

Fast forward six months, and now when I type a question for ChatGPT, it responds so rapidly that it leaves me a bit dizzy. What transpired during these past six months? What changes were implemented by the creators of ChatGPT?

Most likely, OpenAI has scaled the inference capacity of their AI cluster to accommodate the demands of over 100 million subscribers. NVIDIA, a leading  AI chip maker, is reported to have supplied around 20,000 graphic processing units (GPUs) to support the development of ChatGPT. Moreover, there are plans for significantly increased GPU usage, with speculation that their upcoming AI model may require as many as 10 million GPUs.

GPU cluster architecture – the foundation of generative AI

Now, let's take a step back. Wrapping my head around the concept of 20,000 GPUs is manageable, but the thought of optically connecting 10 million GPUs to perform intelligent tasks is quite the challenge.

After a couple of hours of scouring the internet, I stumbled upon various design guides detailing how to build high performance networks that provide the high-speed connectivity required for AI workloads.

Let’s discuss how we can create GPU clusters by initially configuring smaller setups and then gradually expanding them to incorporate thousands of GPUs. We’ll use NVIDIA design guidelines as the example here, which are rooted in the tradition of High-Performance Computing (HPC) networks.

According to NVIDIA’s recommendations in this set of design guidelines, the process involves constructing substantial GPU clusters using smaller units of 256 GPU pods (scalable units). Each pod consists of 8 compute racks and 2 middle-of-the-row networking racks. The connection within and in between these pods is established through InfiniBand, a high-speed, low-latency switching protocol, employing NVIDIA’s Quantum-2 switches.

Current InfiniBand switches utilize 800G OSFP ports, employing dual 400G Next Data Rate (NDR) ports. This configuration uses 8 fibers per port, resulting in 64x400G ports per switch. It's highly likely that the forthcoming generation of switches, whatever name they carry, will adopt Extreme Data Rate (XDR) speeds. This translates to 64x800G ports per switch, also utilizing 8 fibers per port – mostly single mode fiber. This 4-lane (8-fiber) pattern seems to be a recurring motif in the InfiniBand roadmap, summarized in Table-1, utilizing even faster speeds in the future.

Full name 1X (lane) 4X
Enhanced Data Rate (EDR) 25G 100G*
High Data Rate (HDR) 50G 200G
Next Data Rate (NDR) 100G 400G
Extreme Data Rate (XDR) 200G 800G
Gigantic Data Rate (GDR) 400G 1600G

* Link speeds specified in Gb/s at 4X (4 lanes)

When it comes to the cabling approach, the prevailing best practice in the high-performance computing (HPC) world entails employing point-to-point Active Optical Cables (AOCs). These cables establish a robust connection between optical transceivers, with an optical cable linking the two.

However, with the introduction of the latest 800G NDR ports sporting Multifiber Push-On (MPO) optical connector interfaces, the landscape has shifted from AOC cables to MPO-MPO passive patch cords for point-to-point connections. When considering a single 256 GPU pod, utilizing point-to-point connections poses no significant issues. My personal approach would be to opt for MPO jumpers for a more streamlined setup.

Operating at Scale

Things remain relatively smooth up to this point, but challenges emerge when aiming for a larger scale – for example 16k GPUs which will require interconnecting 64 of these 256 GPU pods – due to the rail-optimized nature of compute fabric used for these high-performance GPU clusters. In a rail-optimized setup, all host channel adapters (HCAs) from each compute system are connected to the same leaf switch.

This set-up is said to be vital for maximizing deep learning (DL) training performance in a multi-job environment. A typical H100 compute node is equipped with 4x dual-port QSFP, translating to 8 uplink ports – one independent uplink per GPU – that connect to eight distinct leaf switches, thereby establishing an 8-rails-optimized fabric.

This design works seamlessly when dealing with a single pod featuring 256 GPUs. But what if the goal is to construct a fabric containing 16,384 GPUs? In such a scenario, two additional layers of switching become necessary. The first leaf switch from each pod connects to each switch in spine group one (SG1), while the second leaf switch within each pod links to each switch in SG2, and so forth. To achieve a fully realized fat-tree topology, a third layer of core switching group (CG) must be integrated.

Let's revisit the numbers for a 16,384 GPU cluster once more. Establishing connections between compute nodes and leaf switches (8 per pod) requires 16,384 cables, meaning 256 MPO patch cords per pod. As we embark on the journey of expanding our network, the task of establishing leaf-spine and spine-core connections becomes more challenging. This involves the initial bundling of multiple point-to-point MPO patch cords, which are then pulled across distances that can range from 50 to 500 meters.

Compute

Node Count

GPU Count

Pod Count

Leaf SW Count

Spine SW Count

Core SW Count

Node-Leaf Cable Count

Leaf-Spine Cable Count

Spine-Core Count

2048

16384

64

512

512

256

16384

16384

16384

Could there be a more efficient approach to our operations? One suggestion could be to employ a structured cabling system with a two-patch panel design, utilizing high fiber count MPO trunks, perhaps 144 fibers. This way, we can consolidate 18 MPO patch cords (18x8=144) into a single Base-8 trunk cable. This consolidated cable can be pulled all at once through the data hall. By utilizing patch panels suitable for 8-fiber connectivity and MPO adapter panels at the endpoints, we can then break them out and connect them to our rail-optimized fabric. This method eliminates the need to deal with bundling numerous MPO patch cords.

To illustrate, consider the scenario where 256 uplinks are required from each pod for an unblocking fabric. We can opt for pulling 15x144 fiber trunks from each pod, resulting in 15x18=270 uplinks. Remarkably, this can be achieved using just 15 cable jackets. Additionally, this setup offers 270-256=14 spare connections, which can serve as backups or even be utilized for storage or management network connections.

Ultimately, AI has made significant progress in comprehending our questions, and we’ll witness its continued evolution. When it comes to enabling this transition, seeking cabling solutions that can support extensive GPU clusters – whether they comprise 16K or 24K GPUs – is an important part of the puzzle and a challenge that the optical connectivity industry is already rising to meet.

Please fill out the form below to connect with one of our experts.

Mustafa Keskin

Mustafa Keskin

With over 19 years of experience in the optical fiber industry, he is an accomplished professional currently serving as the Application Solutions Manager at Corning Optical Communications in Berlin, Germany. He excels in determining architectural solutions for datacenter and carrier central office spaces, drawing from industry trends and customer insight research. Previously, he played an important role in the development of the EDGE8 optical cabling system for datacenters as part of a global team, and his expertise extends to publishing articles on innovative applications, such as the utilization of Corning mesh modules in spine and leaf network architecture.

Interested in learning more?

Contact us today to learn how our end-to-end fiber optic solutions can meet your needs.

Thank you!

A member of our team will reach out to you shortly.