From GTC to OFC (22): Optical Networks for Machine Learning

May 9, 2025 484

May 9, 2025, This topic is one of the most interesting topics for the editor, and it is also the only course in the short course section this year that ended the registration process early. SC359: "Networks for Data Centers and Machine Learning", with the speakers being Hong Liu and Ryohei Urata from Google. Interestingly, when I arrived at the venue that day, this was the only session that didn't have a gatekeeper checking tickets, and people could enter freely. Since I didn't register, I didn't dare to attend the lecture rashly. Instead, I managed to borrow the lecture notes after the class and would like to share some content with you.

The first concept to be shared is the warehouse-scale computer. There is very little Chinese introduction about this concept online. Translated according to the Chinese definition, a warehouse-scale computer refers to a hierarchically organized system equipped with a large number of processors, capable of utilizing request-level parallelism and data-level parallelism. These systems form the core of the cloud infrastructure of companies such as Google and Amazon and are crucial for handling interactive applications and batch applications in the cloud. Hong Liu and her colleagues explained it as: "It's not a warehouse that stores computers, but a computer the size of a warehouse or a campus, where all the nodes can cooperate with each other. To achieve this, a low-cost interconnection solution is required to connect these servers and switches."

The second point to note is the requirement for parallelism of this computer cluster. A single server includes hardware such as CPU, DRAM, hard disk, and flash memory. Approximately 40-80 servers, together with switches, form a rack, and dozens of racks, along with switches, form a cluster. The characteristics of traditional data center computer clusters are that the hardware is as low-cost as possible, without pursuing high performance. Instead, software is used to improve the system-level reliability, and there is a high requirement for parallel capabilities because it needs to meet the inherent parallel requirements of the Internet. Another point is that the overall performance is limited by the I/O capability. For the new generation of AI computing clusters, the requirements for a single server are higher. Special CPUs/GPUs/TPUs, etc. are required, and high performance of a single machine is pursued, and the requirement for parallel capabilities is even higher, mainly to meet the needs of large language models (LLMs).

Thirdly, from 2011 to 2021, the internal network traffic of Google's data centers increased by 235 times. The lecture notes mentioned that the requirements of data center computing clusters for network performance are: as non-blocking as possible (giving APP/software engineers enough freedom to handle parallel services), a network architecture with rich bandwidth, low end-to-end latency, redundancy, and reliability. For this reason, the performance of the entire network is mainly determined by the topology structure (Torus ring, Clos, folded Clos, fat tree), routing, and flow control. And the cost is more affected by the topology, switching ASICs, and various interconnection means. Google's data center architecture has now developed to the sixth generation, and the total bandwidth of the Clos network has increased from 2Tbps to 10Tbps, 100Tbps, 200Tbps, 1.3Pbps, and finally to 6.5Pbs.

When it comes to network topology, the lecture notes also highlighted the NXN non-blocking optical cross-connection products on one page. There are already commercial products with more than 100 ports and a rate of 100G.

Fourthly, switches with a high radix can support higher bandwidth, but they are more difficult to manufacture. Over the past 20 years, Google's switches have evolved from the Firehose in 2006 to the Juipiter 4.0 in 2024, and the bandwidth capacity has increased by nearly 5000 times.

Fifthly, specifically regarding the data center network for machine learning. The Google TPU v5p launched at the end of 2023 can provide a performance of 459 teraFLOPS (trillion floating-point operations per second) in bfloat16 (16-bit floating-point format) or 918 teraOPS (trillion integer operations per second) in Int8 (8-bit integer execution), support high-bandwidth memory of 95GB, and be able to transmit data at a speed of 2.76 TB/s. The network for machine learning has a high requirement for parallel capabilities and can be divided into the top Host Network, the middle Scale out network (low latency, DCN or Infiniband), and the most basic Scale up network (from 10 GPUs to 1000 TPUs).

The bandwidth requirement of the Scale-up network is at least an order of magnitude higher than that of the Scaleout network, has higher requirements for local communication, is more sensitive to $/gbps and pJ/bit, is very sensitive to latency, and requires a very simple endpoint architecture. The Scaleout network is used for highly optimized "collective" libraries, and the application is very sensitive to latency jitter. In the lecture notes, a sentence that caught my attention is "ML Superpod is not only related to bandwidth but also to scale." To explain, the design of the supercomputer Superpod needs to balance high computing power and high-speed communication. Larger Pods, lower DCN bandwidth requirements, and more flexible model architectures. Optical connection technology is mainly used to develop larger systems for applications with high bandwidth, short distance, and low latency. The selection of FEC is crucial for low latency (the latency of light in optical fiber is 5ns per meter, and in copper wire is 4.5ns per meter).

算力单位.png

Sixthly, Hong Liu and her team also spent a lot of space discussing the progress of optical module technology. Let's specifically talk about the content regarding 200G Per Lane. This is the basis for 1.6T and higher optical module technologies and is also the key to improving the performance of AI networks currently. The main technical difficulties lie in the electrical channel aspect. One solution is to use a 2nm CMOS process, and another possibility is to use Optical chiplet, CPO, Co-package copper technology, etc. In terms of increasing the bandwidth of light, it mainly includes aspects such as optical channels per dimension (WDM, SDM, I/Q, polarization), symbol rate (10G, 25G, 50G), and multi-layer coding per dimension (2b/s, 2.5b/s, 3b/s). Each technology has its own advantages and disadvantages. For the current debate between IM-DD and coherent technology, the key lies in the bandwidth of the device itself. To achieve the same transmission rate per wavelength, different modulation technologies have different requirements for the bandwidth of the device itself. 200Gbps per channel may be the intersection point of the current IM-DD and coherent technology. When it is difficult to break through 40GHz for the bandwidth of the device itself, to achieve a higher transmission rate per wavelength, coherent technology may be the only solution (I/Q and polarization multiplexing may also be options).

Finally, it is inevitable to mention Google's OCS. This was a popular technology when the TPU v4 was released. In my opinion, Hong Liu and her team did not give a very high evaluation of OCS. Compared with packet switching, in circuit switching, all data packets use the same physical path, without the function of storing and forwarding, and have the minimum end-to-end latency. Compared with the traditional Clos-based switching architecture, the Google Apollo architecture replaces the original Spine switches with OCS, including OCS, circulators (which are the key exhibits of Triple-stone this year), and WDM optical transceiver modules. Without the Spine layer, it can reduce costs, power consumption, and latency, and at the same time support a more flexible architecture design. Google believes that the biggest advantage of introducing OCS is flexibility, and it can also reduce costs. However, the disadvantages may be that compared with ASIC switching, its functions are limited, the switching speed is slow, a basic control plane is required, and the requirement for reliability is high, etc. The OCS technical solutions listed in the lecture notes include MEMS, Robotic, piezoelectric, guided wave, wavelength switching, etc.

There were actually many discussions about the optical communication technologies required for the AI network architecture at this year's OFC. Almost every session was full, which shows everyone's concern about this issue. There are actually many issues worth exploring, such as the relationship between copper and electricity, whether OCS is really necessary, and whether higher bandwidth per channel can be achieved, etc. Everyone is welcome to discuss with us at our CFCF Optical Connection Conference (June 16-18 in Suzhou).