AI clusters require top-notch connectivity components given that a single slow GPU link or failed connection can significantly impede performance. Credit: luchschenF / Shutterstock As AI networking environments become more dependent on optical networking components, enterprises need to pay close attention to the reliability of those high-speed optics, Cisco warns. Surging AI investments are on track to reach $5.2 trillion by 2030, according to McKinsey Quarterly. Data center traffic, too, is exploding, with back-end traffic growing 10x every two years and adoption of 800G and 1.6T speeds on the upswing, reports Dell’Oro Group. [ Related: More Cisco news and insights ] There are a staggering number of GPUs involved in those networks, and it raises questions, says Bill Gartner, senior vice president and general manager of Cisco’s optical systems and optics business. “What’s the implication on the interconnects of all those GPUs? And, in particular, what’s the implication for reliability of that interconnect technology?” Gartner said in an interview with Network World. Reliability is essential given this dramatic shift in data center infrastructure, he noted in a recent blog post: “An analysis from Meta found that a single slow GPU link or failed network connection can reduce cluster performance by up to 40%, leaving expensive GPUs idle and investments stranded. Unlike traditional IP networking, AI workloads require the highest levels of link integrity to ensure continuous training and inference. This puts immense pressure on high-speed optics, which must operate under the intense conditions found in today’s AI data centers.” Gartner pointed to a 2024 report from researchers at SemiAnalysis that looked at the failure rate of optical transceivers/optics in very large GPU clusters and found that, given the huge number of optical links/transceivers in a 100,000-GPU deployment, failures could hamper training progress: “One of the most common problems encountered is Infiniband/RoCE link failure. Even if each NIC to leaf switch link had a mean time to failure rate of 5 years, due to the high number of transceivers, it would only take 26.28 minutes for the first job failure on a brand new, working cluster. Without fault recovery through memory reconstruction, more time would be spent restarting the training run in 100,000 GPU clusters due to optics failures, than it would be advancing the model forward,” SemiAnalysis stated. “And that’s actually pretty consistent with what we’re hearing from customers, primarily hyperscalers, about the pain they’re experiencing with link failures, whether it’s a link flap or whether it’s an actual optic failure in their AI infrastructure,” Gartner says. Because the GPUs are operating in parallel, the implications are different than in a classical network. “If we were on video and an error burst occurs, TCP/IP does a pretty good job of bridging that and retransmitting,” Gartner said. “But in AI infrastructure, because the GPUs are operating in parallel, it’s very sensitive to issues that might occur on one link. All those GPUs are exchanging information and synchronized, and so basically, you have to kind of stop the workload and back up to a checkpoint and restart the workload. And that can result in a 40% reduction in the performance of the cluster when you have these link errors occurring.” “It really suggests to our customers that they need to be focused much more on reliability of the optic,” Gartner said. Reliability testing reveals weaknesses Cisco in the past conducted a reliability test for which it acquired 20 different optics from different suppliers, Gartner recalled. “These were 100G and 400G optics at the time,” and all were compliant with industry standards, and yet “none of those optics passed our stress test,” he said. Cisco’s testing environments make changes to different conditions, such as the temperature or humidity level, or the voltage level that the optic is seeing on the host, or the skew between the signals coming from the host. “We do all of those things in various combinations,” Gartner said. While optics might technically comply with industry standards, “what we know is that if they were put into a stressful environment … they wouldn’t perform,” he said, “and so that’s the thing that we’re trying to raise awareness of for our customers.” “When you start looking at all the different combinations, putting everything at its limit and maybe even a little beyond its limit, how does the optic perform?” Gartner said. “This is kind of a warning flare that we’re sending up to our enterprise customers as well as the neocloud players and the hyperscalers that this is an area that they should be focusing more on,” Gartner said. Optics can account for a significant portion of hardware costs Enterprise customers are a large consumer of Cisco’s optics, Gartner said. He made a point of talking about how optics pricing is evolving as a percentage of hardware. “At 10G, the optic represented about 10% of the cost of the box. The switch or router port was 90%. And at 100G, it’s roughly 50/50. And at 400G, and certainly at 800G, more than half of the cost is in the optic,” he said. “So instead of looking at optics as an accessory with your routers and switches, think about buying optics across your portfolio, because it’s a much bigger part of the spend than it was at 10G.” The optical components in an enterprise AI cluster can include short reach optics inside a rack to connect GPUs; higher-speed links to devices very close to the rack; 400G/800G pluggable optics for tying together clusters and leaf fabrics; and coherent pluggable optics for data center interconnects that link multiple data centers kilometers apart. For its part, Cisco offers a full platform of optical choices across all phases of the optical networking spectrum – from building the optical components (through its Acacia division) and ASICs to providing the software, security and management packages to control the entire networked system. Enterprise-specific competitors such as Arista, HPE/Juniper, Ciena, and Nokia offer certain parts of the full range of optical networking gear; most times they depend on partners, merchant optics and other sources to complete their optical offerings. Looking ahead: CPO and LPO reliability As for future challenges, co-packaged optics (CPO) and linear-drive pluggable optics (LPO) will need to pass reliability tests as well. “The theory is that CPO will ultimately have higher reliability,” Gartner said. CPO is a photonic integrated circuit connected to silicon, and the laser is pluggable and separate. “So, if you think about the CPO packaged assembly, that’s the silicon plus the optic chip – it’s basically all a bunch of silicon. It should have very high reliability,” he said. But caution is required, Gartner added. “The thing that I think we have to be cautious about as an industry is that CPO package assembly will have 1,000 or more optics connections, and that means fiber attached to pieces of silicon. And I think the industry is going to go through a learning curve of making that a very high yield and very highly reliable.” Adding orders of magnitude more fiber connections is going to impact the yield early in the lifecycle of CPO, “and yield problems generally translate to field problems.” “Longer term, CPO has the the potential of delivering higher reliability. But in the short run, I think we must be very cautious as an industry to make sure that we haven’t introduced issues in the field because of challenges in manufacturing, and especially in the fiber attach,” Gartner said. It’s different with LPOs, he said. “You’ve got one pluggable that might have to be replaced. So, you’ve certainly got less of a blast radius with LPO.” SUBSCRIBE TO OUR NEWSLETTER From our editors straight to your inbox Get started by entering your email address below.