Sitemap

How Cerebras Made Inference 3X Faster: The Innovation Behind the Speed

5 min readOct 26, 2024

--

Cerebras Systems has broken its previous industry record for inference performance, achieving 2,100 tokens/second on Llama 3.2 70B. This is significantly faster than any known GPU solution and hyperscale clouds. The company’s Inference technology enables faster processing for large models, unlocking new AI use cases with real-time, high quality responses and increased user engagement.

Background

Cerebras has made a significant leap in AI inference speed, achieving a threefold increase in performance through a single software release, a feat that typically takes GPUs two to three years with new hardware generations. According to Cerebras CEO Andrew Feldman, their solution is enabling early adopters and AI developers to tackle projects previously deemed impossible on GPU-based systems. This new level of performance is poised to transform AI innovation by drastically reducing latency in applications.

How It Achieved

Cerebras Inference is driven by the CS-3 system and the Wafer Scale Engine-3 (WSE-3) processor, which delivers high performance and throughput without compromises. The WSE-3 addresses memory bandwidth challenges and is compatible with the OpenAI Chat Completions API, offering a lower-cost alternative to other cloud solutions with easy API access. Cerebras Systems, a team of computer architects and engineers, is focused on accelerating generative AI by building the world’s largest and fastest AI supercomputer, the CS-3 system. By clustering the WSE-3, Cerebras creates scalable supercomputers, and with this technology powering Cerebras Inference, they provide groundbreaking speeds for advanced AI applications.

Wafer Scale Engine-3

WSE-3 AI Chip

The world’s largest AI chip has over 4 trillion transistors and 46225mm2 of silicon in TSMC 5nm. It is called the Cerebras Wafer Scale Engine 3 and has 900,000 cores and 44GB of memory, which is distributed alongside the cores to keep data and compute close together.

Size Comparison: WSE-3 vs. NVIDIA H100

What Sets It Apart

The WSE-3 (Wafer Scale Engine 3) and the Nvidia H100 show a stark contrast in several key specifications. The WSE-3 features a massive chip size of 46,225 mm² compared to the H100’s 814 mm². This size advantage allows the WSE-3 to house 900,000 cores, vastly surpassing the H100’s 16,896 FP32 and 528 Tensor cores. Additionally, the WSE-3 offers 44 gigabytes of on-chip memory, while the H100 has only 0.05 gigabytes. In terms of bandwidth, the WSE-3 outperforms significantly, delivering 21 petabytes per second of memory bandwidth compared to the H100’s 0.003 petabytes per second. Fabric bandwidth also sees a dramatic difference, with the WSE-3 achieving 214 petabits per second, dwarfing the H100’s 0.0576 petabits per second. These differences highlight the WSE-3’s architecture, designed for massive data throughput and on-chip computation, contrasting the more traditional approach of the Nvidia H100.

CS-3 system

The CS-3 is a compact system that offers high performance computing, delivering results in minutes or hours that would typically take days on large multi-rack clusters. It is capable of delivering the performance of many graphics processing units in a single unit, reducing the need for multiple servers and lowering costs. The CS-3 is efficient, compact, and enables researchers to push their work further with cluster-scale computing in a single device.

The CS-3 offers AI researchers and data scientists the ability to test more ideas quickly with superior AI compute performance. It provides performance gains in a more efficient package, offering significant compute advantages over GPUs with lower power consumption.

Compare Cerebras CS-3 with B200

The Cerebras CS-3 demonstrates remarkable performance and scalability compared to traditional systems like the B200, DGX B200, and GB200 NVL72. With a staggering 1,200,000 GB of memory, it dwarfs the memory capacities of these other systems, which peak at 13,500 GB for the NVL72. The CS-3 achieves 125 PFLOPs in FP16 performance, significantly outperforming the B200’s 4.4 PFLOPs and DGX B200’s 36 PFLOPs, though it falls short of the NVL72’s 360 PFLOPs. In terms of bandwidth, the CS-3 offers 26,750 TB/s, which far exceeds the NVLink fabric bandwidth of the other systems, such as 130 TB/s for the NVL72. Despite consuming a high power of 23,000 watts, the CS-3 maintains an efficiency of 0.005 PFLOPs/W, which is comparable to or better than most competing systems. When scaled, the CS-3 shows impressive improvements, offering up to 28.4 times the performance of the B200 and significantly higher PFLOPs per watt ratios against all other systems, showcasing the advantages of Cerebras’ design for specific AI workloads.

Summary

The text highlights a 3x performance improvement in the Wafer Scale Engine for inference, achieving 2,100 tokens per second for Llama3.1–70B. This improvement equals more than a hardware generation’s worth of performance in a single software release. The team is focused on optimizing software and hardware capabilities, and plans to expand model selection, context lengths, and API features in the near future.

--

--

No responses yet