AWS goes further with the Graviton3 server chip

It’s always an exciting time when a new compute engine hits the market, and interest is especially strong in any new Arm server chip entry. At this point, Amazon Web Services is by far the largest consumer of Arm-based server processors in the world, with its Graviton line of processors and Nitro line of DPUs. The latter sits in all of the company’s servers to offload network processing, security and storage from server CPUs, and the former is becoming an increasingly important part of the AWS server fleet, which is counts in millions of machines.

At this week’s AWS re: Invent conference, CEO Adam Selipsky spoke briefly about the next generation of Graviton3 processors that the cloud computing giant’s Annapurna Labs chip design division designed and retrieved from foundry partner Taiwan Semiconductor Manufacturing. Co, but Selipsky didn’t give a lot of flow and speed that we like to see for comparison with previous generations of Graviton, other Arm server chips, and other X86 and Power processors on the market today. hui. Peter DeSantis, senior president of utility computing at AWS, gave a keynote that provided a bit more detail on Graviton3, thankfully, and more details leaked in another session that we haven’t been able to yet. see but that provides even more insight into what Amazon is doing.

In our impatience, as we await feedback from AWS, we have tried to understand how the Graviton3 might be configured and what impact it might have on the AWS fleet over the coming year, when it should be. more widely available through EC2 instances.

Let’s start with what we know. Here is the graph Selipsky showed:

And here’s a picture of a three-node Graviton3 server tray that DeSantis showed with a few base power supplies and speeds.

If you thought that AWS was going to get on the Core Count Express and drive up to 96 or 128 cores with Graviton3, and use a process down to 5 nanometers to help drive the frequency a bit too, you’ll be surprised to learn that the cloud provider instead settles for 64 cores and barely changes clock speed, with an increase of 100 MHz up to 2.6 GHz compared to Graviton2, which we detailed at launch here and which we made a price / performance analysis there when X2 and R6 instances were available, at Graviton3.

Just to upgrade, here’s a table we’ve prepared showing the power supplies and speeds for the three generations of Graviton processors:

Items in bold red italics are estimates instead of data that AWS did not provide.

DeSantis was very clear as to why AWS is heading in the direction it took with the Graviton3 chip in its opening keynote.

“Like I said last year, the most important thing we do with Graviton is to stay focused on the performance of real world applications – your applications,” DeSantis explained. “When you’re developing a new chip, it can be tempting to design a chip based on those sticker stats – CPU frequency or number of cores. And while these things are important, they are not the end goal. The end goal is the best performance and the lowest cost for actual workloads.

Wait a minute. It doesn’t sound like something Intel or AMD would say. … And that’s why hyperscalers and cloud builders are designing their own silicon. They have the critical mass to be able to afford it, and they are interested in taking advantage of the services and spending as little as possible on the silicon they control.

Rather than trying to make the Graviton3 chip bigger with more cores or faster with more clock speeds, what AWS did instead was make the cores themselves much larger. And to be very precise, it looks like AWS has moved from Arm Holdings’ “Ares” Neoverse N1 kernel, used in Graviton2, to the “Perseus” Neoverse N2 kernel with Graviton3.

There are rumors that it uses the “Zeus” V1 kernel, which has two 256-bit SVE vectors, but the diagrams we have seen only show a total of 256 bits of SVE, and the N2 kernel has one. 128 -bit SVE pair, so it seems to be used as if it were the N2 kernel. We are looking for confirmation from AWS at this time. Kernel V1 was aimed more at HPC and AI workloads than traditional general-purpose compute work. (We detailed the Neoverse roadmap and the V1 and N2 core plan in April.)

AWS is also apparently moving to some sort of chip design, but not the way AMD did and Intel will do with their respective Epyc and Xeon SP processors.

The V1 kernel is larger in many ways than the L1 kernel, and it is this fact that enables AWS to generate more performance. There are also larger vector units, 256-bit SVE units to be precise, which allow for larger data to be chewed, and often with lower precision for AI workloads in particular, dramatically increasing performance by. clock cycle. .

The N1 core used in the Graviton2 chip had a 4-8 instruction wide instruction fetch unit and a 4 wide instruction decoder which fed an 8 wide transmit unit which included two SIMD units, two loading / storage units, three arithmetic units and one branch unit. Along with the Perseus N2 core used in the Graviton3, there is an 8 wide extractor unit which feeds a 5-8 wide decoding unit, which in turn feeds a 15 wide transmission unit, which is basically twice as much. larger than that of the N1 Core used in the Graviton2. Vector engines are twice as wide (and support BFloat16 mixed precision operations) and load / store. arithmetic and branch units are also all doubled. To get more performance, compilers should let as many of these units do something useful as possible.

According to the report in Semi-analysis, which is presumably based on a presentation given at re: Invent 2021, the 64 cores of the Graviton3 chip are on one chip, and two PCI-Express 5.0 controllers have one chip each, and four DDR5 memory controllers have one chip each, for a total of seven tokens. These are linked together using 55 micron microbump technology, and the Graviton3 case is soldered directly to a motherboard rather than being inserted into a socket. All of this cuts costs and, most importantly, reduces heat that would otherwise have been generated to push signals over much larger bumps. We’re coming back to AWS to find out more about this. Stay tuned.

The important thing to note in the design of Graviton3 is not the cores, but the DDR5 memory and PCI-Express 5.0 devices that will be used to power these cores. The Graviton3 is the first to provide PCI-Express 5.0 and DDR5, and the former can deliver high bandwidth with half the number of lanes of its predecessor PCI-Express 4.0 while the latter can provide 50% more memory bandwidth with the same capacity and in the same power envelope. When you’re AWS, you can control your hardware stack and have someone make a PCI-Express 5.0 controller and DDR5 memory sticks for you and be on the cutting edge of technology. We believe that the L1 and L2 caches on the L2 cores that Graviton3 uses will be the same as with Graviton2, but the L3 cache will be doubled. But as the bold red italics show, this is just a guess on our part.

There has been some talk of Graviton3 using 60% less power than Graviton2, but we believe these were special cases and as far as we can tell Graviton3 will be at roughly the same thermal design point of 100 watts. We will try to get clarification on this from AWS. There is also a lot of talk that Graviton3 is an ARMv9 architecture chip, but that is not the case if it uses the V1 kernel or the N2 kernel, which it is. ARMv9 is coming, but not quite yet. Look for that with Graviton4 maybe.

Performance improvements for applications moving from Graviton2 to Graviton3 will vary depending on the nature of the applications. Here’s what the performance improvements look like for web infrastructure workloads:

This cited 25 percent performance improvement was a low-end estimate, not an average to be taken at face value. As you can see, the NGINX web server sees its performance increase by 60% by upgrading to Graviton3.

There are similar performance improvements for applications whose elements of their code can be vectorized, such as video encoding and encryption, and executed through these SVE units:

One of those workloads that can be run through these SVE vector units is machine learning inference, and this is where Graviton3 is really going to shine with support for the BFloat16 capability on those 256 bits of vector:

We strongly suspect that the middle bar in the above graph is meant to indicate Graviton3 – FP32, not Graviton2 – FP32. And as people who move fast and do our share of typos, we’re not going to be judging at all. …

We look forward to providing a more in depth dive on Graviton3 as soon as possible.

Comments are closed.