Untrue

Friday, Sep 14, 2018 · 1100 words · approx 5 mins to read

That’s so hilariously untrue that I don’t even know where to start.

I tweeted that at someone recently, in response to someone saying there are only so many ways you can skin a cat¹ when it comes to GPU design. What I’ll write here as a riposte doesn’t just apply to GPUs; it applies to any complex semiconductor device, it’s just that GPUs are what I’m most familiar with.

The person’s thesis was simple: GPU designs from different vendors are converging, and, after a little prodding, they defended the thesis with the throwaway comment about skinning the cat. I figure that those unfamiliar with the inner workings of modern GPU design might like to understand why it’s untrue.

At the highest level, a GPU is a set of datapaths that allow for programmable processing of structured input data. You put data in, there’s (stateful) programmatic transformation that you define and provide, and new data pops out.

Drilling down into the details, that notion of programmable processing — by which I mean that the datapaths are not completely fixed in function and run user-defined logic — is what tells you that, by very definition, there are a great many ways to design something that implements what today’s modern GPUs need to provide.

The design space is only getting larger, too, due to the expansion of just how configurable and programmable a GPU needs to be in order to implement support for today’s graphics APIs, and knowing that even more programmability is coming in the future.

Say you had to design a GPU from scratch that implemented support for just the Vulkan 1.1 Core API and passed all of the conformance tests. You could do that with no more than a single ALU datapath that can only conditionally branch and move single-bit values. It would be really slow, even on a leading edge foundry process, due to the limits on transistor performance. Not even IBM Research and their fancy graphene transistors are saving that particular day.

You could also design it, say, so that in aggregate it was a SIMD machine that was 7680 datapaths wide to let it work on 2 full scanlines of a UHD (3840 x 2160) framebuffer in parallel, with 256 MiB of embedded SRAM to let it fit a full 256-bit per pixel UHD framebuffer on the chip without tiling. Maybe you also want it to run at 4 GHz or so.

Maybe you also feel like independent FMAs are the dominant instruction issued in most shader programs (often they are!), so you design it to have independent dual-issue FMA per datapath ’lane’. Maybe you’re certain that real-time graphics needs better support for double-precision math, so each independent FMA is now a full-rate IEEE754 binary64 one.

To drive that machine at full-rate you’d need operand gather hardware that can collect 7680 x 2 (FMAs) x 3 (a * b + c) x 64-bit inputs every cycle. 360 KiB of inputs to be precise.

That 360 KiB of read bandwidth per cycle at 4 GHz is roughly 1300 TiB/s that you need to source from some storage structure on the chip somewhere. Thirteen hundred tebibytes per second across one section of the chip’s internal wiring. You need another 33% on top for writing the results somewhere (a * b + c = d, remember).

Good luck designing and manufacturing something like that today on any foundry process, never mind the rest of the logic you’d need for a functioning design. You need to skin the cat differently. I chose some pithy examples that look at the problem from the very top level of understanding at two very different ends of the GPU design spectrum, discarding the real implementation details that GPU hardware designers need to consider, to make the point.

The reality is those implementation details mean that even though both of those designs, assuming you could even manufacture the latter, would support the same mentioned API, they only do so similarly at a certain macro level of detail.

Knowing that it depends on viewpoint, let’s cut the person a break and zoom in to a level where most people commonly consider GPUs at today: their basic structure and the associated “speeds and feeds”. Let’s also assuming that we’re talking about designs that are competitive and manufacturable, unlike the two above. So something realistic that you could ask a foundry to make, that would play today’s games and supports today’s mainstream graphics APIs, and that would be competitive with other GPUs on the market.

Using the excellent GPUDB, we can avoid another theoretical analysis and take a look at Pascal and Turing, two real-world examples of the point I’m trying to make. Comparing GeForce GTX 1080 Ti and the latest GeForce RTX 2080 is incredibly illuminating: both GPUs are wide SIMD machines with similar memory bandwidth and overall throughputs. The older product is faster on paper too, if you look at those top-level headline figures.

Yet the newer Turing-implementing TU104 GPU in the GeForce RTX 2080 is often significantly faster in practice, and that’s before we start to take into account the biggest micro-architectural differences in Turing compared to Pascal, which underpins the GP102 GPU that powers the GeForce GTX 1080 Ti: hardware ray-tracing, and new-ish ML-focused datapaths. In today’s games the changes to the main arithmetic datapaths are likely the main driving force behind Turing’s improved per-clock performance.

Turing and Pascal are similar architectures from a particular viewpoint, from the same chip designer no less, with similar on-paper potential, but there is significant lower-level divergence in the implementation details that make the newer design faster in practice, despite the clear on-paper disadvantages it has.

With that level of generation-to-generation evolution in GPU design, refuting the person’s original thesis that there are only so many ways to implement a GPU in the real-world is easy. The contrived and frankly ridiculous examples above were proof, but even when things look similar they can hide differences significant enough to really matter.

GPUs are not getting more similar as time goes by. Wide SIMD with lots of memory bandwidth is not the best way to describe a GPU compared to a CPU, and analysing GPU architectural convergence at that level means you’ll miss all of the interesting developments in GPUs.

As an animal lover, what a horrible turn of phrase. I hate using it. ↩︎