How to build a great (graphics) benchmark

Introduction

Hardware benchmarking is my (professional) life. At Imagination I lead a talented group of engineers whose entire existence is predicated on designing, developing, maintaining and running GPU benchmarks. The data they generate drives an entire class of activity within our wider business, be it pre-sales when convincing the customer it’s a good idea to take our technology, later in the lifecycle when the customer has silicon back and wants to understand interesting facets of performance in the context of a full system-on-chip, or as input to future generations of our architecture and its specific different configurations.

Beyond3D has always been about custom architecture microbenchmarks that set the site apart from its peers, not just when I was its steward. Further back at Hexus in the early 2000s, great performance analysis is one of the foundational parts of the site that I built, and it’s something that still sets them apart today, nearly a decade later. So I’ve been benchmarking for a living or for the love of 3D for nearly 15 years.

That set of bona fides hopefully gives me a decent insight into how to write great performance analysis software, specifically benchmarks, for low-level hardware analysis. I’m going to write as much of it down as I can, in the context of using it to do a public performance analysis of graphics hardware. I say public because your public might only exist in private, as it is for me in a great many cases, but there’s always an audience. Conveniently, that’s the first key thing to think about.

Know your data’s audience

The most important thing about a benchmark is knowing what it helps explain to your audience. In the context of graphics performance analysis that’s often lost completely. You really have no business using the data generated by any benchmark, especially industry-standard high-level graphics benchmarks, if you don’t understand what the data is telling you. “Bigger is better” is nowhere near enough insight. As the Tech Report have found out to critical acclaim, a deeper understanding of even simple-looking metrics pays huge dividends in helping people understand what the hardware is doing.

Let’s take the most famous example, 3DMark. If memory serves, it has always generated a weighted composite score at the end of its runtime. It runs more than one test, puts the results into a formula, and outputs a weighted representation of what happened. Faster (bigger) is better in an absolute sense, but it doesn’t tell you anything about how the system generated the score. That doesn’t really matter if all you’re shooting for is the world record, but it does matter if you want to compare one system to another. GPU0’s score being higher than GPU1’s doesn’t actually tell you that GPU0 is better without significant extra insight. I said “system generated the score” for a reason, since there’s always a system influence in GPU performance analysis.

There are two groups that need to care about their audience here. Futuremark, because they wrote 3DMark, and the users who want to understand the data it generates to tell someone something useful (and in their case the audience might well be themselves, of course!).

Futuremark’s obligation is to explain exactly what the tests are doing, including the inner workings of the renderer and exactly what’s being drawn, and that means exactly how it uses the API, so that the audience using it can know what actually happened. “3D rendering happened”, isn’t good enough. The user’s obligation is to understand the benchmark as much as possible, using that information, to educate themselves enough to explain the outcome to someone. That someone might be them, and they might choose to do so by proxy (using a technical review site like The Tech Report or Anandtech), but that education is paramount.

If you’re the user and want to do it by proxy, you need to know, not just trust, that the reviewer understands the benchmark on your behalf and has understood your need to interpret the data usefully via their prose. Ask them if you don’t know! If you’re the reviewer you need to have educated yourself on behalf of the user, and the benchmark vendor has to have understood you as the audience and given you everything you need to do that. Ask the benchmark vendor if you don’t know! Saying faster is better is useless at every step of the way, for all concerned.

Can you cut corners knowing your audience? At your peril, or your business is probably at risk.

Repeatability

It should be obvious but I’ve seen it done terribly a million times. A benchmark needs to do the same thing under the same conditions every time it is run. For certain kinds of performance analysis it’s not the benchmark’s responsibility to know everything about the executing conditions, and it is always the user’s even if the benchmark is helping you in some way to make sure the conditions are consistent.

So I mean not just a repeatable workload, but also a repeatable test environment. Basic science, you’ll agree, but often the environment is completely overlooked. If you’re benchmarking a smartphone, did you turn off the radios? That’s obvious. Did you also turn off the diagnostic software that’s built-in, that lights up the CPU and try and send crash logs and usage data periodically? You probably clicked OK when the OS asked you about that, because you probably want to use the smartphone as a smartphone later.

Are you doing battery testing? You are? Great! Did you do it in a thermally controlled environment to level the playing field for the battery chemistries, which are temperature sensitive to a high degree? Hopefully.

You get the idea.

For the workload itself, repeatability is hopefully obvious. There’s huge utility in pseudorandom fuzzing of runtime parameters for certain classes of benchmarks, especially microbenchmarks those doing parameterised testing over a wide range of possible inputs. But the workload should be as idempotent as possible for all of those inputs under its control. Do the exact same thing every damn time.

Time

Speaking of time, it’s perhaps the most important possible input of them all, especially in graphics. It’s also obviously the single most important thing you will want to measure in a benchmark, and you will always measure it in some way even if you didn’t explicitly mean to.

In a high-level benchmark you’ll want to animate something. It’ll be tempting to use the wall clock to drive that animation, but then you’ll just make the benchmark completely useless in many environments that don’t have a “normal” view of time. Instead, seed your animation off of a fixed time interval. That guarantees the same number of frames get rendered, more readily satisfying the repeatability aspect.

If you can’t control that (say you’re live profiling a game as it runs), make sure you’re aware that the frame count will be variable over the time you record it, and factor that in to your analysis. Tech Report do a fantastic job there.

Then there’s stamping time itself. You always want to know how long the workload took. Sometimes you sample time at the same point before and after the workload happened. Sometimes you just count clock cycles on the hardware and know what frequency those cycles happen in. You’re measuring time regardless.

If you’re taking samples from a running, separate clock, that absolutely has to be a highly-reliable, stable, very high resolution clock. I’ve seen benchmarks have an OS-independent time sampler, and hook it up to a clock on a certain OS that has an oscillating 1 second tick, moderated by NTP. One drifting second today on my GPUs can mean upwards of 500 million clock cycles. That’s a lot of potential work to miscount because your clock sampling is terrible.

Work hard to have a battle-tested view of time that you can really rely on, with microsecond accuracy. It’s difficult, but it makes all of your data that much more worthwhile. Think of it as a multiplier for the accuracy of your benchmark’s data.

Travelling back in time due to outside factors is just a complete no-no. Make sure you understand the properties of your clock source when it comes to that.

Source code access

This isn’t really applicable for the general public (although I remain convinced that it can be a really unique and positive thing for a public graphics benchmark to have). Source code access is non-negotiable if you want my money. I have to be able to see every aspect of what’s happening in your benchmark for me to pay for it, without tracing it. Tracing is too brittle.

I also need to be able to modify it for my purposes. You probably don’t support my platform’s custom windowing system under Linux unless you’ve worked with me before. You certainly don’t support my custom logging and timing hooks that are unified across all of my performance analysis software, even 3rd party software like yours.

You might not even support Linux but I need your software to execute there, so I need to be able to port it at my own behest.

Your source licensing terms need to be amenable to that. I have to be able to see it and modify it, and I’m more than happy to do so under regulatory terms you specify. I regularly work with licensing models that let me do what I want with it to support my operating environment, without fundamentally altering the repeatable aspect of the workload execution. In other words I can tinker to support my analysis environment but I can’t change your API calls when reporting performance. Fine by me.

Source code quality

If you’re giving someone source access or releasing it publicly, it has to be easy to work with and modify. Use whatever frameworks, libraries and build system you want, but I need to be able to easily follow its flow, especially when it’s loading data and making graphics API calls, and I need to be able to easily integrate my own code into it (see the previous section).

It’s a point that’s worth stressing, especially if you’re in the licensed performance analysis tooling business. I’ve lost count of the number of times I’ve lost time to ‘working’ with hard to follow, understand and manipulate code. Don’t make your bugs my bugs.

For top marks, integrate with common runtime debugging platforms and use common debugging extensions that exist right next to the graphics API.

Build repeatability

If you give someone a source drop or let them check out a stable tag direct from your repository, it must build repeatably and cleanly for all of the platforms you say you support. I’ve lost count of the number of times we’ve integrated a new source drop and one or more platforms or configurations is broken because they’re untested and unloved.

Resolution independence

A graphics benchmark is useless if it doesn’t let you change resolution. Increasingly, the physical display resolution isn’t a factor in modern rendering for many reasons, so just ignore it and make it configurable in your software. Choose sensible defaults, but make it completely adjustable.

If you have a fixed on-screen resolution because of the platform, always add support for rendering to a customisable resolution, then present that to the display with a simple filter (or 1:1 mapped if it makes sense). The upscaling or downscaling filter makes the rendering workload inherit a new set of caching characteristics, so make sure the filters you use are well documented and understandable (and make sense for the image you’re rendering).

V-sync independence

A graphics benchmark is even more useless if it doesn’t attempt to avoid V-sync. If you’re used to the Windows world then you’re used to the driver controlling that. However if you’re developing a benchmark on Android or iOS then you have to avoid it by rendering off-screen. Just adjusting the workload so it’s slower than V-sync’s rate isn’t good enough.

Because of the way many embedded GPUs work, rendering to an off-screen surface and never sampling from it is a recipe for disaster. Drivers are clever and will buffer up the work to make sure you really need it to be done, and will throw it away if they can and certainly defer it for as long as possible. Therefore for every frame of rendering you must consume the rendered output.

The common techniques today assimilate the (usually) small cost of a (downsampling) blit every frame to copy the contents to another surface that will be displayed on-screen at some point, either immediately or some way down the line. Tiled mosaics of off-screen renders are common and preferred today, because they introduce very little extra into the rendering workload and guarantee consumption of all of the rendering. Drivers therefore can’t miss frames.

Mixed-precision computation

Embedded graphics is a different world entirely to the one you might be used to on the desktop. Low power dominates architectural decisions and plays into every compromise and decision an SoC vendor will make. The end result is that modern embedded GPUs don’t do what their desktop distant cousins do and use IEEE754 single precision floating point everywhere.

The architectures, almost all of them, support lower precision for power and performance reasons. FP32 is not required for quality on all of the pixels drawn, especially outside of games. If you’re writing a graphics benchmark, whether it’s a high-level game test or low-level microbenchmark, make sure you make good use of mixed precision rendering where it makes sense to.

If a portion of your shader can be executed at higher performance or lower power (and sometimes both) because you’ve allowed the hardware to execute it at a lower precision than FP32, definitely go ahead and do so. There’s no downside on architectures that don’t support mixed-precision rendering and lots of upsides on those that do.

Reach out to each vendor’s performance analysis team if you’re writing a benchmark and you’re unsure what to do there. We’ll all give you much the same advice, “try it out and test it as much as possible”, and we’ll all help you iron out potential issues.

Do the work knowing that’s exactly what game vendors do. I put a lot of time in with the engine guys to make sure they understand how mixed-precision rendering works (and not just for my architecture, but in general) to make sure it gets used properly. If you’re using popular middleware like UE4 or Unity, those guys have already done most of that hard work and absolutely make use of mixed precision when they can.

Balance your high-level workloads

This is especially hard to get right for high-level game-like workloads, especially those that want to be future-looking in some way and try and model what games are likely to do in a generation or two’s time.

If you’re doing something atypical to modern rendering, especially on mobile, do your research. Don’t blindly copy desktop techniques. Don’t blindly translate desktop shaders line-by-line from HLSL to GLSL (see the section above for the main reason why).

You want each submitted frame to have a balance of work wherever possible. To draw any good looking pixel you have to have great input geometry with triangles big and small. You need to sample multiple times from multiple types of texture map, with different kinds of filters and sampling modes. That data ingress should be balanced in your frames with a correspondingly sensible balance of compute work and data egress, be that render-to-texture or final frame presentation.

Post-processing makes it particularly difficult to balance. Heavy post-processing makes it incredibly easy to engineer a bandwidth-heavy frame with much less emphasis on computational workload or traditional data ingress. So be careful with how you engineer your rendering and make sure that it doesn’t go too much in one direction. All of rendering can benefit from that understanding: in the middle of a single frame of rendering there are multiple in-flight data flows, all interconnected, all with the potential to bottleneck each other (with both forward and backward pressure, which is counter intuitive), usually sandwiched around a healthy helping of compute.

Be aware of how data flows through a GPU, of how much ingress bandwidth you have for every flop of compute, and how that maps to egress on the way out to the screen. Use that knowledge to balance what you’re doing.

Microbenchmarks

Microbenchmarks are a special case, and one that I fully believe deserves more time understanding and developing than high-level benchmarks. Because they’re (hopefully!) singular in focus and quite simple, the interplay between your shader code, its balance of TRI:ALU:TEX:EXPORT, and the interaction with the shader compiler is often on a knife-edge.

One shader with perfect 8:1 ALU:TEX on one architecture will become an out-of-balance mess on others. Write once, measure anywhere, isn’t a thing. Even if you’re close to the metal somehow, you have to be very careful. The shader compiler isn’t yours and it implements a huge array of clever optimisations specifically designed to outsmart the developer and shave precious shader cycles off any way it can.

These days, it often has one eye on saving power, too, via various mechanisms in the architecture. Make a change to find architectural peak on one SoC and you’ll move further away on another. It’s an arms race, but one you have to fight.

To this end, you must arm yourself with all the weapons the IHV can give you, which it usually uses all the time itself, in terms of accurate and information-rich profiling tools. Sign all the NDAs you need to, to get access to cycle-estimating disassembling compilers, the ISA documentation and the secret run-time profiling tools with the hidden counters that really tell you what’s happening.

Take time to think about data ingress and egress again, and the memory hierarchy of the GPU when designing microbenchmarks. Especially pay attention to the memories outside the GPU boundary, that have an effect on the measured performance of data flow. If you’re attempting at all to measure pixels or texels through a GPU, remember they’re the fundamental datum and are cached as aggressively as possible almost everywhere you can think of.

The memory hierarchy internal to a GPU is rarely what you think it is. Ask lots of questions and think hard about GPU architecture as you go. Think like a journalist, trying to uncover a scoop. Think about where us IHVs might hide facets of the information about performance in caches here and there in the architecture.

Think about the fundamentals of modern 3D rendering, too. Things rarely happen completely independently of each other. Quads of things are your friend, be they pixels or texels, especially when sampling and filtering. Intimately familiarise yourself with what SIMD means in the context of a GPU. It’s subtly different to SIMD in the context of a CPU, mostly because of the memory hierarchy and the sheer width of the machine.

Scheduling

Scheduling in modern GPUs is a reasonable technological manifestation of black magic. Even on ‘simple’ IMRs, there’s more to getting the GPU to do something interesting than just submitting API calls. The driver is always making a bunch of decisions on your behalf, followed by a bunch more decisions that happen inside the GPU afterwards, about when to submit work and why.

A modern GPU is always a unified shader architecture. Multi-purpose units that can do all of the supported kinds of programmable work, wrapped in weird black boxes of state at either end: rasteriser and pixel export. Those unified shader cores initially need to be told what to start working on by the client API driver, via lower-level kernel interfaces, but internally they’re able to figure certain things out themselves, as well. That’s because they’re in possession of more useful information about the current occupancy of the GPU than the client driver is.

Think of the client API driver as the first level of the scheduling hierarchy. It collects really big batches of work up, lots and lots of things at a time. Draw calls that can span hundreds of thousands of triangles. Post processing shaders that can read and write millions of pixels over and over again in single API calls, with just three vertices to kick it all off. That’s the quanta of the client API driver.

The GPU works at a much lower level of granularity. I work with GPUs that process anywhere from pixel-at-a-time to small-power-of-2 pixels at a time. Really: single pixel at a time for some of our GPUs. There’s a scheduler, usually a combination of hardware and software for all of the interesting GPUs I can think of, that figures out how to map those large bodies of work from the client API driver to the hardware underneath, making their way through the workload chunk at a time until it’s complete.

Midway through that top-level workload something else might need to be run, either instead of it or alongside it. The client API driver rarely makes those decisions, so you need to be aware of the other scheduling level underneath you (and usually there’s a hierarchy of those lower-level schedulers!) that’s trying to extract the most efficient use of the GPU’s execution resources underneath your view of things at the software level, where you’re busy issuing top-level commands.

A single graphics API call today is heavily macroscopic. That’s changing, but it’s still going to end up a fair level above the raw GPU ISA. Games ISVs are getting much better at understanding the bridge between the two levels (and in many cases now are actively helping to shape and develop it, which is the best thing to happen to graphics in a very long time).

Benchmark ISVs could do well to get in bed with their game ISV compatriots as they attempt to really understand the GPU underneath. Doing so at a scheduling level is key, even if it is incredibly opaque and very much subject to change right now as the APIs go through a much-needed period of intense flux.

Automation

Often overlooked in modern graphics benchmarks, especially as modern mobile operating systems move away from having any kind of scriptability, but benchmarks need to be automation friendly to a good degree. On Android that means passing parameters to your intent via ADB. On iOS that means a configuration file in iTunes shared device storage, rather than Xcode.

Notice how the conduits to get configuration data for automation onto the devices are fully local, and don’t need the Internet or complex developer tools. That’s critical for automated testing of devices that can’t ever go near the public Internet and don’t need developer environments to execute.

Conclusion

The big lessons are pretty easily summarised:

  • Know what your benchmark is trying to say, either as the producer or as the consumer, and make it easy to interpret properly;
  • Make sure it makes a big effort to be repeatable, in workload and environment (the latter bit might be your responsibility as the user);
  • If you support multiple platforms, support them equally well;
  • Measure time accurately and without drift at all costs;
  • If applicable, make it very easy to follow, modify and build your source code;
  • Test, test, test and test again;
  • Make sure you can easily avoid resolution and v-sync limits;
  • If you’re writing microbenchmarks (and even if you’re not), understand the compiler and machine underneath you;
  • Understand how work gets submitted and executed in depth;
  • Make sure it’s easy to automate and run again and again, and doesn’t need Internet access

Notice how none of those things really apply to graphics specifically, despite me talking about graphics the whole way through? They’re just good things to do if you’re benchmarking anything really.

So if you’re writing a graphics benchmark, especially one that’s cross-platform and/or cross-API, and especially if it contains GPU architecture independent microbenchmarks, feel free to get in touch if you want some help.

And if you only take away one thing, it’s to know your audience.