Custom PC - UK (2020-01)

TheGPUneedstodotwothings:stream
inthehierarchyintherightorderandtest
therayagainsttheouterboxesandthen
thetrianglesinside,uptothatmaximum
depth.Turingacceleratesbothofthose
things,pullinginthehierarchyasefficiently
aspossible,anddedicatingsomefixed
logictotesta coupleofboxesortriangles
inparallel for every ray it’s working on.
As alluded to earlier, the logic is fixed
function, so it can’t be used for any other
calculations, simply because of the
arithmetic and dataflow complexity. You
could run the intersection code on the SM,
but it would be slow and consume the core
when it could be running more common
pixel, compute or vertex work. It’s this fixed
function hardware that isn’t available on
the smaller TU116 and TU117 variants as
discussed earlier.

MEMORY HIERARCHY,
EXPORT AND GDDR6
So that’s the architectural machinery that
Turing uses to get its main computational
jobs done, but there’s one last aspect to
touch on before we tie it all up: the memory
hierarchy and getting finished work out to
graphics memory to be displayed. Each
SM’s L1 cache feeds into a partition of the L2.
TU102, for instance, has a faintly ridiculous
6MB of total L2 cache, with a partition of that
connected to each L1 unit in the TPC. The L2
is then connected to the outside world via a
fabric and a set of memory controllers.

Finishedwork,usuallypixels,hasto be exported out to the very large GDDR6 memories in Turing in an optimal way. Graphics memory accesses come in bursts, as do pixel accesses. For instance, TU102 can output 96 finished pixels per clock in Nvidia Titan RTX form, and those can be HDR pixels without any performance penalty. So potentially 768 bytes (6,144 bits!) of data makes its way out of the back end of the hardware in any given cycle, and the GPU has to maintain writing that out to memory at full rate to achieve peak fill rate. That means the last bit of the memory hierarchy, from the export hardware through L2 into the fabric and then out through the connected GDDR6 memory – each 32 bits wide and usually connected to a single GDDR6 chip – needs to be free-flowing and efficient. It might not sound like a lot of data at first glance, but given the clock speed of these GPUs, it amounts to a huge amount of data per second. This can be particularly tricky when it comes to marshalling those bits over actual wires to the GDDR6 chips that live next to the GPU.

STATE OF PLAY Nvidia took several gambles with the launch of Turing. On the consumer side, it has added several brand-new rendering features, such as VRS and real-time ray tracing, and these will need developer support to bear fruit. We’ve already seen a few headline examples of games with ray-tracing support, such as Battlefield V, Call of Duty: Modern Warfare and Minecraft, so the signs are reasonably

positive that it’s here to stay. However, with thelatestgamesconsolespoweredby AMDAPUs, and future consoles looking like they might follow suit, we may yet see ray tracing flounder, if AMD doesn’t add in this feature too. Otherwise, Nvidia has continued to push the performance envelope far enough that the overall graphics card market situation remains much as it has been for the past many years; Nvidia rules the high end while AMD fights for the mid and low-range markets. Navi is the closest AMD has come in recent years to truly competing with Nvidia at the top, but it still isn’t quite there yet, leaving Nvidia free to charge essentially what it likes for its top-tier Turing products. What’s perhaps most telling about Turing, though, is the inclusion of dedicated hardware for non-graphical calculations. The Tensor cores are an explicit nod to the explosive growth in requirements for fast training and inference hardware for machine learning. A big chunk of Nvidia’s money now comes from that kind of customer, and it’s growing faster than its traditional graphics business. Likewise, the SM changes for integer co-issue and the strong optimisations in the memory hierarchy, especially around L1 partitioning, bandwidth and latency, and the overall L2 size, hint at optimising for other workloads being run not by games in a machine like your PC, but on a giant grid in a supercomputer or in the cloud somewhere. We said in our recent RDNA deep dive (Issue 193 ) that AMD had remembered how to build a GPU, rather than a compute monster. In many ways, Turing swings the other way but crucially Nvidia hasn’t sacrificed gaming performance in the process.

THE FUTURE OF NVIDIA’S GPU ARCHITECTURE – AMPERE Looking to the future, and what might be in store for Nvidia’s next GPU architecture, it will be interesting to see just how much further the company pursues the dedicated compute market and whether we finally see a more direct split in hardware designs. It seems unlikely at this stage that its top-tier chip designs will be split up, but when ray tracing hardware is no use to compute applications (as far as we’re aware), and Tensor cores are no use for gaming, there’s clearly some silicon area to be saved by making the break.

FEATURE / ANALYSIS

Turing is the first GPU architecture to introduce concurrent execution of integer and floating point operations

Custom PC - UK (2020-01)

Get our desktop app

Company

Features

Documentation

Resources