Custom PC - UK (2020-01)

(Antfer) #1

TheGPUneedstodotwothings:stream
inthehierarchyintherightorderandtest
therayagainsttheouterboxesandthen
thetrianglesinside,uptothatmaximum
depth.Turingacceleratesbothofthose
things,pullinginthehierarchyasefficiently
aspossible,anddedicatingsomefixed
logictotesta coupleofboxesortriangles
inparallel for every ray it’s working on.
As alluded to earlier, the logic is fixed
function, so it can’t be used for any other
calculations, simply because of the
arithmetic and dataflow complexity. You
could run the intersection code on the SM,
but it would be slow and consume the core
when it could be running more common
pixel, compute or vertex work. It’s this fixed
function hardware that isn’t available on
the smaller TU116 and TU117 variants as
discussed earlier.


MEMORY HIERARCHY,
EXPORT AND GDDR6
So that’s the architectural machinery that
Turing uses to get its main computational
jobs done, but there’s one last aspect to
touch on before we tie it all up: the memory
hierarchy and getting finished work out to
graphics memory to be displayed. Each
SM’s L1 cache feeds into a partition of the L2.
TU102, for instance, has a faintly ridiculous
6MB of total L2 cache, with a partition of that
connected to each L1 unit in the TPC. The L2
is then connected to the outside world via a
fabric and a set of memory controllers.


Finishedwork,usuallypixels,hasto
be exported out to the very large GDDR6
memories in Turing in an optimal way. Graphics
memory accesses come in bursts, as do pixel
accesses. For instance, TU102 can output 96
finished pixels per clock in Nvidia Titan RTX
form, and those can be HDR pixels without any
performance penalty. So potentially 768 bytes
(6,144 bits!) of data makes its way out of the
back end of the hardware in any given cycle,
and the GPU has to maintain writing that out to
memory at full rate to achieve peak fill rate.
That means the last bit of the memory
hierarchy, from the export hardware through
L2 into the fabric and then out through the
connected GDDR6 memory – each 32 bits
wide and usually connected to a single GDDR6
chip – needs to be free-flowing and efficient. It
might not sound like a lot of data at first glance,
but given the clock speed of these GPUs, it
amounts to a huge amount of data per second.
This can be particularly tricky when it comes to
marshalling those bits over actual wires to the
GDDR6 chips that live next to the GPU.

STATE OF PLAY
Nvidia took several gambles with the launch
of Turing. On the consumer side, it has added
several brand-new rendering features, such
as VRS and real-time ray tracing, and these
will need developer support to bear fruit.
We’ve already seen a few headline examples
of games with ray-tracing support, such as
Battlefield V, Call of Duty: Modern Warfare
and Minecraft, so the signs are reasonably

positive that it’s here to stay. However, with
thelatestgamesconsolespoweredby
AMDAPUs, and future consoles looking like
they might follow suit, we may yet see ray
tracing flounder, if AMD doesn’t add in this
feature too.
Otherwise, Nvidia has continued to push
the performance envelope far enough that
the overall graphics card market situation
remains much as it has been for the past
many years; Nvidia rules the high end while
AMD fights for the mid and low-range
markets. Navi is the closest AMD has come in
recent years to truly competing with Nvidia at
the top, but it still isn’t quite there yet, leaving
Nvidia free to charge essentially what it likes
for its top-tier Turing products.
What’s perhaps most telling about Turing,
though, is the inclusion of dedicated hardware
for non-graphical calculations. The Tensor
cores are an explicit nod to the explosive
growth in requirements for fast training and
inference hardware for machine learning. A big
chunk of Nvidia’s money now comes from that
kind of customer, and it’s growing faster than
its traditional graphics business.
Likewise, the SM changes for integer
co-issue and the strong optimisations in
the memory hierarchy, especially around
L1 partitioning, bandwidth and latency, and
the overall L2 size, hint at optimising for
other workloads being run not by games in a
machine like your PC, but on a giant grid in a
supercomputer or in the cloud somewhere.
We said in our recent RDNA deep dive (Issue
193 ) that AMD had remembered how to
build a GPU, rather than a compute monster.
In many ways, Turing swings the other way
but crucially Nvidia hasn’t sacrificed gaming
performance in the process.

THE FUTURE OF NVIDIA’S
GPU ARCHITECTURE –
AMPERE
Looking to the future, and what might be in
store for Nvidia’s next GPU architecture, it will
be interesting to see just how much further
the company pursues the dedicated compute
market and whether we finally see a more
direct split in hardware designs. It seems
unlikely at this stage that its top-tier chip
designs will be split up, but when ray tracing
hardware is no use to compute applications
(as far as we’re aware), and Tensor cores are
no use for gaming, there’s clearly some silicon
area to be saved by making the break.

FEATURE / ANALYSIS


Turing is the first GPU architecture to introduce concurrent execution of integer and floating point operations

Free download pdf