Custom PC - UK (2020-01)

So even though we discussed that there are two Turing architecture derivatives earlier, one for the big variants and the other in the much smaller TU116 and TU117, there isn’t a difference in their respective SM microarchitecture other than the removal of the Tensor cores and RT cores. At the top level, each SM pair that shares a texture unit inside a TPC has 64 of what Nvidia calls ‘CUDA cores’. However, they don’t operate together as a monolithic block. It’s actually four sets of a 16-wide arithmetic logic unit (ALU) structures that makes up the SM instead, giving Turing a machine width of 16. Why does the machine width matter? Branching and other forms of divergent processing. Because it’s too expensive for each core to have all of the support logic to decide what it’s doing — a cache, register storage, plus instruction decoder, scheduler and dispatch blocks — those things can instead be shared among a group of ALUs to make it cheaper to implement. The wider the machine, the cheaper it is to make, because you need fewer instances of that supporting logic, but that means more performance is lost if one of the cores in a collective block wants to do something different from the others. Make the machine narrower and performance is more robust but cost rises. So the GPU designer chooses a compromise width that makes sense for the workloads they think will be executed. Nvidia chose 16, which makes for one of the narrower designs in a long time, with almost everyone else using 32 or higher. This is a testament to the company asserting that Turing should do well executing divergent shader code.

CACHES, THREAD SCHEDULERS AND REGISTER FILES That quartet of 16-wide vector units shares a rather large 96 KB cache that Turing can partition dynamically depending on what’s executing. When running graphics code, it’s split 64/32 in favour of a general L1 cache and a combined texture cache and register overspill. In compute mode, it’s

around 500 Mpixels/s, so at full texturing rate, an RTX Titan could texture every pixel almost 800 times.

TURING’S SM Turing’s shader core inside a TPC is broken up into groups of vector processors that Nvidia calls Streaming Multiprocessors, or SMs, and it’s the most complex, efficient and highest- throughput shader core in the company’s history. Designed to chew through a wide mix of workloads as well as common graphics tasks, the Turing SM is a future- looking shader core design that’s arguably overengineered for running your games. Because of the inherently complex nature of a GPU, Nvidia can’t really afford to specialise a different SM for each of the Turing-based products it wants to create. In

an ideal world, it might want to leave the SM we’ll describe here to products created for the professional compute, cloud compute, remote rendering, machine learning and other similar markets, and create a much simpler one for gaming products. But the engineering, validation, testing and associated costs with designing big GPUs today mean it’s prohibitive in the extreme to do that, despite Nvidia’s huge resources.

Variable rate shading allows a game to optimise the rendering complexity of different parts of the game to
balance performance and visual fidelity

FEATURE / ANALYSIS

up with this. For example, AMD’s latest Navi
architecture, released just this summer, has
only just gained the same ability.
Whatthisallowsforis that,say,thefirst
SMintheTPC– let’scallit SM0–issues a
texture sample request to the texture unit
then shortly afterwards, the second SM (SM1 )
issues another request.
Ordinarily, the texture unit would only
return the SM1’s requested data after it has
dealt with that for SM0 (because that was
the order of operations sent to it), even if
that means holding back data that’s ready to
be sent to SM1. However, with out of order
processing, if the texture unit has finished
processingthedataforSM1first,it can
deliverit first.
This is so crucial because textures
generally live off-chip in the GDDR5 or
GDDR6 memory, and fetching that data
can take quite a while. However, there’s
also a small on-chip texture cache that’s
much quicker to access. So, in our example,
maybe the texture data for SM0 is in off-
chip memory but the data for SM1 is in the
texture cache. In this scenario, it would
almost certainly be quicker to return the
data for SM1 first.
In aggregate, that eight texels per clock rate
leads to some monstrous total throughputs
for a big configuration like TU102. Six GPCs
with six TPCs, each able to deliver eight texels
per clock and running at 1.35GHz base clock
on an Nvidia RTX Titan, means a minimum of
388.8 billion textured samples every second.
To put that into perspective, 4K60 is only

IT’S THE MOST COMPLEX, EFFICIENT AND

HIGHEST-THROUGHPUT SHADER CORE IN

THE COMPANY’S HISTORY

Custom PC - UK (2020-01)

Get our desktop app

Company

Features

Documentation

Resources