Custom PC - UK (2020-01)

(Antfer) #1
So even though we discussed that there
are two Turing architecture derivatives
earlier, one for the big variants and the other
in the much smaller TU116 and TU117, there
isn’t a difference in their respective SM
microarchitecture other than the removal of
the Tensor cores and RT cores.
At the top level, each SM pair that shares
a texture unit inside a TPC has 64 of what
Nvidia calls ‘CUDA cores’. However, they
don’t operate together as a monolithic block.
It’s actually four sets of a 16-wide arithmetic
logic unit (ALU) structures that makes up the
SM instead, giving Turing a machine width
of 16. Why does the machine width matter?
Branching and other forms of divergent
processing. Because it’s too expensive for
each core to have all of the support logic to
decide what it’s doing — a cache, register
storage, plus instruction decoder, scheduler
and dispatch blocks — those things can
instead be shared among a group of ALUs to
make it cheaper to implement.
The wider the machine, the cheaper it is
to make, because you need fewer instances
of that supporting logic, but that means
more performance is lost if one of the cores
in a collective block wants to do something
different from the others. Make the machine
narrower and performance is more robust
but cost rises. So the GPU designer chooses
a compromise width that makes sense for
the workloads they think will be executed.
Nvidia chose 16, which makes for one of the
narrower designs in a long time, with almost
everyone else using 32 or higher. This is a
testament to the company asserting that
Turing should do well executing divergent
shader code.

CACHES, THREAD
SCHEDULERS AND
REGISTER FILES
That quartet of 16-wide vector units shares
a rather large 96 KB cache that Turing can
partition dynamically depending on what’s
executing. When running graphics code,
it’s split 64/32 in favour of a general L1
cache and a combined texture cache and
register overspill. In compute mode, it’s

around 500 Mpixels/s, so at full texturing
rate, an RTX Titan could texture every pixel
almost 800 times.

TURING’S SM
Turing’s shader core inside a TPC is broken up
into groups of vector processors that Nvidia
calls Streaming Multiprocessors, or SMs, and
it’s the most complex, efficient and highest-
throughput shader core in the company’s
history. Designed to chew through a wide
mix of workloads as well as common
graphics tasks, the Turing SM is a future-
looking shader core design that’s arguably
overengineered for running your games.
Because of the inherently complex
nature of a GPU, Nvidia can’t really afford
to specialise a different SM for each of the
Turing-based products it wants to create. In

an ideal world, it might want to leave the SM
we’ll describe here to products created for
the professional compute, cloud compute,
remote rendering, machine learning and
other similar markets, and create a much
simpler one for gaming products. But
the engineering, validation, testing and
associated costs with designing big GPUs
today mean it’s prohibitive in the extreme to
do that, despite Nvidia’s huge resources.

Variable rate shading allows a game to optimise the rendering complexity of different parts of the game to
balance performance and visual fidelity


FEATURE / ANALYSIS


up with this. For example, AMD’s latest Navi
architecture, released just this summer, has
only just gained the same ability.
Whatthisallowsforis that,say,thefirst
SMintheTPC– let’scallit SM0–issues a
texture sample request to the texture unit
then shortly afterwards, the second SM (SM1 )
issues another request.
Ordinarily, the texture unit would only
return the SM1’s requested data after it has
dealt with that for SM0 (because that was
the order of operations sent to it), even if
that means holding back data that’s ready to
be sent to SM1. However, with out of order
processing, if the texture unit has finished
processingthedataforSM1first,it can
deliverit first.
This is so crucial because textures
generally live off-chip in the GDDR5 or
GDDR6 memory, and fetching that data
can take quite a while. However, there’s
also a small on-chip texture cache that’s
much quicker to access. So, in our example,
maybe the texture data for SM0 is in off-
chip memory but the data for SM1 is in the
texture cache. In this scenario, it would
almost certainly be quicker to return the
data for SM1 first.
In aggregate, that eight texels per clock rate
leads to some monstrous total throughputs
for a big configuration like TU102. Six GPCs
with six TPCs, each able to deliver eight texels
per clock and running at 1.35GHz base clock
on an Nvidia RTX Titan, means a minimum of
388.8 billion textured samples every second.
To put that into perspective, 4K60 is only


IT’S THE MOST COMPLEX, EFFICIENT AND


HIGHEST-THROUGHPUT SHADER CORE IN


THE COMPANY’S HISTORY

Free download pdf