Custom PC - UK (2020-01)

(Antfer) #1

up that allowed Nvidia to head off AMD’s first
Navi10 products at the pass, plus the simpler
TU116 and TU117 implementations of Turing,
it’s time to dive a bit deeper.
We’ve thrown a lot of top-level product
configuration data at you, the GPC and SM
counts in particular, so now it’s time to take a
look at what a Turing GPC actually is and what
the SMs inside of a GPC look like, along with
the rest of the GPC microarchitecture that
makes up what Nvidia have named after our
geniusWorldWarII codebreaker.


We’llincludetheflagshipray-tracing
hardware,plusthemorestaidandexpected
bitsoftheirdesigntoo,comparingand
contrastingtoNvidia’spriordesignsand
AMD’scurrenthardwarewhereit makes
sense,togiveyoua viewofwhyNvidiaput
Turingtogetherthewayit did.


THETURINGGPC
TheGraphicsProcessingCluster(GPC)
isthetop-levelbuilding blockineachof
Nvidia’sarchitecturesinrecentmemory,
andimplementsalmostallofthecore
processing that the GPU undertakes. In fact,
in a six-GPC part such as the TU102, you
could disable five of the six clusters and the
remaining one would run your games just
fine! Let’s look at what a GPC contains.


RASTERISER
Each GPC contains a single rasteriser
that’s responsible for generating pixels
on which the SMs run pixel shading. It
would be unremarkable compared to
the rasteriser before it in Nvidia’s Pascal
architecture, if it weren’t for the fact the
addition of a new rasterisation feature:
variable rate shading, or VRS.
Traditionally, every pixel gets shaded at
the same rate, usually once unless MSAA is
enabled, but with VRS, the rasteriser is capable
of telling the SMs that for a given block of 16
x 16 pixels it generates, the SMs can shade
them at a reduced rate. That reduced shading
rate lets the GPU save on processing every
individual pixel, saving power, and increasing
performance and throughput.


There’s support for several different
rates, as you’d expect from the variable part
of the name: 1x1 (do nothing different), 2x2
and 4x4 are the obvious ones, but there are
non-square footprints too, with 1x2, 2x1,
4x2 and 2x4 all supported as well.
So for each 256 pixel block that the
rasteriser would traditionally generate and
send down to the SMs in a GPC, Turing can
effectively say, ‘for each 4 x 4 pixel footprint,
I’ll shade just 1 pixel and broadcast that
resulttoall16’.

Thegamedeveloperhascontrolover
theratesina coupleofways,andcrucially
forimagequality,theshadingratepatterns
canbenon-squareandthereforecarefully
tailoredtothekindsofgeometrythatare
likelytofallinsidea rasterisertile.Saythe
developerknowsthatina particularscreen
region,trianglesarelikelytobetalland
skinny.Theycanapplyoneofthe‘taller’
shadingrates—2x4and1x2—totakeinto
accounttheshapeofwhat’slikelytobeinthat
rasterisedregion.

The throughput of the rasteriser in a Turing
GPC is unchanged from Pascal and can process
a single input triangle per clock, generating
16 output pixels per clock from it for an SM to
work on. That scales per GPC, so a six-GPC
part such as TU102 has an aggregate rasteriser
throughput of six input triangles per clock.

TPC
Each GPC contains a variable number of
Texture Processing Clusters or TPCs. A TPC
is a collectionoftwoSMsthatsharea texture
unitandpolymorphengine.Thesharingof
textureprocessingandgeometryprocessing
(thepolymorphengine)acrossseveralblocks
ofGPUshadersis a commonarrangement
thesedaysthatallowsfora betterbalanceof
requiredprocessingforthedifferentstepsof
thegraphicspipeline.
Nvidia’stextureunitinTuringis incredibly
powerfulandcansendeightfullyfiltered
texturesamples(texels)totheSMsper
clock,evenwithbilinearfilteringandforwide
HDRpixelswith 64 bitsofdataeach.It’salso
capableofservicingrequestsoutoforder,
whichis somethingNvidiaarchitectures
havebeenofferingfora whilenow,while
competingdesignsareonlyjustcatching

THE GRAPHICS PROCESSING CLUSTER GPC IS


THE TOPLEVEL BUILDING BLOCK, IMPLEMENTING


ALMOST ALL THE CORE PROCESSING


The full diagram of TU102 hints at the sheer scale
and complexity of this 18.6 billion-transistor chip
Free download pdf