Custom PC – October 2019

(sharon) #1

Processor(WGP).We’ll considerRDNAat
thatlevelnow,incollectionsofWGPs,with
theCUarrangementofGCNasourcontext.
Thatpairofprofoundchangesinthe
microarchitectureachievestwokeygoals:
reducingtheinefficiencyofbranchy
code,andgettinga higheraggregate
throughputforsmallerwaves.Let’swork
througha shortbutillustrativeexample.
Imaginea 64-threadwavedispatchto
eachmachine.GCNacceptsthatintoa CU,
andbecauseit runstheCUasa
collectionoffour16-wideSIMDs
thatrunoverfourclocks,all 64
threadshavetogotooneof
thosefourSIMDsintheCU.The
dispatchtakesfourcyclesto
executeandleavesthreeofthe
fourSIMDsidle.RDNAacceptsthedispatch
intoa WGP,andbecausetheSIMDsonlytake
oneclocktoruna wave,andthere’sa separate
decoderandissuepathforeachSIMD,the
workis distributedacrossbothandthework
finishesina singleclockwithnoidleSIMDs.
Theendresultis that,forsmallerdispatches,
whichactuallymakeupa surprisingamountof
modernrendering,RDNAcanworkonthem
moreefficientlythanGCN.


It providesa veryhealthyspeed-upin
modernshadercodejustfromthosechanges
alone,butRDNAdoesn’tstopthere.

SCALARPERFORMANCE
WhereastheSIMDsina GCNCUshareda
singlescalararithmeticlogicunit(ALU),which
is usefulforanytaskthatonlyneedstobe
executedonceforanygivenwave,RDNAhas
a pairofthosesamescalarALUsperWGP,one
assignedtoeachSIMD.Thescalar-to-SIMD

ratiois effectivelydoubledinRDNAasa result,
includingthescalarregisterspaceandscalar
cache,whichAMD calltheK-cache(K$).
Becausethosescalarunitsworkon
behalfoftheSIMDsinAMDGPUdesigns,
tofeedthemdata,theyformanimportant
partoftheprocessingchainintheshader
core.Whileit wasrareforit tohappenin
GCN,inRDNAthescalarunitphysicallycan’t
bea bottleneckfortheSIMDhardware.

GCN distributes work across groups of four 16-wide SIMD units
in a collective block it calls the Compute Unit (CU)


DOUBLED VGPR AND
LDS BANDWIDTH
Each RDNA WGP has 256KB of vector
general purpose registers (VGPRs) to use
for intermediate storage as it runs shader
programs. Like almost all programmable GPU
architectures since they were invented, the
maximum number of waves in flight on the
machine being processed is bound by the
number of available registers to run them.
Say that your shader program needs ten
registers and your machine has 20 available.
You can run two waves of threads, each
consuming ten of 20 registers. That basic
principle applies to modern GPUs, just with
much bigger numbers: RDNA has 256KB of
VGPRs per WGP. Each VGPR is 32 x 32-bits
in size (so 128B), giving you 2,048 (262,144
/ 128) VGPRs per WGP. Even with ten waves
in flight on the WGP, that’s still roughly 200
VGPRs per wave. Contrast that with the
architectural limit of modern x86-64, which
has just 16 registers available for programs to
use (under the hood there are more, but still!);
GPUs are just completely different beasts.
Lastly, let’s talk about Local Data Share
(LDS), which is a special, very fast, very
flexible local memory shared
among the SIMDs in a WGP
or CU. It’s no bigger in RDNA
compared with GCN at 64KB,
but because there are fewer
SIMDs competing for its usage,
the effective bandwidth doubles
fortheaggregate WGP structure compared
witha GCNCU. It’s the same tactic AMD
tookwiththe scalar ALU and its resources,
andtheVGPR pool sizing: keep it as before
butshareit between less SIMD hardware.

DOINGMORE PER
CLOCKIN THE WGP
We’venowestablished the new WGP
structureand new basic execution model,

RDNA distributes its work across a pair of 32-wide SIMD
units in a collective block it calls the Workgroup Processor


BECAUSE THERE ARE FEWER SIMDS


COMPETING FOR LDS USAGE, THE
EFFECTIVE BANDWIDTH DOUBLES

FEATURE/ ANALYSIS

Free download pdf