Custom PC – October 2019

Processor(WGP).We’ll considerRDNAat
thatlevelnow,incollectionsofWGPs,with
theCUarrangementofGCNasourcontext.
Thatpairofprofoundchangesinthe
microarchitectureachievestwokeygoals:
reducingtheinefficiencyofbranchy
code,andgettinga higheraggregate
throughputforsmallerwaves.Let’swork
througha shortbutillustrativeexample.
Imaginea 64-threadwavedispatchto
eachmachine.GCNacceptsthatintoa CU,
andbecauseit runstheCUasa
collectionoffour16-wideSIMDs
thatrunoverfourclocks,all 64
threadshavetogotooneof
thosefourSIMDsintheCU.The
dispatchtakesfourcyclesto
executeandleavesthreeofthe
fourSIMDsidle.RDNAacceptsthedispatch
intoa WGP,andbecausetheSIMDsonlytake
oneclocktoruna wave,andthere’sa separate
decoderandissuepathforeachSIMD,the
workis distributedacrossbothandthework
finishesina singleclockwithnoidleSIMDs.
Theendresultis that,forsmallerdispatches,
whichactuallymakeupa surprisingamountof
modernrendering,RDNAcanworkonthem
moreefficientlythanGCN.

It providesa veryhealthyspeed-upin modernshadercodejustfromthosechanges alone,butRDNAdoesn’tstopthere.

SCALARPERFORMANCE WhereastheSIMDsina GCNCUshareda singlescalararithmeticlogicunit(ALU),which is usefulforanytaskthatonlyneedstobe executedonceforanygivenwave,RDNAhas a pairofthosesamescalarALUsperWGP,one assignedtoeachSIMD.Thescalar-to-SIMD

ratiois effectivelydoubledinRDNAasa result, includingthescalarregisterspaceandscalar cache,whichAMD calltheK-cache(K$). Becausethosescalarunitsworkon behalfoftheSIMDsinAMDGPUdesigns, tofeedthemdata,theyformanimportant partoftheprocessingchainintheshader core.Whileit wasrareforit tohappenin GCN,inRDNAthescalarunitphysicallycan’t bea bottleneckfortheSIMDhardware.

GCN distributes work across groups of four 16-wide SIMD units
in a collective block it calls the Compute Unit (CU)

DOUBLED VGPR AND LDS BANDWIDTH Each RDNA WGP has 256KB of vector general purpose registers (VGPRs) to use for intermediate storage as it runs shader programs. Like almost all programmable GPU architectures since they were invented, the maximum number of waves in flight on the machine being processed is bound by the number of available registers to run them. Say that your shader program needs ten registers and your machine has 20 available. You can run two waves of threads, each consuming ten of 20 registers. That basic principle applies to modern GPUs, just with much bigger numbers: RDNA has 256KB of VGPRs per WGP. Each VGPR is 32 x 32-bits in size (so 128B), giving you 2,048 (262,144 / 128) VGPRs per WGP. Even with ten waves in flight on the WGP, that’s still roughly 200 VGPRs per wave. Contrast that with the architectural limit of modern x86-64, which has just 16 registers available for programs to use (under the hood there are more, but still!); GPUs are just completely different beasts. Lastly, let’s talk about Local Data Share (LDS), which is a special, very fast, very flexible local memory shared among the SIMDs in a WGP or CU. It’s no bigger in RDNA compared with GCN at 64KB, but because there are fewer SIMDs competing for its usage, the effective bandwidth doubles fortheaggregate WGP structure compared witha GCNCU. It’s the same tactic AMD tookwiththe scalar ALU and its resources, andtheVGPR pool sizing: keep it as before butshareit between less SIMD hardware.

DOINGMORE PER CLOCKIN THE WGP We’venowestablished the new WGP structureand new basic execution model,

RDNA distributes its work across a pair of 32-wide SIMD
units in a collective block it calls the Workgroup Processor

BECAUSE THERE ARE FEWER SIMDS

COMPETING FOR LDS USAGE, THE EFFECTIVE BANDWIDTH DOUBLES

FEATURE/ ANALYSIS

Custom PC – October 2019

Get our desktop app

Company

Features

Documentation

Resources