Custom PC – October 2019

(sharon) #1
WhileAMD’scompilercoulddynamically
andtransparentlyselectbetweenthenative
Wave32modeandWave64execution
modeforsomeshaders,forperformance,
wereckonit’smostlythereforbackwards
compatibilitywithGCN,whichrunsinitsown
effective‘Wave64’modeoverfourcycles
allthetime.SomeAMDcustomershighly
valuea measureofbackwardscompatibility,
soit makessenseforit tobeanoperating
modeoftheRDNAmicroarchitecture,
whilestillbeinga featureAMDcanusein
PCGPUs,suchastheRadeonRX 5700
series,andfutureRDNA-baseddesigns.

CACHEANDMEMORY
Allthesecoremicroarchitecturalchangesin
RDNAarebackedupbysomefreshthinkingin
thecachesandmemoryhierarchycompared
toGCN.Firstup,cachesizes:eachRDNA
implementationis freetochooseitsownsizes
forthemaincaches,sowecanonlyreallytalk
aboutwhat’sinNavi10.It has4MBofLevel 2
(L2)cache,whichis thelastcacheonthechip
beforetheexternalGDDR6memoryandthe
PCI-E4 buses,dependingonwhichonethe
GPUneedstotalkto.Alltheblocksinthetotal
designareclientsoftheL2cache,sotheyall
talktotheoutsideworldthroughthatcache.
Thenthere’sa 512KBofL1cacheforevery
tenWGPsintheNavi10ShaderEngine.

EachShaderEnginein
Navi10hastwoofthose
tenWGPsets,sothere’s
1MBoftotalL1cacheper
ShaderEngine,or2MB
forthewholechip.Then
therearea numberof
0th-levelcachesbelow
thoseL1caches,oneeach
fortheSIMDs(theV$L0),
instructionfetchandissue
hardware(theI$L0),and
thescalarK-cache(theK$
L0).AMDdoesn’tdisclose
thesizeoftheL0caches
inRDNAunfortunately,
sowecan’ttellyouwhat
theyare,butthey’relikely
inthe8-32KBrange.
AMDalsodoubledthe
amountofbandwidth
fromtheL0cache
hierarchyintotheshader
core,sotheWGPscan
makebetteruseofthose
cacheswhentheyhit.AMDalsoreduced
theamountofclockcyclesit takestotalk
toeachlevelofthecachehierarchy.That’s
beena traditionaldeficitforverycache-
sensitiveprogramsrunningonAMDGPU
designswhencomparedwithcompeting
architectures.AMDrecognisedthatsituation
andaddressedit inRDNA,heavilyreducing
accesslatencytoL2cachebyupto 24 per
cent.AccesslatencytotheGDDR6memory
throughtheexternalmemoryfabricafterthe
L2cacheis alsoreducedbyupto7 percent.
There’salsoa sweetchangetohowthe
cachehierarchycanrespondtobothrequests
andmisses.Eachkindofmainrequestto
theGDDR6memorybythehardware,beit
a generalload,storeortexturesample,now
hasitsownrequestqueueinthehardware.
Comparedwiththesharedqueuein
GCN,thatletsRDNA’smemorysystem
respondmoreefficientlytomixedrequest
loadsandservicethemseparately,without
anyheavyaccessesofanyonetype
starvingtheothers.It’sraretohearabout
improvementsinthememorysystemina
GPU,withvendorsmainlyfocusingonthe
computeabilityorotheraspects,butthat’s
a greatimprovementforRDNA– it really
helpsgeneralgraphicsworkloadsthat
tendtohavea solidmixofbothload/store
memoryactivityandtextureaccesses.

Finallyforthissection,there’sa changeto
howtheL2cachecanworkif a requestfrom
therestoftheGPUmissesintheL2cache
andneedstogoouttothememoryfabric
andeitherGDDR6orthePCI-Emachinery.
InGCN,a misswouldblocksubsequent
L2accesses,evenif theymight hitand
beabletoreturnimmediately,rather
thanwaitfortherequesttocome
backthroughthelastlevelfabric.
InRDNAtheL2systemnowsupports
hit-under-miss.Asthenamesuggests,any
hitthathappensunderanin-flightmisscan
beservicedbytheL2systemandreturn
datatotheGPUsoit cankeepgoing.Best
ofall,continuedL2hitscanalsoreturn,
whichallowsRDNAtobeblockedlessby
L2misstransactionsandkeepworking.

REBALANCINGRDNAGPUS
Thevastmajorityofdifferencesdescribedin
RDNAsofarhavebeeninserviceofgraphics
workloadsfirstandforemost.Fromco-issue
ofFMAandSFU,toreworkingtheentireWGP
structuretobemoreefficientwithsmaller,
short-livedand/orbranchydispatches—
workthat’slesscommonintypicalGPGPU
workloads—togivingload,storeandsampler
accessestheirownqueuesintothememory
system,allthewaytothereallyfast64bpp
bilinearsamplingrateinthetexturehardware.
There’snowaytolookatRDNAand
notrealisethatAMD’sfocusis graphics
accelerationingames.Allofthatwould
belost,though,if AMDdecidedto
scopeoutRDNA-baseddesignsas
inappropriatelywideasthebiggestVega
GPUsintermsofthenewshadercore.

There’s 4MB of L2 cache,
512KB of L1 cache for every
ten WGPs and a number
of 0th-level caches


Navi10 has 2,560 SIMD lanes, 80 scalar ALUs
and 40 high-performance four-sample-per-
clock texture units

FEATURE/ ANALYSIS

Free download pdf