Custom PC – October 2019

WhileAMD’scompilercoulddynamically andtransparentlyselectbetweenthenative Wave32modeandWave64execution modeforsomeshaders,forperformance, wereckonit’smostlythereforbackwards compatibilitywithGCN,whichrunsinitsown effective‘Wave64’modeoverfourcycles allthetime.SomeAMDcustomershighly valuea measureofbackwardscompatibility, soit makessenseforit tobeanoperating modeoftheRDNAmicroarchitecture, whilestillbeinga featureAMDcanusein PCGPUs,suchastheRadeonRX 5700 series,andfutureRDNA-baseddesigns.

CACHEANDMEMORY Allthesecoremicroarchitecturalchangesin RDNAarebackedupbysomefreshthinkingin thecachesandmemoryhierarchycompared toGCN.Firstup,cachesizes:eachRDNA implementationis freetochooseitsownsizes forthemaincaches,sowecanonlyreallytalk aboutwhat’sinNavi10.It has4MBofLevel 2 (L2)cache,whichis thelastcacheonthechip beforetheexternalGDDR6memoryandthe PCI-E4 buses,dependingonwhichonethe GPUneedstotalkto.Alltheblocksinthetotal designareclientsoftheL2cache,sotheyall talktotheoutsideworldthroughthatcache. Thenthere’sa 512KBofL1cacheforevery tenWGPsintheNavi10ShaderEngine.

EachShaderEnginein Navi10hastwoofthose tenWGPsets,sothere’s 1MBoftotalL1cacheper ShaderEngine,or2MB forthewholechip.Then therearea numberof 0th-levelcachesbelow thoseL1caches,oneeach fortheSIMDs(theV$L0), instructionfetchandissue hardware(theI$L0),and thescalarK-cache(theK$ L0).AMDdoesn’tdisclose thesizeoftheL0caches inRDNAunfortunately, sowecan’ttellyouwhat theyare,butthey’relikely inthe8-32KBrange. AMDalsodoubledthe amountofbandwidth fromtheL0cache hierarchyintotheshader core,sotheWGPscan makebetteruseofthose cacheswhentheyhit.AMDalsoreduced theamountofclockcyclesit takestotalk toeachlevelofthecachehierarchy.That’s beena traditionaldeficitforverycache- sensitiveprogramsrunningonAMDGPU designswhencomparedwithcompeting architectures.AMDrecognisedthatsituation andaddressedit inRDNA,heavilyreducing accesslatencytoL2cachebyupto 24 per cent.AccesslatencytotheGDDR6memory throughtheexternalmemoryfabricafterthe L2cacheis alsoreducedbyupto7 percent. There’salsoa sweetchangetohowthe cachehierarchycanrespondtobothrequests andmisses.Eachkindofmainrequestto theGDDR6memorybythehardware,beit a generalload,storeortexturesample,now hasitsownrequestqueueinthehardware. Comparedwiththesharedqueuein GCN,thatletsRDNA’smemorysystem respondmoreefficientlytomixedrequest loadsandservicethemseparately,without anyheavyaccessesofanyonetype starvingtheothers.It’sraretohearabout improvementsinthememorysystemina GPU,withvendorsmainlyfocusingonthe computeabilityorotheraspects,butthat’s a greatimprovementforRDNA– it really helpsgeneralgraphicsworkloadsthat tendtohavea solidmixofbothload/store memoryactivityandtextureaccesses.

Finallyforthissection,there’sa changeto howtheL2cachecanworkif a requestfrom therestoftheGPUmissesintheL2cache andneedstogoouttothememoryfabric andeitherGDDR6orthePCI-Emachinery. InGCN,a misswouldblocksubsequent L2accesses,evenif theymight hitand beabletoreturnimmediately,rather thanwaitfortherequesttocome backthroughthelastlevelfabric. InRDNAtheL2systemnowsupports hit-under-miss.Asthenamesuggests,any hitthathappensunderanin-flightmisscan beservicedbytheL2systemandreturn datatotheGPUsoit cankeepgoing.Best ofall,continuedL2hitscanalsoreturn, whichallowsRDNAtobeblockedlessby L2misstransactionsandkeepworking.

REBALANCINGRDNAGPUS Thevastmajorityofdifferencesdescribedin RDNAsofarhavebeeninserviceofgraphics workloadsfirstandforemost.Fromco-issue ofFMAandSFU,toreworkingtheentireWGP structuretobemoreefficientwithsmaller, short-livedand/orbranchydispatches— workthat’slesscommonintypicalGPGPU workloads—togivingload,storeandsampler accessestheirownqueuesintothememory system,allthewaytothereallyfast64bpp bilinearsamplingrateinthetexturehardware. There’snowaytolookatRDNAand notrealisethatAMD’sfocusis graphics accelerationingames.Allofthatwould belost,though,if AMDdecidedto scopeoutRDNA-baseddesignsas inappropriatelywideasthebiggestVega GPUsintermsofthenewshadercore.

There’s 4MB of L2 cache,
512KB of L1 cache for every
ten WGPs and a number
of 0th-level caches

Navi10 has 2,560 SIMD lanes, 80 scalar ALUs and 40 high-performance four-sample-per- clock texture units

FEATURE/ ANALYSIS

Custom PC – October 2019

Get our desktop app

Company

Features

Documentation

Resources