Custom PC – October 2019

Whereas the SIMDs in a GCN CU shared a single scalar ALU, RDNA has one assigned toeachSIMD

and how it more efficiently shares available
per-WGP resources, such as the VGPR pool,
LDS and the scalar hardware that assists
the WGP’s SIMDs in their work. AMD could
have reasonably signed off the WGP design
there and called it done, but it also took a
close look at the kinds of shader programs
that modern games ask the hardware to
execute, across the full spectrum of vertex,
pixel, compute and tessellation. It also took
the compiler-driven way of executing the
new geometry pipeline into account.
A special function unit (SFU) lives alongside
the main SIMD hardware in the CU in GCN,
and its job is to efficiently run special kinds
of arithmetic instructions that don’t tend to
come up too often in shader programs and,
because of how the hardware needs to
implement them, shouldn’t really be a part of
the primary arithmetic pipeline in the SIMD.
The primary SIMD pipeline in RDNA
runs floating point at IEEE754-spec single
precision, just like a modern CPU, including
support for weird floating-point numbersthat
affect how you execute, such as denormals
(numbers very close to, but not quite,zero).
It implements that with support for a fused
multiply-add (FMA) instruction. That 2-op
instruction runs both the multiply andthe
add part of the arithmetic in a single cycle.It’s
also capable of just executing a multiplyor
an add if that’s all that’s needed.
But that’s pretty much all it
does, at least for floating point.
There’s no support for the
aforementioned special functions,
such as sin, cos, log, pow, exp,
and other similar arithmetic
instructions, which tend to have a much
longer latency than one cycle at their target
accuracy. Because the latency to generate
the right result for those functions canbelong
and variable, support for those essentialbut
special functions has always been separate
in AMD GPU architectures, and indeedthose
from almost all other GPU makers too.
In GCN, to issue an instruction to theSFU,
you steal an instruction slot in the instruction
stream that you’re running, so you can’tissue
to the main SIMD while you’re issuingtothe
SFU, and you wait for the SFU result tocome
back before you can do more main SIMD
work. That’s easy to implement, but kindof
sucks because it takes issue slots awayfrom
the main SIMD hardware, which is almost
always more commonly needed in normal

shader programs. The innovation in RDNA is to catch up with competing architectures and stilltakeanissueslotawayfromtheSIMD hardware,buttothenruntheSFUinparallel. InRDNA,theSFUtakesfourcyclestoreturn theresultofa particularop.InGCNthatwould blockthemainSIMDhardwareforallfour clocks.InRDNA,it onlyblocksforoneclock, andthemainSIMDcangetbacktoworkfor thethreeotherclocksthatareleftbeforethe SFUcomesbackwithitsresult.If thecompiler

canfindworktodointhosethreeclocks,then RDNAgets(much)moreefficientatrunning thoseco-issuableFMA+ SFUopportunities thatsometimescropupinmodernshaders. It’sdoableforthecompiler,especiallybecause it’snota hugeinstructionwindowtofillup beforetheSFUwillcomebackwithanyresult.

TEXTUREHARDWARE CHANGES,FP16 ANDWAVE64 Theonlytexturehardwarechangeofnote inRDNAthatwecanseeis thatthedesign is nowcapableoffull-rate 64 bitsperpixel (bpp)andBC6Hformatbilinearfiltering, whichis twicetheperformanceinVega. Commonlyrequestedbygamedevelopers overthelastfewyears,AMDuppeditstexture

performance ante and made its sampler hardware full-rate for basically all surface formats you might want to filter in common graphics workloads now. It’s a great future- looking change to the texture hardware design in response to strong developer demand. Shader programs in modern games, especially pixel shaders, have plenty of opportunity for executing parts of the shader program in reduced precision, typically 16-bit floating point (FP16). Like GCN, RDNA still supports executing FP16 at double rate for some instructions (and 16-bit integer too, but that’s less common) by running those instructions on two halves of a packed 32-bit VGPR. Put one 16-bit value in the top half of the register, and one in the bottom half, and the hardware can run a dual-rate instruction for a decent subset of the SIMD’s arithmetic instruction capability. Competing architectures can do a similar process now too. Lastly, before we can move on from the new WGP that makes up the shader core, AMD made a point of telling the public that RDNA can also operate in a special mode it calls Wave64. As we figured out earlier, GPUs have choices when it comes to wave size and how many clocks over which to run the wave. In RDNA, Wave64 mode is a 64-wide wave that runs over two cycles on the 32-wide SIMDs that make up the WGP.

RDNA ALSO CHANGES HOW

WORK IS DISTRIBUTED ACROSS A COLLECTION OF THOSE SIMD UNITS

Custom PC – October 2019

Get our desktop app

Company

Features

Documentation

Resources