Custom PC – October 2019

(sharon) #1
Whereas the SIMDs in a GCN CU shared a
single scalar ALU, RDNA has one assigned
toeachSIMD

and how it more efficiently shares available
per-WGP resources, such as the VGPR pool,
LDS and the scalar hardware that assists
the WGP’s SIMDs in their work. AMD could
have reasonably signed off the WGP design
there and called it done, but it also took a
close look at the kinds of shader programs
that modern games ask the hardware to
execute, across the full spectrum of vertex,
pixel, compute and tessellation. It also took
the compiler-driven way of executing the
new geometry pipeline into account.
A special function unit (SFU) lives alongside
the main SIMD hardware in the CU in GCN,
and its job is to efficiently run special kinds
of arithmetic instructions that don’t tend to
come up too often in shader programs and,
because of how the hardware needs to
implement them, shouldn’t really be a part of
the primary arithmetic pipeline in the SIMD.
The primary SIMD pipeline in RDNA
runs floating point at IEEE754-spec single
precision, just like a modern CPU, including
support for weird floating-point numbersthat
affect how you execute, such as denormals
(numbers very close to, but not quite,zero).
It implements that with support for a fused
multiply-add (FMA) instruction. That 2-op
instruction runs both the multiply andthe
add part of the arithmetic in a single cycle.It’s
also capable of just executing a multiplyor
an add if that’s all that’s needed.
But that’s pretty much all it
does, at least for floating point.
There’s no support for the
aforementioned special functions,
such as sin, cos, log, pow, exp,
and other similar arithmetic
instructions, which tend to have a much
longer latency than one cycle at their target
accuracy. Because the latency to generate
the right result for those functions canbelong
and variable, support for those essentialbut
special functions has always been separate
in AMD GPU architectures, and indeedthose
from almost all other GPU makers too.
In GCN, to issue an instruction to theSFU,
you steal an instruction slot in the instruction
stream that you’re running, so you can’tissue
to the main SIMD while you’re issuingtothe
SFU, and you wait for the SFU result tocome
back before you can do more main SIMD
work. That’s easy to implement, but kindof
sucks because it takes issue slots awayfrom
the main SIMD hardware, which is almost
always more commonly needed in normal


shader programs. The innovation in RDNA is
to catch up with competing architectures and
stilltakeanissueslotawayfromtheSIMD
hardware,buttothenruntheSFUinparallel.
InRDNA,theSFUtakesfourcyclestoreturn
theresultofa particularop.InGCNthatwould
blockthemainSIMDhardwareforallfour
clocks.InRDNA,it onlyblocksforoneclock,
andthemainSIMDcangetbacktoworkfor
thethreeotherclocksthatareleftbeforethe
SFUcomesbackwithitsresult.If thecompiler

canfindworktodointhosethreeclocks,then
RDNAgets(much)moreefficientatrunning
thoseco-issuableFMA+ SFUopportunities
thatsometimescropupinmodernshaders.
It’sdoableforthecompiler,especiallybecause
it’snota hugeinstructionwindowtofillup
beforetheSFUwillcomebackwithanyresult.

TEXTUREHARDWARE
CHANGES,FP16
ANDWAVE64
Theonlytexturehardwarechangeofnote
inRDNAthatwecanseeis thatthedesign
is nowcapableoffull-rate 64 bitsperpixel
(bpp)andBC6Hformatbilinearfiltering,
whichis twicetheperformanceinVega.
Commonlyrequestedbygamedevelopers
overthelastfewyears,AMDuppeditstexture

performance ante and made its sampler
hardware full-rate for basically all surface
formats you might want to filter in common
graphics workloads now. It’s a great future-
looking change to the texture hardware design
in response to strong developer demand.
Shader programs in modern games,
especially pixel shaders, have plenty
of opportunity for executing
parts of the shader program in
reduced precision, typically 16-bit
floating point (FP16). Like GCN,
RDNA still supports executing
FP16 at double rate for some
instructions (and 16-bit integer
too, but that’s less common) by running
those instructions on two halves of a packed
32-bit VGPR. Put one 16-bit value in the top
half of the register, and one in the bottom
half, and the hardware can run a dual-rate
instruction for a decent subset of the SIMD’s
arithmetic instruction capability. Competing
architectures can do a similar process now too.
Lastly, before we can move on from the
new WGP that makes up the shader core,
AMD made a point of telling the public that
RDNA can also operate in a special mode
it calls Wave64. As we figured out earlier,
GPUs have choices when it comes to wave
size and how many clocks over which to
run the wave. In RDNA, Wave64 mode is a
64-wide wave that runs over two cycles on
the 32-wide SIMDs that make up the WGP.

RDNA ALSO CHANGES HOW


WORK IS DISTRIBUTED ACROSS A
COLLECTION OF THOSE SIMD UNITS
Free download pdf