Custom PC - UK (2020-01)

(Antfer) #1
large, 16,384-entry,
register file (RF) that’s
banked to support
efficient register
access. The scheduler
can dispatch up to
32 ready threads per
clock to the underlying
16-wide machine, and
that 16K RF means that
each thread can make
use of up to 1,024
32-bit registers before
the hardware has to cut
down the number of
threads that can run in
parallel. That’s a huge
amount of RF space for
a modern GPU of any
kind, with the total for
a TU102 chip being a
massive 18MB.

TURING
CUDA CORES
For the first time in any
high-performance
GPU each individual
shader core, or CUDA
core, on Turing can run
both floating point and
integer computation
at the same time, and
at full rate no less, by
virtue of having two
fully independent
datapaths. Graphics code doesn’t tend to
have a lot of integer operations mixed in
with the much more common floating point
stuff, but that mix is on the rise as a general
trend, so Turing’s SM configuration is a
future-looking bet.
On the floating point side, the most
common operation is a fused multiply-add
(FMA), but Turing has a rich instruction
set architecture (ISA) and supports quite
a bit more than just that fused 2-FLOP
instruction and its decomposed multiply or
add. Dual-rate 16-bit floating point is also
present in the smaller Turing processors, so
you don’t miss out on this core performance
improvement even on the cheaper parts.
Lots of graphics computation can be done
at lower arithmetic precision, and the little
Turings are now well positioned to exploit
that in a main SM datapath.

On the integer side, the Turing ISA is
similarly rich, but the base throughput is just
single 1-op stuff such as adds or selects, so the
real innovation is being able to run it entirely in
parallel with the floating point datapath.

TURING TENSOR CORE
The two biggest headline features that have
been added to Turing, over previous NvidIa
architectures are Tensor cores and ray-
tracing hardware. Starting with the Tensor
cores, these are for machine learning (ML)
code and, as far as we’re aware, the graphics
compiler for Turing never emits code that
runs on them.
So what do they do? They’re arranged in a
fixed matrix arrangement, spread out across
the total TPC hardware. Remember the
16-wide main vector datapath? Two Tensor
cores ride alongside each of those vector
units and probably share some of that main
datapath hardware in some way, although
Nvidia doesn’t detail exactly how. Each one
is able to run eight full-rate 16-bit floating
point fused multiply-adds (FP16 FMAs).
Not only that, but it can run 8-bit integer
multiply-adds at twice that rate, and 4-bit
integer multiply-adds at 2x again. So 16 INT8
and 32 INT4 multiply-adds, per core.
With 576 of those cores spread across
the entirety of the chip, that equates to
up to 113.8TFLOPS of Tensor processing
on the TU102. In comparison, for normal
floating point operations, it can only deliver
14.2TFLOPs for FP32 or 28.5TFLOPs for
FP16. This massive step up in performance
for these types of calculations means
Turing-based GPUs are in huge demand for
the ever-increasing world of AI computing.

RAY TRACING
Ray tracing is a completely different method
for rendering aspects of a graphical scene,
so it can’t easily be accelerated by the
general-purpose processors that make
up the bulk of a GPU, but instead require
some specialised hardware. Nvidia’s initial
implementation in Turing takes a couple
of the key baby steps, accelerating parallel
triangle and box testing, and the streaming
in of the scene hierarchy into which your
game traces rays.
In a nutshell, you aim a ray at a mesh
of triangles organised into a hierarchy of
bounding boxes, with a particular direction
and maximum depth specified for the ray.

The SM, or streaming multiprocessor, is the
processing heart of a Turing GPU


split either 64/32 or 32/64, between a
shared memory – used so that each CUDA
core can quickly share data with the others
in the collective vector unit – and a general
L1 cache. The compiler decides what to
do based on the data being used and how
it’s shared between CUDA cores. No other
vendor has this clever cache partitioning
technology in its GPU.
The cache also has double the bandwidth
compared with the similar L1 structure in
Pascal’s TPC, and there’s a lower latency to
service a hit if the data the SM requests is
already in the L1 ready to be returned.
Each 16-wide vector unit has its own
thread scheduler that takes care of tracking
and managing all of the threads of work
that are being processed, along with a very

Free download pdf