Custom PC - UK (2020-01)

large, 16,384-entry, register file (RF) that’s banked to support efficient register access. The scheduler can dispatch up to 32 ready threads per clock to the underlying 16-wide machine, and that 16K RF means that each thread can make use of up to 1,024 32-bit registers before the hardware has to cut down the number of threads that can run in parallel. That’s a huge amount of RF space for a modern GPU of any kind, with the total for a TU102 chip being a massive 18MB.

TURING CUDA CORES For the first time in any high-performance GPU each individual shader core, or CUDA core, on Turing can run both floating point and integer computation at the same time, and at full rate no less, by virtue of having two fully independent datapaths. Graphics code doesn’t tend to have a lot of integer operations mixed in with the much more common floating point stuff, but that mix is on the rise as a general trend, so Turing’s SM configuration is a future-looking bet. On the floating point side, the most common operation is a fused multiply-add (FMA), but Turing has a rich instruction set architecture (ISA) and supports quite a bit more than just that fused 2-FLOP instruction and its decomposed multiply or add. Dual-rate 16-bit floating point is also present in the smaller Turing processors, so you don’t miss out on this core performance improvement even on the cheaper parts. Lots of graphics computation can be done at lower arithmetic precision, and the little Turings are now well positioned to exploit that in a main SM datapath.

On the integer side, the Turing ISA is similarly rich, but the base throughput is just single 1-op stuff such as adds or selects, so the real innovation is being able to run it entirely in parallel with the floating point datapath.

TURING TENSOR CORE The two biggest headline features that have been added to Turing, over previous NvidIa architectures are Tensor cores and ray- tracing hardware. Starting with the Tensor cores, these are for machine learning (ML) code and, as far as we’re aware, the graphics compiler for Turing never emits code that runs on them. So what do they do? They’re arranged in a fixed matrix arrangement, spread out across the total TPC hardware. Remember the 16-wide main vector datapath? Two Tensor cores ride alongside each of those vector units and probably share some of that main datapath hardware in some way, although Nvidia doesn’t detail exactly how. Each one is able to run eight full-rate 16-bit floating point fused multiply-adds (FP16 FMAs). Not only that, but it can run 8-bit integer multiply-adds at twice that rate, and 4-bit integer multiply-adds at 2x again. So 16 INT8 and 32 INT4 multiply-adds, per core. With 576 of those cores spread across the entirety of the chip, that equates to up to 113.8TFLOPS of Tensor processing on the TU102. In comparison, for normal floating point operations, it can only deliver 14.2TFLOPs for FP32 or 28.5TFLOPs for FP16. This massive step up in performance for these types of calculations means Turing-based GPUs are in huge demand for the ever-increasing world of AI computing.

RAY TRACING Ray tracing is a completely different method for rendering aspects of a graphical scene, so it can’t easily be accelerated by the general-purpose processors that make up the bulk of a GPU, but instead require some specialised hardware. Nvidia’s initial implementation in Turing takes a couple of the key baby steps, accelerating parallel triangle and box testing, and the streaming in of the scene hierarchy into which your game traces rays. In a nutshell, you aim a ray at a mesh of triangles organised into a hierarchy of bounding boxes, with a particular direction and maximum depth specified for the ray.

The SM, or streaming multiprocessor, is the
processing heart of a Turing GPU

split either 64/32 or 32/64, between a
shared memory – used so that each CUDA
core can quickly share data with the others
in the collective vector unit – and a general
L1 cache. The compiler decides what to
do based on the data being used and how
it’s shared between CUDA cores. No other
vendor has this clever cache partitioning
technology in its GPU.
The cache also has double the bandwidth
compared with the similar L1 structure in
Pascal’s TPC, and there’s a lower latency to
service a hit if the data the SM requests is
already in the L1 ready to be returned.
Each 16-wide vector unit has its own
thread scheduler that takes care of tracking
and managing all of the threads of work
that are being processed, along with a very

Custom PC - UK (2020-01)

Get our desktop app

Company

Features

Documentation

Resources