Generation of Random Numbers on GPUs
The last topic in this chapter is the use of devices for massively parallel operations — i.e.,
General Purpose Graphical Processing Units (GPGPUs, or simply GPUs). To use an
Nvidia GPU, we need to have CUDA (Compute Unified Device Architecture, cf.
https://developer.nvidia.com) installed. An easy way to harness the power of Nvidia GPUs
is to use NumbaPro, a performance library by Continuum Analytics that dynamically
compiles Python code for the GPU (or a multicore CPU).
This chapter does not allow us to go into the details of GPU usage for Python
programming. However, there is one financial field that can benefit strongly from the use
of a GPU: Monte Carlo simulation and (pseudo)random number generation in particular.
[ 32 ]
In what follows, we use the native CUDA library curand to generate random numbers on
the GPU:
In [ 97 ]: from numbapro.cudalib import curand
As the benchmark case, we define a function, using NumPy, that delivers a two-dimensional
array of standard normally distributed pseudorandom numbers:
In [ 98 ]: def get_randoms(x, y):
rand = np.random.standard_normal((x, y))
return rand
First, let’s check if it works:
In [ 99 ]: get_randoms( 2 , 2 )
Out[99]: array([[-0.30561007, 1.33124048],
[-0.04382143, 2.31276888]])
Now the function for the Nvidia GPU:
In [ 100 ]: def get_cuda_randoms(x, y):
rand = np.empty((x * y), np.float64)
# rand serves as a container for the randoms
# CUDA only fills 1-dimensional arrays
prng = curand.PRNG(rndtype=curand.PRNG.XORWOW)
# the argument sets the random number algorithm
prng.normal(rand, 0 , 1 ) # filling the container
rand = rand.reshape((x, y))
# to be “fair”, we reshape rand to 2 dimensions
return rand
Again, a brief check of the functionality:
In [ 101 ]: get_cuda_randoms( 2 , 2 )
Out[101]: array([[ 1.07102161, 0.70846868],
[ 0.89437398, -0.86693007]])
And a first comparison of the performance:
In [ 102 ]: %timeit a = get_randoms( 1000 , 1000 )
Out[102]: 10 loops, best of 3: 72 ms per loop
In [ 103 ]: %timeit a = get_cuda_randoms( 1000 , 1000 )
Out[103]: 100 loops, best of 3: 14.8 ms per loop
Now, a more systematic routine to compare the performance:
In [ 104 ]: import time as t
step = 1000
def time_comparsion(factor):
cuda_times = list()
cpu_times = list()
for j in range( 1 , 10002 , step):
i = j * factor