Using PyCUDA's ElementWiseKernel for performing pointwise computations
We will now see how to program our own point-wise (or equivalently, element-wise) operations directly onto our GPU with the help of PyCUDA's ElementWiseKernel function. This is where our prior knowledge of C/C++ programming will become useful—we'll have to write a little bit of inline code in CUDA C, which is compiled externally by NVIDIA's nvcc compiler and then launched at runtime by our code via PyCUDA.
We use the term kernel quite a bit in this text; by kernel, we always mean a function that is launched directly onto the GPU by CUDA. We will use several functions from PyCUDA that generate templates and design patterns for different types of kernels, easing our transition into GPU programming.
Let's dive right in; we're going to start by explicitly rewriting the code to multiply each element of a gpuarray object by 2 in CUDA-C; we will use the ElementwiseKernel function from PyCUDA to generate our code. You should try typing the following code directly into an IPython console. (The less adventurous can just download this from this text's Git repository, which has the filename simple_element_kernel_example0.py):
import numpy as np
import pycuda.autoinit
from pycuda import gpuarray
from time import time
from pycuda.elementwise import ElementwiseKernel
host_data = np.float32( np.random.random(50000000) )
gpu_2x_ker = ElementwiseKernel(
"float *in, float *out",
"out[i] = 2*in[i];",
"gpu_2x_ker")
Let's take a look at how this is set up; this is, of course, several lines of inline C. We first set the input and output variables in the first line ( "float *in, float *out" ), which will generally be in the form of C pointers to allocated memory on the GPU. In the second line, we define our element-wise operation with "out[i] = 2*in[i];", which will multiply each point in in by two and place this in the corresponding index of out.
Note that PyCUDA automatically sets up the integer index i for us. When we use i as our index, ElementwiseKernel will automatically parallelize our calculation over i among the many cores in our GPU. Finally, we give our piece of code its internal CUDA C kernel name ( "gpu_2x_ker" ). Since this refers to CUDA C's namespace and not Python's, it's fine (and also convenient) to give this the same name as in Python.
Now, let's do a speed comparison:
def speedcomparison():
t1 = time()
host_data_2x = host_data * np.float32(2)
t2 = time()
print 'total time to compute on CPU: %f' % (t2 - t1)
device_data = gpuarray.to_gpu(host_data)
# allocate memory for output
device_data_2x = gpuarray.empty_like(device_data)
t1 = time()
gpu_2x_ker(device_data, device_data_2x)
t2 = time()
from_device = device_data_2x.get()
print 'total time to compute on GPU: %f' % (t2 - t1)
print 'Is the host computation the same as the GPU computation? : {}'.format(np.allclose(from_device, host_data_2x) )
if __name__ == '__main__':
speedcomparison()
Now, let's run this program:
Whoa! That doesn't look good. Let's run the speedcomparison() function a few times from IPython:
As we can see, the speed increases dramatically after the first time we use a given GPU function. Again, as with the prior example, this is because PyCUDA compiles our inline CUDA C code the first time a given GPU kernel function is called using the nvcc compiler. After the code is compiled, then it is cached and re-used for the remainder of a given Python session.
Now, let's cover something else important before we move on, which is very subtle. The little kernel function we defined operates on C float pointers; this means that we will have to allocate some empty memory on the GPU that is pointed to by the out variable. Take a look at this portion of code again from the speedcomparison() function:
device_data = gpuarray.to_gpu(host_data)
# allocate memory for output
device_data_2x = gpuarray.empty_like(device_data)
As we did before, we send a NumPy array over to the GPU (host_data) via the gpuarray.to_gpu function, which automatically allocates data onto the GPU and copies it over from the CPU space. We will plug this into the in part of our kernel function. In the next line, we allocate empty memory on the GPU with the gpuarray.empty_like function. This acts as a plain malloc in C, allocating an array of the same size and data type as device_data, but without copying anything. We can now use this for the out part of our kernel function. We now look at the next line in speedcomparison() to see how to launch our kernel function onto the GPU (ignoring the lines we use for timing):
gpu_2x_ker(device_data, device_data_2x)
Again, the variables we set correspond directly to the first line we defined with ElementwiseKernel (here being, "float *in, float *out").