GPU Computing on OpenCL

Hi, today I have something very nice for all readers. The subject is the GPU Computing on OpenCL for all readers.

Abstract: Have you ever wanted to sum 10 million random numbers in array in C? Do you know that on GPU it can be faster about 20k times than on CPU? This post is about GPU Computing and all examples were written in C language with OpenCL 1.2+. Basic understanding of C language is necessary to understand this post content. From the technical standpoint, the goal of each example was to compute on both CPU and GPU, compare two results and discuss all aspects. So you may expect many of the following examples in all mentioned flavours. All code examples are available at https://github.com/sowson/gpucomp and after those simple examples that should make the readers feel comfortable with OpenCL there will be another key project of the author that is at https://github.com/sowson/darknet that is the fork of the https://github.com/pjreddie/darknet that is very successful implementation of the Conventional Neural Network (CNN) and with very famous Yolo1, Yolo2 and Yolo3 models including Yolo3-SPP. It is very unique because it has more than 15.4k stars and 9.4k forks on the GitHub. That last-mentioned project is written in CUDA, not in OpenCL, it is faster than OpenCL implementation made by me, but OpenCL version is able to run not only on NVidia but also Intel, AMD, ARM-based platforms. Some examples of the computation and training will be presented on CIFAR-10 as an industry-standard model for validating CNN engines. This post covers also some aspects of Image Classification on images and / or video files. Gartner publication for business makes it very clear that Image Classification on images and / or video stream is now one of the hottest research disciplines of artificial intelligence on the planet.

Introduction: Have you ever wanted to accelerate of any of your CPU computed program written in C by for example 20k times? do you know that Darknet that is the name of the Convolutional Neural Network (CNN) Engine written in C and leverage CUDA technology on that uses GPU on the GitHub has more than 15.4k stars and mote than 9.4 forks? Do you know that GPU or FPGA accelerators are ready to make this happen with the OpenCL library on them? Are you looking for a good study to become an Artificial Intelligence Specialist who improve the detection algorithms from minutes to nanoseconds? If any of those questions you answer yes, this post is definitely for you! Modern research and development units of many industries need the fastest computation for decision making and improving by the artificial intelligence the business and for example the Gartner researchers are providing predictions and summaries for the business of many kinds says that Deep Learning on CNN is the hottest subject for researchers around the globe, and this all computations can be done on GPUs instead of CPUs. Many robotic, anonymous, agents based solutions need GPU computing to accelerated computation every day. Airlines, Cars Manufacturers, Entertainment Industry and even Marketing units of many companies are looking for AI engineers who are able to write efficient algorithms to improve decisions making every day. If you want to become one of them or wants to hire one of them this post is also for you! What I expect from you is basic knowledge of C language and thanks to great IDE for C programmers from JetBrains® named the CLion that can be run on Apple® macOS, GNU/Linux and Microsoft® Windows operating systems. You will be able to follow easily examples presented in this post. All examples will be tested on the following computers. First will be PC workstation with GNU/Linux Ubuntu 18.04 accelerated by Intel® i7-5960X CPU and 2 NVIDIA® Titan RTX GPU cards. The second will be Apple® MacBook Pro 2018 13’’ with Intel® i7-8559U CPU with Intel® Iris Plus Graphics 655 GPU and with Sonnet Breakaway Puck AMD® Radeon 570 eGPU. The third environment will be Asus® Tinker Board S small form-factor single-board computer with Mali-T760 GPU. All that supports at least OpenCL 1.2. Why do I believe OpenCL 1.2 and C language is a perfect combination? First, because all presented devices and test environments support those standards and programming language. There are seams to be old but also very powerful and can run the fastest code that can be even faster on GPU than on CPU and that post is all about that. Provided CPU, GPUs and computers are generally available in the mainstream segment of the customers market, nothing is special for only big data centres or supercomputing laboratories. And the author believes that this is an important aspect as well, to provide a power of computing, algorithms, deep learning, Convolutional Neural Networks CNN for researches with a limited budget. That is also important for the audience of the readers.

First Impression: Let me show you caparison of two simple implementations of computing sum of all elements of the matrix, a matrix has 10 millions of random integers. I believe that this example will show you the power of GPU computing and give you the first impression of GPU computing. All examples were well tested. The author wants to give you a good foundation of examples to teach you GPU Computing. Now we can answer the question from the abstract about 1 million random numbers sum. How to make it in C on CPU and how to accelerate that on GPU? Let’s go for the first nice and clean C code example of computation.

//CPU (Intel i7-5960X)
int sum()
{
    int sum = 0;

    const int* set = _gen_randoms;

    int i = 0;
    for(i = 0; i < N; ++i) {
        sum += set[i];
    }

    return sum;
}

//GPU (NVIDIA Titan RTX)
inline void sum(
__global int* a,
int v)
{
    int s = v;
    int n = 0;
    int o = 0;
    
    do {
        n = s + atom_xchg(a, o);
        s = o + atom_xchg(a, n);
    }
    while (s != o);
}

__kernel void sum_kernel(
__global int *set,
__global int* out)
{
    int i = (get_group_id(0) +
             get_group_id(1) *
             get_num_groups(0))
           * get_local_size(0) +
             get_local_id(0);

    sum(out, set[i]);
}
RESULT
EX01: SUM INT
SUM of 10 000 000 random numbers:
Compare 1:
Exec of sum_cpu took 20 000 ticks
with result 25 002 358.
Exec of sum_gpu took 0 ticks
with result 25 002 358.

When you look at the above example you may think it is a too small amount of code and you are basically right. All OpenCL elements like context, queue, kernel compilation are away and if you go to examples site at https://github.com/sowson/gpucomp you will find all needed elements that are missing. My the intention is however here to provide you with the essence of the OpenCL code, so the first is pure C code for CPU and second is the pure kernel code for GPU. Be aware that in GPU code uses atomic functions that are not part of some OpenCL implementations for example on the Asus Tinker Board S you may use on MaliGPU implementation that does not have those atomic extensions. But it is, not any blocker to use OpenCL for Convolutional Neural Network in general. And the next example that you can find below the author of this post will provide a more realistic computation example for them.

// CPU Neural Network Computation
void ex06_net_cpu() {
    network net = *ex06_net;
    int i;
    layer I = net.L[0];
    for(i = 0; i < I.n; ++i)
    {
        I.N[i] = (float)(rand() % 4 + 1);
    }
    clock_t t = ex06_benchmark_start();
    for(i = 1; i < net.l; ++i)
    {
        layer L = net.L[i];
        layer B = net.L[i-1];
        int j;
        int w = 0;
        for(j = 0; j < L.n; ++j) {
            float sum = .0f;
            int k;
            for(k = 0; k < L.b; ++k, ++w) {
                sum += B.N[k] * L.W[w];
            }
            L.N[j] = L.A(sum);
        }
    }
    layer O = net.L[net.l - 1];
    ex06_benchmark_stop(t, "net_cpu");
    printf(" with result %.4f.\n", O.N[0]);
}

// GPU Kernel for Neural Network Computation
static const char* const ex06_net_kernel_source
= CONVERT_KERNEL_TO_STRING(
        typedef enum {
            LINEAR,
            TANH
        } activation;

        float activation_linear(float x);
        float activation_tanh(float x);

        float activation_linear(float x)
        {return x;}
        float activation_tanh(float x)
        {return (2.f/(1 + exp(-2.f*x)) - 1);}

        __kernel void net_kernel(
        int b,
        __global float *B,
        __global float* W,
        __global float* N,
        int a)
        {
            int j =
            (get_group_id(0) +
            get_group_id(1) * 
            get_num_groups(0)) * 
            get_local_size(0) + 
            get_local_id(0);

            int w = 0;
            int k = 0;

            N[j] = 0.f;

            for(k = 0; k < b; ++k, ++w) {
                N[j] += B[k] * W[w];
            }

            N[j] = a == LINEAR
            ? activation_linear(N[j]) :
                   a == TANH
            ? activation_tanh(N[j])   :
                   0;
        }
);

// GPU Neural Network Computation
void ex06_net_gpu()
{
    network net = *ex06_net;
    int i;
    layer I = net.L[0];
    for(i = 0; i < I.n; ++i)
    {
        I.N[i] = (float)(rand() % 4 + 1);
    }
    opencl_push_array(I.Ng, I.N);
    clock_t t = ex06_benchmark_start();
    for(i = 1; i < net.l; ++i)
    {
        layer L = net.L[i];
        layer B = net.L[i-1];
        dim2 dimGrid = opencl_gridsize(L.n);
        opencl_kernel(
        ex06_net_kernel[0], dimGrid, 10,
        &L.b, sizeof(cl_int),
        &B.Ng.org, sizeof(cl_mem),
        &L.Wg.org, sizeof(cl_mem),
        &L.Ng.org, sizeof(cl_mem),
        &L.a, sizeof(cl_int)
        );
    }
    layer O = net.L[net.l - 1];
    ex06_benchmark_stop(t, "net_gpu");
    opencl_pull_array(O.Ng, O.N);
    printf(" with result %.4f.\n", O.N[0]);
}

Above example is because I want to show you how easily you can model the neural network of any kind i.e. perceptron or other solutions. The thing is that I did not provide to you the result of a computation, please check on your own those benchmarks. On that post, I would like to encourage you to more experiments with GPU Computing. My journey started with the https://github.com/sowson/darknet and there are already examples of using it, so I will add only two nice training processes results of CIFAR-10 below.

CUDA version of Darknet

OpenCL version of Darknet

Quite similar, however, there are some differences, OpenCL is slower than CUDA, but shape of the learning curve looks very similar. There are few new things you may check on your own and that is the validation of computation results… I am leaving you with this exercise… Enjoy!

p ;).

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.