Skip to end of metadata
Go to start of metadata

2.4.1 Introduction

In order to make application runnable on GPUs one has to use CUDA programming language or OpenACC directive based approach.
The main idea is to create a program that usually runs in a normal way on host CPU(s), but rather slowly.

To speed the program up, GPUs need to be brought into action. This is accomplished by adding CUDA C++ code
(sometimes even CUDA Fortran in case PGI compiler is used) or by inserting #pragma acc or $!acc -directives to denote code regions eligible for GPU.

2.4.2 CUDA approach

Here is an example of CUDA-program that calculates famous AXPY vector operation in double precision. The program ( daxpy.cu ) looks as follows :

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>

__global__ void daxpy(int n, double a,
                      const double *x,
                      double *y)
{
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  while (tid < n) {
    y[tid] += a * x[tid];
    tid += stride;
  }
}

double sum_up(int n, const double *y)
{
  double sum = 0;
  for (int j=0; j<n; ++j) sum += y[j];
  return sum;
}

int main()
{
  int n = 1<<27; // = 128M

  // allocate Host arrays
  double *h_x = (double *)malloc(sizeof(*h_x)*n); // x
  double *h_y = (double *)malloc(sizeof(*h_y)*n); // y

  // allocate Device arrays
  double *d_x, *d_y;
  cudaMalloc((void **)&d_x, sizeof(*d_x)*n); // x
  cudaMalloc((void **)&d_y, sizeof(*d_y)*n); // y

  double a = 2.0;
  int vlen = 256; // == blockdim.x
  dim3 blockdim = dim3(vlen,1,1);
  dim3 griddim = dim3((n+blockdim.x-1)/blockdim.x,1,1);

  fprintf(stderr,
          "n=%d : vlen=%d : griddim = %d %d %d : blockdim.x = %d %d %d\n",
          n,vlen,griddim.x,griddim.y,griddim.z,blockdim.x,blockdim.y,blockdim.z);

  // initialize x & y on Host

  for (int j=0; j<n; ++j) {
    h_x[j] =  j;
    h_y[j] = -j;
  }

  // transfer them to Device

  cudaMemcpy(d_x,h_x,sizeof(*h_x)*n,cudaMemcpyHostToDevice);
  cudaMemcpy(d_y,h_y,sizeof(*h_y)*n,cudaMemcpyHostToDevice);

  // perform DAXPY on GPU : a * x + y -> y

  daxpy<<<griddim,blockdim>>>(n, a, d_x, d_y);

  // copy updated "y" back to Host

  cudaMemcpy(h_y,d_y,sizeof(*h_y)*n,cudaMemcpyDeviceToHost);

  // check & display results

  double dn = n;
  double sum = sum_up(n,h_y), check_sum = (dn * (dn - 1))/2;
  double diff = sum-check_sum;
  fprintf(stderr,"daxpy(n=%d): sum=%g : check_sum=%g : diff=%g\n",
          n,sum,check_sum,diff);

  // free memories

  free(h_x); free(h_y);
  cudaFree(d_x); cudaFree(d_y);

  return 0;
}
 

We then proceed by using the following module environment for compilation & linking :

module purge
module load gcc/4.8.2 cuda/6.0
nvcc -O3 --restrict -arch=compute_35 -code=sm_35 daxpy.cu -o daxpy.x.gnu

It is paramount to supply the options " -arch=compute_35 -code=sm_35 " in order to generate correct and optimal code for the attached GPUs (K40, sm_35) on Taito-GPU system. 

Please note that we are using NVIDIAs CUDA C/C++ compiler front-end "nvcc". Since the module "gcc/4.8.2" was also loaded, the nvcc ends up invoking gcc/g++ the back-end compiler from 4.8.2 – not from system default path.

If you would like to use PGI-environment, the following palette of commands does the job :

module purge
module load pgi/14.4 cuda/6.0
nvcc -O3 --restrict -arch=compute_35 -code=sm_35 daxpy.cu -o daxpy.x.pgi

Note, however, that here the gcc/g++ compiler comes from the underlaying Linux-system – i.e. not from 4.8.2 – since the module environment at CSC does not allow two different compilers to co-exist. To figure out which libraries get linked, use ldd -command against your GPU-executable. With PGI compiler environment you should see :

% ldd daxpy.x.pgi
        linux-vdso.so.1 =>  (0x00007fff50361000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f46b7c39000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f46b7a1b000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f46b7817000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f46b7511000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f46b728c000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f46b7076000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f46b6ce2000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f46b7e52000)

And with "gcc/4.8.2" you should to see :

% ldd daxpy.x.gnu
        linux-vdso.so.1 =>  (0x00007fffb1bff000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f753ee78000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f753ec5a000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f753ea56000)
        libstdc++.so.6 => /appl/opt/gcc/4.8.2/lib64/libstdc++.so.6 (0x00007f753e74d000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f753e4c8000)
        libgcc_s.so.1 => /appl/opt/gcc/4.8.2/lib64/libgcc_s.so.1 (0x00007f753e2b2000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f753df1e000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f753f091000)

2.4.3 OpenACC approach

It is often easier to use OpenACC directives rather than CUDA language. But then you need PGI compiler in order to interpret your OpenACC directives.
These can be embedded into C/C++ codes or Fortran codes. OpenACC resembles very much OpenMP directives and is often a good starting point in
moving part of your code to use GPU accelerators. One beauty of this approach is that you can move your application gradually to accelerators, and there is
no need to learn CUDA.

For CUDA programmers there is also a good news : you can interoperate your existing CUDA code with OpenACC. So, there is no need to throw away a
well functioning CUDA kernels – just invoke them (usually via wrapper) from within OpenACC region. It is said that CUDA codes still perform 10-20% faster than
OpenACC constructs, but this gap is clearly narrowing with newer releases of OpenACC compilers.

The corresponding DAXPY-code written in OpenACC is rather simple. We provide both C/C++ and Fortran versions.

2.4.3.1 C/C++ version of OpenACC DAXPY

Here is the C/C++ -version of our OpenACC DAXPY ( daxpy.c ) :

#include <stdio.h>
#include <stdlib.h>
#ifdef _OPENACC
#include <openacc.h>
#endif

#define RESTRICT restrict

double init(int n,
                   double *RESTRICT x,
                   double *RESTRICT y)
{
#pragma acc parallel present(x,y)
  {
#pragma acc loop
    for (int j=0; j<n; ++j) x[j] = j;
#pragma acc loop
    for (int j=0; j<n; ++j) y[j] = -j;
  }
  return 2.0;
}

void daxpy(int n, double a,
           const double *RESTRICT x,
           double *RESTRICT y)
{
#pragma acc parallel loop present(x,y)
  for (int j=0; j<n; ++j)
    y[j] += a * x[j];
}

double sum_up(int n, const double *y)
{
  double sum = 0;
#pragma acc parallel loop reduction(+:sum)
  for (int j=0; j<n; ++j) sum += y[j];
  return sum;
}

int main()
{
  int n = 1<<27; // = 128M
  double x[n], y[n];

#ifdef _OPENACC
  // It's recommended to perform one-off initialization of GPU-devices before any OpenACC regions
  acc_init(acc_get_device_type()) ; // from openacc.h
#endif

#pragma acc data create(x[0:n],y[0:n])
  {
    // initialize vectors -- directly on GPU device
    double a = init(n,x,y);

    // call daxpy with 128M elements
    daxpy(n, a, x, y);

    double dn = n;
    double sum = sum_up(n,y), check_sum = (dn * (dn - 1))/2;
    double diff = sum-check_sum;
    fprintf(stderr,"daxpy(n=%d): sum=%g : check_sum=%g : diff=%g\n",
            n,sum,check_sum,diff);
  }

  return 0;
}

Please pay attention to directives #pragma acc . They in fact dictate how, where and when your data will be used on GPUs.
To compile & link the application, the PGI environment must be used. Unless you have strict requirement for using C++ compiler (pgc++),
it is often more convenient to deploy PGI's C-compiler with c99 standard extensions:

module purge

module load pgi/14.4 cuda/6.0
module list

set -xe
pgcc -c99 -O3 -Minfo=all -acc -ta=tesla:cc35,6.0,kepler+ daxpy.c -o daxpy.x.acc
ldd daxpy.x.acc

Since we asked for compiler info via -Minfo=all , we will be rewarded by a rather exhaustive listing of messages. In this case they reveal how well
we succeeded in our attempts of making our GPU busy – at least in theory. In addition, dynamic libraries are also shown:

Currently Loaded Modules:
  1) pgi/14.4    2) cuda/6.0
+ pgcc -c99 -O3 -Minfo=all -acc -ta=tesla:cc35,6.0,kepler+ daxpy.c -o daxpy.x.acc
init:
     13, Generating present(x[:])
         Generating present(y[:])
         Accelerator kernel generated
         16, #pragma acc loop gang, vector(256) /* blockIdx.x threadIdx.x */
         18, #pragma acc loop gang, vector(256) /* blockIdx.x threadIdx.x */
     13, Generating Tesla code
daxpy:
     27, Generating present(x[:])
         Generating present(y[:])
         Accelerator kernel generated
         28, #pragma acc loop gang, vector(256) /* blockIdx.x threadIdx.x */
     27, Generating Tesla code
sum_up:
     35, Accelerator kernel generated
         36, #pragma acc loop gang, vector(256) /* blockIdx.x threadIdx.x */
             Sum reduction generated for sum
     35, Generating present_or_copyin(y[:n])
         Generating Tesla code
main:
     50, Generating create(x[:n])
         Generating create(y[:n])
+ ldd daxpy.x.acc
        linux-vdso.so.1 =>  (0x00007fff6fbf0000)
        libdl.so.2 => /lib64/libdl.so.2 (0x0000003625600000)
        libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x0000003628e00000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003625a00000)
        libm.so.6 => /lib64/libm.so.6 (0x0000003625e00000)
        libc.so.6 => /lib64/libc.so.6 (0x0000003625200000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003624e00000)

2.4.3.2 Fortran version of OpenACC DAXPY

The corresponding Fortran-version ( daxpy.F90 ) looks as follows (please pay attention to $!acc  -directives for enable OpenACC) :

function init(n, x, y) result(a)
implicit none
integer, intent(in)  :: n
real(8), intent(out) :: x(n), y(n)
real(8) :: a
integer :: j
!$acc parallel present(x,y) private(j)
!$acc loop
do j=1,n
   x(j) = j-1
enddo
!$acc loop
do j=1,n
   y(j) = -(j-1)
enddo
!$acc end parallel
a = 2.0_8
end function init

subroutine daxpy(n, a, x, y)
implicit none
integer, intent(in)  :: n
real(8), intent(in) :: a
real(8), intent(in) :: x(n)
real(8), intent(inout) :: y(n)
integer :: j
!$acc parallel loop present(x,y)
do j=1,n
  y(j) = y(j) + a * x(j)
enddo
!$acc end parallel
end subroutine daxpy

function sum_up(n, y) result(sum)
implicit none
integer, intent(in)  :: n
real(8), intent(in) :: y(n)
real(8) :: sum
integer :: j
sum = 0
!$acc parallel loop reduction(+:sum) private(j)
do j=1,n
   sum = sum + y(j)
enddo
!$acc end parallel
end function sum_up

program main
use openacc, only : acc_init, acc_device_nvidia
implicit none
integer, parameter :: n = 2**27 ! = 128M
  real(8) :: x(n), y(n)
  real(8) :: a, dn, sum, check_sum, diff
  real(8), external :: init, sum_up

#ifdef _OPENACC
  ! It's recommended to perform one-off initialization of GPU-devices before any OpenACC regions
  call acc_init(acc_device_nvidia)
#endif

!$acc data create(x(1:n),y(1:n))
  ! initialize vectors -- directly on GPU device
  a = init(n,x,y)

  ! call daxpy with 128M elements
  call daxpy(n, a, x, y)

  dn = n
  sum = sum_up(n,y); check_sum = (dn * (dn - 1))/2
  diff = sum-check_sum

  write(0,1000) n,sum,check_sum,diff
  1000 format("daxpy(n=",i0,"): sum=",g12.5," : check_sum=",g12.5," : diff=",g12.5)
!$acc end data

end program main

Compilation goes as follows (please observe also importancy of the -mcmodel=medium  -flag  : program would not link correctly without it) :

module purge
module load pgi/14.4 cuda/6.0
module list

set -xe
pgfortran -O3 -Minfo=all -mcmodel=medium -acc -ta=tesla:cc35,6.0,kepler+ daxpy.F90 -o daxpy.x.acc

ldd daxpy.x.acc

Output from such compilation & linking sequence looks like this :

Currently Loaded Modules:
  1) pgi/14.4    2) cuda/6.0
+ pgfortran -O3 -Minfo=all -mcmodel=medium -acc -ta=tesla:cc35,6.0,kepler+ daxpy.F90 -o daxpy.x.acc
init:
      7, Generating present(x(:))
         Generating present(y(:))
         Accelerator kernel generated
          9, !$acc loop gang, vector(256) ! blockidx%x threadidx%x
         13, !$acc loop gang, vector(256) ! blockidx%x threadidx%x
      7, Generating Tesla code
daxpy:
     27, Generating present(x(:))
         Generating present(y(:))
         Accelerator kernel generated
         28, !$acc loop gang, vector(256) ! blockidx%x threadidx%x
     27, Generating Tesla code
sum_up:
     41, Accelerator kernel generated
         42, !$acc loop gang, vector(256) ! blockidx%x threadidx%x
         43, Sum reduction generated for sum
     41, Generating present_or_copyin(y(:n))
         Generating Tesla code
main:
     61, Generating create(x(:))
         Generating create(y(:))
+ ldd daxpy.x.acc
        linux-vdso.so.1 =>  (0x00007fff42aec000)
        libaccapi.so => /homeappl/appl_taito/opt/pgi/14.4/linux86-64/14.4/libso/libaccapi.so (0x00007f9ece3be000)
        libaccg.so => /homeappl/appl_taito/opt/pgi/14.4/linux86-64/14.4/libso/libaccg.so (0x00007f9ece2a9000)
        libaccn.so => /homeappl/appl_taito/opt/pgi/14.4/linux86-64/14.4/libso/libaccn.so (0x00007f9ece192000)
        libaccg2.so => /homeappl/appl_taito/opt/pgi/14.4/linux86-64/14.4/libso/libaccg2.so (0x00007f9ece087000)
        libdl.so.2 => /lib64/libdl.so.2 (0x0000003625600000)
        libpgmp.so => /homeappl/appl_taito/opt/pgi/14.4/linux86-64/14.4/libso/libpgmp.so (0x00007f9ecdeea000)
        libnuma.so => /homeappl/appl_taito/opt/pgi/14.4/linux86-64/14.4/libso/libnuma.so (0x00007f9ecdde9000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003625a00000)
        libpgf90.so => /homeappl/appl_taito/opt/pgi/14.4/linux86-64/14.4/libso/libpgf90.so (0x00007f9ecd970000)
        libpgf90_rpm1.so => /homeappl/appl_taito/opt/pgi/14.4/linux86-64/14.4/libso/libpgf90_rpm1.so (0x00007f9ecd86e000)
        libpgf902.so => /homeappl/appl_taito/opt/pgi/14.4/linux86-64/14.4/libso/libpgf902.so (0x00007f9ecd75a000)
        libpgf90rtl.so => /homeappl/appl_taito/opt/pgi/14.4/linux86-64/14.4/libso/libpgf90rtl.so (0x00007f9ecd635000)
        libpgftnrtl.so => /homeappl/appl_taito/opt/pgi/14.4/linux86-64/14.4/libso/libpgftnrtl.so (0x00007f9ecd501000)
        libpgc.so => /homeappl/appl_taito/opt/pgi/14.4/linux86-64/14.4/libso/libpgc.so (0x00007f9ecd391000)
        librt.so.1 => /lib64/librt.so.1 (0x0000003626600000)
        libm.so.6 => /lib64/libm.so.6 (0x0000003625e00000)
        libc.so.6 => /lib64/libc.so.6 (0x0000003625200000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003624e00000)


 

 

 

 

 

 

  • No labels