Skip to end of metadata
Go to start of metadata

3.4.1 Introduction 

There are several modes of execution applicable on Bull-MIC partition.

  • host CPU-only mode
  • host CPU with offload to MIC-cards using OpenMP4.0 target -directive
  • MIC native mode that run only on MIC-card(s), requires compilation & linking with -mmic option
  • MPI (+ OpenMP) versions that
    • run solely on CPU-node(s)
    • run solely on MIC-cards (with one or more MPI-tasks per card)
    • symmetric (or better : hybrid) mode, where two executables are built, one for CPUs, one for MICs and they are launched via MPI-mechanisms

The host CPU-only mode would not utilize MIC-cards at all, unless offload-directives are deployed – and
this is the preferred way of using the system at the moment. And all MPI communication, if applicable, would happen
between host CPU MPI-tasks only.

The second mode is to build solely for MIC-cards i.e. for MIC native mode. No offload is required. And no host CPU involvement either.
Application build process is carried out on the host CPU, and CSC's auto-offload execution (AOE) will automatically
detect that executable is for MIC-card.

We recommend at the moment that you use MIC native mode only without MPI as we have serious problems getting
MICs to work over Infiniband. We are expecting Intel to solve these problems in any time now.

3.4.1.1 The recommended build directory 

We recommend building your application under your $TMPDIR before copying executable(s) to $WRKDIR, where you
can launch them to the SLURM batch queue :

mkdir $TMPDIR/my_software # for example ...

< copy your software source codes, tarballs etc. here >

cd $TMPDIR/my_software

module purge
module load intel/14.0.1 mkl/11.1.1 intelmpi/4.1.3
module list

< your compilation and linking sequence for C/C++/Fortran programs e.g. using make >

cp my_executables $WRKDIR
pushd $WRKDIR

# Submit batch job
sbatch my_jobfile

3.4.2 OpenMP builds (for host CPU and native MIC modes)

3.4.2.1 Host CPU-only build 

The following compilation and linking sequence will be valid for host CPU-only builds (OpenMP & offloads require the -openmp flag) :

  • C-programs :

% mpiicc -openmp cprog.c -o cprog.x

  • C++ programs :

% mpiicpc -openmp cxxprog.cpp -o cxxprog.x

  • Fortran programs :

% mpiifort -openmp fprog.F90 -o fprog.x

3.4.2.2 Hello World in C : pure OpenMP, no offloads – for host CPU and native MIC runs 

The "Hello World" for host CPU-only is :

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <math.h>

#ifdef _OPENMP
#include <omp.h>
#else
#error Must be compiled with -openmp flag
#endif

int main(int argc, char *argv[])
{

  int maxth = omp_get_max_threads();
  int j, count = 0;
  char hostname[64];

#pragma omp parallel reduction(+:count) default(none) shared(maxth)
#pragma omp for private(j)
  for (j = 0; j < maxth; ++j) ++count;

  gethostname(hostname, sizeof(hostname));
  printf("%s: Hello World from '%s' !! The number of threads = %d and count = %d\n",
         argv[0], hostname, maxth, count);
}

3.4.2.3 Native MIC build 

When building for native MIC executable, you need to add compiler options -mmic (and usually -openmp) and you will have
your MIC executable ready for rumble – cannot be easier !

  • C-programs :

% mpiicc -mmic -openmp cprog.c -o cprog.x

  • C++ programs :

% mpiicpc -mmic -openmp cxxprog.cpp -o cxxprog.x

  • Fortran programs :

% mpiifort -mmic -openmp fprog.F90 -o fprog.x

Exactly the same "Hello World" as for host CPUs – without a single line of change – compiles and runs on MIC-cards as well.

3.4.3 Offload builds

3.4.3.1 Hello World in C : using OpenMP4.0 target -directive 

In order to make our "Hello World" program running in offload-mode, we need to modify it slightly (not too bad). Yet compilation
is exactly as if you were building for host CPUs (but remember -openmp flag of course).

The following code chooses one of the devices (= MIC-cards) to run the parallel construct. Please notice that we grab the number of threads available
on the device by running a "superfluous" parallel region, but that actually also initializes the device making the subsequent parallel region faster.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <math.h>

#ifdef _OPENMP
#include <omp.h>
#else
#error Must be compiled with -openmp flag
#endif

int main(int argc, char *argv[])
{

  int maxth = omp_get_max_threads() ; /* Number of threads -- on host CPU */
  int j, count = 0;
  char hostname[64];
  int devnum = (argc > 1) ? atoi(argv[1]) : 0; /* Choose MIC-card 0 or 1 */
  int numdevs = omp_get_num_devices() ;

  devnum %= numdevs;

#pragma omp target map(from:hostname) device(devnum)
  {
    /* This also wakes up the device in concern !! */
#pragma omp parallel default(none) shared(maxth, hostname)
#pragma omp single
    {
      maxth = omp_get_max_threads() ;
      gethostname(hostname, sizeof(hostname));
    }

#pragma omp parallel reduction(+:count) num_threads(maxth) default(none) shared(maxth)
    {
#pragma omp for simd private(j)
      for (j = 0; j < maxth; ++j) ++count;
    }
  }

  printf("%s: Hello World from '%s', MIC-card#%d out of %d !! The # of threads = %d and count = %d\n",
         argv[0], hostname, devnum, numdevs, maxth, count);
} 

Compilation & linking sequence is as follows :

module purge
module load intel/14.0.1 mkl/11.1.1 intelmpi/4.1.3
module list
set -xe
mpiicc -openmp hello4.c -o hello4_offload.x
ldd ./hello4_offload.x

And typival output looks something like this :

Currently Loaded Modules:
  1) intel/14.0.1    2) mkl/11.1.1    3) intelmpi/4.1.3
+ mpiicc -openmp hello4.c -o hello4_offload.x
+ ldd ./hello4_offload.x
        linux-vdso.so.1 =>  (0x00007fff775ee000)
        libmpigf.so.4 => /appl/opt/cluster_studio_xe2013/impi/4.1.3.045/intel64/lib/libmpigf.so.4 (0x00007fa3bab59000)
        libmpi_mt.so.4 => /appl/opt/cluster_studio_xe2013/impi/4.1.3.045/intel64/lib/libmpi_mt.so.4 (0x00007fa3ba4db000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fa3ba2c7000)
        librt.so.1 => /lib64/librt.so.1 (0x00007fa3ba0bf000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa3b9ea1000)
        libm.so.6 => /lib64/libm.so.6 (0x00007fa3b9c1d000)
        libiomp5.so => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libiomp5.so (0x00007fa3b9902000)
        liboffload.so.5 => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/intel64/liboffload.so.5 (0x00007fa3b96d0000)
        libcilkrts.so.5 => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libcilkrts.so.5 (0x00007fa3b9492000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007fa3b918c000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fa3b8f75000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fa3b8be1000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fa3bad8b000)
        libimf.so => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libimf.so (0x00007fa3b8719000)
        libsvml.so => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libsvml.so (0x00007fa3b7b22000)
        libirng.so => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libirng.so (0x00007fa3b791b000)
        libintlc.so.5 => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libintlc.so.5 (0x00007fa3b76c4000) 

3.4.3.2 OpenMP4.0 offload version of Fortran DAXPY

Just to compare whether we can use OpenMP target directives also in Fortran successfully, the DAXPY program familiar in GPU/OpenACC has been ported to use OpenMP 4.0.
This is the outcome of such port – looks pretty similar to OpenACC version. The code sections within #ifdef __MIC__ ... #endif show that the code actually runs on MIC-side, given that
corresponding debug output is displayed. Please pay attention to use of !$acc declate target directive : that makes sure the subroutines/functions are getting generated for MIC offloading.

function init(n, x, y) result(a)
use omp_lib
implicit none
!$omp declare target
integer, intent(in)  :: n
real(8), intent(out) :: x(n), y(n)
real(8) :: a
integer :: j
#ifdef __MIC__
write(0,'(a,i0)') 'init: I am running on MIC#',omp_get_default_device()
#endif
!$omp parallel private(j)
!$omp do
do j=1,n
   x(j) = j-1
enddo
!$omp enddo nowait
!$omp do
do j=1,n
   y(j) = -(j-1)
enddo
!$omp enddo nowait
!$omp end parallel
a = 2.0_8
end function init

subroutine daxpy(n, a, x, y)
use omp_lib
implicit none
!$omp declare target
integer, intent(in)  :: n
real(8), intent(in) :: a
real(8), intent(in) :: x(n)
real(8), intent(inout) :: y(n)
integer :: j
#ifdef __MIC__
write(0,'(a,i0)') 'daxpy: I am running on MIC#',omp_get_default_device()
#endif
!$omp parallel do private(j)
do j=1,n
  y(j) = y(j) + a * x(j)
enddo
end subroutine daxpy

function sum_up(n, y) result(sum)
use omp_lib
implicit none
!$omp declare target
integer, intent(in)  :: n
real(8), intent(in) :: y(n)
real(8) :: sum
integer :: j
#ifdef __MIC__
write(0,'(a,i0)') 'sum_up: I am running on MIC#',omp_get_default_device()
#endif
sum = 0
!$omp parallel do reduction(+:sum) private(j)
do j=1,n
   sum = sum + y(j)
enddo
end function sum_up

program main
use omp_lib
implicit none
!$omp declare target(init,daxpy,sum_up)
  integer, parameter :: n = 2**27 ! = 128M
  real(8), allocatable :: x(:), y(:)
  real(8) :: a, dn, sum, check_sum, diff
  real(8), external :: init, sum_up
  integer :: devnum, numdevs

  numdevs = omp_get_num_devices()
  write(0,'(a,i0)') 'main: Number of MIC-devices = ',numdevs
  devnum = 0
  if (numdevs > 0) devnum = mod(1,numdevs) ! Trying to run on MIC#1 ...

  allocate(x(n), y(n))

!$omp target map(alloc: x(1:n),y(1:n)) device(devnum) if (numdevs > 0)
#ifdef __MIC__
  write(0,'(a,i0)') 'main: I am running on MIC#',omp_get_default_device()
#endif
  ! initialize vectors -- directly on MIC device
  a = init(n,x,y)

  ! call daxpy with 128M elements
  call daxpy(n, a, x, y)

  dn = n
  sum = sum_up(n,y); check_sum = (dn * (dn - 1))/2
  diff = sum-check_sum
!$omp end target

  write(0,1000) n,sum,check_sum,diff
  1000 format("daxpy(n=",i0,"): sum=",g12.5," : check_sum=",g12.5," : diff=",g12.5)

  deallocate(x,y)

end program main

 

The following sequence of command will compile this Fortran program :

module purge
module load intel/14.0.1 mkl/11.1.1 intelmpi/4.1.3
module list
set -xe
mpif90 -openmp daxpy.F90 -o daxpy.x.omp4
ldd ./daxpy.x.omp4

With typical output being :

Currently Loaded Modules:
  1) intel/14.0.1    2) mkl/11.1.1    3) intelmpi/4.1.3
+ mpif90 -openmp daxpy.F90 -o daxpy.x.omp4
+ ldd ./daxpy.x.omp4
        linux-vdso.so.1 =>  (0x00007fffdaca0000)
        libmpigf.so.4 => /appl/opt/cluster_studio_xe2013/impi/4.1.3.045/intel64/lib/libmpigf.so.4 (0x00007f6dd3b47000)
        libmpi_mt.so.4 => /appl/opt/cluster_studio_xe2013/impi/4.1.3.045/intel64/lib/libmpi_mt.so.4 (0x00007f6dd34c9000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f6dd32b5000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f6dd30ad000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f6dd2e8f000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f6dd2c0b000)
        libiomp5.so => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libiomp5.so (0x00007f6dd28f0000)
        liboffload.so.5 => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/intel64/liboffload.so.5 (0x00007f6dd26be000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f6dd232a000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f6dd2114000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f6dd1e0d000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f6dd3d79000)
        libimf.so => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libimf.so (0x00007f6dd1946000)
        libsvml.so => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libsvml.so (0x00007f6dd0d4e000)
        libirng.so => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libirng.so (0x00007f6dd0b47000)
        libintlc.so.5 => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libintlc.so.5 (0x00007f6dd08f1000)

3.4.3.3 Important remark on building host CPU offload versions

Be careful with OpenMP 4.0 offload mode when you compile and link steps are separate. When OpenMP target directives are encountered, the compiler automatically creates an object file (say foo.o) for host CPU and another one for MIC (now fooMIC.o). In order to link them properly you need to create an archive file with xiar -command (not with ar -command) and put both incarnations of the object files there. Then you need to link against that library.

Here is the rather tedious procedure at the moment (assuming all files may contain OpenMP target directives) :

mpiicc -c -openmp main.c foo.c bar.c
xiar -crv libiar.a main.o mainMIC.o foo.o fooMIC.o bar.o barMIC.o # or just: xiar -crv libiar.a *.o
mpiicc main.o -L. -liar -o main_offload.x
ldd main_offload.x

 

 

 

 

 

 

  • No labels