Skip to end of metadata
Go to start of metadata

Lets run our pure OpenMP "Hello World" application on host CPUs only, then MIC-cards only.
We then proceed to run the OpenMP 4.0 offload version of "Hello World" and we are done.


Recalling our module setup is as follows :

% module purge
% module load intel/14.0.1 mkl/11.1.1 intelmpi/4.1.3

The host CPU version was compiled and linked as follows :

% mpiicc -openmp hello.c -o hello_cpu.x

And the native MIC-card version got built as follows :

% mpiicc -mmic -openmp hello.c -o hello_mic.x

We provide the following SLURM-script (hello.slurm) :

#!/bin/bash
#SBATCH -N 1
#SBATCH -p mic
#SBATCH -t 00:05:00
#SBATCH -J hello
#SBATCH -o hello.out.%j
#SBATCH -e hello.out.%j
#SBATCH --gres=mic:2
#SBATCH --exclusive
#SBATCH

module purge
module load intel/14.0.1 mkl/11.1.1 intelmpi/4.1.3
module list

set -xve

cd ${SLURM_SUBMIT_DIR:-.}
pwd

# Running the host CPU (no offload) only:
ldd ./hello_cpu.x
./hello_cpu.x

# Running the native MIC (-mmic) version:
micrun $(which ldd) ./hello_mic.x
./hello_mic.x # CSC's auto-offload execution (AOE) takes care

And submit it with a sbatch -command :

% sbatch hello.slurm

The output file ( hello.out.<slurm_jobid_number> ) looks as follows :

Currently Loaded Modules:
  1) intel/14.0.1    2) mkl/11.1.1    3) intelmpi/4.1.3

cd ${SLURM_SUBMIT_DIR:-.}
+ cd /wrk/sbs/GuideBull/MickeyMouse/chello/openmp
pwd
+ pwd
/wrk/sbs/GuideBull/MickeyMouse/chello/openmp

# Running the host CPU (no offload) only:
ldd ./hello_cpu.x
+ ldd ./hello_cpu.x
        linux-vdso.so.1 =>  (0x00007ffffddff000)
        libmpigf.so.4 => /appl/opt/cluster_studio_xe2013/impi/4.1.3.045/intel64/lib/libmpigf.so.4 (0x00007effab611000)
        libmpi_mt.so.4 => /appl/opt/cluster_studio_xe2013/impi/4.1.3.045/intel64/lib/libmpi_mt.so.4 (0x00007effaaf93000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007effaad7f000)
        librt.so.1 => /lib64/librt.so.1 (0x00007effaab77000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007effaa959000)
        libm.so.6 => /lib64/libm.so.6 (0x00007effaa6d5000)
        libiomp5.so => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libiomp5.so (0x00007effaa3ba000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007effaa1a3000)
        libc.so.6 => /lib64/libc.so.6 (0x00007effa9e0f000)
        /lib64/ld-linux-x86-64.so.2 (0x00007effab843000)
./hello_cpu.x
+ ./hello_cpu.x
./hello_cpu.x: Hello World from 'm3' !! The number of threads = 24 and count = 24

# Running the native MIC (-mmic) version:
micrun $(which ldd) ./hello_mic.x
which ldd)
which ldd
++ which ldd
+ micrun /usr/bin/ldd ./hello_mic.x
        linux-vdso.so.1 =>  (0x00007fff3bfff000)
        libmpigf.so.4 => /appl/opt/cluster_studio_xe2013/impi/4.1.3.045/mic/lib/libmpigf.so.4 (0x00007fe34101c000)
        libmpi_mt.so.4 => /appl/opt/cluster_studio_xe2013/impi/4.1.3.045/mic/lib/libmpi_mt.so.4 (0x00007fe340946000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fe340742000)
        librt.so.1 => /lib64/librt.so.1 (0x00007fe34053a000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fe34031d000)
        libm.so.6 => /lib64/libm.so.6 (0x00007fe3400ee000)
        libiomp5.so => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/mic/libiomp5.so (0x00007fe33fddd000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fe33fbcb000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fe33f873000)
        /lib64/ld-linux-k1om.so.2 (0x00007fe34124c000)
./hello_mic.x # CSC's auto-offload execution (AOE) takes care
+ ./hello_mic.x
./hello_mic.x: Hello World from 'm3-mic0' !! The number of threads = 244 and count = 244


With the similar module setup as before we build our MIC-offload version as follows :

% mpiicc -openmp hello4.c -o hello4_offload.x

Using the following SLURM-script we run the job :

#!/bin/bash
#SBATCH -N 1
#SBATCH -p mic
#SBATCH -t 00:05:00
#SBATCH -J hello4
#SBATCH -o hello4.out.%j
#SBATCH -e hello4.out.%j
#SBATCH --gres=mic:2
#SBATCH --exclusive
#SBATCH

module purge
module load intel/14.0.1 mkl/11.1.1 intelmpi/4.1.3
module list

set -xve

cd ${SLURM_SUBMIT_DIR:-.}
pwd

ldd ./hello4_offload.x

./hello4_offload.x 1 # Using card#1


The output file ( hello4.out.<slurm_jobid_number> ) looks as follows :

Currently Loaded Modules:
  1) intel/14.0.1    2) mkl/11.1.1    3) intelmpi/4.1.3

cd ${SLURM_SUBMIT_DIR:-.}
+ cd /wrk/sbs/GuideBull/MickeyMouse/chello/openmp4_offload
pwd
+ pwd
/wrk/sbs/GuideBull/MickeyMouse/chello/openmp4_offload

ldd ./hello4_offload.x
+ ldd ./hello4_offload.x
        linux-vdso.so.1 =>  (0x00007fffa63ff000)
        libmpigf.so.4 => /appl/opt/cluster_studio_xe2013/impi/4.1.3.045/intel64/lib/libmpigf.so.4 (0x00007f3590567000)
        libmpi_mt.so.4 => /appl/opt/cluster_studio_xe2013/impi/4.1.3.045/intel64/lib/libmpi_mt.so.4 (0x00007f358fee9000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f358fcd5000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f358facd000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f358f8af000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f358f62b000)
        libiomp5.so => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libiomp5.so (0x00007f358f310000)
        liboffload.so.5 => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/intel64/liboffload.so.5 (0x00007f358f0de000)
        libcilkrts.so.5 => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libcilkrts.so.5 (0x00007f358eea0000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f358eb9a000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f358e983000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f358e5ef000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f3590799000)
        libimf.so => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libimf.so (0x00007f358e127000)
        libsvml.so => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libsvml.so (0x00007f358d530000)
        libirng.so => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libirng.so (0x00007f358d329000)
        libintlc.so.5 => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libintlc.so.5 (0x00007f358d0d2000)

./hello4_offload.x 1 # Using card#1
+ ./hello4_offload.x 1
./hello4_offload.x: Hello World from 'm3-mic1', MIC-card#1 out of 2 !! The # of threads = 240 and count = 240


Running OpenMP4.0 offload version of Fortran DAXPY goes with the following script :

#!/bin/bash
#SBATCH -N 1
#SBATCH -p mic
#SBATCH -t 00:05:00
#SBATCH -J daxpy
#SBATCH -o daxpy.out.%j
#SBATCH -e daxpy.out.%j
#SBATCH --gres=mic:2
#SBATCH --exclusive
#SBATCH

module purge
module load intel/14.0.1 mkl/11.1.1 intelmpi/4.1.3
module list

set -xve

cd ${SLURM_SUBMIT_DIR:-.}
pwd

ldd ./daxpy.x.omp4

./daxpy.x.omp4

And the output looks as follows (compare this with the GPU/OpenACC version output, if you like) :

Currently Loaded Modules:
  1) intel/14.0.1    2) mkl/11.1.1    3) intelmpi/4.1.3

cd ${SLURM_SUBMIT_DIR:-.}
+ cd /wrk/sbs/GuideBull/MickeyMouse/daxpy/openmp4_fortran
pwd
+ pwd
/wrk/sbs/GuideBull/MickeyMouse/daxpy/openmp4_fortran

ldd ./daxpy.x.omp4
+ ldd ./daxpy.x.omp4
        linux-vdso.so.1 =>  (0x00007fff253ff000)
        libmpigf.so.4 => /appl/opt/cluster_studio_xe2013/impi/4.1.3.045/intel64/lib/libmpigf.so.4 (0x00007f3a9d8bf000)
        libmpi_mt.so.4 => /appl/opt/cluster_studio_xe2013/impi/4.1.3.045/intel64/lib/libmpi_mt.so.4 (0x00007f3a9d241000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f3a9d02d000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f3a9ce25000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f3a9cc07000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f3a9c983000)
        libiomp5.so => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libiomp5.so (0x00007f3a9c668000)
        liboffload.so.5 => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/intel64/liboffload.so.5 (0x00007f3a9c436000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f3a9c0a2000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f3a9be8c000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f3a9bb85000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f3a9daf1000)
        libimf.so => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libimf.so (0x00007f3a9b6be000)
        libsvml.so => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libsvml.so (0x00007f3a9aac6000)
        libirng.so => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libirng.so (0x00007f3a9a8bf000)
        libintlc.so.5 => /appl/opt/cluster_studio_xe2013/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libintlc.so.5 (0x00007f3a9a669000)

./daxpy.x.omp4
+ ./daxpy.x.omp4
main: Number of MIC-devices = 2
main: I am running on MIC#1
init: I am running on MIC#1
daxpy: I am running on MIC#1
sum_up: I am running on MIC#1
daxpy(n=134217728): sum= 0.90072E+16 : check_sum= 0.90072E+16 : diff=  0.0000

 

 

 

  • No labels