This wiki contains end user documentation for the CSC HPC Prototype 'Hybrid', intended for assessing Xeon Phi and GPGPUs.

For support requests, send E-Mail to: proto-support@postit.csc.fi

PRACE Spring School 2013 Lectures on Xeon Phi

Table of Contents

System Configuration

The system is a cluster consisting of 10 T-Platforms T-Blade V200 blades. 

The node coprocessor/accelerator configuration is as follows:

The system currently has only local disk. There are two main directories:

In the near future these directories will be merged.

Xeon Phi -specific configuration

Unlike GPGPUs, each Xeon Phi runs a very stripped-down Linux with the busybox shell. The Xeon Phi can be accessed using ssh. Each Phi can be accessed with two hostnames: 

Do not access the MICs of the compute nodes directly to run jobs! Use the batch job queuing system.

Diagram of the node internal topology of a Xeon Phi node

Usage Policy

The system is intended to be used by following groups:

The system is a prototype, which has the following implications:

Applying for an account

To apply for an account, send the following information to proto-support@postit.csc.fi

Logging In

The system can be accessed by logging into hybrid.csc.fi using ssh.

When you login for the first time, please create a passphraseless ssh keypair to access the MIC cards:

$ ssh-keygen -f $HOME/id_rsa -N ''; cp $HOME/.ssh/id_rsa.pub $HOME/.ssh/authorized_keys2

The OS is CentOS 6.3, which is based on RedHat Enterprise Linux.

To toggle between different compilers, libraries etc. the modules command is used. Information on the modules command can be found in the Vuori user guide

Running Jobs

master (frontend node)

The master node (frontend) and it's Xeon Phi coprocessor are time-shared (cannot be reserved by a single user). The master node is intended for code development, compiling and porting. Do not run computationally intensive, memory hungry and/or long running jobs there.

Compute nodes

The batch job queue manager is SLURM. The basic usage is similar to using SLURM on CSC's Taito supercluster or most other commodity clusters.

The following sections (Xeon Phi development and NVidia Tesla develpment) have basic examples on how to run jobs via SLURM and focuses on the special features necessitated by the GPUs and Phis. Please refer to the Taito SLURM documentation for more detailed information about using SLURM (constructing batch job scripts etc.).

In this prototype cluster, SLURM daemons run both on the compute node hosts as well as the Xeon Phi cards. This is an experimental feature still under development and currently unique to the CSC cluster.

There are the following partitions available:

On occasion some nodes may be down or drained for experiments. To check node availability, use the sinfo command.

Currently all nodes are in the EXCLUSIVE mode. This means only a job reserves a node completely.

Xeon Phi Development

The following tools are available for Xeon Phi:

This document currently describes only the practical usage of the Phi in the prototype system. To learn about developing for Xeon Phi, here are a few starting points:

The following types of execution models are supported:

Executable Auto-Offloading

The Phi nodes have Executable Auto-Offloading (EAO) enabled by default. This feature is developed at CSC and is not currently in the standard Xeon Phi distribution.

With this feature, any executable in the K1OM (MIC) binary format that the user tries to run on the host, will transparently be executed on the Xeon Phi coprocessor card instead. The execution is performed using the /usr/bin/micrun script.

By default all environment variables with the MIC_ prefix will be passed to the binary, with the prefix stripped away. For example (MIC_LD_LIBRARY_PATH -> LD_LIBRARY_PATH).

EAO can be disabled by setting the environment variable MICRUN_DISABLE (i.e. export MICRUN_DISABLE=1).

Offload programming model

The Intel compilers support offload compilation automatically. This means either offloading a code section using offload pragmas or calling an offload-enabled library. (e.g. MKL).

In order to run offload jobs, one needs to set the GRES (Generic Resource Scheduling) parameter '--gres=mic:1'. For example:

$ srun --gres=mic:1 ./hello

If this is not set, the user will the following warning:

offload warning: OFFLOAD_DEVICES device number -1 does not correspond to a physical device

Native programming model

Currently the Intel and GNU compilers support native compilation Phi. However only the Intel compiler can exploit vectorization and should always be used for perfomance-critical code. The GNU compilers can be used to compile non-performance critical support programs and libraries.

We recommend that you use the .mic suffix in binaries to differentiate MIC binaries from x86_64 binaries.

While the instructions here discuss OpenMP, any other pthread-based programming model could potentially be used in a similar way.

Native OpenMP code

To compile OpenMP code natively, you can use the -mmicflag.

$ module load intel
$ icc -mmic -openmp hello.c -o hello.mic

To run, use the mic partition, for example:

$ srun ./hello.mic 

Native MPI+OpenMP

To compile:

$ module load intel mic-impi
$ mpiicc -mmic -openmp ompmpihello.c -o ompmpihello.mic

The execution is performed using a wrapper script using the michost queue.

The executable is launched in a SLURM batch job script using the mpirun-mic -m <mic_binary> command.

The following environment variables and SLURM parameters are used to control the MPI task and OpenMP thread count.

The following example runs:

$ MIC_PPN=4 MIC_OMP_NUM_THREADS=60 srun -N 2 mpirun-mic \
  -m /share/mic/examples/mpiomphello/mpiomphello.mic 


In the symmetric model, MPI tasks are executed both on the host and on the MIC.

To create a symmetric job, please compile the application for both the host architecture and MIC.

$ module load intel impi
$ mpiicc -openmp mpiomphello.c -o mpiomphello
$ mpiicc -mmic -openmp mpiomphello.c -o mpiomphello.mic

The execution of symmetric jobs is performed using a wrapper script called mpirun-mic using the michost queue.

The mpirun-mic script takes two parameters:

The following environment variables and SLURM parameters are used to control the MPI task and OpenMP thread count.

The following example runs:

$ MIC_PPN=4 MIC_OMP_NUM_THREADS=60 OMP_NUM_THREADS=2 srun -n 12 --tasks-per-node 6 mpirun-mic 
  -m /share/mic/examples/mpiomphello/mpiomphello.mic \
  -c /share/mic/examples/mpiomphello/mpiomphello 

The tasks will be sorted in consecutive order with CPU tasks first and MIC tasks next for each node. For example, for the above example the ranks would be placed as follows:

NodeHost ranksMIC ranks
node 10-56-9
node 210-1516-19

Debugging with Intel IDB


Debugging with TotalView



NVidia Kepler Development

The following tools are available for NVidia Kepler:

In order to run offload jobs, one needs to set the GRES (Generic Resource Scheduling) parameter '--gres=gpu:1'.

$ module load cuda
$ srun -p gpu --gres=gpu:1 ./gpuhello