Child pages
  • Data staging Script (Copy from EUDAT wiki)
Skip to end of metadata
Go to start of metadata

The script described below can be used to move data between an EUDAT data node and an external GridFTP server (stage data in and out). It uses the GlobusOnline (hereafter GO) service via its API to manage the transfers "in the cloud", so you should have an account on GO and the needed GO end points configured
In particular, please associate your x509 certificate to your GO account under https://www.globusonline.org/account/ManageIdentities 
Moreover, if you would like to transfer data from your laptop where no gridFTP server is available, please consider installing GlobusConnect.

Thanks to GO you will be able to look at your data transfers transfer in GO web GUI. 

The picture below give an overview of the working principles of the script. The use case describes a user of the VPH community who:

  1. upload files to EUDAT (where they are replicated by "Safe Replication hooks") using the script;
  2. files, identified by the PID, are moved to HeCTOR, the HPC machine at EPCC. 

This second picture gives a more detailed view of the process of moving data from EUDAT to an HPC machine: data is first identified (by the PID, in the picture, but other pattern are possible) then GO is used to perform a thisr party transfer. 

 

In order to correctly work the script needs to interrogate the iRODS server hosting your files: some rules should be installed on that server (see below). They will possibly be included in the eudat.re file, but now they should be added manually to one of the rule file issued by the server (such as core.re). 

Prerequisites:

  • python 2.7 (pip is suggested)
  • grid-proxy-utils (in particular grid-proxy-init)
  • python-m2crypto (script version > 1.1)[1]
    or
  • myproxy commands (in particular myproxy-init)
  • icommands

[1] You also should replace the file m2.py installed with the GO API with a new one, available on the SVN (see below).

The installation is quite simple:

  • download all the needed files (see list below) from EUDAT SVN (you can choose if download the files via an svn client or the corresponding tar)
  • install globus online api, for example:
    pip install globusonline-transfer-api-client
  • configure your irods environment in order to point to the iRODS server enabled with the needed rules

The script consists of the following files:

  • datastagerconfig.py.example template for the required (cp todatastagerconfig.py file (script >1.1)
  • PIDselecter.r the rule which gives you the PID of a file given its URL
  • URLselecter.r the rule which gives you the URL of a file given its PID
  • seedselecter.r the rule which gives you the PATH of a file given its seismological data (specific for the EPOS use case)

  • datastager.py the main script
  • datamover.py the script which invokes GO 
  • seedselecter.py the script which construct the path of the seismological data (specific for the EPOS use case)

  • example some concrete example
  • pid.file.example an example of PID file, i.e. a list of PIDs (to be used with -PF option)
  • task.file.example an example of path file, i.e. a list of file names (to be used with -pF option)
  • url.file.example an example of URL file, i.e. a list of URLs (to be used with -UF option)
  • README some further informations

The script works as follow:

  1. calculate the list of files to be transferred (for example, extracting the iRODS path from the PID)
  2. activate the endpoints involved in the transfer in the following way 
    1. auto-activation (good for globusconnect)
    2. if not myproxy activation (if you have a myproxy server associated with your enpoint)
    3. if not local activation (which requires python-m2crypto package) and uses your local proxy
  3. delegate GO to do the transfer

Note 1 In datastager.py or datastagerconfig.py, iPATH (around line 12) store the the path of the icommands, like in the following line:  

iPATH='/home/jack/CINECA/GridTools/iRODS/iRODS/clients/icommands/bin/'

In order to use datastager.py you should adapt it to your needed.

Note 2 In datastager.py or datastagerconfig.pyurlendpoint (around line 14) lists the correspondence between GO endpoint and their url, like in the following line:  

urlendpoint={'data.repo.cineca.it': "cinecaRepoSingl", 'irods-dev.cineca.it': "irods-dev"}

In order to use datastager.py you should insert your correspondence between GO endpoints and plain url  inside or inform the developers. 

For usage the latest example, see datastager.py -h:

usage: datastager.py [-h] [-d] [-p PATH] [-u USER] [-y YEAR] [-n NETWORK]
[-c CHANNEL] [-s STATION] [-P PID] [-PF PIDFILE] [-U URL]
[-UF URLFILE] [-t TASKID] [-pF PATHFILE] [--ss SRC_SITE]
[--ds DST_SITE] [--sd SRC_DIR] [--dd DST_DIR]
{in,out} {seed,pid,url,taskid}

Data stager: move a bounce of data inside or outside iRODS via GridFTP.
The -d options requires both positional arguments.

positional arguments:
{in,out} the direction of the stage: in or out
{seed,pid,url,taskid}
the description of your data

optional arguments:
-h, --help show this help message and exit
-d, --details a longer description and some usage examples
-p PATH, --path PATH the path of your iRODS collection if staging out or the local file name if staging in
-u USER, --username USER
your username on globusonline.org
--ss SRC_SITE the GridFTP src server as GO endpoint
--ds DST_SITE the GridFTP dst server as GO endpoint
--sd SRC_DIR the GridFTP src directory
--dd DST_DIR the GridFTP dst directory

taskid:
Options specific to stage in taskid

-t TASKID, --taskid TASKID
the taskID of your transfer
-pF PATHFILE, --pathFile PATHFILE
the file listing your files (alternative to -p)

seed:
Options specific to seed

-y YEAR, --year YEAR the year of interest
-n NETWORK, --network NETWORK
the network of interest
-c CHANNEL, --channel CHANNEL
the channel of interest
-s STATION, --station STATION
the station of interest

url:
Options specific to url (mutually exclusive)

-U URL, --url URL the URL of your data
-UF URLFILE, --urlfile URLFILE
the file listing the URL(s) of your data

pid:
Options specific to pid (mutually exclusive)

-P PID, --pid PID the PID of your data
-PF PIDFILE, --pid-file PIDFILE
the file listing the PID(s) of your data

Note 3, for data node admins. The rules to be installed on the server side follows:

# Rules for data staging
getSEED(*path,*year,*network,*channel,*station,*response) {
msiWriteRodsLog("get SEED path associated to: ZONE, GROUP, USER, YEAR, NETWORK, STATION and CHANNEL: $userNameClient, *path *year *network *channel *station ", *status);
msiExecCmd("seedselecter.py","-p *path -y *year -n *network -c *channel -s *station ", "null", "null", "null", *out);
msiGetStdoutInExecCmdOut(*out, *response);
msiWriteRodsLog("search handle response = *response", *status);
}

getPID(*object_path,*response) {
msiWriteRodsLog("get PID associated to: USER, OBJPATH : $userNameClient, *object_path", *status);
msiExecCmd("epicclient.py","os /repo/home/userprod/proirod1/iRODS-3.2/modules/EUDAT-PID/cmd/credentials search URL \"*object_path\"", "null", "null", "null", *out);
msiGetStdoutInExecCmdOut(*out, *response);
msiWriteRodsLog("search handle response = *response", *status);
}

getURL(*object_pid,*response) {
msiWriteRodsLog("get URL associated to: USER, PID : $userNameClient, *object_pid", *status);
msiExecCmd("epicclient.py","os /repo/home/userprod/proirod1/iRODS-3.2/modules/EUDAT-PID/cmd/credentials read --key URL *object_pid", "null", "null", "null", *out);
msiGetStdoutInExecCmdOut(*out, *response);
msiWriteRodsLog("search handle response = *response", *status);
}

 

  • No labels