Child pages
  • iRODS PID Management (copy from EUDAT wiki)
Skip to end of metadata
Go to start of metadata

The page describes how data replication with PID management can be implemented. 

Version: 1.1 Alpha

Author: Willem Elbers (The Language Archive - MPI-TLA)

 

Accessing EPIC Service

A python script (epicclient.py: available in the svn https://svn.eudat.eu/EUDAT/Services/DataManagement/PID/iRODS-Integration/Python/) has been developed by BSC to interface with the EPIC API.

Initially the embedPython module was used to call these functions directly from iRODS. However, due to some severe  security issue  the embedPython module is and should no longer be used.

The epicclient.py script is still used by calling it via a shell script with the msiExecCmd micro service.

Implementation

High level approach

A iRODS rule base is in development to implement _scenario 2_ as discussed in pid.discussion.pptx. For the implementation an exchange of information between source collection and target collection must be implemented. In particular the target repository must have a possibility to inform the source repository about the PID that was assigned to the replica. This kind of asynchronous communication is realized by writing to special files (.pid). The target repository writes the handle into a .pid file. The source repository can read the file and update its handle accordingly (to add new location). 

Scenario 2 from pid.discussion.pptx:

The implementation requires that the source zone can write in some space of the destination zone. The current implementation is based on "command" file triggering policies in iRODS via the acPostProcForPut hook. This approach allows for a decoupled communication between the two zones. This approach is based on the proposed alternative described in the Issue section.

All required rules for the replication and PID management are defined in a single rule base file. In addition a number of simple C-based microservices are required to provide more flexible options to write to the iRODS log files and handle system timestamps.

Command files

A number of "command" files can be written to trigger certain policies wrt replication and PID management. The following command files have currently been defined:

1.*.replicatetrigger a replication
  

contains: (pid,source,destination)

 

pid: PID of the DO from the parent archive

source:

destination

2.*.pid.createtrigger a pid creation action
  

contains: (pid,destination)

 

pid: PID of the DO from the parent archive

destination

3.*.pid.updatetrigger a pid update action
  

contains: (pid,new_pid)

 

pid: PID of the DO from the parent archive

new_pid:

Next to the file based triggering of a replication, this can also directly be triggered via the doReplication() rule. This approach could be used to integrate the triggering of a replication in the community software stack. The same goes for the PID management actions. The createPID() and updatePIDWithNewChild() rules can also be called directly.

Implementation Details

The following sequence diagrams show the currently implemented flow.

  1. Starting a replication by writing a .replicate file.

(the delayed monitor, 1.1.1.4.2.1, is described in the next diagram)

This diagram describes how to trigger a replication, the actual rsync of the object, the PID management in the destination repository (B) and the launch of the update monitor.

2. The delayed monitor launched in the source repository, monitoring the destination archive for .pid.update file.

(this rule should be started with a "repeat double until success", to keep trying until the .pid.update file is available. If necessary a timeout can be configured as well.)

This diagram describes how a PID update command is triggered, which is used for the notification of the source repository about a new replica location by the destination repository.

 

Issue

While implementing the actual calls to the EPIC API I have run into an issue with the above described workflow related to the last step 1.1.1.4.1.1.4.1.2: updatePID().

After rsyncing the data object, repo A writes the .pid file in repo B as userA. This triggers the acPostProcForPut in repo B, running with the privileges of the remote user userA. Finally repo B is supposed to write a .pid file in repo A only the remote userA running the rules in repo B is not allowed to this in repo A. It will throw status = -39000 SYS_PROXYUSER_NO_PRIV error in iRODS.

If this proves to be a problem, we might go for an alterative where repo B writes the last .pid file in repo B in a collection which is monitored by repo A with a delayed periodic rule.

Any ideas or suggestions? For some reason this wasn't a problem in my test environment. I am not exactly sure why, but the user configuration there is quite relaxed in contrast to a more production like environment.

alternative

The original approach has been updated to reflect the implementation of this alternative.

I've implemented an alternative where repo A starts to monitor a location in repo B. This seems to work and has the advantage as pointed out by Jedrzej that repo B does not need write access in repo A. I will be working on finalizing this set of policies towards the end of the week so that the other communities can test them as well.

Installation and Configuration

Available in SVN: https://svn.eudat.eu/EUDAT/Services/DataManagement/PID/EUDAT-PID/ : branches/release-1.0

These installation notes are also included in the module (install.txt)

Dependencies 

  1. Epic API client script
    1. requires a EPIC account
  2. Epic API shell wrapper scripts
    1. to be called via msiExecCmd
  3. Microservices
    1. EUDAT module
      1. msiWriteToLog
      2. msiBytesBufToStr
    2. guinot module (only for iRODS version < 3.1)
      1. msiGetFormattedSystemTime

Installation

The install.txt in the checked out svn has the latest installation instructions.

1. Compile modules

Enable the modules in "<iRODS>/module/EUDAT-PID/info.txt" and "<iRODS>/modules/guinot/info.txt" and re-run irodsssetup

OR

Enable the modules and recompile manually

enable modules and recompile
./scripts/configure --enable-EUDAT-PID
./scripts/configure --enable-guinot
and recompile iRODS
make clean
make

2. Install rulebase

Create a symbolic link to the eudat rulebase

rule base symbolic link
ln -s <irods>/modules/EUDAT-PID/rulebase/eudat-v1-beta.re <irods>/server/config/reConfigs/eudat.re

Edit <irods>/server/config/server.config and append ",eudat" to "reRuleSet" 

(make sure to include the comma and no spaces)

3. Configure iRODS hooks

Edit the <irods>/server/config/reConfigs/core.re file and add the following two acPostProcForPutHooks:

iRODS system hooks
acPostProcForPut {
  ON($objPath like "\*.replicate") {
    processReplicationCommandFile($objPath);
  }
} 

acPostProcForPut {
  ON($objPath like "\*.pid.\*") {
    processPIDCommandFile($objPath);
  }
}

Properly configure the default resource in <irods>/server/config/reConfigs/core.re

4. Install the scripts

shell scripts
ln -s /srv/irods/current/modules/EUDAT-PID/cmd/epicclient.py ../../server/bin/cmd/epicclient.py

check permissions on the script and make sure it is executable by the irods user: chmod u+x cmd/epicclient.py

Update the python credential file with the proper credentials: modules/EUDAT-PID/cmd/credential_example

5. Shared space for command files

create a shared space in all zones as configured in the eudat.re rulebase getSharedCollection function.
- defaults to "<zone>/replicate"
- make sure all users involved in the replication can write in this collection.

Usage example

By calling the triggerReplication rule a .replicate file will be written in the specified location.

Replication example
replicate {
    triggerReplication(*path,*pid,*source,*destination);
}
INPUT *pid="842/07cc0858-edb9-11e1-a27d-005056ae635a",*source="/vzMPI/bin/test.txt",*destination="/vzMPI-REPLIX/bin/test.txt",*path="/vzMPI/replicate/test.replicate"
OUTPUT ruleExecOut

  • No labels