The Finnish Grid Infrastructure is coming to an end. In <insert date here> it will be replaced by FGCI.
- Inform users about this change.
- Decrease grid queues maxtime to 30 minutes in order to drain all clusters from Grid jobs.
- Remove all FGI endpoints from GOCDB.
Detailed procedure bellow.
There are various reasons to decommission FGI, notably:
- Hardware is old and unsupported;
- FGI and FGCI aren't fully compatible, so properly maintaining two software stacks considering the available resources isn't realistic;
- Decommissioning FGI will allow us to center everyone's attention on FGCI.
As soon as possible and not after 31.12.2016.
Each FGI site admin is responsible for erasing FGI endpoints before the deadline.
In order to decommission FGI we'll follow EGI's decommissioning procedure (PROC12 Production Service Decommissioning).
Service Centre decommissioning steps
- Actions tagged RC are the responsibility of the Resource Centre Operations Manager (site admins).
- Actions tagged RP are the responsibility of the Resource Infrastructure Operations Manager (CSC admins).
- Actions tagged OC are the responsibility of the Operations Centre
- Stateful service example: Storage Elements.
- The Resource Centre Operations Manager opens a GGUS ticket, which will be used as parent ticket to track the whole process. The ticket must remain in an open status until the service is removed from GOCDB. The ticket has to be assigned to the Resource Infrastructure Operations Centre (NGI).
- The Resource Centre Operations Manager contacts the Resource Provider regional staff, communicating the decommissioning plan for the service.
- The Resource Centre Operations Manager announces through the broadcast tool to NGI_FI that it is starting the decommissioning procedure:
- Announce a detailed timeline for the decommissioning and that the Resource Centre Manager will start a downtimes of the service to prevent any further usage. In the timeline must be clearly listed the deadlines for the VO Managers' actions.
- The timeline is recorded in the parent ticket.
- The broadcast link is recorded in the parent ticket.
- The downtime should start no earlier than 15 days and no later than one month after the broadcast.
- State that the aim is to remove the service in XX weeks (min 6 weeks for stateful services - e.g. Storage elements).
- [If the service is a CE or a workload management service] After the announce of the service decommissioning the Resource Centre MAY disable VO job submissions to prevent further VO activity - except the monitoring jobs. (set maxtime to grid queue to 30 minutes).
[If the service is a storage or data management service] After the announce of the service decommissioning the Resource Centre MAY disable VO writing access to prevent further VO activity - except infrastructure VOs (If selective permissions are not possible, the service must remain enabled also in writing until the begin of the downtime).
|[root@aesyle-install ~]# scontrol update PartitionName=grid MaxTime=30:00|
- [If the service is a storage element] The VO Manager in the time between the announcement of the decommissioning and the begin of the downtime SHOULD check If the volume of data stored by a VO in the site is big enough to require more than one month to be moved, the VO manager can ask to reschedule the downtime period.
- If no communications are sent to the Resource Centre by the first week of downtime the schedule can be considered agreed by all VO Managers.
- If there are multiple SEs being decommissioned together, the total amount of data to move could be bigger, and VOs may be informed about that.
- Any request of reschedule MUST be supported by technical reasons (e.g. total amount of data to move / Site max data transfer throughput)
- [If the service is a central service like VOMS or LFC for a given VO] VO Manager, Resource Centre Operations Manager and Resource Infrastructure Operations Manager should discuss finding a new Resource Centre for hosting these services, taking into account pre-existing agreement between VO and NGI. For international VOs, this discussion could be held at the EGI level, especially if a solution cannot be easily found within that Resource Infrastructure Provider.
- According to the dates announced in the broadcast or differently agreed in step 4, the Resource Centre puts the service in downtime to prevent any further usage. This downtime shall last for the scheduled period or until phase 5 is over - which ever is the shorter.
- The downtime must be recorded in the parent ticket
If the service is a stateful service containing VO data:
- Once the service is in downtime and closed for write access (if possible) the Resource Centre Operations Manager opens N child tickets of the procedure's parent ticket to each of the N VO managers of the N VOs the service supports.
- The VOs are given up to the amount of time agreed in step 4 - to retrieve their data from the decommissioning service. During this period, the Resource Centre should make sure that the service works for the different VOs to allow them to migrate their data. The VO managers can specify any specific requirements in their child ticket. For instance:
- Request in the child ticket from the Resource Centre Operations Manager the time limit needed to retrieve data.
- (If the service is an SE) Request from VO central services admins the list of LFNs/DNs still having SURLs on SEs at that Resource Centre.
- VO Manager MUST communicate to the Resource Centre - if possible using the GGUS child ticket - when the data moving is completed.
- If the service's data cannot be migrated using the user interface (e.g. if there is the need to have access to a database dump) the Resource Centre administrators should cooperate with the VO Managers.
If the service does not contain user/state persistent data (e.g. CE):
- Once the service is in downtime the interface can be closed in order to prevent users to start new tasks on the service, but allowing them to retrieve the output of the tasks submitted before the begin of the downtime.
- At the end of the scheduled downtime period or when step 6 is completed and validated:
- The service is set to "production=N" "monitored=N" in the GOCDB.
- Once the service disappears from Nagios, it must be removed from the Resource Centre GIIS (e.g. Site-BDII).
- The downtime is terminated.
- All this actions must be recorded in the parent ticket.
- At this point the service is no longer listed in the top-BDIIs of EGI. If hardware is closed down, the Resource Centre will need to address this, possibly informing these users that their data could be at risk.
- Logs are to be kept at the Resource Centre, available for the period of time requested by the Grid Security Traceability and Logging Policy (90 days) after the service has been removed from the resource centre GIIS and its public interfaces are no more accessible by the users, in case of inquiries related to security incidents the period could be extended. Note:If the logs are saved elsewhere the services hardware can be disposed.
- Service is removed from GOCDB.
- This action must be recorded in the parent ticket.
- Parent ticket is closed.