Skip to end of metadata
Go to start of metadata

Please share your experiences, performance results, makefiles etc. here. You can add your comments to the end of this page (add a suitable subheader) or create an child page from the upper right corner menu.

 

The queue system

Torbjörn:

I think the 6 hour limit seems to work really well. Suddenly my estimates of when my job might start seem to be about right even when there were a rather large number of jobs running. Still with significant margins of error of course, but way better than the typical best guess at Louhi, which for a large job is something like "will probably start within two weeks plus or minus two weeks". It will of course get worse in production, but even so you will be reasonably sure that if you haven't run much lately, your job will start in about six hours. It also seems that large jobs start reasonably quickly, even when there are a lot of short jobs in line. I like these settings.

Atte:

Thanks for the comment! Short queues make fair share really work, but restarting jobs often will cause some unwanted manual work and based on our tour at the universities there are still a number of codes that can't be run efficiently (or at all) in short queues. The queues will be longer in production likely with preference to large jobs.

  • No labels

4 Comments

  1. I also prefer short limits on queues.  This seems to be the policy also on Tier-0 centers, though the limit seems to be 12 or 24 hours.  For me, 6h actually works fine.

    I guess that most of the codes which cannot be "checkpointed" (by hand) actually do not use very many cores.  Thus, perhaps one should have only a special small job queue with long runtime.

     

  2. Would it be possible to get some form of development queue? Where short 10-15 min jobs on a few nodes would get a high priority. There have been times during the initial pilot user phase where it has been very frustrating trying to debug and test stuff when you had to wait days for short runs to complete.

    1. Thanks for the comment. A test/development queue, gputest, has been added to taito-gpu:

       

      [GPU-Env ~]$ sinfo -p gputest
      PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
      gputest up 15:00 2 idle g[46-47]


      1. Hi,

        I want to note also, that even if there is now a small test queue, we ask the users not to monopolize the machine with long jobs even if those are currently possible. It is a PRACE prototype and the most important function is to be able to test one's codes and try to improve scaling etc. If you run jobs with lots of nodes, please don't do too long runs so that also others get to try the resource with reasonable queuing times.

        Cheers,

        Atte