[gridengine users] resource types -- changing BOOL to INT but keeping qsub unchanged

bergman at merctech.com bergman at merctech.com
Fri Jan 5 15:51:42 UTC 2018

In the message dated: Tue, 02 Jan 2018 09:11:51 +0000,
The pithy ruminations from William Hay on 
<Re: [gridengine users] resource types -- changing BOOL to INT but keeping qsub unchanged> were:
=> On Fri, Dec 22, 2017 at 05:55:26PM -0500, bergman at merctech.com wrote:
=> > True, but even with that info, there doesn't seem to be any universal
=> > way to tell an arbitrary GPU job which GPU to use -- they all default
=> > to device 0.
=> With Nvidia GPUs we use a prolog script that manipulates lock files
=> to select a GPU then chgrp's the selected /dev/nvidia? file so the group is

Can you provide a copy of the scripts?

I understand the part about the chgrp, but how dows the prolog tell an
arbitrary program which GPU to use? My understanding was that software
defaults to GPU #0, and some packages may use a different GPU #, if
they are aware of multiple GPUs and if they accept an option to use a
specified device.

I'm unclear on how the prolog restricts the GPU software (theano,
tensorflow, caffe, FSL, locally-developed code, etc) to use of a
particular device.

=> the group associated with the job.   An epilog script undoes all of this.  
=> The /dev/nvidia? files permissions are set to be inaccessible to anyone 
=> other than owner(root) and the group.  However you have to pass
=> a magic option to the kernel to prevent permissions from being reset
=> whenever anyone tries to access the device.


Does this affect things like "nvidia-smi" (user-land, accesses all GPUs,
but does not run jobs)?



=> This seems to be a fairly bullet proof way of restricting jobs to
=> their assigned GPU.
=> William

More information about the users mailing list