[gridengine users] resource types -- changing BOOL to INT but keeping qsub unchanged

William Hay w.hay at ucl.ac.uk
Mon Jan 8 09:44:35 UTC 2018


On Fri, Jan 05, 2018 at 10:51:42AM -0500, bergman at merctech.com wrote:
> In the message dated: Tue, 02 Jan 2018 09:11:51 +0000,
> The pithy ruminations from William Hay on 
> <Re: [gridengine users] resource types -- changing BOOL to INT but keeping qsub unchanged> were:
> => On Fri, Dec 22, 2017 at 05:55:26PM -0500, bergman at merctech.com wrote:
> => > True, but even with that info, there doesn't seem to be any universal
> => > way to tell an arbitrary GPU job which GPU to use -- they all default
> => > to device 0.
> => 
> => With Nvidia GPUs we use a prolog script that manipulates lock files
> => to select a GPU then chgrp's the selected /dev/nvidia? file so the group is
> 
> Can you provide a copy of the scripts?

The scripts themselves are less well written than they could be.
They could easily be rewritten to be better than they are but as they
work for us we haven't bothered.  I really wouldn't suggest anyone else
use the scripts we use.

> 
> I understand the part about the chgrp, but how dows the prolog tell an
> arbitrary program which GPU to use? My understanding was that software
> defaults to GPU #0, and some packages may use a different GPU #, if
> they are aware of multiple GPUs and if they accept an option to use a
> specified device.

So far all the software we've seen defaults to using either the first
gpu it can access or all the gpus it can access.  If you prevent it from
accessing gpu #0 by tweaking the permissions on the file in /dev it will
try to use gpu #1

A bit of googling around
(https://devtalk.nvidia.com/default/topic/754848/cuda-errors-when-permissions-on-dev-nvidia-are-not-666/)
suggests you might also be able to use the environment variable
CUDA_VISIBLE_DEVICES although it can be overridden by users and doesn't
interact well with direct permission twiddling:



> 
> I'm unclear on how the prolog restricts the GPU software (theano,
> tensorflow, caffe, FSL, locally-developed code, etc) to use of a
> particular device.

By changing the permissions on that device so they don't have access to it.



> => the group associated with the job.   An epilog script undoes all of this.  
> => The /dev/nvidia? files permissions are set to be inaccessible to anyone 
> => other than owner(root) and the group.  However you have to pass
> => a magic option to the kernel to prevent permissions from being reset
> => whenever anyone tries to access the device.
> 
> Details?

In this post:
http://gridengine.org/pipermail/users/2017-February/009581.html

> 
> Does this affect things like "nvidia-smi" (user-land, accesses all GPUs,
> but does not run jobs)?

It should.  If you want it to scan everything (for load sensor
purposes or otherwise) you can run it as root or the user owning the
/dev/nvidia? files if that isn't root.

William
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://gridengine.org/pipermail/users/attachments/20180108/604c206e/attachment.sig>


More information about the users mailing list