[gridengine users] building Son of Grid Engine
Tina.Friedrich at diamond.ac.uk
Wed May 22 15:03:29 UTC 2013
On 22/05/13 15:37, Reuti wrote:
>> Hi Reuti,
>>>> have finally decided to look into upgrading our SGE6.2 installation >> - mainly to see if it helps with my job scheduling problem.
>>>> I'm trying to build Son of Grid Engine - succeeded actually.
>>>> Currently trying to make it run / import my old configuration.
>>>> Which mostly worked. Couple of niggles.
>>>> Our setup is SGE_ROOT on shared NFS file system, SGE running as a
>>>> non-root user. I'd quite like to keep it that way (it worked well
>>>> for us).
>>> The real and effective user is not root? I wonder how to change to a
>>> different user during execution then. Often this can be seen:
>>> $ ps -e -o user,ruser,group,rgroup,command
>>> USER RUSER GROUP RGROUP COMMAND
>>> sgeadmin root gridware root /usr/sge/bin/lx24-x86/sge_execd
>> The real and effective user is not root, and never was. Never caused us any problems. The NFS share is exported with root_squash.
> This is quite interesting. And all jobs are running under their inquired user account or do you use one common user account for all jobs?
Jobs are running as the user that submitted them, yes. No common
account. Been set up like that since we installed it.
Haven't had time to progress with this setup much; is there any
documentation on how the 'inbuild' qrsh etc work? As at the moment, my
test installation works, and I can submit jobs (and they run), but
interactive sessions don't work - I get a commlib error:
[kdf51254 at ws112 ~]$ qrsh
error: commlib error: got read error (closing
Didn't have that problem on my old 6.2 installation :)
>>>> Managed to build & install, got the qmaster running, managed to
>> execds. However, at least inst_sge.sh -upd-execd simply refuses to work
>> if you're not root, if I remember correctly (not helping!).
>>>> Script(s) sometimes say 'You are not installing as user >root< -
>> Can't set the file owner/group and permissions'. It would help if they'd
>> tell me (without digging through them) what files they're trying to
>> chown/chmod and what they're trying to chown/chmod it to - so I can fix
>> that, if there is a problem. Goes for a lot of these sort of errors (to
>> do with running as non-root) - if it fails to do something, it would
>> really help to know what it failed to do.
>>>> The other thing is that I keep having to run it with -nobincheck,
>> far as I can tell simply because I didn't build qmon. Annoying - should
>> it not just check for actually required binaries?
>>>> Importing my old installation / upgrading from my old installation
>> didn't quite work. Mostly did, it seems, which is something. No error
>> that I'd seen during the import/upgrade, but none of my queues are
>> there. Host groups are; exec hosts are; complexes look okay; global
>> config looks right. PEs aren't there; trying to create the PEs from the
>> config files I originally created them from I get 'error: required
>> attribute "qsort_args" is missing'. Assume that's the root problem (i.e.
>> did not manage to import PEs, thus can't import queues). Anyone else had
>> issues with that? Should the save_config script have caught that?
>>> The "qsort_args" is new therein. You dumped the old configuration
>> using $SGE_ROOT/util/upgrade_modules/save_sge_config.sh? Then it should
>> work to add just this line to the generated textfile for the PEs in the
>> created directory with the text files.
>> I indeed dumped the config using said script. Was just wondering if the script were supposed to add a default qsort_args line, or at least the import script warn you that it's missing and will thus not work? (Or the export script telling you?)
>>>> And now for the important question :). My execds currently are a
>> of RHEL5 and RHEL6; SoGE got compiled on RHEL6, doesn't work on RHEL5
>>> Do you use the old original execds or the newly compiled one?
>>> If you use the new ones: maybe compiling all on RHEL5 and execute
>> these on RHEL6 might have better chances to work.
>> I shall try that; I was just wondering if anyone already knows of a way to make them work on both.
>>>> Also, all nodes and the master/shadow hosts get software upgrades
>> quite regularly
>>> I would fear that with updates to the nodes all the software you use
>> also need to be revalidated, i.e. running the test-suite for all.
>> Otherwise a change to e.g. a mathematical library may lead to different
>> results after an update.
>> The cluster node configuration is very similar to our standard workstation(s) - and there is a lot of software people are using on both. A lot of it compiled (and/or written) in house, and in a central location. So the risk of said libraries being out of sync (as it were) with the standard workstation setup (and hence, things that work on workstations not working on the cluster or vice versa) is - to us - much more of a concern. So, cluster nodes get upgraded along with the rest of the estate.
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
More information about the users