[gridengine users] disaster recovery of grid engine setup..

Paul Simpson paul at realisestudio.com
Tue Nov 6 10:47:31 UTC 2012


Following on, we've managed to get a quick replacement system up and
running, so we can deliver. Many thanks.

Two questions while I have your ears/eyes:
- we have recently implemented a rescheduling facility. However, we find
that while the grid restarts a job on a new machine, it fails to properly
kill off the old one. We use one level of wrapper (which does get
terminated). The child process does not get killed which leaves an over
stressed machine which leads to knock on errors. From reading this list, we
are not alone in suffering from this. Can anyone shred light on this in
either positive or negative ways? Ie, can/should this work ie is this a
known bug/'feature'
- we are currently using 6.2u5 but this is rather old. Would anyone
recommend upgrading and it so why?

Again, many thanks for your collective help.

Regards,

Paul
On 5 Nov 2012 13:25, "Paul Simpson" <paul at realisestudio.com> wrote:

> many thanks all - we're wading through this now.  what a great community!
> :)
>
>
>
> On 5 November 2012 13:18, MacMullan, Hugh <hughmac at wharton.upenn.edu>wrote:
>
>> Version control: definitely THE way to go Tina! (Adding to my task list).
>> :)
>>
>> On Nov 5, 2012, at 8:02 AM, "Tina Friedrich" <
>> Tina.Friedrich at diamond.ac.uk> wrote:
>>
>> > Hi Paul,
>> >
>> > don't know about everything, but e.g. for complexes - have a look in
>> the spool directory, there's a 'centry' subdirectory
>> ("$SGE_ROOT/$SGE_CELL/spool/qmaster/centry" for me). That has a ASCII file
>> for every complex with all the configuration for it.
>> >
>> > There's likewise a subdirectory 'pe' with the PE configuration,
>> hostgroups, ...
>> >
>> > Tina
>> >
>> > PS: ...I do all my configuration from files that I keep in subversion
>> (especially queue config, complex config). I find it makes this sort of
>> thing lots easier ;)
>> >
>> > On 05/11/12 12:07, Paul Simpson wrote:
>> >> hi grid gurus,
>> >>
>> >> i've had a bad w/end where the disk which stored the db filled up. the
>> >> grid came down and i couldn't fix the db using db_recover -c - which
>> >> meant no grid engine (6.2u5).
>> >>
>> >> we need to get the system back up asap (like yesterday). so, we've
>> >> installed a fresh version which is coming up. however, we've got a load
>> >> of complex's, host groups, share-trees, parallel envs, etc. etc. that i
>> >> can't seem to recover from the old system.
>> >>
>> >> i've looked through all the old dirs - but can't find any text files.
>> >> can anyone suggest how this config information could possibly be
>> >> recovered? typically, this has happened a day before a huge deadline -
>> >> so time is not on our side.
>> >>
>> >> -paul
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> users mailing list
>> >> users at gridengine.org
>> >> https://gridengine.org/mailman/listinfo/users
>> >
>> >
>> > --
>> > Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
>> > Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
>> >
>> > --
>> > This e-mail and any attachments may contain confidential, copyright and
>> or privileged material, and are for the use of the intended addressee only.
>> If you are not the intended addressee or an authorised recipient of the
>> addressee please notify us of receipt by returning the e-mail and do not
>> use, copy, retain, distribute or disclose the information in or attached to
>> the e-mail.
>> > Any opinions expressed within this e-mail are those of the individual
>> and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd.
>> cannot guarantee that this e-mail or any attachments are free from viruses
>> and we cannot accept liability for any damage which you may sustain as a
>> result of software viruses which may be transmitted in or with the message.
>> > Diamond Light Source Limited (company no. 4375679). Registered in
>> England and Wales with its registered office at Diamond House, Harwell
>> Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
>> >
>> >
>> >
>> > _______________________________________________
>> > users mailing list
>> > users at gridengine.org
>> > https://gridengine.org/mailman/listinfo/users
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20121106/46540b85/attachment.html>


More information about the users mailing list