[gridengine users] error with qsub -sync y
Brad.Dobbie at caviumnetworks.com
Mon Oct 10 13:12:04 UTC 2011
We occasionally get this in our site, and it is pretty catastrophic. Most users have switched to qrsh to avoid the bug. I agree that it happens in batches, for periods of 5-10 minutes. More posts describing the problem are starting to show up on google, since I first ran into it about a year ago.
I tried adding the "-t 1" option to define a range_list, but that did not help.
Most recently, I tried stopping the qmaster and starting it again. That seemed to fix the problem, but that's not a very good solution. We probably wont see the issue again for months. It seems network-load or nfs-load dependent, but I have no data to back up that claim.
Our cluster uses RHEL5.4 and SGE 6.2u5.
On Oct 8, 2011, at 1:25 PM, Daniel Povey wrote:
> I have been getting occasional errors when using qsub -sync y. It prints out the error message:
> Unable to initialize environment because of error: range_list containes no elements
> This is not reproducible, but seems to occur in batches. This is with GE 6.2R5.
> Looking for this online gives little information that is useful-- it seems to be a bug in qsub.
> Is anyone familiar? What is the best way to debug this? I don't have root on the machines concerned.
More information about the users