[gridengine users] Fwd: eqw for qsub jobs

Reuti reuti at staff.uni-marburg.de
Wed Sep 28 15:41:41 UTC 2016


> Am 28.09.2016 um 17:06 schrieb Dan Hyatt <dhyatt at dsgmail.wustl.edu>:
> 
> Thanks,
> 
> after what you said, suggests it is something the user is doing. But she is saying some of the jobs are working and some are being dumped because its full.

Maybe with "full" she refers to the diskspace on the nodes and not any output of SGE.

-- Reuti


> On 09/28/2016 09:41 AM, Chris Dagdigian wrote:
>> 
>> I think the "queue instance dropped because ... full" is not related to your user/job problem. The dropped message is a sign from the job placement process that the queue instance was skipped during the active host select-and-job-dispatch round because it had no more job slots free to take new work. This would be a normal status alert on an active cluster with lots of jobs in 'qw' state. No big deal basically unless you think a resource, quota or some other thing is interfering.
>> 
>> State "Eqw" is usually a sign that something went badly wrong with a job. Its usually a sign of a significant issue like the UID/GID of the user not existing on the execution host or similar or it could be as simple as user error in a script (permission denied, path not found, etc.).
>> 
>> What does "qstat -j <jobID>" tell you about the jobs in Eqw state? Any interesting spool lots from the compute nodes or qmaster?
>> 
>> Chris
>> 
>> 
>> 
>> 
>> Dan Hyatt wrote:
>>> 
>>> I am trying to narrow down what would cause this. I searched google and the sge resources and could not find a reason for
>>> 
>>>  queue instance "VeryHighMem at blade5-5-8" dropped because it is full
>>>  queue instance "HighMem at blade5-1-4" dropped because it is full
>>> 
>>> This is that one user almost every shop has who is incredible at its work, but causes about 90% of the technical problems because of bad choices.
>>> 
>>> 
>>> Why would sge queue the jobs for everyone else but with this user suddenly drop jobs "because its full"
>>> 
>>> I have lots of jobs went to "eqw" as shown in the follow:
>>> 1144122 0.55500 sas64      username       Eqw   09/27/2016 22:54:45                                    1
>>> 1144125 0.55500 sas64      username       Eqw   09/27/2016 22:55:35                                    1
>>> 1144127 0.55500 sas64      username       Eqw   09/27/2016 22:56:25                                    1
>>> 1144130 0.55500 sas64      username       Eqw   09/27/2016 22:57:15                                    1
>>> 1144134 0.55500 sas64      username       Eqw   09/27/2016 22:58:05                                    1
>>> 1144139 0.55500 sas64      username       Eqw   09/27/2016 22:58:55                                    1
>>> 1144142 0.55500 sas64      username       Eqw   09/27/2016 22:59:46                                    1
>>> 1144145 0.55500 sas64      username       Eqw   09/27/2016 23:00:36                                    1
>>> 1144151 0.55500 sas64      username       Eqw   09/27/2016 23:01:26                                    1
>>> 1144156 0.55500 sas64      username       Eqw   09/27/2016 23:02:16                                    1
>>> 1144161 0.55500 sas64      username       Eqw   09/27/2016 23:03:06                                    1
>>> 1144165 0.55500 sas64      username       Eqw   09/27/2016 23:03:56                                    1
>>> 1144169 0.55500 sas64      username       Eqw   09/27/2016 23:04:46                                    1
>>> 1144174 0.55500 sas64      username       Eqw   09/27/2016 23:05:36                                    1
>>> 1144177 0.55500 sas64      username       Eqw   09/27/2016 23:06:26                                    1
>>> 1144182 0.55500 sas64      username       Eqw   09/27/2016 23:07:17                                    1
>>> 1144186 0.55500 sas64      username       Eqw   09/27/2016 23:08:07                                    1
>>> 1144196 0.55500 sas64      username       Eqw   09/27/2016 23:08:57                                    1
>>> 1144204 0.55500 sas64      username       Eqw   09/27/2016 23:09:47                                    1
>>> 1144212 0.55500 sas64      username       Eqw   09/27/2016 23:10:37                                    1
>>> 1144217 0.55500 sas64      username       Eqw   09/27/2016 23:11:27                                    1
>>> 1144221 0.55500 sas64      username       Eqw   09/27/2016 23:12:17                                    1
>>> 1144224 0.55500 sas64      username       Eqw   09/27/2016 23:13:08                                    1
>>> 1144225 0.55500 sas64      username       Eqw   09/27/2016 23:13:58                                    1
>>> 1144227 0.55500 sas64      username       Eqw   09/27/2016 23:14:48                                    1
>>> 1144232 0.55500 sas64      username       Eqw   09/27/2016 23:15:38                                    1
>>> 1144236 0.55500 sas64      username       Eqw   09/27/2016 23:16:28                                    1
>>> 1144244 0.55500 sas64      username       Eqw   09/27/2016 23:17:18                                    1
>>> 1144255 0.55500 sas64      username       Eqw   09/27/2016 23:18:09                                    1
>>> 1144265 0.55500 sas64      username       Eqw   09/27/2016 23:18:59                                    1
>>> 1144276 0.55500 sas64      username       Eqw   09/27/2016 23:19:49                                    1
>>> 1144286 0.55500 sas64      username       Eqw   09/27/2016 23:20:39                                    1
>>> 1144295 0.55500 sas64      username       Eqw   09/27/2016 23:21:29                                    1
>>> 1144306 0.55500 sas64      username       Eqw   09/27/2016 23:22:19                                    1
>>> 1144316 0.55500 sas64      username       Eqw   09/27/2016 23:23:09                                    1
>>> 1144326 0.55500 sas64      username       Eqw   09/27/2016 23:23:59                                    1
>>> 1144335 0.55500 sas64      username       Eqw   09/27/2016 23:24:49                                    1
>>> 1144344 0.55500 sas64      username       Eqw   09/27/2016 23:25:39                                    1
>>> 1144351 0.55500 sas64      username       Eqw   09/27/2016 23:26:30                                    1
>>> 1144359 0.55500 sas64      username       Eqw   09/27/2016 23:27:20                                    1
>>> 1144366 0.55500 sas64      username       Eqw   09/27/2016 23:28:10                                    1
>>> 1144374 0.55500 sas64      username       Eqw   09/27/2016 23:29:00                                    1
>>> 1144416 0.55500 sas64      username       Eqw   09/27/2016 23:29:50                                    1
>>> 1144482 0.55500 sas64      username       Eqw   09/27/2016 23:30:40                                    1
>>> 1144484 0.55500 sas64      username       Eqw   09/27/2016 23:31:30                                    1
>>> 1144485 0.55500 sas64      username       Eqw   09/27/2016 23:32:20                                    1
>>> 1144486 0.55500 sas64      username       Eqw   09/27/2016 23:33:10                                    1
>>> 1144487 0.55500 sas64      username       Eqw   09/27/2016 23:34:00                                    1
>>> 1144491 0.55500 sas64      username       Eqw   09/27/2016 23:34:51                                    1
>>> 1144498 0.55500 sas64      username       Eqw   09/27/2016 23:35:41                                    1
>>> 1144499 0.55500 sas64      username       Eqw   09/27/2016 23:36:31                                    1
>>> 1144500 0.55500 sas64      username       Eqw   09/27/2016 23:37:21                                    1
>>> _______________________________________________
>>> users mailing list
>>> users at gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>> 
> 
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users





More information about the users mailing list