[gridengine users] CPU time limit exceeded

Lars van der bijl lars at realisestudio.com
Tue Mar 13 11:46:40 UTC 2012


On 13 March 2012 12:32, Reuti <reuti at staff.uni-marburg.de> wrote:
> Am 13.03.2012 um 12:03 schrieb Lars van der bijl:
>
>> On 13 March 2012 11:18, Reuti <reuti at staff.uni-marburg.de> wrote:
>>> Hi,
>>>
>>> Am 13.03.2012 um 10:59 schrieb Lars van der bijl:
>>>
>>>> Hey everyone,
>>>>
>>>> Where having the following problem.
>>>>
>>>> randomly on some task we start getting "CPU time limit exceeded". we
>>>
>>> You notice that in the messages file of SGE on the execution host or where do you get the statement?
>>>
>>
>> we get this in our stderr output.
>
> Then I would say it's not a limit by SGE. Can you set up any time limit in the appliation itself?

not that I am aware of. the application is rendering a image and has
nothing setup to kill it on time.
we do have a limit on memory.


>
>
>>>> don't specify a time limit. we do specify h_vmem.
>>>> this only happens on some tasks and not other. even between same tasks
>>>> from a batch on the same machine.
>>>
>>> It could be a set limit in the queue definition (h_cpu), specified for some particular jobs (-l h_cpu=...).
>>>
>>> The time for an SGE limit is usually mentioned in the messages file. Is it always the same time?
>>>
>>
>> 03/13/2012 05:41:24|worker|nano|W|rescheduling job 61607.121
>> 03/13/2012 05:41:24|worker|nano|W|job 61607.131 failed on host louie
>> general rescheduling on application error because: 03/13/2012 05:41:23
>> [0:10105]: exit_status of job start = 100
>
> So, the job was rescheduled (do you know why?), but the restart failed and put the job in error status (because of exit code 100). Do you see this?

to force sge to error out or retry we check the exit status of the
task in the prolog. if it anything other then 0 and it has re-tries it
will exit 99 from the prolog. otherwise exit with 100.
we always have task dependent on the output and we don't want them to start.

could a SIGXCPU or a SIGTERM cause this?


>
> Can you elaborate in some why what is going on there in detail - is it supposed to fail if it's just rescheduled without cleaning any former files or so?
>
> -- Reuti
>
>
>> unless [0:10105] is the limit i'm not sure.
>>
>>
>>
>>> -- Reuti
>



More information about the users mailing list