[gridengine users] qmake retry after connection error
Joachim Gabler
jgabler at univa.com
Tue Jun 14 14:45:00 UTC 2011
Hi Ido,
Ido Tamir wrote:
> Hi,
> we use qmake to parallelize the illumina/solexa pipeline. Its a make based system that
> operates on many files to generate some output.
>
> However, often under load we get errors like:
>
> error: commlib error: got select error (Connection reset by peer)
> error: executing task of job 7980306 failed: failed sending task to XXX at XXX.xxx: can't find connection
>
> Then we have to restart the pipeline.
>
> I tried the make options -k (keep going) and -i (ignore), and it keeps working, but the result is broken.
> -r is not available for qmake.
>
> Is there a possibility to retry for a certain amount of tries if this error comes up - and only this
> error? Sometimes there are missing files etc... then it should fail.
> But this is simply a node not answering in a specified amount of time.
> Is there a possibility to extend the timeout?
Setting the gdi_timeout=<timeout> in the global configuration (qconf
-mconf), attribute qmaster_params does increases the receive timeout for
the requests done by qmake (via qrsh -inherit).
See also man page sge_conf.5, section about qmaster_params.
You can try if it helps, but I have doubts.
From the error message "Connection reset by peer" I would guess it
really would require a retry.
You can configure gdi_retries=n, where n > 0, in the global
configuration, attribute qmaster_params to configure a retry of client
request.
Unfortunately this has effect on all clients (qsub, qstat, ...) except
for the qrsh -inherit used by qmake.
I'll file an issue to make sure this gets fixed.
Best regards,
Joachim
> Thank you very much for your answers,
> ido
>
>
>
>
>
>
>
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users
More information about the users
mailing list