[gridengine users] CFX tight integration

William Hay w.hay at ucl.ac.uk
Thu Nov 24 14:56:33 UTC 2011


On 24 November 2011 12:59, Reuti <reuti at staff.uni-marburg.de> wrote:
> Am 24.11.2011 um 12:51 schrieb William Hay:
>
>> Are there any instructions for getting CFX working under tight
>> integration?  It appears to work OK loosely integrated but it doesn't
>> appear to work under our existing integrations.  If we use an rsh
>> resembling wrapper around qrsh I get the following output:
>>
>> + cfx5solve -max-elapsed-time '14 [min]' -def EYEball2.def
>> -start-method 'HP MPI Distributed Parallel' -par-dist
>
> HP MPI (aka Platform MPI aka IBM MPI) understands a hostfile in the MPICH(1) format, depending on the version of HP MPI.
>
> We don't have this particular application (but otherts which use HP MPI), but I had to play around with MPI_REMSH to get it working, i.e. route it to `rsh` to override the default `ssh` and use this way SGE's rsh wrapper.
>
CFX is a particularly evil piece of software that while using HP-MPI
avoids the provided tools.  As a result it ignores MPI_REMSH but
responds to CFX5RSH instead.  I'm pretty sure it is calling the
wrapper I intend regardless.
>
>> 'usertest08*4,usertest07*4'
>> An error has occurred in cfx5solve:
>>
>> Unable to determine type of remote host usertest07 as no data was returned.
>>
>> An error has occurred in cfx5solve:
>>
>> Remote connection to usertest07 exited with return code 129.
>
> This is SIGHUP (128 +1), maybe your application starts the communications and closes the connection before all data was returned. This could be the result of a non-existing async `qrsh -inherit` in SGE. I.e. after closing the connection all kids will be killed on the slave machine. This was the reason I supplied the start_mpich2.c  (the C application, not the script) for the former daemon based solution of MPICH2: it will fork on the master node of the parallel job and this way the processes (i.e. daemons) on the slaves will continue).
>
>
>> Check that you
>> have typed the hostname correctly, that you have an account "ccaawih" on
>> the specified host with permission to rsh from this host, and that
>> (particularly for Windows hosts) it is running an rsh daemon.  You can use
>> the following command to check the connection to a UNIX machine:
>>
>>  rsh usertest07 uname
>
> I'm not aware that the standard HP MPI will do such things.
>
> But anyway: this should be caught by the SGE rsh-wrapper and then just be routed to a slave host: right?
>
> What is you set up communication method in SGE (rsh_daemon/rsh_command)? There was an issue regarding the PVM integration, where the builtin method doesn't close a port: http://arc.liv.ac.uk/pipermail/gridengine-users/2009-February/023140.html

ssh with a perl wrapper around sshd to enable saving the job details
for pam_sge_qrsh.
>
>
>> or the following command if it is a Windows machine:
>>
>>  rsh usertest07 cmd /c echo working
>>
>> An error has occurred in cfx5solve:
>>
>> The architecture string for host usertest07 could not be determined.
>>
>> Googling around suggests people have had CFX run under tight
>> integration but I can't find any documentation as to how.
>>
>> When I have the wrapper report what it is doing it looks like
>> cfx5solve is calling  the wrapper with
>> <host> -n echo TRUE as arguments.
>>
>> The commands it suggests work with the wrapper.
>
> What for a  wrapper? The rsh-wrapper from SGE?
>
> -- Reuti
>
Yes with some mods to ignore a few more harmless switches that some
MPI  use when they think they are calling ssh (like -q and -x).  I
also commented out the echoing of the qrsh command before it is run.
>
>>
>> William
>> _______________________________________________
>> users mailing list
>> users at gridengine.org
>> https://gridengine.org/mailman/listinfo/users
>
>
>
>



More information about the users mailing list