[gridengine users] command runs in grid engine but does not complete.

Dan Hyatt dhyatt at dsgmail.wustl.edu
Tue Jun 9 13:51:02 UTC 2015


I should explain the queues better.

I have

normal    - default
HiMem    -jobs needing up to 30GB per node (max 2 jobs per host)
Hi CPU     - same except jobs run 100% cpu for weeks, so need 2 jobs per 
host max
short       - kills job after 60 min
vshort     - kils job after 30 minutes

so I

qsub -q normal  -cwd   ........
or
qsub -q HiMem  -cwd .....

and if normal is full it will go to HiMem, then try HiCPU, then short, 
then very short...
I want it to wait until the queue is available that the job is assigned 
to.  I don't want it to try the next queue



on the other one..

previously, when an execute node sent an error to the grid, the node was 
pulled out automatically.
now I have to ask the user what node the job failed on, they tell me and 
I start testing

qstat - f would tell me which nodes were bad (or the dirty word qmon 
would tell me which were not healthy).
Now it just keeps sending jobs to bad nodes.


Thank You!

On 06/09/2015 08:25 AM, MacMullan, Hugh wrote:
> Hi Dan:
>
> qsub -cwd -b y 'metal < input.file > log.file'
>
> Might work, if users/you still want to work without script files.
>
> Default queue -- to $SGE_ROOT/default/common/sge_request add:
>
> -q all.q
>
> (or whatever your default queue is) That sets a hard queue, which can be overridden with '-q someotherqueue'.
>
> Not sure about the last ... generally nodes will only error out if the error is a GE error, not a job error, so maybe they're just job errors? Do simple jobs work?
>
> -Hugh
>
> -----Original Message-----
> From: users-bounces at gridengine.org [mailto:users-bounces at gridengine.org] On Behalf Of Dan Hyatt
> Sent: Tuesday, June 09, 2015 9:02 AM
> To: users at gridengine.org
> Subject: Re: [gridengine users] command runs in grid engine but does not complete.
>
>
> As my grid engine was stable and working...and that was more important
> than anything else. I left it mostly running minimally administered for
> a year.
>
> When I went to add 6 more nodes to the grid, I inadvertently and
> stupidly reconfigured the grid breaking it.
>
> So I am trying to get the grid back to where my users expect it.
>
>
> I found what my problem was and a work around...
>
> the command  was
>
> qsub -cwd -b y metal < input.file > log.file  and it was choking
>
> when I put   metal < input.file > log.file  into a script file, and ran
> qsub -cwd script.bash         it works fine
>
>
> The next two issues that I have been googling and pouring over the
> documentation....
>
> When I send a job to a queue, if the queue is busy it sends it to the
> next queue (defeating the purpose of separate queues in my env). How do
> I set the queues to run jobs ONLY in the appointed queue?
>
> The execute nodes were updated, and some are not playing well in the
> sandbox. When the grid sends a job there, it hangs, sends an error but
> does not remove that blade from the execute node list like it did before.
> Is there an easy way to manually test the execute nodes (there are 180),
> and why is it not removing bad nodes from the available nodes as it did
> before?  Before it would mark it unusable so when I list the execute
> nodes I would see that the node was bad and it would not accept jobs.
>
>
> On 06/08/2015 02:10 PM, Alex Chekholko wrote:
>> What was the "grid reconfiguration"?
>>
>> On 06/08/2015 11:42 AM, Dan Hyatt wrote:
>>> We are running a binary program called metaanalysis, which the user says
>>> was working prior to a grid reconfiguration.
>>>
>>>
>>> qsub -cwd -b y /dsg_cent/bin/metal < c22srcfile.txt > c22SBP.log
>>>
>>> This starts, runs, creates the logs, and then fails to create the data
>>> files
>>> qsub -cwd -b y  /dsg_cent/bin/metal < c22srcfile.txt > c22SBP.log
>>>
>>> -rw-rw-r-- 1 aldi   genetics 8523209 Jun  8 09:53 c22GENOA.SBP.EA.M1.csv
>>> -rw-rw-r-- 1 aldi   genetics 8660667 Jun  8 09:53 c22FamHS.SBP.ea.M1.csv
>>> -rw-rw-r-- 1 aldi   genetics 6025412 Jun  8 09:53
>>> c22HYPERGEN.SBP.EA.M1.csv
>>> -rw-rw-r-- 1 aldi   genetics    2061 Jun  8 09:53 c22srcfile.txt
>>> -rw-rw-r-- 1 dhyatt genetics      43 Jun  8 13:40 c22SBP.log
>>> -rw-r--r-- 1 dhyatt genetics       0 Jun  8 13:40 metal.e1043
>>> -rw-r--r-- 1 dhyatt genetics    2743 Jun  8 13:40 metal.o1043
>>> [dhyatt at blade5-2-1 c22
>>>
>>>    the control/output file indicates everything runs there are .o and .e
>>> files, but no data
>>>
>>>
>>> The command line works fine, and creates the data files. But I need to
>>> run large jobs on the queue
>>>
>>> -rw-rw-r-- 1 aldi   genetics  8523209 Jun  8 09:53
>>> c22GENOA.SBP.EA.M1.csv
>>> -rw-rw-r-- 1 aldi   genetics  8660667 Jun  8 09:53
>>> c22FamHS.SBP.ea.M1.csv
>>> -rw-rw-r-- 1 aldi   genetics  6025412 Jun  8 09:53
>>> c22HYPERGEN.SBP.EA.M1.csv
>>> -rw-rw-r-- 1 aldi   genetics     2061 Jun  8 09:53 c22srcfile.txt
>>> -rw-rw-r-- 1 dhyatt genetics  8177082 Jun  8 13:39 METAANALYSIS1.TBL
>>> -rw-rw-r-- 1 dhyatt genetics     1054 Jun  8 13:39
>>> METAANALYSIS1.TBL.info
>>> -rw-rw-r-- 1 dhyatt genetics 10487038 Jun  8 13:39 METAANALYSIS2.TBL
>>> -rw-rw-r-- 1 dhyatt genetics     1316 Jun  8 13:39
>>> METAANALYSIS2.TBL.info
>>> -rw-rw-r-- 1 dhyatt genetics     5030 Jun  8 13:39 c22SBP.log
>>>
>>> any thoughts?
>>>
>>> Dan
>>> _______________________________________________
>>> users mailing list
>>> users at gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users




More information about the users mailing list