[gridengine users] Anyone using S-GAE reporting app with Univa grid engine?

Chris Dagdigian dag at sonsorol.org
Tue Mar 3 12:16:16 UTC 2015


I'll give some impressions of S-GAE since I have it installed in a lot 
of places ...

- It's a good basic reporting tool for monthly metrics.
- I don't use all of the features; mainly the full cluster "view"
- In the full cluster view there are 4-6 PNG graphics that I just 
generate and copy/embed into a written document

The basic metrics that I like are:

  - Job count shown as a percentage of success/failed jobs (job success 
% is a great top-line metric)
  - Cluster exec time (bar graph showing longest / shortest / avg job info)
  - Slots per job  graph (great way to show that only 1% of jobs use MPI 
or threaded PE hack)
  - Top ten users by memory consumption
  - Top ten users by raw job count
  - Top ten users by absolute exec time

Generic observations:

  - It's not super fast at ingest; it does a qacct on every job in the 
accounting file, parses the data and loads into db; I usually let it 
cook overnight on ingest

  - It can be tuned for ingest with various memory, mysql and ramdisk 
methods

  - It's not fast at viewing - tons of temporary mysql tables are made 
in $TMP just to show the front cluster view page

  - It can take 10 minutes just to render the HTML main page after we've 
loaded metrics for the month; lots of action in /tmp with temporary 
mysql files

  - By default it will reject jobs for which the username does not exist 
on localhost - this is crappy for situations where I take someone's 
accounting file and run it through my own S-GAE server running on AWS 
cloud or elsewhere. I had to make scripts that parse the accounting file 
for usernames, generate a uniq list and then make fake dummy accounts on 
the local system. This is problematic if you don't pay attention to the logs

  - Errors in the logs about being unable to ingest or create summary 
views may make you think at first about SQL or database problems but 99% 
of the time it means that the system ran /tmp to 100% full and just 
bombed out trying to execute a procedure

  - There are certain things that can ONLY be done in the web interface 
that kill me when I set up or repeatedly setup and rebuild a metric 
system. You can't configure the known queues or other parameters via a 
script or a config file. Each time you install or reinstall you need to 
step through the web page. There are multiple point and click events 
require to register each cluster queue which is painful on big systems 
where I may be destroying and rebuilding the S-GAE system multiple 
times. It's a human interaction / UI  hassle basically


Tuning:

  - S-GAE needs huge /tmp space and may fail subtly unless you are 
careful about watching the logs
  - For a cluster that does between 1-2million jobs a month we need a 
100GB /tmp partition to run metrics


For fixed installs that run metrics monthly I just configure the server 
to use a big /tmp partition and decide if I can get away with turning on 
the in-memory accounting file handling on a given system.

When running on the Amazon cloud doing a 1-off analysis on accounting 
file from a client I've found that I could make things go far far faster by:

  - Running on a spot node with lots of memory
  - Carving out a ramdisk out of some of the ram and mounting it as /ramdisk
  - Relocating the mysql database data/table files into /ramdisk
  - Applying some of the mysql tuning advice from google to the 
mysql.conf file
  - Keeping the accounting file in /ramdisk/ path







More information about the users mailing list