[gridengine users] Anyone using S-GAE reporting app with Univa grid engine?
beckerje at mail.nih.gov
Tue Mar 3 14:04:29 UTC 2015
Thanks, some comments/questions inline.
On Tue, Mar 03, 2015 at 07:16:16AM -0500, Chris Dagdigian wrote:
>- It's a good basic reporting tool for monthly metrics.
Is that the smallest resolution it supports? XDMoD can drill down to
daily, which we find very useful.
> - Job count shown as a percentage of success/failed jobs (job success
>% is a great top-line metric)
That's actually a really nice metric. I don't know if XDMoD supports
that out of the box. That said, the charts it makes are nice, and
there's a custom reporting system.
> - Cluster exec time (bar graph showing longest / shortest / avg job info)
> - Slots per job graph (great way to show that only 1% of jobs use
>MPI or threaded PE hack)
> - Top ten users by memory consumption
> - Top ten users by raw job count
> - Top ten users by absolute exec time
XDMoD has similar stuff.
> - It's not super fast at ingest; it does a qacct on every job in the
>accounting file, parses the data and loads into db; I usually let it
>cook overnight on ingest
Seriously? a full "qacct -j <jobid>" on each job? That's got to be
slow. *MoD at least groks the raw accounting logs.
> - It can be tuned for ingest with various memory, mysql and ramdisk
> - It's not fast at viewing - tons of temporary mysql tables are made
>in $TMP just to show the front cluster view page
We run xdmod on a small VM, works just fine. To be fair, the mysql
server that stores the data is a relatively large physical box.
> - It can take 10 minutes just to render the HTML main page after
>we've loaded metrics for the month; lots of action in /tmp with
>temporary mysql files
Yeah, nothing like that here, even for the worst case graphs.
> - By default it will reject jobs for which the username does not
>exist on localhost - this is crappy for situations where I take
>someone's accounting file and run it through my own S-GAE server
>running on AWS cloud or elsewhere. I had to make scripts that parse
>the accounting file for usernames, generate a uniq list and then make
>fake dummy accounts on the local system. This is problematic if you
>don't pay attention to the logs
XDMoD deals with that issue as well. One thing that you cannot do, yet,
is cleanly map all of XDMoD's organizational hierarchies directly into
SGE's. For example, we would really like to map SGE's
Division/Project/User into XDMoD, but it's not perfect. Projects are
most important to us, but to get them to show up in the charts, we have
to map them PIs.
> - Errors in the logs about being unable to ingest or create summary
>views may make you think at first about SQL or database problems but
>99% of the time it means that the system ran /tmp to 100% full and
>just bombed out trying to execute a procedure
Sometimes we get funky logs, but given that we push somewhere around 1.5
million jobs a week, losing a few is not a big deal.
> - There are certain things that can ONLY be done in the web interface
>that kill me when I set up or repeatedly setup and rebuild a metric
>system. You can't configure the known queues or other parameters via a
>script or a config file. Each time you install or reinstall you need
>to step through the web page. There are multiple point and click
>events require to register each cluster queue which is painful on big
>systems where I may be destroying and rebuilding the S-GAE system
>multiple times. It's a human interaction / UI hassle basically
A lot of the group/queue/user stuff is auto-generated from the logs, so
that's a good thing.
There are some "admin" type things that UI-only, but I do most of the
setup via puppet pushing RPMs and json files around. It's not 100%
automated, but about as close as I care to make it.
Authentication on the webpages is a bit odd. Not basic HTTP Auth,
doesn't tie in Kerberos, and there are some odd hooks in place that
assume you're part of the U. Buffalo system. However, it works.
> - S-GAE needs huge /tmp space and may fail subtly unless you are
>careful about watching the logs
> - For a cluster that does between 1-2million jobs a month we need a
>100GB /tmp partition to run metrics
XDMoD does all of this in the database. It "shreds" the raw log files,
and stuffs them into the DB directly. Those records are then ingested
from one DB to another DB for aggregation and storage. (There are 6
different databases for XDMoD, which is a bit odd. The schemas are
fairly sane though.)
>When running on the Amazon cloud doing a 1-off analysis on accounting
>file from a client I've found that I could make things go far far
> - Running on a spot node with lots of memory
> - Carving out a ramdisk out of some of the ram and mounting it as /ramdisk
> - Relocating the mysql database data/table files into /ramdisk
> - Applying some of the mysql tuning advice from google to the
> - Keeping the accounting file in /ramdisk/ path
Probably all useful for XDMoD as well, if it fits your environment. I
pull the accounting logs off a moderately powerful, moderately
overloaded netapp. For running a nightly ingest, it's ceratinyl fast
Jesse Becker (Contractor)
More information about the users