[gridengine users] Huge amount of files generated in local disk
prod.feng at gmail.com
Mon Jan 26 16:15:37 UTC 2015
I just found a strange behavior of SGE 2011.
One user's job generate 1+ million small files in local
disk($TEMPDIR). It looks like it makes the execd very busy and from
the side of qmaster, the node is lost and unavailable, while I can ssh
to login. On the node, execd makes huge IOs( a few hundred KB/s to a
few MB/s). Some nodes can survive and get back to normal, some nodes
failed at the end(Since this kind of jobs also use a lot of memory, so
it looks like these nodes failed while the RAM got used up). I am
wondering that whether the execd handles the files that a job
generates? Or execd does something else to communicate with qmaster
while there are a lot of job generated files?
More information about the users