[gridengine users] Obtaining the number of granted slots in C++

Txema Heredia txema.llistes at gmail.com
Fri Nov 29 11:41:32 UTC 2013


Hi all,

We are having some problems with jobs using a C++ binary program that, 
simply put, ignores all slot allocations.

The C code in question uses a call to "sysconf(_SC_NPROCESSORS_ONLN);" 
to determine the number of threads it can open, and pthreads to 
parallelize.
The problem is that this retrieves all the online cores, not just the 
assigned ones by either SGE or core-binding. So we end up with 12 jobs 
in a node, each with 1 assigned slot by SGE, the whole job core-binded 
to that core, but each job using 12 threads that are fighting for cpu 
cycles inside that single core. Then, load average skyrockets and the 
node is no longer usable until the cpu-storm passed.

I have been investigating a little and I haven't found any 
"out-of-the-box" method to have C report the "granted" number of cores. 
All the direct methods (single-function call) I have tested report the 
total number of cores in the system:
sysconf(_SC_NPROCESSORS_ONLN);
sysconf(_SC_NPROCESSORS_CONF);
get_nprocs_conf ();
get_nprocs ();

The only method I have found ( 
http://stackoverflow.com/questions/4586405/get-number-of-cpus-in-linux-using-c 
) to report the proper number of assigned cores, requires creating a 
function that loops and checks the job affinity for all the cores.
This method (apparently) works. It at least reports the number of 
core-binded cores.

For reference, this is the code I tested:

#include <pthread.h>
#include <unistd.h>
#include <sys/sysinfo.h>
#include <stdio.h>


int GetCPUCount()
{
         cpu_set_t cs;
         CPU_ZERO(&cs);
         sched_getaffinity(0, sizeof(cs), &cs);

         int count = 0;
         for (int i = 0; i < get_nprocs(); i++)
         {
                 if (CPU_ISSET(i, &cs))
                         count++;
         }
         return count;
}


int main(int argc, char* argv[]){
         long sc = sysconf(_SC_NPROCESSORS_ONLN);
         long sc_conf = sysconf(_SC_NPROCESSORS_CONF);
         long nprocs_conf = get_nprocs_conf ();
         long nprocs = get_nprocs ();
         long sched = GetCPUCount();

         printf("sysconf(_SC_NPROCESSORS_ONLN) = %d\n",sc);
         printf("sysconf(_SC_NPROCESSORS_CONF) = %d\n",sc_conf);
         printf("get_nprocs_conf() = %d\n",nprocs_conf);
         printf("get_nprocs() =  %d\n",nprocs);
         printf("sched_getaffinity = %d\n",sched);
}


After submitting it in a job, these are the results:

#1-slot, core binding=1
qsub -cwd -l h_vmem=500M -binding linear:1 -b y ./test_n_procs

sysconf(_SC_NPROCESSORS_ONLN) = 12
sysconf(_SC_NPROCESSORS_CONF) = 12
get_nprocs_conf() = 12
get_nprocs() =  12
sched_getaffinity = 1

#3-slots, core binding=1
qsub -cwd -l h_vmem=500M -pe threaded 3 -binding linear:1 -b y 
./test_n_procs

sysconf(_SC_NPROCESSORS_ONLN) = 12
sysconf(_SC_NPROCESSORS_CONF) = 12
get_nprocs_conf() = 12
get_nprocs() =  12
sched_getaffinity = 1

#3-slots, core binding=3
qsub -cwd -l h_vmem=500M -pe threaded 3 -binding linear:3 -b y 
./test_n_procs

sysconf(_SC_NPROCESSORS_ONLN) = 12
sysconf(_SC_NPROCESSORS_CONF) = 12
get_nprocs_conf() = 12
get_nprocs() =  12
sched_getaffinity = 3

#3-to-6-slots, core binding=6
qsub -cwd -l h_vmem=500M -pe threaded 3-6 -binding linear:6 -b y 
./test_n_procs

sysconf(_SC_NPROCESSORS_ONLN) = 12
sysconf(_SC_NPROCESSORS_CONF) = 12
get_nprocs_conf() = 12
get_nprocs() =  12
sched_getaffinity = 6



Has anyone encountered this problem before? Is there a more elegant 
solution? Is there a way that doesn't require reprograming all the 
software that faces this problem?

Thanks in advance,

Txema



More information about the users mailing list