GridEngine parallel jobs/tasks show up as occupying only one core #289

mightybigcar · 2017-04-12T01:22:53Z

A job or task submitted with a parallel environment specification such as
-pe make 16
will occupy 16 cores, but qtop only shows them as occupying a single core. This leads naive users to think the cluster is being under-utilized.

PS I might chase this bug myself, but it could take a while. If someone else with actual experience with the qtop code wants to jump on it in the meantime, feel free.

The text was updated successfully, but these errors were encountered:

fgeorgatos · 2017-05-08T12:08:27Z

@mightybigcar : thanks for raising this.

we've been seeing this for a while and in fact it has been discussed before, because indeed it makes you think that only 1 out of X (x=12,16,20, whatever) cores is being allocated.

when we had a first crack at it, it turned out that it was not very reliable to identify when the whole node is allocated, under SGE - do you have a reliable means to establish that? (we'd like to hear a method). I could be beta tester in this because the need here is similar

sfranky · 2017-06-25T15:09:44Z

@mightybigcar also, could you kindly provide an xml file showing how -pe make 16 manifests in there?

fgeorgatos · 2017-06-26T23:21:37Z

hi @mightybigcar :

in PR #295, using the queue names world or whole on a node, will have the effect you described. it's understandable that this is not the same like what you asked, but there is a need to deterministically identify which jobs get expanded. your feedback?

mightybigcar · 2017-06-27T01:28:25Z

@sfranky ,

Here's the fragment I think you're looking for:

  <JB_script_size>0</JB_script_size>
  <JB_pe>make</JB_pe>
  <JB_pe_range>
    <ranges>
      <RN_min>16</RN_min>
      <RN_max>16</RN_max>
      <RN_step>1</RN_step>
    </ranges>
  </JB_pe_range>

I've also attached the full qstat output.

Cheers,
Chris
qstat-364174.txt

mightybigcar · 2017-06-27T01:29:25Z

Fsck! I banana fingered the touchpad and accidentally closed this? Is it possible to reopen it? Sorry about that.....

sfranky · 2017-06-27T07:58:18Z

Thanks for that, I'll incorporate it into the system! issue reopened 👍

sfranky · 2017-06-27T08:20:21Z

btw the queue names that follow this rule are customizeable in qtopconf.yaml

mightybigcar · 2017-08-09T02:09:04Z

@fgeorgatos Here's a qstat.xml output with the parallel environment info (look for requested_pe).
qstat.txt

mightybigcar · 2017-08-09T02:20:12Z

Hi @fgeorgatos,

when we had a first crack at it, it turned out that it was not very reliable to identify when the whole node is allocated, under SGE - do you have a reliable means to establish that? (we'd like to hear a method).

For determining whether a node is fully allocated, I use a rather simplistic approach and look at the qstat.xml for a given node. For example:

[sgeadmin@barrel ~]$ qstat -f -q *@doppelbock -xml
<?xml version='1.0'?>
<job_info  xmlns:xsd="http://gridscheduler.svn.sourceforge.net/viewvc/gridscheduler/trunk/source/dist/util/resources/schemas/qstat/qstat.xsd?revision=11">
  <queue_info>
    <Queue-List>
      <name>all.q@doppelbock</name>
      <qtype>BP</qtype>
      <slots_used>64</slots_used>
      <slots_resv>0</slots_resv>
      <slots_total>64</slots_total>
      <load_avg>49.93000</load_avg>
      <arch>linux-x64</arch>
    </Queue-List>
    <Queue-List>
      <name>background.q@doppelbock</name>
      <qtype>BIP</qtype>
      <slots_used>0</slots_used>
      <slots_resv>0</slots_resv>
      <slots_total>64</slots_total>
      <load_avg>49.93000</load_avg>
      <arch>linux-x64</arch>
      <state>S</state>
    </Queue-List>
    <Queue-List>
      <name>mapreduce.q@doppelbock</name>
      <qtype>BIP</qtype>
      <slots_used>0</slots_used>
      <slots_resv>0</slots_resv>
      <slots_total>64</slots_total>
      <load_avg>49.93000</load_avg>
      <arch>linux-x64</arch>
      <state>S</state>
    </Queue-List>
    <Queue-List>
      <name>simulation.q@doppelbock</name>
      <qtype>BP</qtype>
      <slots_used>0</slots_used>
      <slots_resv>0</slots_resv>
      <slots_total>64</slots_total>
      <load_avg>49.93000</load_avg>
      <arch>linux-x64</arch>
      <state>S</state>
    </Queue-List>
  </queue_info>
  <job_info>
  </job_info>
</job_info>
[sgeadmin@barrel ~]$```

Since we always map one slot per logical CPU (Opteron core or Xeon hyperthread), I simply sum up the values for slots_used.  If the result is equal to the number of logical CPUs, then the node is fully booked.  If it's greater than the number of CPUs, I consider the node overbooked.

fgeorgatos assigned sfranky, fgeorgatos and mightybigcar May 8, 2017

fgeorgatos added the enhancement label May 8, 2017

fgeorgatos added this to the 0.9.201705XX milestone May 8, 2017

mightybigcar closed this as completed Jun 27, 2017

sfranky reopened this Jun 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GridEngine parallel jobs/tasks show up as occupying only one core #289

GridEngine parallel jobs/tasks show up as occupying only one core #289

mightybigcar commented Apr 12, 2017 •

edited by fgeorgatos

Loading

fgeorgatos commented May 8, 2017

sfranky commented Jun 25, 2017

fgeorgatos commented Jun 26, 2017 •

edited

Loading

mightybigcar commented Jun 27, 2017

mightybigcar commented Jun 27, 2017

sfranky commented Jun 27, 2017

sfranky commented Jun 27, 2017

mightybigcar commented Aug 9, 2017

mightybigcar commented Aug 9, 2017 •

edited

Loading

GridEngine parallel jobs/tasks show up as occupying only one core #289

GridEngine parallel jobs/tasks show up as occupying only one core #289

Comments

mightybigcar commented Apr 12, 2017 • edited by fgeorgatos Loading

fgeorgatos commented May 8, 2017

sfranky commented Jun 25, 2017

fgeorgatos commented Jun 26, 2017 • edited Loading

mightybigcar commented Jun 27, 2017

mightybigcar commented Jun 27, 2017

sfranky commented Jun 27, 2017

sfranky commented Jun 27, 2017

mightybigcar commented Aug 9, 2017

mightybigcar commented Aug 9, 2017 • edited Loading

mightybigcar commented Apr 12, 2017 •

edited by fgeorgatos

Loading

fgeorgatos commented Jun 26, 2017 •

edited

Loading

mightybigcar commented Aug 9, 2017 •

edited

Loading