Not Running: Insufficient amount of resource: ncpus Even Through Resource Are Available



Problem: Jobs are in queue – showing insufficient amount of resource ncpus even though resources are available.
Troubleshooting: While troubleshooting checks the log file/tracejob also. From this we are unable to find any details information, So we increase the log events

To Increase the Verbose mode in Server, Scheduler, Mom.
for server : qmgr -c "s s log_events=2047"
for scheduler: make the log_filter to 0 in sched_config(/var/spool/PBS/sched_priv/sched_confg) file
For mom : add in mom config file : $logevent 0xffffffff


Then while I troubleshoot the problem I have checked the trace job.
[root@hn2017 sched_priv]# tracejob 16998

Job: 16998.hn2017

11/15/2017 14:17:59  L    Considering job to run
11/15/2017 14:17:59  L    Insufficient amount of resource: ncpus
11/15/2017 14:17:59  S    enqueuing into routeq, state 1 hop 1
11/15/2017 14:17:59  S    dequeuing from routeq, state 1
11/15/2017 14:17:59  S    enqueuing into pdd, state 1 hop 1
11/15/2017 14:17:59  S    Job Queued at request of ushak@hn2017, owner = ushak@hn2017, job name = Case1_coarse, queue = pdd
11/15/2017 14:17:59  S    Job Modified at request of Scheduler@hn2017
11/15/2017 14:17:59  A    queue=routeq
11/15/2017 14:17:59  A    queue=pdd
11/15/2017 14:25:05  L    Considering job to run
11/15/2017 14:25:05  L    Insufficient amount of resource: ncpus
11/15/2017 14:25:30  L    Considering job to run
11/15/2017 14:25:30  L    Insufficient amount of resource: ncpus
11/15/2017 14:30:43  L    Considering job to run
11/15/2017 14:30:43  L    Insufficient amount of resource: mem (R: 30gb A: 0kb T: 0kb)
11/15/2017 14:30:43  S    enqueuing into pdd, state 1 hop 1
11/15/2017 14:30:43  S    Requeueing job, substate: 11 Requeued in queue: pdd
11/15/2017 14:30:43  S    Job Modified at request of Scheduler@hn2017
11/15/2017 14:30:43  L    Job will never run with the resources currently configured in the complex
11/15/2017 14:30:58  L    Considering job to run
11/15/2017 14:30:58  L    Evaluating subchunk: ncpus=4:mem=30720mb:mpiprocs=4:lscratch=30gb
11/15/2017 14:30:58  L    Failed to satisfy subchunk: 1:ncpus=4:mem=30720mb:mpiprocs=4:lscratch=30gb
11/15/2017 14:30:58  L    Insufficient amount of resource: ncpus
11/15/2017 14:30:58  S    Job Modified at request of Scheduler@hn2017
11/15/2017 14:30:59  L    Considering job to run
11/15/2017 14:30:59  L    Evaluating subchunk: ncpus=4:mem=30720mb:mpiprocs=4:lscratch=30gb
11/15/2017 14:30:59  L    Failed to satisfy subchunk: 1:ncpus=4:mem=30720mb:mpiprocs=4:lscratch=30gb
11/15/2017 14:30:59  L    Insufficient amount of resource: ncpus
11/15/2017 14:39:13  S    Asked external license server for 4 cpu licenses, got 4
11/15/2017 14:39:13  S    Allocated 4 cpu licenses, float avail global 645815, float avail local 0, used locally 101
11/15/2017 14:39:13  L    Considering job to run
11/15/2017 14:39:13  L    Evaluating subchunk: ncpus=4:mem=30720mb:mpiprocs=4:lscratch=30gb
11/15/2017 14:39:13  L    Allocated one subchunk: ncpus=4:mem=30720mb:mpiprocs=4:lscratch=30gb
11/15/2017 14:39:13  S    Job Run at request of Scheduler@hn2017 on exec_vnode (cn002:ncpus=4:mem=31457280kb:lscratch=31457280kb)
11/15/2017 14:39:13  L    Job run
11/15/2017 14:39:22  A    user=ushak group=engr project=_pbs_project_default jobname=Case1_coarse queue=pdd ctime=1510735679 qtime=1510735679 etime=1510735679
                          start=1510736962 exec_host=cn002/1*4 exec_vnode=(cn002:ncpus=4:mem=31457280kb:lscratch=31457280kb) Resource_List.abaqus_lic=8
                          Resource_List.mem=30720mb Resource_List.mpiprocs=4 Resource_List.ncpus=4 Resource_List.nodect=1 Resource_List.place=shared
                          Resource_List.select=1:ncpus=4:mem=30720mb:mpiprocs=4:lscratch=30gb Resource_List.software=Abaqus resource_assigned.mem=31457280kb
                          resource_assigned.ncpus=4
11/15/2017 14:58:26  S    Python spawn status 0 exit value 0
See the Inside the table I have split into two row -> First row less log event -> second row -> full log even(0 and 2047).
1) I have check the tracejob and /var/spool/PBS/sched_logs
11/15/2017 14:30:58;0400;pbs_sched;Node;16998.hn2017;Evaluating subchunk: ncpus=4:mem=30720mb:mpiprocs=4:lscratch=30gb
11/15/2017 14:30:58;0400;pbs_sched;Node;cn001;Job would conflict with reservation or top job
11/15/2017 14:30:58;0400;pbs_sched;Node;cn002;Job would conflict with reservation or top job
11/15/2017 14:30:58;0400;pbs_sched;Node;cn003;Job would conflict with reservation or top job
11/15/2017 14:30:58;0400;pbs_sched;Node;cn005;Job would conflict with reservation or top job
11/15/2017 14:30:58;0400;pbs_sched;Node;cn004;Job would conflict with reservation or top job
11/15/2017 14:30:58;0400;pbs_sched;Node;cn006;Job would conflict with reservation or top job
11/15/2017 14:30:58;0400;pbs_sched;Node;cn007;Job would conflict with reservation or top job
11/15/2017 14:30:58;0400;pbs_sched;Node;cn008;Insufficient amount of resource: ncpus (R: 4 A: 1 T: 28)
From the log, it says the job is a conflict with reservation or top job.
While check qstat command.

16868.hn2017                   vinaydh         fastq           JD_J1                --     8   224  960gb   --  Q  --
    --
   Not Running: Insufficient amount of resource: mem (R: 960gb A: 1004256164kb T: 1580972964kb)
16869.hn2017                   vinaydh         fastq           JD_J2                --     8   224  960gb   --  Q  --
    --
   Not Running: Insufficient amount of resource: mem (R: 960gb A: 1004256164kb T: 1580972964kb)
16870.hn2017                   vinaydh         fastq           JD_J3                --     8   224  960gb   --  Q  --
    --
   Not Running: Insufficient amount of resource: mem (R: 960gb A: 1004256164kb T: 1580972964kb)

PROBLEM: These jobs(1688,16869,16870) are waiting for a long time –this job queue is fastq(100- priority), recently submitted job queue is pdd(80-priority) queue. Fastq is higher priority compare to pdd. That is the reason, even though resources are available it is not allowing any other small jobs to run.Once the big job run on the cluster then only it will allow other job to run.it is just like bottleneck problem.
SOLUTION: we need to kill these big job or needs to reduce the priority of that particular queue then small job started running.

0 comments: