Checklist Cluster Software Upgradation

What are things I have to do Before doing cluster software upgradation.  
Pre Upgrdation
1)Take a backup of old version
            a)important configuration file
            b)log parameter
            C)setting like PBS queue configuration, hook data.
 2)Upgrade the software.
Post Upgradation
3)Once upgrade check all the old parameter like queue hook node details everything configured. For example, in PAS after new installation FILE EXPIRATION date will be reset by default 14 days. So we need to change into at least max no of days or else all the old data will be deleted (Bala’s experience). NOTE: At any point in time if anything goes wrong we should ready to restore the data or else we will lose the confidence with customer.

Same System Setup NIS/NFS - But Facing Issue


System Setup: NIS/ NFS – Home directory is also configured.
Problem: when users invoke the source <NFS-sharepath>/script_location. One system it is invoking properly but on another system, it says command not found from the same user id.
NOTE: So, I can say there is no problem with the application side, some problem has been found only on the system side.
How I troubleshoot
System A(system have problem)
System B(System is working fine)
1)#which command_name
Command is not found
2)PATH is not properly set. So, I have set the proper PATH for environment variable
3)Then I have tried after that also I have faced the same problem.it is shell script, so I have execute in VERBOSE mode.then I have found while run the shell script I got the error message particular command not found.
4)while checking I have directly invoked that  single command (then also I  faced the same problem).
5)So, I have check that command is binary file or script while checking it is the ksh shell script. In this machine, ksh shell script is not installed compare to another system. Once installed the problem get solved.
#which command_name
Command is found

Mount.Nfs: Requested NFS Version Or Transport Protocol Is Not Supported



[root@admin ~]# mount blrqa:/ATS /mnt
mount.nfs: requested NFS version or transport protocol is not supported
NOTE: while doing the same mount on another system it is mounting. But another system it is not mounting. Linux version is same.
System A(Not Working)
System B(Working)
1)     By default, this system is mounting on the version 3.So, I have tried to change the version to 4 and tried, then it is working fine.

By default this system it is mounting on the version 4.

mount.nfs: requested NFS version or transport protocol is not supported

Not Running: Insufficient amount of resource: ncpus Even Through Resource Are Available



Problem: Jobs are in queue – showing insufficient amount of resource ncpus even though resources are available.
Troubleshooting: While troubleshooting checks the log file/tracejob also. From this we are unable to find any details information, So we increase the log events

To Increase the Verbose mode in Server, Scheduler, Mom.
for server : qmgr -c "s s log_events=2047"
for scheduler: make the log_filter to 0 in sched_config(/var/spool/PBS/sched_priv/sched_confg) file
For mom : add in mom config file : $logevent 0xffffffff


Then while I troubleshoot the problem I have checked the trace job.
[root@hn2017 sched_priv]# tracejob 16998

Job: 16998.hn2017

11/15/2017 14:17:59  L    Considering job to run
11/15/2017 14:17:59  L    Insufficient amount of resource: ncpus
11/15/2017 14:17:59  S    enqueuing into routeq, state 1 hop 1
11/15/2017 14:17:59  S    dequeuing from routeq, state 1

Kernel Error Nvidia Installation Error message

# rpm --nodeps -ivh --force nvidia-gfxG03-kmp-default-331.38_k3.4.6_2.10-2
Preparing...                          ################################# [100%]
Updating / installing...
   1:nvidia-gfxG03-kmp-default- 331.38_##################### ############ [100%]
make: Entering directory '/usr/src/linux-4.4.21-69-obj/x86_64/default'
  CC [M]  /usr/src/kernel-modules/nvidia-331.38-default/nv.o
In file included from /usr/src/linux-4.4.21-69/include/uapi/linux/stddef.h:1:0,
                 from /usr/src/linux-4.4.21-69/include/linux/stddef.h:4,
                 from /usr/src/linux-4.4.21-69/include/uapi/linux/posix_types.h:4,
                 from /usr/src/linux-4.4.21-69/include/uapi/linux/types.h:13,
                 from /usr/src/linux-4.4.21-69/include/linux/types.h:5,
                 from /usr/src/linux-4.4.21-69/include/uapi/linux/capability.h:16,
                 from /usr/src/linux-4.4.21-69/include/linux/capability.h:15,
                 from /usr/src/linux-4.4.21-69/include/linux/sched.h:15,
                 from /usr/src/linux-4.4.21-69/include/linux/utsname.h:5,
                 from /usr/src/kernel-modules/nvidia-331.38-default/nv-linux.h:44,
                 from /usr/src/kernel-modules/nvidia-331.38-default/nv.c:13:
/usr/src/linux-4.4.21-69/include/asm-generic/qrwlock.h: In function ‘queued_write_trylock’:
/usr/src/linux-4.4.21-69/include/asm-generic/qrwlock.h:106:36: warning: comparison between signed and unsigned integer expressions [-Wsi
           cnts, cnts | _QW_LOCKED) == cnts);
                                    ^
/usr/src/linux-4.4.21-69/include/linux/compiler.h:165:40: note: in definition of macro ‘likely’