LinkedIn Google+

Background

One of our customers implements a set of business processes that span over two following SAP technical systems:

  • SAP NetWeaver BPM Enterprise Portal
  • SAP NetWeaver MDM server

Beginning of 2014 it was created a new initiative which eventually became a new project for us aimed basically at:

  • Add and improve some features this end-to-end solution is offering to its user community
  • Move the infrastructure to a newer and more efficient hardware in terms of power consumption and space optimization compliant with the global High Density Data Centers strategy
  • Continue improve the solution stability and support according to the best practices recommended by SAP in this case as software vendor

The plan was to move each existing environment (i.e. sanpit, development, consolidation, etc.) supporting this solution according to following general infrastructure approach:

  • Non-productive systems will be moved to virtualized servers
  • Productive systems will be moved to physical blade servers

On both cases, virtual and physical, a new OS layer based on Red Hat Enterprise Linux 6.3 (RHEL 6.3) will be also installed on new servers along the move. The solution was initially running on Red Hat Enterprise Linux 5.5 (RHEL 5.5) platforms.

To accommodate to that new OS release and be compliant with SAP requirements published in Product Availability Matrix (PAM), such moves require as well upgrading underlying Oracle databases. This altogether will bring us up to the right position to move a step further in the plan and technically upgrade SAP products to SAP NetWeaver 7.31 and SAP NetWeaver MDM 7.1 including latest support packages.

However, theory is theory and real world is real world. If you have been dealing with IT for some years you should know what this means.

Upon completion of development environment moves and upgrades there was a noticeable loss of performance identified by the development team. Yet there was some kind of uncertainty at that time. It was not however until the upgrade of the test environment when performance issue was clearly identified once team begun testing both current and new developments.

It was at this point when some initial actions were performed at the infrastructure in order to try to improve performance. Among others:

  • Review and improve I/O speed
  • Increase hardware resources
  • Review OS parameters
  • Review Oracle parameters
  • Increase Oracle buffers
  • Activate Oracle HugePages according to recommendation in note 1672954
  • Reorganize most accessed Oracle tables and indexes

There was a gain of performance after all above actions. Nonetheless overall upgraded system response time was not yet acceptable and as good as expected by the team.

So we got SAP support on board to help us on this issue.

In parallel, based on the development team’s daily experience, the MDM server side was identified as a potential root cause of the issue. This is definitively helped us as SAP Basis experts to narrow down the issue a bit.

They then developed a Java performance test that simulated how the SAP BPM portal called up different methods using the SAP MDM API to connect and extract information out of the SAP MDM server. It was designed to be executed completely independently of the SAP NetWeaver Java AS taking in this way we could take SAP BPM portal fully out of the picture. It also measured and showed the response time each of most representative MDM API calls took.

This performance test was of great help for us as we will see next. Additionally after several performance test executions and comparisons with production, i.e. our reference of what needs to be considered as acceptable, it was confirmed that the issue should come from the SAP MDM server side.

Upon this assumption there were several other actions aimed at narrow down the real root cause at little bit more:

  • On one hand we discarded the virtualization layer as the guilty since same behavior occurred on non-virtualized servers (physical servers)
  • On the other, we discarded also the Oracle and MDM 7.1 upgrades since after several performance test executions on old RHEL 5.5 servers they all ended up with an acceptable response time

Therefore, the immediate outcome out of above points was clearly telling us where the real root cause came from. It was not the virtualization layer, the SAP MDM patch level nor the Oracle release but the issue must then be on the different RHEL releases.

SYSSTAT and SYSDIG to the rescue

We now knew something at OS level happened. But the one billion question was… What could it be?

Lucky us we have system tools like SYSSTAT and SYSDIG:

There are so many other tools available in Linux but we picked those two basically because:

  • SYSSTAT offers a very wide list of different OS system statistics in terms of CPU, I/O, memory, paging and networking
  • SYSDIG is a tool that combines strace, tcpdump and lsof in just one tool with a very powerful user friendly filter capability and amazingly easy to be extended with the so called chisels that are nothing else that Lua scripts on top that allows you to analyze the SYSDIG event stream

Additionally both tools offer a very easy way to run them in background while the process to be debugged is executing and then analyze the trace data later on offline.

The idea was then simply to observe how OS behaves while a performance test was executing on new RHEL 6.3 server. So we ran a test execution some day between 17:35 CET to 17:42 CET while gathering tracing information using those tools above.

We will show next how helpful those system tools could be even for SAP administrators.

We started looking at the overall performance KPI gathered by sadc daemon in terms of CPU usage, disk I/O, memory, paging and networking throughout.

Out of them the only relevant one that showed up an indication of something potentially wrong was the overall CPU usage as shown in figure 1. At a very first glance it didn’t really show any overall CPU shortage since usage was steadily idling at around 90% of the overall computer power.

Overall CPU usage
Figure 1

Just as confirmation that there was no such a CPU shortage, we depicted the rate of the run queue length in figure 2. Taking into account that the Linux server had 16 Intel Xeon cores we could take a breath and finally assumed our server was properly sized in terms of CPU (we would even dare to say it was oversized indeed).

Rate of run queue length
Figure 2

 

However a quite noticeable fact out of figure 1 was the relatively high percentage of CPU usage in kernel mode (%system) during performance test execution. Zooming in a bit this fact can be even notice clearer (figure 3). System CPU usage measures the portion of CPU time OS processes and threads spend on kernel mode. This could be mainly due to interrupt handling or system calls. Additionally in figure 3 we could also checked whether there were any other CPU usage components we could not perceive in figure 1 due to the graph scale. For instance the portion of CPU time in user mode (%user) was quite low. There was also no CPU load in terms of interrupt handling either hard or soft (%irq and %soft).

Overall CPU usage - zoom
Figure 3

Based on these inputs, we focused next on analyzing that apparently high percentage of kernel mode CPU consumption. So we went a step deeper and depicted the percentage of system CPU usage split by physical core (figure 4). Recall the server had 16 Intel Xeon cores.

system usage per core
Figure 4

Figure 4 showed some of the cores kept relatively busy executing in kernel mode during the performance test execution. As we pointed out before we could not observe any remarkable CPU load in figure 3 used for handling interrupts which made us to suspect on system calls as a potential cause.

Another question we made ourselves was whether that CPU consumption was really caused by the SAP MDM server and not some other process or processes. We may think at first glance after looking at figures 3 and 4 that the SAP MDM server (which at OS level could be identified by process name mds-r) was behind since we can clearly see the edges when the performance test started (~17:35) and ended (~17:42). However, to demonstrate process mds-r was really the one we used the SYSDIG trace together with the set of chisels provided by this tool.

So we started running the chisel topprocs_cpu which displays the top processes by CPU time during the time window the trace spans:

$ sysdig -r sysdig.scap.gz -c topprocs_cpu
CPU%      Process
------------------------------
33.17%    mds-r
24.23%    sh
8.87%     sysdig
1.55%     pidstat
1.11%     uuidd
0.95%     oracle
0.42%     collectl
0.39%     top
0.33%     bash
0.29%     agent

Note that this chisel is including the CPU time consumed by each OS process and all its child threads or tasks.

We could immediately see our assumption was kind of right. CPU usage in kernel mode came from the SAP MDM server. But there was also an unexpected visitor, OS process sh i.e. the standard OS shell, which was the second most CPU demanding process. Both together took almost 60% of CPU scheduling time during performance test execution.

Question now… was this related somehow to the very bad performance response time of SAP MDM server on RHEL 6.3? To answer that question we asked again to SYSDIG trace.

First of all, since SAP MDM server (mds-r) is a multithreaded application, we figured out the PID and TID (thread IDs) of the mds-r processes during MDM test execution:

$ sysdig -r sysdig.scap.gz -c ps proc.name=mds-r
TID     PID     USER        VIRT       RES        FDLIMIT   CMD
9135    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9140    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9141    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9142    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9143    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9144    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9145    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9160    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9161    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9162    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9163    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9172    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9173    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9174    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9175    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9176    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9202    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9203    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9204    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9205    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9206    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9207    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9208    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9209    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9210    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9211    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9212    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9213    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9222    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9223    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9224    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9308    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9309    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9310    9135    c5wadm      3.33P      1.20P      429496729 mds-r
9311    9135    c5wadm      3.33P      1.20P      429496729 mds-r

Out of the above list we could identify that the root (parent) SAP MDM server process had a PID of 9135 (i.e. process having PID equal to TID).

We used next chisel topscalls_time filtered by process name mds-r to see the most time consuming system calls made by any of the child thread under the SAP MDM server:

$ sysdig -r sysdig.scap.gz -c topscalls_time proc.name=mds-r | head -10
Time      Syscall
------------------------------
110.16s   clone
108ms     recvfrom
105ms     open
86ms      write
68ms      rt_sigaction
64ms      access
59ms      futex
54ms      getdents

The SAP MDM server threads were spending a quite nice time on cloning themselves for creating a new child process out of them (system call clone). But what was the purpose behind? Let’s see what mds-r was doing while calling system call clone.

$ sysdig -r sysdig.scap.gz "proc.name=mds-r and evt.type=clone"
1611796 17:35:05.473209914 9 mds-r (9203) > clone
1611917 17:35:05.486698498 9 mds-r (9203) < clone res=24208(mds-r) exe=/usr/sap/MD0/MDS55/exe/mds-r args= tid=9203(mds-r) pid=9135(mds-r) ptid=9134(mds.sapMD0_MDS7) cwd=/usr/sap/MD0/MDS55/work fdlimit=1024 pgft_maj=0 pgft_min=314280 vm_size=3495004 vm_rss=1261504 vm_swap=0 flags=512(CLONE_PARENT_SETTID) uid=77582 gid=201
1611930 17:35:05.486748793 4 mds-r (24208) < clone res=0 exe=/usr/sap/MD0/MDS55/exe/mds-r args= tid=24208(mds-r) pid=24208(mds-r) ptid=9203(mds-r) cwd=/usr/sap/MD0/MDS55/work fdlimit=1024 pgft_maj=0 pgft_min=0 vm_size=3495004 vm_rss=1230668 vm_swap=0 flags=512(CLONE_PARENT_SETTID) uid=77582 gid=201
1613149 17:35:05.541253689 9 mds-r (9311) > clone
1613245 17:35:05.554007496 9 mds-r (9311) < clone res=24210(mds-r) exe=/usr/sap/MD0/MDS55/exe/mds-r args= tid=9311(mds-r) pid=9135(mds-r) ptid=9134(mds.sapMD0_MDS7) cwd=/usr/sap/MD0/MDS55/work fdlimit=1024 pgft_maj=0 pgft_min=340888 vm_size=3495004 vm_rss=1261508 vm_swap=0 flags=512(CLONE_PARENT_SETTID) uid=77582 qgid=201
1613255 17:35:05.554055706 4 mds-r (24210) < clone res=0 exe=/usr/sap/MD0/MDS55/exe/mds-r args= tid=24210(mds-r) pid=24210(mds-r) ptid=9311(mds-r) cwd=/usr/sap/MD0/MDS55/work fdlimit=1024 pgft_maj=0 pgft_min=0 vm_size=3495004 vm_rss=1230672 vm_swap=0 flags=512(CLONE_PARENT_SETTID) uid=77582 gid=201
1614966 17:35:05.674883839 9 mds-r (9209) > clone
1615075 17:35:05.688886348 9 mds-r (9209) < clone res=24212(mds-r) exe=/usr/sap/MD0/MDS55/exe/mds-r args= tid=9209(mds-r) pid=9135(mds-r) ptid=9134(mds.sapMD0_MDS7) cwd=/usr/sap/MD0/MDS55/work fdlimit=1024 pgft_maj=0 pgft_min=320484 vm_size=3495004 vm_rss=1261508 vm_swap=0 flags=512(CLONE_PARENT_SETTID) uid=77582 gid=201
1615089 17:35:05.688931156 11 mds-r (24212) < clone res=0 exe=/usr/sap/MD0/MDS55/exe/mds-r args= tid=24212(mds-r) pid=24212(mds-r) ptid=9209(mds-r) cwd=/usr/sap/MD0/MDS55/work fdlimit=1024 pgft_maj=0 pgft_min=0 vm_size=3495004 vm_rss=1230672 vm_swap=0 flags=512(CLONE_PARENT_SETTID) uid=77582 gid=201
1616652 17:35:05.721102315 0 mds-r (9309) > clone
…
…

Firstly the timing matched the start of the performance test execution (17:35) which meant the SAP MDM server started doing system calls as soon as it got work to do.

The three first highlighted events above #1611796, #1611917 and #1611930 showed how a SAP MDM thread with TID=9203 (one of the child task from root MDM process with PID=9135) was cloning itself and as a result we got a new process/task with TID=24208. Apart from this, same clone call pattern was repeating all over the whole performance test execution even though we only showed the very first events. If you don’t believe us see figure 5 that shows the rate of new tasks created by the OS.

rate of new tasks per second
Figure 5

All repeating clone system calls had for sure an impact on the high % of system CPU usage.

We took next the new cloned process with TID=24208 as a sample to try to see what happened once it got cloned and ready to run. We got first the raw time of event #1611930 in SYSDIG trace:

$ sysdig -r sysdig.scap.gz -p"%evt.rawtime" evt.num=1611930
1412350505486748793

And then we asked SYSDIG for what it could tell us about TID=24208 taking 15ms around of that event (output truncated for simplicity):

$ sysdig -r sysdig.scap.gz -p "%evt.num %evt.time.s, name: %proc.name, ppid: %proc.ppid, pid: %proc.pid, tid: %thread.tid, syscall: %evt.type(%evt.args)" "thread.tid=24208 and evt.dir=< and evt.around[1412350505486748793]=15"
1611930 17:35:05, name: mds-r, ppid: 9203, pid: 24208, tid: 24208, syscall: clone(res=0 exe=/usr/sap/MD0/MDS55/exe/mds-r args= tid=24208(mds-r) pid=24208(mds-r) ptid=9203(mds-r) cwd=/usr/sap/MD0/MDS55/work fdlimit=1024 pgft_maj=0 pgft_min=0 vm_size=3495004 vm_rss=1230668 vm_swap=0 flags=512(CLONE_PARENT_SETTID) uid=77582 gid=201 )
1611935 17:35:05, name: mds-r, ppid: 9203, pid: 24208, tid: 24208, syscall: rt_sigaction()
1611938 17:35:05, name: mds-r, ppid: 9203, pid: 24208, tid: 24208, syscall: rt_sigaction()
1611941 17:35:05, name: mds-r, ppid: 9203, pid: 24208, tid: 24208, syscall: rt_sigprocmask()
1612036 17:35:05, name: sh, ppid: 9203, pid: 24208, tid: 24208, syscall: execve(res=0 exe=sh args=-c./usr/sbin/uuidd -t >/dev/null 2>/dev/null. tid=24208(sh) pid=24208(sh) ptid=9203(mds-r) cwd=/usr/sap/MD0/MDS55/work fdlimit=1024 pgft_maj=0 pgft_min=11 vm_size=1176 vm_rss=20 vm_swap=0 )
1612038 17:35:05, name: sh, ppid: 9203, pid: 24208, tid: 24208, syscall: brk(res=1ADE000 vm_size=1176 vm_rss=52 vm_swap=0 )
1612040 17:35:05, name: sh, ppid: 9203, pid: 24208, tid: 24208, syscall: mmap(res=7F7773E63000 vm_size=1180 vm_rss=104 vm_swap=0 )
1612042 17:35:05, name: sh, ppid: 9203, pid: 24208, tid: 24208, syscall: access()
1612044 17:35:05, name: sh, ppid: 9203, pid: 24208, tid: 24208, syscall: open(fd=-2(ENOENT) name=/usr/sap/MD0/MDS55/exe/tls/x86_64/libtinfo.so.5 flags=1(O_RDONLY) mode=0 )
1612046 17:35:05, name: sh, ppid: 9203, pid: 24208, tid: 24208, syscall: stat(res=-2(ENOENT) path=/usr/sap/MD0/MDS55/exe/tls/x86_64 )
1612048 17:35:05, name: sh, ppid: 9203, pid: 24208, tid: 24208, syscall: open(fd=-2(ENOENT) name=/usr/sap/MD0/MDS55/exe/tls/libtinfo.so.5 flags=1(O_RDONLY) mode=0 )
1612050 17:35:05, name: sh, ppid: 9203, pid: 24208, tid: 24208, syscall: stat(res=-2(ENOENT) path=/usr/sap/MD0/MDS55/exe/tls )
1612052 17:35:05, name: sh, ppid: 9203, pid: 24208, tid: 24208, syscall: open(fd=-2(ENOENT) name=/usr/sap/MD0/MDS55/exe/x86_64/libtinfo.so.5 flags=1(O_RDONLY) mode=0 )
1612054 17:35:05, name: sh, ppid: 9203, pid: 24208, tid: 24208, syscall: stat(res=-2(ENOENT) path=/usr/sap/MD0/MDS55/exe/x86_64 )
1612056 17:35:05, name: sh, ppid: 9203, pid: 24208, tid: 24208, syscall: open(fd=-2(ENOENT) name=/usr/sap/MD0/MDS55/exe/libtinfo.so.5 flags=1(O_RDONLY) mode=0 )
1612058 17:35:05, name: sh, ppid: 9203, pid: 24208, tid: 24208, syscall: stat(res=0 path=/usr/sap/MD0/MDS55/exe )
1612060 17:35:05, name: sh, ppid: 9203, pid: 24208, tid: 24208, syscall: open(fd=-2(ENOENT) name=/oracle/client/11x_64/instantclient/tls/x86_64/libtinfo.so.5 flags=1(O_RDONLY) mode=0 )
1612062 17:35:05, name: sh, ppid: 9203, pid: 24208, tid: 24208, syscall: stat(res=-2(ENOENT) path=/oracle/client/11x_64/instantclient/tls/x86_64 )
1612064 17:35:05, name: sh, ppid: 9203, pid: 24208, tid: 24208, syscall: open(fd=-2(ENOENT) name=/oracle/client/11x_64/instantclient/tls/libtinfo.so.5 flags=1(O_RDONLY) mode=0 )
1612066 17:35:05, name: sh, ppid: 9203, pid: 24208, tid: 24208, syscall: stat(res=-2(ENOENT) path=/oracle/client/11x_64/instantclient/tls )
1612068 17:35:05, name: sh, ppid: 9203, pid: 24208, tid: 24208, syscall: open(fd=-2(ENOENT) name=/oracle/client/11x_64/instantclient/x86_64/libtinfo.so.5 flags=1(O_RDONLY) mode=0 )
1612070 17:35:05, name: sh, ppid: 9203, pid: 24208, tid: 24208, syscall: stat(res=-2(ENOENT) path=/oracle/client/11x_64/instantclient/x86_64 )
1612072 17:35:05, name: sh, ppid: 9203, pid: 24208, tid: 24208, syscall: open(fd=-2(ENOENT) name=/oracle/client/11x_64/instantclient/libtinfo.so.5 flags=1(O_RDONLY) mode=0 )
1612074 17:35:05, name: sh, ppid: 9203, pid: 24208, tid: 24208, syscall: stat(res=0 path=/oracle/client/11x_64/instantclient )
1612076 17:35:05, name: sh, ppid: 9203, pid: 24208, tid: 24208, syscall: open(fd=-2(ENOENT) name=/usr/sap/MD0/SYS/exe/run/tls/x86_64/libtinfo.so.5 flags=1(O_RDONLY) mode=0 )
1612078 17:35:05, name: sh, ppid: 9203, pid: 24208, tid: 24208, syscall: stat(res=-2(ENOENT) path=/usr/sap/MD0/SYS/exe/run/tls/x86_64 )
1612080 17:35:05, name: sh, ppid: 9203, pid: 24208, tid: 24208, syscall: open(fd=-2(ENOENT) name=/usr/sap/MD0/SYS/exe/run/tls/libtinfo.so.5 flags=1(O_RDONLY) mode=0 )
1612082 17:35:05, name: sh, ppid: 9203, pid: 24208, tid: 24208, syscall: stat(res=-2(ENOENT) path=/usr/sap/MD0/SYS/exe/run/tls )
1612084 17:35:05, name: sh, ppid: 9203, pid: 24208, tid: 24208, syscall: open(fd=-2(ENOENT) name=/usr/sap/MD0/SYS/exe/run/x86_64/libtinfo.so.5 flags=1(O_RDONLY) mode=0 )
1612086 17:35:05, name: sh, ppid: 9203, pid: 24208, tid: 24208, syscall: stat(res=-2(ENOENT) path=/usr/sap/MD0/SYS/exe/run/x86_64 )
…
…

Note that event #1612036 has been highlighted above. This is because it explained pretty well now what the SAP MDM server was really doing behind the scene:

  1. SAP MDM server thread with TID=9203 made a system call to clone to create a new child task with TID=24208
  2. Child task with TID=24208 was then making another system call to execve to execute following OS command:

sh -c “/usr/sbin/uuidd -t >/dev/null 2>/dev/null“

Once again, this clone & execve call pattern was repeating over and over again during performance test execution which contributed to the number of newly created tasks shown in figure 5 and had a direct impact on the high % of system CPU usage.

If you are not convinced yet processes mds-r and sh were related we can go back to SYSDIG trace again and try to see whether there were events captured by SYSDIG and generated by process name sh which had SAP MDM server root process with PID = 9135 as any of its process ancestors. The answer was yes:

$ sysdig -r sysdig.scap.gz "proc.apid=9135 and proc.name=sh" | wc -l
4516641

And if we added the system call execve to the filter we saw a long list of system calls to same uuidd command above (output truncated for clarity):

$ sysdig -r sysdig.scap.gz "proc.name=sh and proc.apid=9135 and evt.type=execve"
1612036 17:35:05.498922922 9 sh (24208) < execve res=0 exe=sh args=-c./usr/sbin/uuidd -t >/dev/null 2>/dev/null. tid=24208(sh) pid=24208(sh) ptid=9203(mds-r) cwd=/usr/sap/MD0/MDS55/work fdlimit=1024 pgft_maj=0 pgft_min=11 vm_size=1176 vm_rss=20 vm_swap=0
1612417 17:35:05.500185224 1 sh (24209) > execve
1613376 17:35:05.567260858 5 sh (24210) < execve res=0 exe=sh args=-c./usr/sbin/uuidd -t >/dev/null 2>/dev/null. tid=24210(sh) pid=24210(sh) ptid=9311(mds-r) cwd=/usr/sap/MD0/MDS55/work fdlimit=1024 pgft_maj=0 pgft_min=10 vm_size=1176 vm_rss=16 vm_swap=0
1613757 17:35:05.568702690 9 sh (24211) > execve
1615194 17:35:05.701212609 12 sh (24212) < execve res=0 exe=sh args=-c./usr/sbin/uuidd -t >/dev/null 2>/dev/null. tid=24212(sh) pid=24212(sh) ptid=9209(mds-r) cwd=/usr/sap/MD0/MDS55/work fdlimit=1024 pgft_maj=0 pgft_min=10 vm_size=1176 vm_rss=16 vm_swap=0
1615571 17:35:05.702471439 1 sh (24213) > execve
1616904 17:35:05.748568557 12 sh (24214) < execve res=0 exe=sh args=-c./usr/sbin/uuidd -t >/dev/null 2>/dev/null. tid=24214(sh) pid=24214(sh) ptid=9309(mds-r) cwd=/usr/sap/MD0/MDS55/work fdlimit=1024 pgft_maj=0 pgft_min=11 vm_size=1176 vm_rss=20 vm_swap=0
1617293 17:35:05.749920607 1 sh (24215) > execve
1618420 17:35:05.784438664 2 sh (24216) < execve res=0 exe=sh args=-c./usr/sbin/uuidd -t >/dev/null 2>/dev/null. tid=24216(sh) pid=24216(sh) ptid=9223(mds-r) cwd=/usr/sap/MD0/MDS55/work fdlimit=1024 pgft_maj=0 pgft_min=10 vm_size=1176 vm_rss=16 vm_swap=0
1618799 17:35:05.785717490 9 sh (24217) > execve
1620159 17:35:05.852282737 11 sh (24219) < execve res=0 exe=sh args=-c./usr/sbin/uuidd -t >/dev/null 2>/dev/null. tid=24219(sh) pid=24219(sh) ptid=9308(mds-r) cwd=/usr/sap/MD0/MDS55/work fdlimit=1024 pgft_maj=0 pgft_min=11 vm_size=1176 vm_rss=20 vm_swap=0
1620540 17:35:05.853519534 1 sh (24220) > execve
1621131 17:35:05.887054697 11 sh (24221) < execve res=0 exe=sh args=-c./usr/sbin/uuidd -t >/dev/null 2>/dev/null. tid=24221(sh) pid=24221(sh) ptid=9210(mds-r) cwd=/usr/sap/MD0/MDS55/work fdlimit=1024 pgft_maj=0 pgft_min=10 vm_size=1176 vm_rss=16 vm_swap=0
1621508 17:35:05.888330270 1 sh (24222) > execve
1626367 17:35:05.993531715 9 sh (24223) < execve res=0 exe=sh args=-c./usr/sbin/uuidd -t >/dev/null 2>/dev/null. tid=24223(sh) pid=24223(sh) ptid=9203(mds-r) cwd=/usr/sap/MD0/MDS55/work fdlimit=1024 pgft_maj=0 pgft_min=10 vm_size=1176 vm_rss=16 vm_swap=0
1626746 17:35:05.994787054 1 sh (24224) > execve
1642708 17:35:06.033695673 3 sh (24225) < execve res=0 exe=sh args=-c./usr/sbin/uuidd -t >/dev/null 2>/dev/null. tid=24225(sh) pid=24225(sh) ptid=9205(mds-r) cwd=/usr/sap/MD0/MDS55/work fdlimit=1024 pgft_maj=0 pgft_min=11 vm_size=1176 vm_rss=16 vm_swap=0
1643085 17:35:06.035070792 9 sh (24226) > execve
1654793 17:35:06.063841178 4 sh (24227) < execve res=0 exe=sh args=-c./usr/sbin/uuidd -t >/dev/null 2>/dev/null. tid=24227(sh) pid=24227(sh) ptid=9311(mds-r) cwd=/usr/sap/MD0/MDS55/work fdlimit=1024 pgft_maj=0 pgft_min=10 vm_size=1176 vm_rss=16 vm_swap=0
1655895 17:35:06.065124835 9 sh (24228) > execve
1672916 17:35:06.157488119 4 sh (24229) < execve res=0 exe=sh args=-c./usr/sbin/uuidd -t >/dev/null 2>/dev/null. tid=24229(sh) pid=24229(sh) ptid=9207(mds-r) cwd=/usr/sap/MD0/MDS55/work fdlimit=1024 pgft_maj=0 pgft_min=10 vm_size=1176 vm_rss=16 vm_swap=0
1673299 17:35:06.158880771 9 sh (24230) > execve
1673883 17:35:06.190812483 9 sh (24231) < execve res=0 exe=sh args=-c./usr/sbin/uuidd -t >/dev/null 2>/dev/null. tid=24231(sh) pid=24231(sh) ptid=9161(mds-r) cwd=/usr/sap/MD0/MDS55/work fdlimit=1024 pgft_maj=0 pgft_min=10 vm_size=1176 vm_rss=16 vm_swap=0
1674260 17:35:06.192099826 1 sh (24232) > execve
1674801 17:35:06.223490331 9 sh (24233) < execve res=0 exe=sh args=-c./usr/sbin/uuidd -t >/dev/null 2>/dev/null. tid=24233(sh) pid=24233(sh) ptid=9209(mds-r) cwd=/usr/sap/MD0/MDS55/work fdlimit=1024 pgft_maj=0 pgft_min=10 vm_size=1176 vm_rss=16 vm_swap=0
1675186 17:35:06.224751652 1 sh (24234) > execve
1677490 17:35:06.424546019 4 sh (24235) < execve res=0 exe=sh args=-c./usr/sbin/uuidd -t >/dev/null 2>/dev/null. tid=24235(sh) pid=24235(sh) ptid=9309(mds-r) cwd=/usr/sap/MD0/MDS55/work fdlimit=1024 pgft_maj=0 pgft_min=10 vm_size=1176 vm_rss=16 vm_swap=0
1677868 17:35:06.425974264 9 sh (24236) > execve
1679041 17:35:06.463249140 11 sh (24237) < execve res=0 exe=sh args=-c./usr/sbin/uuidd -t >/dev/null 2>/dev/null. tid=24237(sh) pid=24237(sh) ptid=9223(mds-r) cwd=/usr/sap/MD0/MDS55/work fdlimit=1024 pgft_maj=0 pgft_min=10 vm_size=1176 vm_rss=16 vm_swap=0
1679420 17:35:06.464599439 1 sh (24238) > execve
1680469 17:35:06.496347012 11 sh (24239) < execve res=0 exe=sh args=-c./usr/sbin/uuidd -t >/dev/null 2>/dev/null. tid=24239(sh) pid=24239(sh) ptid=9308(mds-r) cwd=/usr/sap/MD0/MDS55/work fdlimit=1024 pgft_maj=0 pgft_min=10 vm_size=1176 vm_rss=16 vm_swap=0
1680849 17:35:06.497569382 12 sh (24240) > execve
…
…

Therefore we could conclude at this point the SAP MDM server eventually ended up creating an OS shell at some point to run uuidd commands to return a unique id. The chain of system calls it generated together with the fact this was a repeating step all along the performance test execution had a direct impact on the high % of system CPU usage observed at figures 1, 3 and 4 above.

This was in fact a very important finding. Binary /usr/sbin/uuidd was provided by RPM package uuidd.x86_64. Comparing RHEL 6.3 (where SAP MDM server behaved not as expected) and RHEL 5.5, this RPM package was only installed on RHEL 6.3 while it did not exist on RHEL 5.5. This led us to obviously think the present of such a RPM package eventually made the SAP MDM server to misbehave somehow. Unfortunately the reason of such misbehavior could not be found in the OS traces and it is something to be asked to SAP as software vendor. Nonetheless after uninstalling RPM package uuidd.x86_64 deployed on RHEL 6.3 server and ran the performance test execution once again we could prove our finding was fundamentally right which in turn made managers and team members very happy as well since we could go live with new infrastructure without jeopardizing the project deadlines.

Artículo anterior

ATAQUES DE INGENIERÍA SOCIAL Y PHISHING

Artículo siguiente

ORACLE, OPERATIONS WITH TABLESPACES

1 Comentario

  1. 8 septiembre 2015 en 18:34 — Responder

    Master Class…

Responder

Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *