I had received a call from a customer that their SAP system server was consuming Full CPU cores while normally the use was 5% on average. This was after an Oracle11G upgrade had taken place so it immediately pointed towards the switch from Oracle10G to Oracle11G.
I logged on once I got home. I had my kids with me which is the normal situation, my future wife drives them to grand-ma in the morning, and I pick them up in the evening. To my delight they were like little angels and I could logon without the need for superhero moves.
Searching for the root cause
The logs (SAP logs and Oracle logs) didn't show anything special and when checking the statistics in the SAP systems only showed some trace of the elevated CPU use and the slowdown caused by it.
I did check the SAP system thoroughly for clues (system administrator transactions, logging, tracing, expensive SQL statements, AWR reporting for Oracle). Too bad Database Analysis for Diagnostics scenario is not yet available (it should be available in Solution Manager 7.1).
Residing to other sources
After those checks I started looking on sources such as SAP Service Marketplace, Oracle Metalink, SCN and Google (in that order). The first search I did I used the keywords "oracle CPU" in SAP Service Marketplace. This brings up SAP Note 712624 - High CPU consumption by Oracle. A SAP Note with thirty-one points on possible causes for high CPU consumption.
I couldn't really pinpoint the cause though based on the instructions I could find on any of these sources. After some searching around on the net I couldn't find anything that was really conclusive and could give the proper root cause.
The page includes a case study of Oracle on resolving High CPU Usage on Oracle Servers (interesting content). Further down on the page you can also find SQL statements used to check CPU consumption.
Further down on the webpage mentioned previously you can find instructions to use a Windows administrative tool QSlice. It can be used to check which part of ORACLE.EXE is consuming the CPU.
This is also possible using other tools such as the Oracle Administration Assistant for Windows can also provide Process Information of individual threads.
When you open QSlice you will have a similar overview as in picture 1.1. You can see here the process on the bottom takes the CPU (oracle.exe).
To check which part of oracle.exe is consuming that much CPU you can double click on oracle.exe in the second column to view the details.
The total CPU% on picture 1.2 is actually around 50%. The reason is because I performed a stop/start of SAP + Oracle earlier on the same day before business hours which freed of the CPU power.
Now when you look at the first column in picture 1.2 you will see the TID (Thread ID). The SQL statements mentioned on the webpage used this TID to try and figure out what the thread is actually doing.
In this case TID 13e0 and 1778 are consuming CPU power. The notation here of the TID is Hexadecimal, to be able to use it in SQL statements. First convert the TIDs to Decimal format. You can do this through SQL or using a simple calculator which you Google up in two seconds.
Hexadecimal to Decimal conversion: 13e0 equals 5088 1778 equals 6008
You can see now we have identified the Oracle process ID (PID = 49) and the Windows thread id in decimal format (SPID = 5088).
You can see now for the second thread ID in QSlice we have identified the Oracle process ID (PID = 50) and the Windows thread id in decimal format (SPID = 6008).
In the source mentioned earlier there are also SQL statements to find out which Oracle session is related to a certain process. However since there were no abnormalities in SAP (transaction DB02) I didn't expect any return. I did perform the SQL statement to see which session is connected but I got a "no rows returned" as result.
Since I still didn't really have enough information I wanted to go a step further and dump what Oracle was doing. You can find instructions to perform those kinds of actions on the same source mentioned in this blog.
Now you can create a trace of what these processes are doing:
Enable the trace by setting the following event:
Disable the trace again after a little bit of time with the following statement:
This dumps a trace file to the location which is specified by parameter user_dump_dest. To check the location where the trace file is written you can show the parameter value:show parameter user_dump_dest;
I enabled the trace and let it run for some minutes. I went to check the content of the trace file; to my surprise it was empty. The only additional information I had found is the fact that it concerns shadow process since it's visible when you perform the oradebug command in the ORACLE.EXE (SHAD).
An Oracle shadow process is typically created when a connection towards Oracle is made. SAP also uses the mechanism. I was also communicating with the IT department of my customer as good communication and status updates etc aid the cause of finding the proper solution.
Since the CPU usage was 0% upon startup and it went up to 25% after a while, stayed there and then bumper up to 50% and after some more time 75% and eventually 100% it seemed the issue was triggered by a periodic job.
To blog or not to?
I saw the serious discussion on the link blog earlier today so I had to build in an interval at this point. I love story telling and for me blogging works best for me, I cannot write this way if I would have to write a white-paper.
I love troubleshooting, yes it can bring additional work hours and yes it's a challenge and people will be anxious but for me troubleshooting is creating art (hence my blog the art of troubleshooting). Since I wanted to give it some more momentum I decided to draw a small piece of art (see picture).
Around 9PM my future wife arrived at home and started to give comments on the crime scene that was left behind by the kids inside the house. I did let the little angels watch DVD movies, play around, eat and drink, colour in colouring books and probably also on colour on items they shouldn't colour on.
I waved it off by saying "This is IT". It sounded a lot like the one-liner from 300 "This is Sparta". The only problem was mine didn't have any impact.
Then I probably skipped half of the conversation and I answered without knowing what I answered but I was focused on the greater cause. Finding the root cause and solving the issue. Most likely I promised to clean up the mess so I could continue. Apparently I tend to talk and answer in my sleep also. Good thing my brain is always alert for trick questions.
This is IT
Alright I'm tilting towards too much non technical content so back to the problem. Where were we? Yes so I called the customers IT persons and discussed what I had seen so far. So now we were looking for a job or program that would trigger at certain timings but outside of SAP since there was no single trace that SAP itself (SAP ERP) was involved in any way. While I was talking on the phone I saw another CPU core being fully utilized in QSlice so I had an exact timing. With that timing and the traces of the other timings the customer was able to find out that it was a Cognos job that was triggered on that timing.
It makes sense once the puzzle is complete
Since we wanted to be sure, he launched a job manually and we saw the CPU bumping up another 25% so we had found the root cause of the issue. After checking it seemed the Oracle client used by Cognos was too old (or the Cognos software was already too old) and when the Oracle client connected to Oracel11G, on the Cognos side, it throws an error and closes down and on Oracle11G side the shadow process keeps using CPU.
This is great
By the time the test was done it was 11:15pm. The great thing about the whole story is how we (customer and me) teamed up. The mission was accomplished and the customer was very happy that the root cause was found. Since there were persons working on SAP at that timing I set an alarm to wake up at 4:45am again to stop/start SAP+Oracle to free up the resources (it had been running at 75% CPU while I was sleeping). A few twitters the next morning are traces of the excellent teamwork that was done.
I also want to point out that system administration contains lots of area's (certainly not narrowed down to only SAP software).
I graduated as a developer and as a developer, having to use coding from can be a challenge. In the past I have spend time on building a modification for Unreal Tournament 2004 (PC game). To be able to create modifications, add content and so on, you first have to understand the coding that exists to be able to reuse parts and code new parts.
This is also very often the case as a system administrator, we have to gather the pieces of the puzzle and put them together. If a SAP application fails, the administrator is contacted and he has to find out what caused it. It's possible it doesn't have anything to do with a problem caused by the SAP software itself.