Recently, Gene Kim released his second book: "The Unicorn Project" and even gave it away for free during its launch. This was sort of the follow-up to "The Phoenix Project". There are some subtle differences in the story line though, as the latter was focused on the IT Operational aspects while the former is more on Application side of DevOps.
Figure 1. The Phoenix Project and The Unicorn Project
But, what is true and resonate with me is his fast-paced and incredible story of disasters that can happen in production systems. In reflecting my experience for the past twenty odd years, I would like to share my worst SAP Production Outage Disaster experience and then invite you to share yours in the comments section.
First though, mission criticality of production systems is dependent of the business process that it serves for the organization. For example, a SAP CRM (Customer Relationship Management) and ISU (Industry Solution-Utilities) Billing would be considered highly mission critical as any outage would be disastrous and will impact the downstream billing runs and customer interaction for service officers.
On the other hand, a SAP HR module running payroll run might be down from a availability standpoint for two days and may not have any significant business impact because payroll runs every two weeks. However, integrity/validity of the transactions need to be 100% accurate. Recall in The Unicorn Project where Gene described Parts Unlimited's payroll failure to its part-time workers due to null entries posted to its payroll database.
Therefore, it would be useful to understand what categories of IT outages and key causes for outages in general. Here I refer to Uptime Institutes' number of reported IT Outages, Categories and Impacts for a Service Outages.
This is self-explanatory, and Uptime Institute have these key findings:
In 2018, most publicly reported outages fell in the categories of 1 to 3 (79%)
From a trend line perspective there are some significant drops. The proportion of Level 5 outages (severe, business-critical outages) dropped from 9% to 4% from 2016-2018.
Number of Level 4 serious outages recorded in numbers also dropped significantly in 2018 (17%) versus 2016 (47%).
Figure 4: Breakdown of Outages (Source: Uptime Intelligence, January 2019)
The explanation for this is due to two reasons :
"the reporting of outages, in social media and then picked up by mainstream media, is increasing as more people are affected (due to higher adoption of public cloud, SaaS, and managed hosted services) and it is easier to spread the news, even about smaller outages; and
IT-based outages, which are now more common than full data center outages, are more likely to be partial and, while certainly disruptive, can often have a lower impact than a complete data center outage,which may affect all applications and create cascading effects"
Uptime further uncovered:
"However, when viewed in context, the combination of facilities issues (power, cooling, fire, and fire suppression) are still the biggest cause (at 32%) of outages. In addition, many of the failures classed as IT and network were actually caused by a power-related problem at a single component, system or rack level – which we have not generally classed as a data center power problem."
Figure 5: Causes of Outages (Source: Uptime Institute)
This implies up to 60+% are the main root causes of IT outages are power-related at data center/system/component level! This is probably one of the key reasons why more and more organizations are keen to migrate to the public hyperscale cloud platforms like Microsoft Azure.
My Worst "Phoenix" type of SAP Production Outage Experience
This disastrous incident is seared permanently in my memory as I recall it now vividly. It happened sometime circa in early 2000s and it was widely reported at that time because it had a significant business impact to hundred thousands (maybe even over a million) customers (leaving dates, places, name details out).
I was on standby call as part of my job as a Mission Critical SAP support technical consultant for hardware vendor at that time. I was activated by my manager and was told to get on-site immediately. Upon arrival, I met up with the customer and an SAP (from the company) Active Global Support consultant was already there performing troubleshooting. So, I joined in to perform analysis, , and that took us throughout the night.
Apparently, the SAP Production System was having thousands of ABAP dumps every minute. It was extremely weird, because every single transaction seemed to dumping out with many errors. This was a 4.5B system and at that time, and I remember the SAP Transport Change Transport System (CTS) or STMS was more basic. Many customers were still hungover from the SAP 3.X/4.0 era and was still using tp commands because they were still not comfortable with the GUI interface for transport.
This organization though was using STMS for sometime, and apparently the operator had clicked one of the icons on Figure 6.
Please Make a Guess which is the wrong button (1) or (2) in the Comments Section?
Figure 6: SAP Transport System icon
By clicking one of those transport truck buttons, STMS apparently had overwritten the production system. The issue here is that over a period of years, there were no housekeeping on the tp buffer and transport directory. Theoretically, the transports requests is in the buffer queue should overwrite in sequence. But because there were thousands of them there, many were faulty and there was no way to revert back the damage done. The application was now broken, no business users could use it.
Disaster was declared by management by morning and the Disaster Recovery Plan (DRP) was activated. However, because this was application error, there was no point to fail-over to the secondary site because issue was also replicated to warm standby instance there.
This Production system was 12 TB in size then (considered to be very large especially during the early 2000s and I believe it to be the largest in the country/region). Backup technologies were still using LTO tapes and and full database restoration was necessary. LTO was considered quite advanced then but it was still very slow because there were too many small data-files and we could not achieve fast restoration throughput.
It took nearly 2 days (cannot recall exact timing-but my colleagues and myself took turns) for us to completely restore/recover the whole SAP Production system. The customer took another half a day to perform all the application post-checks before management made a call to release back to the business users. In the end, around 3 business days were lost, resulting in significant impact to no quantifiable metric of business loss).
This incident happened nearly twenty years back, but it remained very clearly in my mind as if it was yesterday. In hindsight, this human error was totally preventable.
Basis Housekeeping should have been regularly performed on the tp directory
Procedures for the Operations Team to be strictly followed through training
SAP notes applied to STMS (hide the unnecessary buttons)
This would probably be classed as "Other/External" since this is application transport release issue resulting in a Category Level 5 - Severity issue. If SAP technical best practice operational procedures were practiced, this would not have occurred.
I would like to invite you to answer which was the wrong button (1) or (2) that was clicked and share a short story on what was your worst "Phoenix" type of SAP Production Outage Disaster ever?