Scale and complexity of SAP's public cloud landscape
I have written
previously about the scale, growth rate and complexity of SAP's public cloud landscape. Since that article, the size has only grown to now over 15 million cloud resources and nearly 12,000 active cloud accounts in the global environment. This is a handful to keep an eye on for policy compliance.
Aside from the scale and resulting complexity alone in scaling the Chef InSpec infrastructure to ensure we can scan these cloud accounts, we have a separate problem to contend with, and that is the size, complexity and variation within SAP as an organization itself - from product and platform organizations, to internal usage across all board areas for a wide variety of use cases. SAP has over 100,000 employees, and ~25,000 alone in various development business units. Regardless of the size of the environment itself, how do you make sure that you manage the organization?
Context is Everything: Data Enrichment
Regardless of the quality of scans from security tooling - which is a separate challenge in its own right for maybe another time - when the size of the organization is like SAP's, making sure that the resulting alerts end up in the right hands to get them acted on is not a trivial exercise. Security tooling generally doesn't make it easy to bring in organizational context to the security scans, so we instead do it the other way.
Each scan - passed or failed - is associated with a particular cloud account. Within our Multi Cloud team, we maintain a data base of cloud accounts with their main responsible person, security officer and additional contacts (like distribution lists), a set of additional security attributes that give us more information about what the cloud account is used for, and a cost center number used for billing. This information is refreshed at least once every six months (preferably more frequently if so needed), or follow-up processes kick in with the responsible colleague for the account that in the most severe of cases could lead to accounts being locked and ultimately deleted (luckily it has never come that far!).
The cost center used for the account for billing is tied via an internal system to the organizational hierarchy, going potentially 7-9 layers in the organizational hierarchy. We perform a daily lookup for all the cost centers, which means that when there are organizational changes, the account metadata is updated along with it.
Our compliance scans as part of the data pipelining is enriched with the organizational hierarchy and the additional security metadata information for the account and sent to storage and search. This also means that we store the data with the enrichment, so any historical analysis is based on the shape of the organization it was at the time, not the current organization which is likely to have changed during longer reporting periods.
From this storage we move the scan data to the SAP SIEM where it is associated with other data sources, such as centrally collected public cloud audit logs, and thereby drive security incident response processes where needed.
We separately provide internal analytics and data exports for internal stakeholders, organized by board area and multiple layers into the organizational hierarchy. Since all the scan alerts are associated with the account metadata, we can further segment and slice the data to gain more insights, watch for particular trends and look for overall patterns that can lead to policy sharpening, or focus campaigns for particular alert types we see across business units, or reach out to particular teams that have challenges.
Security Analytics
I have spent a great deal of time in analytics earlier in my career, so it comes natural to provide a variety of security compliance dashboards that track our progress, segment it out by different board areas and down the organization, and highlight emerging problems. That includes a fair bit of line and bar charts, weekly delta reporting, etc. that are fairly self-explanatory.
The organization however is not only varied in structure, it is also varied in workloads, in cloud usage, in the sizes of their DevOps teams and other operational support teams. Public cloud usage is pervasive throughout the company, from customer-facing landscapes, QA, and development environments, but also internal development, labs and sandbox environments, and cloud accounts used for internal or external training and education services, or demo systems.
Product and platform teams use public cloud accounts, but also Marketing, Finance, HR. Some teams use millions of cloud resources across thousands of accounts, some have many resources in a smaller number of accounts, and others may just have a couple running an internal system.
Combine that with a landscape that is constantly growing and you end up with trying to express ratios of alerts per cloud account to try to normalize the alerts across the organization in some way to make it meaningful. Consider a small rise in alerts associated with a big rise in cloud accounts and resources deployed. Does that mean we did worse? Or did we actually do better
per cloud account, and we really deal with a legacy problem that needs rectifying, but we have new growth under control? These questions are not easy to answer, and we have to find more creative ways to find patterns that would remain hidden in simpler stacked bar charts or top line metrics. This gets us in the area of data visualization.
Examples of Security Compliance Data Visualizations
I want to share some examples of the data visualizations we use to drive compliance and greater transparency through the organization. I realize this suggests we
have policy violation alerts, but I believe there is greater value in sharing how we analyze them, and how they are followed up on in the organization.
The charts below have had their labels removed, but internally they contain the names of the teams and the number of outstanding alerts that need to be addressed. The number of cloud accounts for the business unit are on the x-axis, on the y the number of alerts per cloud account. This allows us to see patterns that otherwise wouldn't be visible.
If you don't have any alerts you don't show up in these charts, and if you have such a small number compared to your cloud accounts, you may fall off the bottom of the chart - these are not really the business units we are too concerned about other than saying they are doing a good job. The higher you are up the x-axis though, the more we will want to reach out to the business unit.
Please note that in these chart the min and max size of the circle ratio is fixed, and mapped to the min and max values - so don't make any assumptions about our compliance rate. These are intended to be
actionable internally and drive attention to problem areas, rather than make us look good.
Because of the organizational hierarchy, we can drill deeper into lower levels of business units.
Allowing greater and greater precision to see which teams are responsible for which outstanding alerts and how their are performing compared to other teams.
What it also displays is a general shape of the alert data set. In each of these charts you see a general pattern that the more cloud accounts a team tends to have, the better they are performing in general in terms of alert compliance. Most SAP solutions are single-tenant deployed, which favors this type of visualization (multi-tenant landscapes tend to run more resources in a smaller number of cloud accounts, and since many possible policy violation may occur multiple times in a single cloud account, they come off a bit worse in a visualization like this).
Larger teams are easier to reach, have the resources to react and the ability to adjust their deployment quickly. They have 24/7 operational support and dedicated security operations. Business units with just a few accounts in a non-technical area and internal-facing do not have the same level of support structures, naturally - if unfortunately.
We see a similar pattern when we segment out further by type of environment. Note that the environment can be internal or external in case of Production landscape. Again you can see that those with high number of cloud accounts do proportionally better (lower on the y-axis).
What is also interesting here is you see one of those business units that do particularly well, but hasn't quite dropped below the chart yet, with the small red dot at the very bottom right.
"It takes a Village...": Organizational Support Structures
We don't do all the data pipelining, enrichment and analytics just to make pretty data visualizations and dashboards. They are not just to raise awareness, but to be directly actionable. We make the alerts directly available to the account owners, but in addition we have a weekly reporting cadence working throughout the entire organization at multiple levels.
Of course, we expect all teams to stay on top of their alerts on a continuous basis, but SGS and Multi Cloud organizations support that further to drive transparency and accountability through the company.
- Thursday is "reporting day" to collect a weekly snapshot that is structured by organizational hierarchy and is shared with a large community of internal security experts throughout the company, and is the basis for further analysis so everyone works off the same data set. There is a set of standard dashboards, including samples as shown above, but any unexplained change or a particular trend is further investigated and if warranted, visualized
- Friday a status overview and dashboards is collected in a weekly executive status deck, which is shared with key security leaders in the company, with a separate executive summary
- Monday we conduct a board area delegate call with representatives from the board member office to provide the weekly status update - these are business operations representatives, not security specialists
- Tuesday a weekly Office Hours call is conducted with the wider community of security experts giving them the same update, and provide a place to ask questions, keep up to date on ongoing security programs and releases and roadmap for new internal security tooling and services
This information exchange is also expressed in the first illustration in this article. The community of security specialists representing various teams and business units call their development teams to account. The board area delegate calls, extended with representatives from major business units, bring in the managerial hierarchy to ensure action is indeed taken. Executive leadership receive a weekly status update, and progress reports are included in quarterly and annual internal reporting.
Continuous Improvement
The only meaningful question in security operations is, are we effective? I am glad to say that we are. We are keeping compliance alerts under control and are driving continuous improvement, with two significant policy sharpening exercises this year and two deployments of additional
preventative controls. This occurred while the number of resources deployed doubled in the three quarters since the start of the year.
It is impossible to track how effective you are if you can't measure it. It is also impossible to drive improvement if you can't get the right information into the right hands, where action can be taken. Rich security analytics combining security alerts with organizational metadata identify new problem areas that may require special attention and may lead to policy changes or new guardrails. The enrichment of security alerts, whether in CSPM or otherwise, in addition is critical for triage and prioritization during security incident response. Keeping the corporate hierarchy informed through executive reporting ensures transparency and accountability through the organization, and ensures that appropriate priority is given in remediation.
This comes paired with support structures for the various teams in SAP such as our Office Hours to ask questions, stay informed on security programs and get advice, and internal tooling that facilitates the security compliance process and support DevOps teams to "shift-left" in their DevSecOps practices.
The work is never finished in security so we continue to strive for continuous improvement. Planning for 2022 is already under way.
Finally, we were delighted and honored to receive
Chef's "Continuous Compliance Champion of the Year" Award during ChefConf 2021 and talked there about our journey.
ChefConf 2021 was conducted virtually and the sessions are available for replay.