Lessons from Sun Tzu: Explaining Machine Learning

Vriddhi · ‎07-14-2021

All quotes that follow are from The Art of War by Sun Tzu.

Opinions are my own.

Thoughts are welcome in comments below.

If you found this useful and know folks who will too, give it a like and share on social media. Thanks!

“He who knows when he can fight and when he cannot will be victorious” - Sun Tzu, The Art of War

The underlying mathematics of machine learning models is sometimes hard even for practitioners to explain to each other. Explaining how these models work to business users is what our friends in computer programming would probably call a “non-trivial problem”.

We’ve all ducked these curve balls in our own way. Here I share how I respond to them, with a little bit of help from a 1500 yrs old ancient Chinese military treatise. Please throw in comments on whether you agree, disagree or would respond to your customers differently. I’m sure this is a teacup waiting for its raucous storm.

What the customer asks…

Can we use machine learning to solve this?

“In the midst of chaos, there is also opportunity”

Far too often I have witnessed the force fitting of ML to a problem because well, FOMO. If a simple rule solves the problem then you don’t need ML. If the rule needs to be learnt and cannot be supplied upfront, then you need ML. Identify a business area that could do with help, identify a specific problem that needs solving, ask if ML can provide an answer. Do not go backwards on this list.

I don’t have data, can you build a model anyway?

“Even the finest sword plunged into salt water will eventually rust.”

Yes, but we have to be prepared for the fact that the results may not necessarily be aligned to your business. For a logistics company that needs to identify damaged packages on its conveyor belt, we can train with labelled images batch downloaded from the net. With the sample data we can establish a possibility that this approach could save the human manning the conveyor belt some time. But we can’t be sure that this model would work just the same on the shop floor, unless we fit a camera over the conveyor belt and get the actual images, with its uniquely personal flaws of lighting, angle and jitter. Only a model trained on these images can give you a true sense of whether it might eventually work. Still, for some use cases a proof of concept can be established without actual data, if very similar data is available.

I have data, can you predict?

“He who wishes to fight must first count the cost”

If you missed it, the key word of the previous answer is “labelled” data, a problem that is the Achilles heel of supervised learning problems [1]. If you wish to predict if a service ticket is a complaint or an enquiry (for example), you need to start with tens of thousands of service tickets where you know the truth of whether they were complaints or enquiries. This “truth” in ML speak is called a “label”. If you don’t have the label, you will need to create it manually. Most times, the conversation ends here, because manual labelling can cost jaw dropping numbers for person days (on occasion I have seen effort estimates of 300 person days) and outsourcing labelling to service providers like Amazon Mech Turk has data privacy issues for most use cases. There happen to be start ups, that you can enter NDAs with for providing data labelling services, but they don’t come cheap. Especially if you have exotic content (like say Japanese text that need to be labelled into classes like complaint or enquiry). So sometimes, when a plantation manager wants to fly drones on crops to identify if they are on the verge of disease, getting data is not a challenge, but getting “labelled” data is a challenge big enough to abandon the adventure before it starts.

How good will the results be?

“Thus the good fighter is able to secure himself against defeat, but cannot make certain of defeating the enemy.”

We can never know how good the model results will be, until we build and run the models. Even after baseline results have been established, we cannot comment on the amount of improvement subsequent tuning will bring. This response is never well received, but is the unfortunate truth of ML adventures. While this conversation sits well with research driven teams, it often goes south with IT teams that are used to fixed success criteria in tech projects. It takes considerable explanation to eventually land on success criteria defined around number of iterations or sprints dedicated to improving results.

How much data is good data?

“Hence, though an obstinate fight may be made by a small force, in the end it must be captured by the larger force.”

There is no straightforward answer to this and the short answer is - the more the better. But if you insist on knowing the minimal sample size, then research generally tilts in favour of Events per variable (EPV), notably EPV>=10 [2]. Simply put if you need 3 predictors to predict salary (say Years of Experience, Gender and Education), you need data for at least 30 people in your model. There is extensive research to see the minimum data size at which performance is reliable (some mentioned in references). Practically, I see no reason why we need to plumb the bottom of the data pit if a business is serious about its ML model. The more reasonable starting point, is to give as much data as you can, build the model and test the results for robustness with techniques like bootstrapping. If the model appears sufficiently robust, the discussion can march onwards.

Will the model improve over time?

“He who accomplishes triumph by adjusting his strategies to the circumstance of the adversary can be divine.”

The model can be set up to change over time and adjust its strategy to the changing nature of data, just like military strategy must change to suit the foe. But of whether that strategy will improve - we cannot be certain. Certain models, like those that learn user behaviour will improve with usage because over time the data the model is trained on becomes hyper personal. For example, the more you use autocorrect, the more the model understands what you typically misspell and predictions improve. On the other hand, if we are forecasting the price of raw material for a procurement manager, the data being used for training does not become more relevant over time. Such models require constant monitoring by the MLOPs teams to see if performance is deteriorating (aka model and data drift) and require redesign as time passes.

What the customer doesn’t ask, but we need to ask

Can we fix how we judge the results?

“If he sends reinforcements everywhere, he will everywhere be weak.”

For some ML problems, there is an obvious choice of metric to see how good the results are. For some others, how you judge the results can be a tad convoluted [3]. Depending on the problem statement, a list metrics will materialise. Accuracy, precision, recall, AUC, Kappa, F1 score, the list goes on and we’re just talking of one type of model called classification here. You will not have one model, that simultaneously aces all its tests. The business user will have to choose a metric that they care the most about and we can work on a winning model that scores highest for this metric. If this goalpost shifts along the way, there will be re-work and inefficiencies. To help the business user make the choice of metric, it often helps to explain the options simply, with examples, visually. A refresher course on this will perhaps be required each time model results are presented.

What nobody questions, but I’d like to clarify

Does the machine think like a human does?

“To know your enemy, you must become your enemy”

Far too often, I find in an attempt to explain ML, we try to humanise it. We say with dramatic flair that the machine “thinks” and “predicts”. Anthropomorphising is after all a powerful teaching tool. It is far easier to understand a complex idea when you attribute human characteristics to it. When we visualise it like this, it does indeed look like the machine is strategising. Nonetheless, I will offer that a machine does not think. It calculates - adds and multiplies numbers at scale - and arrives at other numbers. As humans we attach interpretations to these numbers which lead to decisions and on occasion strategy. Ofcourse, it catches the fancy of a generation when we say neural networks are designed with inspiration from the human brain. I will concede there is a poetic connection here, seeing as nodes of the network are connected like the human brain’s neurons. If any one of the nodes get fired the final decision will swing like with neurons. But the similarity, the way I see it, ends there. I expect this one to stir our teacup some bit, so do drop your thoughts down into the comments.

In conclusion

“Build your opponent a golden bridge to retreat across.”

The key lesson I carry with me, from my days in consulting, is this: no matter what the question you are asked, it is far better to start your response with “Yes, if” than “No, because”. Can you lift the earth? “No, because it’s not possible” vs “Yes, if you give me a lever long enough”. Some would say there is no difference in the response, after all both mean “No”. But after all these years, I can say with conviction - there is a world of a difference in how the answer is received.

Additional notes

[1] With supervised learning we know what the output was for previous inputs and we want the machine to figure out a rule that connects the inputs with the expected output. This algorithm is based on the belief that history is a good indicator of what to expect from the future. For example, if we want to predict if a deal close or not, we will train the machine with the outcomes of prior deals to identify a pattern between deal attributes and deal outcome. With unsupervised learning, we don’t know the output labels and would like to structure the data in a way that a label can be stuck on data points that exhibit similar behaviour. This algorithm is based on the belief that birds of the same feather will flock together. For example, if we want to segment customers by behaviour, say Loyalists, Churn risk and so on, we will train the machine with attributes of a customer, allow it to cluster similar people together, observe the groups created and stick a label to them.

[2] Literature on Events per Variable

Concato J, Peduzzi P, Holford TR, Feinstein AR (1995) Importance of events per independent variable in proportional hazards analysis. I. Background, goals, and general strategy. J Clin Epidemiol 48: 1495–1501.

Peduzzi P, Concato J, Feinstein AR, Holford TR (1995) Importance of events per independent variable in proportional hazards regression analysis. II. Accuracy and precision of regression estimates. J Clin Epidemiol 48: 1503–1510.

Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR (1996) A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 49: 1373–1379.

Vittinghoff E, McCulloch CE (2007) Relaxing the rule of ten events per variable in logistic and Cox regression. Am J Epidemiol 165: 710–718.

Courvoisier DS, Combescure C, Agoritsas T, Gayet-Ageron A, Perneger TV (2011) Performance of logistic regression modeling: beyond the number of events per variable, the role of data structure. J Clin Epidemiol 64: 993–1000.

[3] For some problems, there is an obvious choice of metric to see how good the results are. For example, if you are identifying invoice numbers from images of invoices, you would look at accuracy. This would read like an exam result - 8 out of 10 right, so 80% accuracy achieved. For some problems, how you judge the results can be a tad convoluted. For example, if you are building a model to catch spam mails, accuracy would be pointless. Your model might conservatively say nothing is a spam, get 95% accuracy and fail at the one job it had which was catching the 5% spams that came in. So you change tact, and say “of all the mails I predicted as spams, how many did I get right” (aka recall) or “of all the mails that were truly spams, how many did I get right” (aka precision). As the problem context changes, different performance metrics will need to be considered.