
The aim of this post is not specifically to shed more light regarding what went wrong before the launch of SCN, but I do promise to get there. In fact, I hope to do much more: inspire some thinking about the making of load-tests for any big, complex system. I've been to a few of these, and many of you have been through this as well, I guess.
One thing that's always noticeable to me either when I present my findings, or when I read of other's experiences, is this aura of "OMG SCIENCE AT WORK! HYPOTHESES, ISOLATING VARIABLES, STATISTICS...FEAR ALL YE PRODUCT PEOPLE!". I do admit it's kinda satisfying as an engineer to bask in that light. However when looking at the details, there appears an intricate layer of reasoning, switchbacks, convenient omissions and the like which make it read more like a novel (and a bad one at times). Why does that happen?
The Unknown
One major reason, I think, is the actual amount of unknowns you're facing. Even when you have the legacy of an existing system such as the old SDN, there appear numerous questions. Here are but a few - some are easy, some are hard.
I could go on in similar vein forever here, but a pattern does emerge I guess: There's just a lot you don't know, and nobody's gonna help you. So, you try to strike a balance which FEELS right - to you.
You Don't Have Enough Time - And This Will Never Change
An excuse? for sure, but also kind of a given for load-testing, because of the golden rule of load-tests: "By the time the system is mature & stable enough to test, it's time to deliver already". You can pathetically try to stress a work-in-progress, but would be hammered on all sides by the bugs, and cannot compare the previous week's results to this week anyway. In all probability, you are also usually busy building that system (someone has to do it), while hoping that your solution does scale as planned, but you don't really know except for some synthetic micro-benchmarks, which a lot of people won't even bother to do.
This also relates to a rant about optimization: too many people seem to think that optimization is the choice between ArrayList and LinkedList when their list is 5 items long. They don't realize that performance is usually borne out of architecture. If it's well-built, then it can scale already or can be fixed to become so. As for our context, this probably means that during the happy development phase, many people won't know how to code or what to test anyway when it gets to this dirty issue of performance.
My & Your Tests are Static, Reality is Not
Given the first rule which concerns the inherent lack of time, here is one thing that we tend to miss, and I think we missed it here.
You think you came up with some use-cases which describe "a day in the life" of your system. In reality, nonetheless, there is always change. Sounds like New-Age talk? Well, you better believe it. We based our scenarios on users that pretty much know how to get to stuff - as many did in the latter SDN days. On our launch, however, our users did not know how - for various justified reasons which have nothing to do with performance per-se. And so they wentto the content browser and clicked on just about every possible combination, with or without search terms, trying to FIND THAT CONTENT ALREADY. And so, a usability issue (which would both need better tooling on our side and might become less of an issue as many users become more accustomed to current navigation methods) became a performance issue - because it just happened to be generating lots and lots of unique queries that are really hard to cache on any level. Granted, this was not the only feature whose usage patterns we did not account for, but it's enough to have just one which hurts - and then it doesn't really matter that you found ten other such biggies just before launch (why just before launch?? see again: "You don't have time" etc.)
You (and I) are Doing it Wrong
The final point, for today at least, is that even if you came up with a brilliant test set, you probably use tools in a way that doesn't REALLY match the real world. One case in point: AJAX requests, such as the "More Like This" query on content pages which brought the system to its knees on its first day live. Unless you have a grid of computers at your disposal just aching to act as your test clients, it is much more feasible to just fetch the HTML content of pages instead of running in a real browser - which has this nice feature where it loads & runs not just static resources but all Javascript content. Instead, you look at the page and see what the "important" AJAX calls are, so you can just mimic these directly. When you miss one "important" call, as was the case here, it can blow over.
By the way, I don't think there's a magic bullet here: I've twice heard in the context of Cucumber/Capybara-based automated testing that one should use HtmlUnit instead of the default Firefox driver. You get a "real" browser core with JS, but don't have to suffer the burden of a full browser - and all is super-fast and well. In these two instances, when the browser implementation was changed from HtmlUnit to Firefox just as a demonstration for me, the tests then failed immediately, at which point the response of all involved was: "So....anyway....".
Before you get all depressed, I have to say that the picture is not that bad: if you work your way against all odds, you manage to somehow get to production with most of the big pain points already squashed. In a major site like SCN, where people actually care (and I appreciate that every day!), you get the rejects, live with them and fix the roadblocks as fast as you can, so we could argue about features again (which is the best possible state, I guess). Everybody knows that for internal deployments, as messy as they are, users would just have to live with it until the situation is fixed...(and so it's sometimes never fixed). For us here, this is not an option of course.
Now, let's see if the patches and fixes give all of you the experience expected. The error & response-time numbers show me an increasingly better picture, but as we now know... well, these are just numbers.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
4 | |
2 | |
2 | |
2 | |
1 | |
1 | |
1 | |
1 |