Monday, December 1, 2014

Shorts: Ideas described in less words; Code Coverage, Metrics, Reading and More

Most times I write more 'essay' style articles, but Isaac and I have sometimes had small ideas we wanted to discuss but didn't feel like they were big enough to post on their own.  So I'm trying this out, with a series of small short ideas that might be valuable but are not too detailed.  Please feel free to comment on any of these shorts or on the idea of these smaller, less essay-style posts.  If you are really excited about a topic and ask interesting questions, I might try to follow it up with another essay-style post.

Code Coverage


Starting with a quote:
Recently my employer Rapita Systems released a tool demo in the form of a modified game of Tetris. Unlike "normal" Tetris, the goal is not to get a high score by clearing blocks, but rather to get a high code coverage score. To get the perfect score, you have to cause every part of the game's source code to execute. When a statement or a function executes during a test, we say it is "covered" by that test. - http://blog.jwhitham.org/2014/10/its-hard-to-test-software-even-simple.html

The interesting thing here is the idea of having manual testing linked to code coverage.  While there are lots of different forms of coverage, and coverage has limits, I think it is an interesting way of looking at code coverage.  In particular, it seems like it might be interesting if it was ever integrated into a larger exploratory model.  Have I at least touched all the code changes since the last build?  Am I exploring the right areas?  Does my coverage line up with the unit test coverage and between the two what are we missing?  This sort of tool would be interesting for that sort of query.  Granted you wouldn't know if you covered all likely scenarios, much less do complete testing (which is impossible), but more knowledge in this case feels better than not knowing.  At the very least, this knowledge allows action, where as just plain code coverage from unit tests as a metric isn't often used in an actionable way.

I wonder if anyone has done this sort of testing?  Did it work?  If you've tried this, please post a comment!

 Mario’s Minus World


Do you recall playing Super Mario Brothers on the Nintendo Entertainment System ("NES")? Those of you who do will have more appreciation for this short but I will try to make it clear to all.  Super Mario Bros, which made Nintendo's well known series of Mario games famous.  In it you are a character who has to travel through a series of 2D planes with obstacles including bricks and turtles that bite, all to save a princess.  What is interesting is that fans have in fact cataloged a large list of bugs for a game that came out in 1985.  Even more interesting, I recall trying to recreate one of the bugs back in my childhood, which is perhaps the most famous bug in gaming history.  It's known as the minus world bug.  The funny thing is in some of these cases, if a tester had found these bugs and they had been fixed and the tester would have tested out value rather than adding value in, at least for most customers.  I am not saying that we as testers should ignore bugs, but rather one man's bug can in some cases be another man's feature.

How Little We Read


I try not to talk much about the blog in a blog post (as it is rather meta) or post stats, but I do actually find them interesting.  They to some degree give me insight into what other testers care about.  My most read blog post was about the Software Test World Cup, with second place going to my book review of Exploratory Software Testing.  The STWC got roughly 750 hits and EST got roughly 450 hits.  Almost every one has heard of Lessons Learned in Software Testing (sometimes called "the blue test book").  It is a masterpiece made by Cem Kaner, James Bach and Bret Pettichord back in 2002.  I just happened upon a stat that made me sad to see how little we actually read books.  According to Bret Pettichord, "Lessons Learned in Software Testing...has sold 28,000 copies (2/09)".  28,000 copies?!?  In 7 years?!?  While the following assertion is not fully accurate and perhaps not fair, that means there is roughly 30k actively involved testers who consider the context drive approach.  That means my blog post which had the most hits saw roughly 3% of those testers.  Yes, the years are off, yes I don't know if those purchased books were read or if they were bought by companies and read by multiple people.  Lots of unknowns.  Still, that surprised me.  So, to the few testers I do reach, when was the last time you read a test related book?  When are you going to go read another?  Are books dead?

 Metrics: People Are Complicated

Metrics are useful little buggers.  Humans like them.  I've been listening to Alan and Brent in the podcast AB Testing, and they have some firm opinions on how it is important to measure users who don't know they are (or how they are) being measured.  I also just read Jeff Atwood's post about how little we read when we do read (see my above short).  Part of that appears to be those who want to contribute are so excited to get involved (or in a pessimistic view, spew out our ideology) that they fail to actually read what was written.  In Jeff Atwood's article, he points to a page that only exists in the Internet Archives, but had a interesting little quote.  For some context, the post was about a site meant to create a community of forum-style users, using points to encourage those users to write about various topics.

Members without any pre-existing friends on the site had little chance to earn points unless they literally campaigned for them in the comments, encouraging point whoring. Members with lots of friends on the site sat in unimpeachable positions on the scoreboards, encouraging elitism. People became stressed out that they were not earning enough points, and became frustrated because they had no direct control over their scores.
How is it that a metric, even a metric as meaningless as score stressed someone out?  Alan and Brent also talked about gamers buying games just to get Xbox gamer points, spending real money to earn points that don't matter.  Can that happen with more 'invisible metrics' or 'opaque metrics'?  When I try to help my grandmother on the phone deal with Netflix, the fact that they are running 300 AB tests.  What she sees and what I see sometimes varies to my frustration.  Maybe it is the AB testing or maybe it is just a language and training barrier (E.G. Is that flat square with text in it a button or just style or a flyout or a drop down?).

Worse yet, for those who measure, these AB tests don't explain why one is preferred over another.  Instead, that is a story we have to develop afterwards to explain the numbers.  In fact, I just told you several stories about how metrics are misused, but those stories were at least in part told by numbers.  In more theoretical grounds, let us consider a scenario.  Is it only mobile users who are expert level liked the B scenario while users of tablets and computers prefer A?  Assuming you have enough data, you have to ask, 'does your data even show that'?  Knowing the device is easy in comparison, but which devices count as tablets?  How do you know someone is an expert?  Worse yet, what if two people share an account, which one is the expert?  Even if you provide sub-accounts (as Netflix does), not everyone uses one and not consistently.  I'm not saying to ignore the metrics, just know that statistics are at best a proxy for the user's experience.

2 comments:

  1. HI JCD,
    Followed the link you post on my blog. Thanks for the shout out.

    Your blog post, again, is *very* thoughtful. However, I'd like to offer you perhaps a different perspective on metrics. It seems you are a very deeper thinker and learner, but based on the content of this post. It seems you may not as yet come across the idea of Gamification. It is a specialization under a new discipline known as "behavioral economics".

    It isn't just about metrics and testing. What these teams (who enabled your examples) was provide a visible way for people to have fun while doing something repetitive (and potentially valuable).

    Hawthorne's principle (or effect on Wikipedia) is, in essence: "people's behavior changes in accordance to how they are being measured". So if you want a behavior change, measure it and let them see it. But if you don't want a behavior change and simply want to understand what it *is*, don't. When to do one or the other, again, deep topic. Check out Game Theory as well (if interested)

    lastly, on numbers, while I certainly agree with almost all of your post, I most definitely would not agree with your closing line. Statistics & Data Science are much deeper topics than you really describe. While true, that answering "why" is often a challenge in data science, answering "what" can be quite powerful. For example, Amazon knows exactly how fast their main page needs to render before they lose money due to people switching. They did AB testing on an isolated number of real users and artificially raised render latency until they saw and noticeable change in the users behavior. Then they observed that change: people left the site. Following the chain further, they were able to show the $$ loss that represented. Their test team does not need to argue from a philosophical or theoretical standpoint; they only need to show that the correlation is statistically significant. So while understanding Cause/Effect would be *most* beneficial, simply understanding Stimuli/Response can give you clues to improve your hypotheses and (often) iterate to a cause.

    From personal experience, learning these techniques have been the single greatest accelerator of my personal pursuit to making customers happy via quality than most all I have learned/done in the past on a traditional test team.

    ReplyDelete
    Replies
    1. I have heard of Gamification, it is a fascinating concept. It really comes from behaviouralisation, ring a bell, feed a dog and eventually the dog will start salivate to the bell, before they get fed. Have an achievement pop up with some points and the user will try to get more achievements. Tie it to reputation or to discounts and you can further change a user’s behavior. Make the tasks feel fun and make people feel good doing them, rather than feeling like ‘work’. Some people see it as a form of manipulation and resent it, others enjoy it and still others just ignore it. It might even tie to dopamine and things like compulsive gambling (see the recent research around gambling and Parkinson's disease). It certainly involves culture. It even verges into territory of free will. It’s a very deep topic indeed, but my effort was to give people small areas of thought, not my more typical deep posts.

      I don’t disagree that ‘what’ can be important and in some cases it can be enough to ignore the why. Sometimes why doesn’t matter, but that doesn’t mean it isn’t a proxy. Perhaps the linkages are important enough that the proxy nature of the measure doesn’t matter. We don’t know if the users who left the slower site threw their mice on the ground and swore never to buy from Amazon again or came back the next day. We don’t know the details of the individual and how they felt about the software. A policy going back to the mid 90s for a large software/hardware company has caused me to dislike them and avoid their products for the past 20 years (and no, it wasn’t Microsoft). No measure of theirs could have predicted that. I couldn’t have predicted it even. There is statistical conclusion validity, outliers, biased sampling, practical vs statistical significance and plenty of other possible ways to get the stats wrong. It isn’t the same thing as observing a user’s usage. You can’t see the emotions and reactions, you can only see what they did. You can’t see where their eyes went, and unless you capture all mouse movement, you can’t know where they moved their mouse… Looking for what they want. It is a proxy for a user’s experience. To be fair and clear, many testers are just proxies for a customer. In some cases a customer’s data is the better proxy and sometimes the tester, being flesh and blood is a better proxy. It just depends.

      Just as an aside, it turns out that Hawthorn’s effect is not completely replicable, at least according to wikipedia: http://en.wikipedia.org/wiki/Hawthorne_effect#Interpretation_and_criticism I didn’t know that, even though I too had heard of the effect. I don’t pretend to be an expert in this and have not researched the criticisms, but thought you might find it interesting.

      Thanks for adding some depth to the topic. I know I’m not a data scientist and don’t pretend to be one on TV. So I’d be interested in seeing any resources you have on the topic… or even blogs you might write in the future.

      - JCD

      Delete