We live in a big data world, so you may not think much of small data, but we look at just how big a story it can tell.
If the world of data and analytics were a popularity contest, big data would be the homecoming queen. Search for “big data” in Google, and you’ll get 142 million results in less than a second, as well as probably more than 250 articles from the past 24 hours alone. Dozens of new technologies will help you process your big data, and you’ll find countless conference talks centered on this single topic.
“Small data,” on the other hand, is the nerdy kid in the corner. Try that same Google search with “small data,” and you’ll be met with less than 8 million results and maybe 15 articles from the past 24 hours — if it’s been a busy day. Where’s the love for small data? I’m willing to contend that small data can make as much of an impact as its more popular friend, and I have a perfect example to support my stance.
My Small Data Diversion
First, a disclaimer: I am a millennial by age, but certainly not by behavior or interests. I am not on Twitter, Instagram, Tiktok, or whatever else is popular these days. I enjoy jigsaw puzzles and crochet. I do not subscribe to any streaming services other than Disney+, and I do not listen to Taylor Swift. Lastly, I absolutely cannot stand bubbly waters. My beverage of choice will always be Diet Dr. Pepper.
Though far from my favorite drink, bubbly water’s appeal — and the astounding 16 flavors a particular leading sparkling water brand comes in — led my friends and me to get creative. In part, to answer a couple of questions: Why was there so much variety? Which flavors do people prefer the most? And in part, to beat the boredom and spend some (virtual) time together.
Enter: The 2021 Sparkling Water Tasting Tournament. To answer our questions, my friends and I decided to give each participant in a tasting tournament one can of each flavor and complete individual brackets based on our tasting preferences. We seeded the brackets by the most recently available national sales numbers.
Throughout the week, we got on Zoom calls to debate the merits of certain flavors and to watch each other’s faces as we tried some particularly egregious varieties. The group designated me as the official number-cruncher, and I was initially only tasked with aggregating the results to reach a consensus.
But I’m a curious soul, and I thought this set of small data had a bigger story to tell than an aggregate opinion. To be fair, this data barely qualifies as small — it’s tiny. I set up SQL Server Express on my desktop and created a single table to house each bracket’s results. The resulting table has a whopping 17 columns and 15 rows. It weighs in at an astounding .016 MB. I’m certain smaller datasets exist, but this is a gold star example of something so minuscule that any big data fan would deem it inconsequential.
However, small data’s benefit is that its size allows for higher standards of cleanliness and for more time to slice, dice and love the data to see what it is telling you. In our world, time and money are always constraints. And whenever those are constraints, small data will have value. The time and money required to elucidate impactful information from small data are also small. Better to spend a bit of time and money finding the right questions to ask before investing in more significant endeavors. For example, a business could host a small focus group and analyze the results before launching into a larger A/B marketing test or some similar, more expensive campaign.
What Did I Learn From This Small Set of Carbonated Data?
Initially, I had to determine a framework by which to quantify the results. None of the participants created finalized “standings,” so I didn’t have any ranked output. Instead, each round had winners and losers. To avoid introducing bias to the results, this left me with some ties. Each bracket had a clear winner and a clear second place. It then had two flavors that didn’t advance past the final four, so I labeled those as sharing third place.
Four flavors didn’t advance past the elite eight, and I labeled those as sharing fifth place. The remaining eight flavors that lost in the very first round all tied for ninth place. Therefore, in all calculations, numbers closer to one were more desirable, and numbers closer to nine were less desirable (Lower is better. Higher is worse).
Now that each flavor had a numerical result from each bracket, naturally, the first thing I wanted to know is who won. In this example, there were multiple ways to crown a “winner” – each with its pros and cons, and each with its part of the story to tell. Since each flavor had a numerical rank, the simple first step was to average all flavor results and call the flavor with the lowest overall average the winner.
In this case, Key Lime came out on top with an average rank of 4.3. But there is another way to slice the numbers to reveal more of the details in the story. Another way a flavor could come out on top was if most participants chose it as their favorite. In this case, Mango and LimonCello were each selected by three participants (20 percent) as the best flavor.
At this point, I noticed a few flavors floating to the top of the list – Key Lime, Mango and LimonCello were relatively popular. To round things out, I needed to have information about the bottom of the stack, too. Which flavors fared poorly? The lowest possible average rank for any flavor would be a nine if all participants didn’t take the flavor out of the first round.
In this dataset, two flavors that scored the worst were Pasteque with an average of 8.1 and Passionfruit with 8.3. Yikes! Even after just these three quick exercises, I could easily tell that if I had every flavor available to me (and no Diet Dr. Pepper), I would have at least some flavors I’d willingly try before others – and some you’d have to bribe me to sample.
Breaking Down My Small Data Results Even Further
The first set of results pointed me to another question – Key Lime only had an average of 4.3, so obviously, some participants must have ranked it relatively poorly. I would therefore expect quite a bit of variance in the data. But which flavors were most polarizing or controversial, and therefore potentially popular with specific audiences but derided by others? Standard deviation is a quick, easy, readily accepted, and generally universally understood measure for exactly this sort of question.
When I calculated the standard deviation for this dataset, two flavors had remarkably higher standard deviations than the rest – LimonCello and Tangerine. Interestingly, LimonCello showed up earlier in the list of three potential “winners” of the tournament but then showed up as the flavor with the most varied responses.
Three flavors had particularly low standard deviations on the other end of the spectrum – Passionfruit, Lemon, and Pasteque. The story became more evident. Pasteque and Passionfruit were both near the bottom of the average overall rankings and had low variance. As much as the top of the brackets seemed murky, the bottom became more apparent. Perhaps preferences and favorites varied, but some flavors certainly appeared more universally disliked.
Since the data pointed toward varied tastes, the next path I wanted to explore was determining which participants seemed to have normal (or abnormal) preferences compared to the group. This digs into the rational explanation for variances in flavor performance – human taste differs. In a group of friends, this was an entertaining exercise. After all, who wouldn’t want to know which of their group had the most eccentric taste?
In this case, the data told me something even potentially more noteworthy. I calculated the difference between their rank for each flavor and the group average of that flavor for each participant. I then summed the absolute values of all the differentials to come up with what I fondly titled a “weirdness score.” The average weirdness score of all participants was 36, and it turns out that all weirdness scores fell between 30 and 39, except two. Those two scores were 42.1 and 54.5, and these each belonged to the only two children who completed brackets. I’m hesitantly proud to say that the weirdest tastes of all belonged to my daughter, age four.
Now, a weirdness score sounds silly, but in this dataset, it tells us something important we otherwise would have had to utilize some data science tricks, like clustering, to find out. Children have different tastes than adults. No one is shocked by that, but in less than 10 minutes of simple averaging, addition and subtraction I discovered these results without specifically looking for them. Even if I didn’t know it already, the data would’ve led me to realize that part of the story.
Now that I was calculating differences, it was easy to ask a few related questions using the same basic math on different groupings of participants and results. For example, calculating differentials between all participants individually (instead of between each participant and the average) determines which participants had similar tastes. On a small scale, this yielded some very entertaining results. At a larger scale, companies could use similar information to pair customers or even suggest products based on what similar customers also enjoyed.
Comparing the Results: Small Data With a Big Story
A final way I wanted to utilize the idea of differentials is by comparing the flavors themselves. Regardless of individual participants, did like (or dislike) of any one flavor correlate with liking or disliking any other flavor? Using this tiny data set, I leveraged the results into a baby version of predictive analytics. It turns out that in this small data, there were some remarkably strong relationships.
For example, anyone who liked Peach-Pear was much more likely also to enjoy LimonCello. If a participant took Peach-Pear to the final four, the average rank of LimonCello was 2.3. However, if Peach-Pear fell outside of the final four, the average for LimonCello dropped to 7.1. Perhaps this isn’t shocking since both flavors are similarly sweet and pretty pronounced. Surprisingly, an even stronger like-like correlation existed between Coconut and Apricot – not a typical pairing based on flavor similarity. If Coconut made it to the final four, the average rank of Apricot was 2.0 – a solid second place. But if Coconut was outside of the final four, the average for Apricot fell to 7.9.
Both of these are examples of like-like correlations, but I also searched for like-hate correlations. For example, those participants who took Razz-Cranberry to the final four soundly hated Tangerine – it had an average rank of 9.0, meaning it never made it out of the first round. But if participants didn’t put Razz-Cranberry in their final four, the average for Tangerine rose to 4.8. If you like Razz-Cranberry, I might suggest you stay away from Tangerine.
I can also compare my results with the data set that generated the original seeding to see which flavors taste “better” (or worse) than their sales data would seem to predict. In this group, the biggest over-achievers were:
- Peach-Pear — Seeded dead last but came in ninth place overall
- Mango — Seeded 10th but came in second place
- Lemon — Seeded 11th but came in third place.
And the most significant relative losers were:
- Pamplemousse — Seeded first but came in a disappointing eighth place
- Pasteque — Seeded sixth but came in 15th place
- Passionfruit — Seeded seventh but came in dead last.
Discrepancies like this in any small data set shed a different light on the story. They lead to more questions to ask, and potentially to dig into further with different audiences. By following these leads, a business might identify if they’re spending way too much money marketing a specific flavor that otherwise wouldn’t perform well, and see their efforts could better support a different product.
Conclusion
In the end, with very low investment (sixteen 8-packs of sparkling water and less than 3 hours of data analysis), I’ve landed on a few significant insights, a lot of entertainment for participants, several potential pathways for further exploration, and a better understanding of the stories this data was trying to tell. I also learned that I would still rather drink Diet Dr. Pepper, but if the world runs out of soda, I’ll look for a can of LimonCello.
I’m an engineer by training and a technical consultant by trade – but, I’m a storyteller at heart. Data isn’t useful as data – it only becomes useful when you take that data, turn it into information that becomes knowledge, and apply that knowledge to your situation until it becomes wisdom. Often, that requires your data to tell a story to unite a group of people around a shared understanding. And small data frequently has big stories to tell.
Sometimes the outcome of that story may be as simple as shared laughs among a group of friends who cannot spend as much time together as they’d like. But, that outcome could provide a deeper understanding of which beverages perform better with different subsets of the population, which might lead to a new marketing push and increased sales. Don’t be tempted to take your small data for granted. The homecoming queen is popular for a reason, but the nerdy kid in the corner might be your key to success.
As my favorite author, JRR Tolkien, almost wrote: “Even the smallest [data] can change the course of the future.”