Big Data: A Warning

The Black Monolith from 2001: A Space Odyssey. Big Data before it was cool.

The Black Monolith from 2001: A Space Odyssey. Big Data before it was cool.

You’ve heard of Big Data, right?  The coming Goliath that will be the savior of everything and has a significant chance of blowing your mind?  If you work at a healthcare organization or perhaps just run a blog about HIT you’ve probably already gotten calls and e-mails about how this new version of analytics will change EVERYTHING. It’s going to make us more efficientmore justhealthier, wealthierpredict the futureperform betterin bed (safer at least).  However, let me show through the use of simple math that this will not be the case for the semi-immediate future.  In fact, the first thing it will do is make everything noisier.

Information Theory revolves around the concept of entropy or how many bits of information are necessary to communicate a single concept between point A and point B.  One of the major factors affecting entropy is Channel Capacity, or how well information travels (or how much can we expect to lose) between point A and point B.  This is measured by taking the strength of the signal (our source of information) and dividing it by the noise (everything that interferes with the signal).  Stick with me.


This used to be a beautiful picture.  Too much noise and not enough signal.

This used to be a beautiful picture. Too much noise and not enough signal.

Big Data is about collecting all of the possible data points that are feasible to acquire and then doing statistics on them hardcore.  The beauty of this is that we are no longer taking samples of what we think our population is like, we are looking at all of it.  Getting answers to the most specific of questions no longer will require an estimate, it will just be.  The downside of this is that unless you are coming up with your thesis or starting a new research project, humans are pretty bad at asking specific questions.  We are also particularly bad at interpreting a specific answer to a vague question.  The problem is that finding an answer in your mountains of data (They don’t call it data mining for nothin’!) to your vague question is going to return a lot of noise.  Without a comparable increase in our signal, the increase in noise is going to drastically reduce our Channel Capacity, then our Entropy, and then we’re just going to be really confused.  Now back to the simple math!

Let’s say our signal has a value of 1 an our noise has a value of 2.  Through the process of division we find that our signal-to-noise ratio is 0.5. Now let’s say the amount of data we have to sift through increases two-fold, putting our noise value to 4.  Our signal still stays at 1 so that means our new ratio is halved down to 0.25.  In otherwords, we can no longer hear that awesome guitar solo because the backing band is playing way too loud.  Diminishing signals aside, there’s another lurking variable with Big Data which is known to scientists as “Fishing”.

Fishing, or as Marketing Researchers call it: Daily work.

Fishing, or as Marketing Researchers call it: Daily work.


What’s the age-old way of finding food in a body of water?  Stick a lure in and see what bites.  In our case, Big Data is our growing ocean.  The scientific community looks down upon fishing (the metaphorical version, not the start of a delicious salmon dish) because it goes completely against the scientific process.  Fishing is where you find answers and then formulate the questions.  No hypothesis, no method, just judging the ocean’s contents based upon what you pull out on your single fishing trip.  For instance, the NYT ran an article last year about a company called Factual that has built a massive repository of data and letting people get in there and see what they could see.  People came up with all sorts of interesting insights like if one were to buy a used orange car, it was much less likely to be a lemon than any other color.  The problem was, that “interesting” was all these conclusions were.  Nothing is actionable.  The vast majority of the models being projected onto the data were the simple and vague linear relationships, also known as correlation.  In healthcare, Epidemiologists and Public Health officials have long been aware that correlation does not necessarily mean causation and it takes a lot of factors to motivate action based off a correlation analysis.  The rally against tobacco products was prompted by a strong correlation for example.  Those correlations were then quickly supported by studies looking at the mechanisms for cancer and other health problems.  Additional studies on the orange car study would show that the color of the car really has nothing to do with it’s durability; it’s just coincidence.

If you were to give me all of your EHR data, I could show you all sorts of interesting correlations about how your number of walk-ins is highest on Thursdays, patients coming in with Hypertension are highly likely to also have Diabetes, and that one of your Family Practitioners strangely keeps patients with Anxiety in the waiting room longer than anyone else.  But is this actionable information?  It’s been my experience that most leaders of healthcare organizations do not know what metrics they want or need to look at outside of a few key financial ones that they’ve already been tracking.  They don’t necessarily have any new questions they are asking, but these new Big Data analytics companies that are starting up are claiming to all have answers.  From my discussions with some of them, they really just want to go fishing and tell you what interesting things they find which probably won’t help you improve your organization.  It could however, make for another funny Top 5 [whatever] article on though.

The idea of Big Data does have promise.  Being able to retroactively look at the entire population of something instead of just a sample is a huge benefit if the scientific method is used and a heap load of analysis goes into it.  The only way to start that process though is with a very specific question.  Here’s my advice: If you have a specific question about how your organization is running or performing, use analytics to answer those questions, but understand  that the answer you’re going to get may not provide a solution.  It will most likely provide an area to look into further.  The answer can tell you how accurately you are diagnosing a disease, but it can’t tell you why the doctors missing the diagnosis are doing so.  Above all, please remember to only buy solutions to the problems you already know you have.

3 Responses to “Big Data: A Warning”
Check out what others are saying...
  1. […] might remember previously I have written a warning on how vague analyses of big data sets can lead to some incorrect conclusions.  In other […]

  2. […] study above and numerous others, data analysis of large data sets can prove to be immensely helpful as long as we know what we’re asking of the data in the first place.  The Big Data hype is cool and all, but humans are wonderfully good at finding patterns […]

  3. […] see what the data tells you about itself.  That’s commonly called fishing, and as I’ve stated before, it isn’t exactly a sound […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: