Thursday 13 June 2013

Big Data - understanding what you see is more important than simply seeing it

I read an article in Nature published today titled Biology: The big challenges of big data. It was interesting in a technical computer science sort of way but, for me, it omitted what I see as the biggest challenge we face: that simply seeing Big Data is not as important as understanding it..
It ismore important to understand what you see than it is to simply see it.
The red lines are called the Hindenburg Omen, a pattern used to identify
potentially large stock market falls. The early 2007 crash warning is clearly
visible in the centre and to the far right the credit crunch is about to hit.
One was also seen this month (June 2013). Image: Ian Woodward

Our group at EBI primarily works with what used to be called big data, what you might now call Medium Data™- it's big but it's not Big Data big (though it is getting bigger). Confused? Good. One of the key tasks we undertake is to 'add value' to the data that is submitted to the likes of ArrayExpress and then processed into the Gene Expression Atlas or the BioSample Database. Adding value takes many forms, but primarily it's about making sure the data is internally consistent within the experiment and then trying to make it outwardly consistent with the rest of the experimental data we host. We use ontologies as part of this alignment, as well as resources like ENSEMBL, BioMart, and others. It's a Big Job (and a difficult one). For me, the Big Job we do at EBI has always been one of adding adding value. The EBI is not just the world's hard drive and nor should it be.

The article describes the technical challenges of Big Data in some detail; the role of cloud computing, security, legitimacy of data sources, analysis tools, etc. I missed what I consider to be the biggest challenge in Big Data - the part about how you actually make sense of the massive quantities you're faced with. Larry Hunter comes closest in the article when he says "getting the most from the data requires interpreting them in light of all the relevant prior knowledge."

Data Sharing has increasingly become a misnomer to me. The point of sharing is, presumably, so others can reproduce or reuse. However, the intention of making data available to others (sharing) is somewhat redundant if the end user can't actually use it because they can't understand it. A previous Nature article reflected on the practice of data sharing and that reproducing results was rarely possible because of a lack of detail accompanying the data. With Big Data this problem only gets, well, Bigger.

An amusing YouTube cartoon circulated recently which perfectly captured many of the issues that I think are salient. The idea of having USB drives posted to one another will be impossible in the Big Data world of course, these are the technical issues the Nature article points to. What remains the same is the issue of understanding what it is you're trying to use: how it was produced, formatting, variables named, etc.

Big Data requires Big MetaData. The scope of new technologies means we can capture much more detail about many more biological entities. The Nature Reviews Genetics article by Nekrutenko highlight that "very few current studies record exact details of their computational experiments, making it difficult for others to repeat them."

Hindenburg Omen or not, I fear that we are entering a decade of Big Disappointment if we don't address the issues of how we describe the data we are sharing in more formal, rich and meaningful ways and do so earlier rather than when it is too late. It is already too late for much of the data that has already been 'shared'. The irony is, of course, that this existing data may already hold many of the answers we are looking for but will never be found simply because we can't reuse it. I'd rather have Small Data I Can Understand then Big Data I can't. Repeating previous errors when sharing data would be the Biggest Mistake Of All.