James Malone's EBI Blog

Thursday 1 August 2013

Thinking Small: RDFApps and Killing the Killer App

At ISMB in Berlin this year, Goncalo Abecasis gave a keynote in which he outlined the driver for all of his work; to enable more biologically significant questions to be asked of the data. I sat nodding as he summed up. I can't remember the exact wording so forgive my paraphrasing, but the message was along the lines of: just having data is not science, just having tools is not data analysis.

The next day I spoke at ISMB on some of the work +Simon Jupp and I have been doing using RDF for data integration. I made a similar point, in a much less profound way; just having RDF is not data integration and that integrating data is not a panacea (or any other type of Italian cake). I wanted to make it clear that I'm not an RDF zealot and that I don't think that it solves all our data integration or sharing problems. In fact, it creates problems of its own. RDF without a clear description of the schema or ontologies used is hard to understand and frankly even with the ontology it can still be hard. And of course, SPARQL which looks a bit like SQL, but isn't. And should we be exposing our core users to SQL anyway never mind SPARQL?

Think Small. Aim high. Don't get squashed.
Image courtesy of SweetCrisis /freedigitalphotos.net

Technology aside, Goncalo and I share a common goal. The work we've been doing is to enable richer, more precise, more biologically relevant questions to be asked of our Gene Expression data by exposing more of the meta data and putting it into wider contexts as requested by users. Some of the most common queries they wish to ask include signalling pathways, orthologs, drug targets and integration with other ontologies in the NCBO BioPortal. But with this additional richness comes the SPARQL problem.

Simon and I have spoken at length about how we see some of the RDF work we are doing and ultimately it is not about teaching all our biology focused users to SPARQL. The RDF+SPARQL layer should really be seen as an application interface, one that programmers and some bioinformaticians would code against in the same way they code against other web APIs. What I'd like to see the bioinformatics community embrace is the idea of RDFApps; focused pieces of software that solve a specific use case for the biomedical community. They don't need to be big, just useful.

We've been working on a few RDFApps that I'll blog about in the coming weeks. One such RDFApp we've been working on is an R package (which uses the great R SPARQL package) to perform querying of our Atlas RDF. The App essentially hides all of the SPARQL to the user but allows for some quite complex questions to be asked very simply. It also allows for some additional analysis such as over-representation analysis. We've also been experimenting with an App that does on-the-fly faceted browsing using Apache Solr and visualisation of the gene expression data.

I've become ever convinced over the last year or so that the idea of the Semantic Web/RDF Killer App needs to die. We are in search of something that is essentially irrelevant. It is clear that there are things going in the community that are of much value, in and out of biology. GoodRelations is such an example - it uses lightweight semantics to attach some metadata to data on the web about products which search engines like Google can use to create Google Shopping - and Good Relations is used by over 10,000 other online stores. This is a small, focused application that has incredible power. We need to think small.

For us, thinking small means tools aimed at specific types of new analysis using specific, focused applications for the parts of biology that are of interest to a user and to a specific scientific question. RDF is a tool in this context, a tool that is clearly appropriate as it naturally fits for publishing data on the web, for integration with other data and for describing this data with ontologies. Forget Big Data, think Small Apps.

The Gene Expression Atlas RDF can be found at http://www.ebi.ac.uk/fgpt/atlasrdf

Thursday 13 June 2013

Big Data - understanding what you see is more important than simply seeing it

I read an article in Nature published today titled Biology: The big challenges of big data. It was interesting in a technical computer science sort of way but, for me, it omitted what I see as the biggest challenge we face: that simply seeing Big Data is not as important as understanding it..

It ismore important to understand what you see than it is to simply see it.
The red lines are called the Hindenburg Omen, a pattern used to identify
potentially large stock market falls. The early 2007 crash warning is clearly
visible in the centre and to the far right the credit crunch is about to hit.
One was also seen this month (June 2013). Image: Ian Woodward

Our group at EBI primarily works with what used to be called big data, what you might now call Medium Data™- it's big but it's not Big Data big (though it is getting bigger). Confused? Good. One of the key tasks we undertake is to 'add value' to the data that is submitted to the likes of ArrayExpress and then processed into the Gene Expression Atlas or the BioSample Database. Adding value takes many forms, but primarily it's about making sure the data is internally consistent within the experiment and then trying to make it outwardly consistent with the rest of the experimental data we host. We use ontologies as part of this alignment, as well as resources like ENSEMBL, BioMart, and others. It's a Big Job (and a difficult one). For me, the Big Job we do at EBI has always been one of adding adding value. The EBI is not just the world's hard drive and nor should it be.

The article describes the technical challenges of Big Data in some detail; the role of cloud computing, security, legitimacy of data sources, analysis tools, etc. I missed what I consider to be the biggest challenge in Big Data - the part about how you actually make sense of the massive quantities you're faced with. Larry Hunter comes closest in the article when he says "getting the most from the data requires interpreting them in light of all the relevant prior knowledge."

Data Sharing has increasingly become a misnomer to me. The point of sharing is, presumably, so others can reproduce or reuse. However, the intention of making data available to others (sharing) is somewhat redundant if the end user can't actually use it because they can't understand it. A previous Nature article reflected on the practice of data sharing and that reproducing results was rarely possible because of a lack of detail accompanying the data. With Big Data this problem only gets, well, Bigger.

An amusing YouTube cartoon circulated recently which perfectly captured many of the issues that I think are salient. The idea of having USB drives posted to one another will be impossible in the Big Data world of course, these are the technical issues the Nature article points to. What remains the same is the issue of understanding what it is you're trying to use: how it was produced, formatting, variables named, etc.

Big Data requires Big MetaData. The scope of new technologies means we can capture much more detail about many more biological entities. The Nature Reviews Genetics article by Nekrutenko highlight that "very few current studies record exact details of their computational experiments, making it difficult for others to repeat them."

Hindenburg Omen or not, I fear that we are entering a decade of Big Disappointment if we don't address the issues of how we describe the data we are sharing in more formal, rich and meaningful ways and do so earlier rather than when it is too late. It is already too late for much of the data that has already been 'shared'. The irony is, of course, that this existing data may already hold many of the answers we are looking for but will never be found simply because we can't reuse it. I'd rather have Small Data I Can Understand then Big Data I can't. Repeating previous errors when sharing data would be the Biggest Mistake Of All.

Thursday 16 May 2013

My NCBO Webinar

JJ Abrams was also on the call but I told him I was
too busy with ontologies no matter how much he begged.

I could be found speaking on a live Webinar last night which has now been made available online for anyone interested. The talk is about 40 minutes long - I do drone on a bit it seems but we had a lot to talk about. Hopefully the content is interesting. The talk covers the work Simon Jupp (primarily) and I have been doing as one of NCBO's Driving Biological Projects. I've blogged about some of the topics I spoke about, but the main components I spoke about were:

Our resources (briefly)

How we rapidly develop our ontologies

How our curators use a tool we developed called Corona to annotate data in the Gene Expression Atlas

Our Phenotator tool, developed to allow cell biologists to develop a cellular phenotype ontology that covers their data without having to understand OWL or the various reference ontology nuances.

Zooma, a knowledge base of curator ontology annotations for automatically annotating data

Our new RDF Atlas work, including web UI allowing new querying over the Atlas and other data resources we've integrated our data with, such as Reactome pathways

Our new Atlas RDF-R package (not yet public) which wraps SPARQL into nice convenience functions in R and includes an enrichment package for use against the Atlas. I'm working on a new version of this which build on the work my student Maryam Soleimani and I did in a prototype and will try and release it as soon as possible. I'll blog when I do.

Enjoy The Science.

Monday 29 April 2013

Keeping it Agile: the secret to a fitter ontology in 4* easy** steps!

*there are probably more than 4
**it's not all that easy

I've been preachy recently in complaining about how the ontology world doesn't apply enough software engineering practices in producing ontologies. I thought it was about time I explained some of the things I thought they could do by talking specifically about the things we do here to help us. There's an expanded version of this in a paper accepted for 2013 OWLED workshop for those attending.

1. Whatcha gonna do?

First thing we steal from software engineering is our overall methodology. I have talked a bit about this previously at ICBO 2012 where I presented on how we applied Agile Software Engineering Methods to the development of the Software Ontology. There are a few things this gives us. It helps us prioritise development. Collecting requirements is not usually a problem - there are always bucket loads. As with most projects, there is always more work than people and we need to focus on the things that are most important - which can change month to month.

The red stuff means we're doing it right (that is,
we're catching the stuff we're doing wrong early).

We use a few agile methods to help with this. Priority poker and buy-a-feature have been of particular use when engaging with users and also reasonably fun to do. It also helps keep our major stakeholders involved with the process of development, which is useful because it means there are no big surprises at end of each sprint (i.e. cycle of development). This way everyone knows what we're gonna do and so do we.

2. Building a house with bricks on your back

One of the primary ontology I'm currently involved with developing is the Experimental Factor Ontology. EFO is an application ontology - that is to say it is built to serve application use cases, distinct from a reference ontology which are built as a de facto reference for a given domain. When building EFO we try to reuse as many reference ontologies as we deem suitable (I won't expand on what this means here). But needless to say, our reliance on external resources introduces a coupling - in the same way it does in software projects using libraries. I often refer to this as trying to build a house with the bricks strapped to your back; nice to know you have them close by, but they're heavy. We have some importing code we use to help us manage these imports, based on MIREOT. This still gives us issues to look out for. For example, there is much variation in annotation property names, for example for 'synonyms', so we need to merge these so our applications know where to find them. Where imports are not possible or suitable, we mint new EFO classes. Since multiple developers from various sites can mint these, we have built some tooling for central URI management to avoid clashes which could otherwise easily occur. URIGen is this tool - see my previous blog for more on this.

To keep track of external changes and to produce our release notes we use our Bubastis tool which does a simple syntactic diff across two ontologies to tell you what's changed, been added and been deleted. Keeping track of what's going on externally is a complicated process and brings baggage with it. There is a discussion to be had as to when the balance of keeping track introduces an unacceptable overhead as you are effectively at the mercy of external developers. Examples of changes we've had to deal with include: upper ontology refactoring, mass URI refactoring, funding ending, general movement of classes, changes to design patterns (and axiomatisation therein) and so on. For what it's worth I think we're in a better place now than we started building EFO five and a bit years ago, although my opinion on this will change if the new (non-backwards compatible) BFO temporal relations are adopted.

3. Test Driven Development

Another agile process we adopt is test driven development. In a continuous integration framework, it is necessary to test each commit of code to ensure that it does not break previously working components and introduce new bugs and we treat OWL with the same respect. We have developed a series of automated tests using Bamboo that the ontology is ran against after each commit which performs checks such as for: invalid namespaces; IRI fragments outside accepted conventions; duplicate labels between different classes; synonyms duplicated between classes; obsolete classes used in axiomatisation; unit tests for expected class subsumption (e.g. cancer should be subclass of disease).

4. Design Patterns

Another aspect is performance and the OWL DL profile we restrict to. In order to fully exploit the querying power of the ontology, we use reasoning to infer various hierarchies of interest, such as classifications of cell lines by disease and species, and we need this to happen in a time that is responsive. There are several methods we use to ensure this remains the case. The first is the use of design patterns. We restrict axiomatisation to a set of patterns that we have developed to answer our priority competency questions. The second is to disallow the addition of new object properties and characteristics on those properties. The third is to classify the ontology on every commit (and run the above test code). HermiT gives us best performance for this interested in reasoning and has done for quite some time now.

We also employ an automated release cycle to release a new version of EFO monthly, in order to best coordinate with our application needs. The release is programmatically performed using a Bamboo build plan which performs tasks such as creating the inferred version of the ontology, converting the ontology to
OBO format, publishing files to the web, building the EFO website and creating URLs for classes in the EFO namespace to ensure that concepts described in EFO fully dereference.

Agility, reality, profanity

Our overall approach has improved the production quality immensely over the last few years. To quantify this with an example: over our last 3 months of work, 74% of the time our EFO continuous integration testing has passed on check in. This means that 26% of the time it has not. Although this sounds like a bad thing it's actually good to know we're catching this now before it goes for release to applications. Much of this is actually relatively minor stuff like white space in labels which we are fairly strict on but sometimes it's more serious stuff that we're glad we caught.

We've also become more dynamic in prioritising and sharing tickets meaning more important stuff gets done more quickly and by a number of people, tickets being picked off the top of the priority pile as people become available.

We still struggle with a few things and these are challenges that hit most ontology consumers I think. The biggest is balancing correctness with 'doing something'. This is a tricky brew to get right as we don't want the ontology to be wrong, but we do want to get things out and working as quickly as possible. Thinking about the metaphysical meaning of a term over a period of months does not help when you have data to annotate covering 1,000 species and 250,000 unique annotations as your target; this is the reality we face. In the same breath though, getting things very wrong doesn't provide the sort of benefits you want from using an ontology - and using an ontology adds an overhead so there should be benefits.

There is a dirty word in the ontology world that most dare not utter, but we do so here; 'compromise'. We do stuff, if it's wrong we fix it, we release early, release often and respond rapidly to required changes from users. Sound familiar?

Monday 4 February 2013

When will bio-ontologies grow up (and how will we know)?

Robert Stevens and I have published an experiment we did on evaluating levels of activity in bio-ontologies over the last decade. It felt like a decade ago since we did the work such was the delay by the journal in getting it out. Here's a summary of the full paper.

At ICBO 2011 in Buffalo, NY, Robert Stevens and I were chatting outside my Hotel about which ontologies we use in our work and how one makes a choice. A few others there - I recall +Melanie Courtot and +Frank Gibson were also present - also had thoughts. There was a collective wisdom about ontology maturity, development and engineering in what we said and we felt it probably would change the landscape of this area of research forever. I wish I had been able to remember any of it.

Nevertheless, Robert and I went ahead and performed a bit of work looking at one aspect of ontology evaluation to see if we could glean some insights into the constitution of the various bio-ontologies in existence and see how far we had come. We limited our work to looking at what we called 'activity'. Some of the research questions we wanted to investigate were:

How frequently is an ontology updated?
What do these changes look like?
Who makes these changes?
Is there attribution for changes?
Can we see patterns (profiles) as ontologies mature?

"Ontology activity" by Aureja Jupp.

Our method for doing this was relatively simple. Firstly, find the available ontology repositories. Secondly download the ontologies and record data about it - date, commiter etc. Thirdly, perform a syntactic diff between subsequent versions looking at number of classes added, deleted and that have add axiomatic changes (for example, have a new parent class, or part of assertion made on them). Finally, perform a bit of analysis on these results.

Activity and the Super-Ontologist

We performed the diff using a tool I had written several years ago now called Bubastis - there's also a Java library available since we did this work. The tool is fairly simple; it uses the OWL-API to read in OWL or OBO ontologies and performs a class-level set difference on axiom assertions. It also looks for newly declared named classes and similarly for named classes missing in previous versions.

I'm not going to go into everything here, you can read the paper for all the details, but here's a few of the interesting things we found.

1. Most activity in the community is in refining classes that already exist within an ontology. Alongside this, we also found that a lot of classes were deleted between versions which is in contrast to the perceived wisdom that classes are only made obsolete and not removed. It is arguable that for the OBO ontologies this is less of a crime; when we look at the details we can see some of these deletions are caused by name space changes between versions with the ID fragment at the end (e.g. the GO_1234567 bit) not changing. Nevertheless, this is a problem if one chooses to use OWL versions or use the full URIs for these ontologies.

2. Between 2009 and 2011 ontology activity remained fairly constant. We produced a metric by totalling all the changes and compared the two using a paired t-test which suggests that levels have not changed significantly.

3. Active ontologies tend to release often. This is perhaps not surprising to anyone familiar with software practices of releasing early and often, but was good to see. The perceived wisdom in software engineering is that this allows for rapid feedback from and response to users - something active ontologies have adopted.

4. Some ontologies may be dead. Dormancy may suggest the ontology is inactive or complete. The Xenopus anatomy ontology is currently inactive and is more likely to be a case of completeness rather than death. Nevertheless, monitoring when an ontology becomes moribund is almost certainly a worthwhile endeavor for efforts such as the OBO Foundry since an ontology could end up occupying an area, thereby preventing progress.

5. Lots of committers does not always lead to lots of activity. There are several factors to consider here. Firstly, a few of the project use automated bots to release the ontology so the committer name is not a good indicator of the number of editors. Nevertheless, there are some projects which contain many different committers which have low levels of activity and vice versa. This may be suggestive of many things - that large collaborate efforts suffer from "consensus paralysis" - but it may also suggest that tracking tacit contributions is difficult. Which raises other issues, namely of appropriate credit for such contributions. We arrived at more questions than answers with this one.

6. There lives amongst us a Super-Ontologist. The committer 'girlwithglasses' has made a total of 500 ontology revisions spanning 13 different ontology projects; She is truly the Ontological Eve.

Discussions

We had many discussions when writing this work up as to what it all meant, and we've put most of this into the paper - if you're interested you'd be best of heading there. Perhaps, more than anything, the conclusion I was drawn to from all of this work was that ontology engineering is still immature. We leaned towards software engineering when we undertook this work; maturity profiles, diff analysis on code revisions, etc. but we were feeling our way towards what might be a good way of summarising this aspect of what you might call ontology quality. Software and ontologies are the same in many aspects but different in many others but I still think this is where our best hope lies in applying QC.

I've previously discussed why the lack of ontology engineering practices is a problem and I think we need more quantifiable approaches for developing and evaluating ontologies. In a workshop I attended in September one of the top desires of the biologist user community was advice on what ontologies they should use. I'm always tempted to name the ontologies I know and use. When we first started using ontologies we performed an analysis of coverage across the various ontologies to work out which would offer us most. Coverage is one metric - a simple but important one - but now there are so many ontologies which offer coverage, we need more than this to inform our decision. Our ontologies are growing up. We should probably help to point them in the right direction.