James Malone's EBI Blog: July 2012

Thursday 26 July 2012

Why choosing ontologies should not be like choosing Pepsi or Coke

I've just returned from the International Conference on Biomedical Ontology (ICBO) which is the biggest conference in the field of bio-ontology. I was invited to sit on a panel which was somewhat provocatively titled 'How to deal with sectarianism in biomedical ontology' during which we discussed how we might better get along and defuse some of the issues that have plagued the community over the last decade or so.

Fish-Tree-Pepsi-Coke. I know what you're thinking but
just go with me on this and read the post.

The range of views of the panelists was interesting, in part because they were not as extreme as one might have expected, or at least as they might have been five years ago. I'll attempt to summarise the panelists thoughts based on their initial two minute 'pitch' slide, I've included a link to the participants slide:

Alan Rector - Alan mentioned the need for humility and to understand what a given ontology is designed to do before we criticise it as they are can be made for different purposes. He also mentioned the need for proper evaluation.

Chris Stoeckert - Chris stated that sectarianism is inevitable and that he had chosen his sect which was BFO/realism. Ultimately, he said the biggest sect wins and that this is the OBO Foundry, which, as a community effort, we should join.

Barry Smith - Barry suggested that any ontology of any worth should be developed by an ontologist that has signed up to a 'code of ethics' which includes principles of reuse, aggressive testing in multiple real world applications and of 'thinking first' before adding a term or definition.

My own stance was that in general, I don't think a sectarian approach is very useful, not only because it causes political divides within our community, but because it also alienates us from other communities who, from the outside looking in, may be less likely to engage with us. And that hurts us because above all else we need users, more than they need us. I also think competition is fine. This is in general how science has worked for quite some time, moreover, if it didn't then we would never have made leaps forward by listening to the minority voices on issues such as evolution and Copernican heliocentrism.

But underlying everything I said is my desire to see ontology engineering become a first class citizen and mature as a discipline. My job, in part, entails building ontologies for millions of data points with much diversity; 1,000 species, 30,000 experiments, 250,000 unique annotations. If people are willing to call out that I should be using ontology a instead of ontology b, then I need to know why, and this can not be based on subjective or political opinions. I want to see the development of formal, objective metrics to determine whether or not one ontology is better than another, so that we can really measure these artifacts and have something scientific to base our judgements on.

Alan Rector also rightly points out ontologies are built for different purposes so we need to factor that in. As Einstein said "if you judge a fish by its ability to climb a tree, it will live its whole life believing that it is stupid." If Amazon used an ontology to power their website, it would be hard to argue that particular fish is not a good artifact as the Amazon application seems to work pretty well.

I've also heard many comments from certain quarters about an 'ontology crisis' wherein ontologies of poor quality are now everywhere to be seen, polluting the pool. This sort of comment is similar to comments made during the software crisis of the 1960s, and, given that funding for ontologies can be hard to come by, we can ill afford to overrun. They reacted to this by developing software engineering processes and methods which, over time, helped enormously, though they did not resolve all the issues, cf. no silver bullet. Whatever your stance, it is hard to argue against wanting proper processes and methods for building in quality; nobody wants a blue screen of death on a plane's fly-by-wire system during a transatlantic flight. Similarly, nobody wants a medical system using an ontology to give incorrect results. An ontology for your photo collection, we care less so.

So what do we need? Here's my list;

A formal set of engineering principles for systematic, disciplined, quantifiable approach to the design, development, operation, and maintenance of ontologies
The use of test driven development, in particular using sets of (if appropriate, user collected) competency questions which an ontology guarantees to answer, with examples of those answers - think of this as similar to unit testing
Cost benefit analysis for adopting frameworks such as upper ontologies, this includes aspects such as cost of training for use in development, cost to end users in understanding ontologies built using such frameworks, cost benefits measured as per metrics such as those above (e.g. answering competency questions) and risk of adoption (such as significant changes or longer term support).

In a sentence; making public judgements on ontologies should be a formal, objective and quantifiable process and less like deciding whether you'd prefer a Pespi or a Coke.

Incidentally, I prefer Coke.

Friday 13 July 2012

A million gene expression annotations with Zooma

ArrayExpress, based at the EBI, is one of the world's largest public repositories of transcriptomic data. One of the much valued features of the repository is that data submitted undergoes curation, not only by computational assessment but ultimately by manual experts - our curators. Their job is to ensure this data meets certain minimum quality requirements and is described in a way that is accurate (ontologies help with this) and therefore searchable in the archive.

Zooma - Like an Ooma but even faster.

This is a much valued service by a lot of the community, but it is not without issues. One of the primary issues is that data submitted to ArrayExpress continues to increase. In fact, even though it has been postulated that microarrays are dead there is no sign submission of these experiments is slowing, in fact our figures show they are on the whole stable. On top of this, new sequencing technologies are emerging almost monthly it would seem and our figures also show a slow but steady increase in this sort of submission. Overall then, this leads to a net increase in the amount of submissions coming into ArrayExpress.

So business is good, but this is not without its drawbacks. The primary one is cost and I mean this in the broadest sense; high quality annotations are time-consuming and there is a limit to how many experiments a curator can curate. Simply put, more experiments means more resources are required. We call this an annotation gap, i.e. the gap between high-quality data annotation (especially using ontology classes) and the amount of resources available to do such annotation.

Search the Zooma KB for an annotation. The pop out
here is showing info for the first hit for Caucaisan used as
as a value for a category ethnicity. This pattern has been
used 608 times and each annotation has its own URL.

One of the ways of reducing this annotation gap is by enabling submitters to annotate their own data more easily and more aligned to a common standard, in this case the ontologies we use. This reduces the effort required by curators to make sure everything is aligned within the repository. Another way, is to maximise the amount of automated annotation against such ontologies that can be done. The is the job of Zooma.

Zooma is an RDF knowledge base of annotation knowledge, extracted from the expert curation performed on a subset of ArrayExpress data. This subset of data has the added advantage of being curated twice because it has also been loaded into the Gene Expression Atlas, where it has been aligned to ontology classes in EFO. This is very powerful for several reasons.

Firstly, it enables access to the curation process. This is useful because it allows a person to easily look-up how a property (some textual item) has been annotated to an ontology class and therefore repeat the process - the users here are both external submitted and our own curation team (see image). This makes curation consistent and rich. An additional benefit of this is that it also enables computational exploitation of the curation process. Not only does Zooma capture how a textual property has been mapped to an ontology class, but it also captures corrections between annotations, for example an update to a more accurate class. What this really gives us is a big set of rules, manually created over several years, which can be applied to data automatically.

A second feature is that provenance is stored and used for ranking and filtering. Using the Open Annotation Model, where an annotation 'rule' has come from, for instance curator asserted or inferred from the knowledge base, is recorded.

A third feature is that everything has a URI. Every annotation, sample, assay, study and they link to experiments in ArrayExpress and the BioSample Database. So this is truly linked (to) data.

Finally, this model is additive. Not only can our own, new additions be added by our curators and submitters, but any annotation based on our simple abstract model can be incorporated - including a whole database dump.

Have a play with the live demo at http://wwwdev.ebi.ac.uk/fgpt/zooma
Tony Burdett will be presenting this work at ISMB Technology Track on Sunday July 15th at 15:30.
This work is supported in part by EMBL and by the DBP project with NCBO via NIH.

Monday 2 July 2012

Has the War of the Words alienated Google?

A great wordsmith once wrote;

What's in a name? that which we call a rose
By any other name would smell as sweet

This is taken from Shakespeare's play Romeo and Juliet and it captures two themes I'd like to briefly explore in this post. Firstly, that the essence of what a thing is does not rely on its name alone. Secondly, that feuding about a thing can unintentionally damage a thing.

Your search did not match any documents.
Did you mean Brazillian Shakespeare Horse Jupiter Mountain?

'And what does that have to do with Google?', you might well ask. In May 2012 Google announced the Knowledge Grap h search enhancement which they headlined as 'things not strings'. This led to much discussion in various press outlets in the last month or so about how Google were going to use this new way of indexing web pages to give you more intelligent searching, such as dealing with homonyms - words that have multiple, different meanings (e.g. tire - car wheel, tire - sleepy). This sounds great and, in fact, I think it is. But the idea is nothing new to anyone who has been working in ontology, semantic web or more specialist cases like biomedical data curation. It's one of the driving use cases of ontologies - I refer you to my blog on what an ontology does. So how does Google's new Knowledge Graph differ? I noted with particular dismay that Google's blog did not contain the words 'ontology' and 'semantic'. Various press stories which talk about this hint at it without saying it and many proclaim this to be fundamental new technology, with a tip of the hat to Yahoo's 2009 paper.

Credit to Google though - they have actually implemented something and they are using it, that is more than a lot of practitioners do. But the question remains - why are the words 'ontology' and 'semantic web' missing from these articles, including Google's own? An ontology by any other name is still an ontology - concepts, relationships, graph nodes, edges, types, instances, whatever you call it.

I think the answer may lie in my second theme; the War of the Words. In biomedical ontologies, a field in which I am closely involved, there is an undercurrent of strong opinions and cutting debate with the aim of building consensus. Undercurrent is probably inaccurate because it's actually highly visible - it's more like a tsunami. See the 2010 Merrill and Smith papers for a peek at some of this. Ontologies, even from within the community, divide opinion, engender indignation and entrench viewpoints and to those on the outside this must sometimes seem, well, problematic. It's not always this way of course - there are many great things happening in these communities and some times they unite opinion, reduce division and bridge viewpoints. Collaborative work from many different communities continues and I have been party to several such efforts, with mixed success, but then getting everyone to agree is intrinsically hard. The worry is that, perhaps, the success stories are overshadowed by the war of words. The punch is mightier than the handshake, sadly and perhaps this is the root of my disappointment. I've heard the words 'if ontologies/sem web were really that good Google would be using them' often from those outside these communities.

There is also a feeling from certain quarters that the Semantic Web, as it was originally cast, has also failed to live up to the hype and that what I consider to be a simplified version (avoiding grand gestures) - Linked Data - is similarly floundering. I should add this is not a one-sided argument though and many believe it has and is succeeding though perhaps it needs to do more. Indeed I personally believe that we are now in a better position than ever to exploit these technologies and I am already involved in a project here at EBI which is doing just that. I'll report that in the future.

It would seem apparent that Google are using something akin to ontologies, and possibly Semantic Web technologies, but are unwilling to beat the drum about these overloaded and much travelled buzz words. These words come with baggage, high expectations and a strongly opinionated community. It may just be an omission by Google of course, accidental in nature, and in the coming months they will begin to champion the cause and recognise the work that goes on in the ontology and semantic web communities. If they want a success story then I refer them to the Knowledge Graph that is the Gene Ontology, circa 1999.

Perhaps Google's Knowledge Graph is the killer app that everyone has been 'searching' for; the ontology is dead long live the Knowledge Graph. Let us not, then, kill their efforts with semantics, they're just words after all and in the end, what's in a name?