James Malone's EBI Blog: 2012

Sunday 9 December 2012

After the hype, the biology (and finally, the cancan)

I spent an enjoyable three days in Paris at this year’s SWAT4LS, meeting and talking with a lot of interesting people working in and around the semantic web community. Here are a few observations I made.

There were a lot of people.

It was my first time at this particular meeting and it struck me that there were a lot of people in attendance, somewhere around 100. I was surprised, though perhaps I should not be as the community has been steadily growing over the last few years. Still, this suggests this is no longer just a niche activity and has many more paid-up members than ever before. OK, I accept the fact it was in Paris might have swayed a few hearts and minds, but still.

They were cancaning in the aisles at Moulin Rouge
when they heard about all the RDF in town.

There was a lot of interesting work…

I was pleasantly surprised at how good a lot of the work presented was and it spoke to much of what we have been doing over the past six months. To pick out one, I was impressed with a talk given by Matthew Hindle on data federation on Ecotoxicology using SADI in which he outlined the way they utilise the services to start to answer what I would call proper biological questions. It was a nice example of where they had been able to pick up some of the existing approaches and resources and apply them to their data and problems successfully and speaks to the notion that the field has matured some over the last few years. Of course I've read this claim numerous times in the past but largely from an anecdotal point of view; this was at least evidential that, to some extent, things exist that can be used to solve problems in the life sciences. There was inevitably some work to do of course, they did not find everything as an out of the box solution, but the components were there at least. I also liked the SPARQL R package that has been recently published and that we've been using in an MSc project with one of our students. It's been very useful and we've written a package for analysing our Gene Expression RDF data in more intuitive ways, allowing simple R-like function calls to be made with the RDF, behind which the SPARQL lies, hidden. I think this sort of tool is important because it exposes the technology to an important, existing bioinformatic community in ways that aid and not hinder their work. We’ll be releasing this tool early next year. UniProt also presented their work on using rules within their RDF to detect inconsistencies within the annotations. I like this a lot and it's something we have also started exploring with our own RDF, such as looking for disease annotations that were made to ontology classes which were not subclasses of disease - we found a couple. This demonstrates nicely the advantage of having your data in a format that is native to the ontology. This enables one to ask meaningful ontology type questions (subclasses of x, classes part_of y, etc.) rather than having to formulate some hack between a database query and an OWL one and then do a textual comparison.

…but nothing that got me really, really excited

I didn't at any point have that epiphany that made me think "we're there". I've read much hyperbole of this sort far too often in publications (numbers of triples does not alone equate to good science), and at the moment it still remains just that. I do think it's maturing and I think this field is now becoming Important to those working in the life sciences. Certainly here, at EBI, we've been working a lot on RDF representations of data over the last six months and I see this from others too. But it doesn't underpin the basic science we do. It’s not that important to our databases, our curation, our applications and most importantly our users. It may become so - I think it probably will - but it's not there yet so, for now, read those aforementioned publications with skepticism.

There is much more to be done; opportunity, hard work and money

I see a lot of interesting opportunities in this area but, for my mind, there needs to be more engineering methods applied if we want to see less bespoke solutions occurring that live and die within the length it takes a paper to be written, accepted and published (in particular SPARQL end points). I've said this many times about engineering ontologies and I think it equally applies here. We need better update models – updating RDF once it's in a triple store seems to be too onerous a task at present. And documentation on how to do this stuff is really lacking. The HCLS W3C group that we have been working with have been having a go at writing a note on representing gene expression data in RDF but it's slow work and I still don’t know if we’re on the right lines. Which also suggests we need better ways of evaluating this stuff. What does it all mean once it's out there, in RDF? Can I use it, integrate with it and how? Most importantly, is it correct? It's not just about sticking a URI on everything and making it RDF – that's too simple if we want to use this in more computational approaches to analysing and solving real biological problems. One of the big problems I've found in this area is convincing funding agencies that these approaches can result in discoveries in biology when the evidence for this is conceptually strong but practically weak. As is often the case, we have a scenario in which waiting for discoveries before giving more funding will mean the discoveries never come, because the work is never funded. If I had a single want it would be some focused calls for this stuff across the likes of the BBSRC and EU Framework 7 with some specific biological objectives. There is hope on this front - Douglas Kell gave a plenary talk at last year's SWAT4LS so there is some recognition of the importance of this area.

My comrade in arms Simon Jupp gave a talk recently on the RDF work we have been doing (mainly Simon) which I added a subtitle to: ‘after the hype, the biology’. The hype may be fading, the biology may be surfacing but there is still much to do.

*Congratulation to Tomasz Adamusiak on winning an iPad in the Nanopublication competition. He celebrated Tomasz style; an evening at the Moulin Rouge.*

Tuesday 30 October 2012

Future Ontology - Five year predictions past and future

I made a number of ontology predictions five years ago to this very day. Here's a review of those and a few more for the next five years.

On 30th October 2007 I gave my first public presentation on the ontology work I'd been doing since starting work at EBI in May that same year. Today, 30th October 2012, I gave a talk during which I reflected back on things that have passed over those five years and, in order to do so, ended up looking at that old set of slides from 2007. It made for interesting reading. At the end of the talk I made some predictions about what we might do and where we might end up as a community. I thought it might be nice to share those now and to make a few more public, five year predictions. If I'm still working and the world has not ended in 2017 maybe I'll do it all again.

I also predicted that leather pants would make
a comeback. I was wrong on that one, thank God.

October 2007 - My Future Ontology Predictions

We will rely on and reuse external URIs in our work, rather than minting our own, as ontologies become more populous and stabilise.

This is certainly true of the work we do at EBI. Some ontologies, such as Gene Ontology, have been stable for quite a while and a couple of others have also followed a fairly stable road to persisting ontology URIs over time (including our own EFO which follows the GO's practice). What I really wanted to see was that once a URI is minted in an ontology it persists unless there is a very good reason for it to go away. Many more bio-ontologies do this than used to and in some ways, this is a measure of maturity of the community.

We will add dereferencable URIs for our data and put metadata behind it.

Partially. This happens for most ontologies now which is a definite step in the right direction. For example, OntoBee does a lot of content negotiation for OBO ontologies which is nice. The data side still lags behind but we are looking into that internally now. I suppose overall the data part is less true than I would have wished but this is a classic chicken and egg. The promise comes later, when everyone does it, but until everyone does it, it is not immediately obvious why you should. We need to be bold.

As ontology numbers increase and overlap, mapping between them and data described with them, will become our biggest challenge.

I think this is true and I am still concerned by it. Where this is not true is that there is not a huge amount of data published using all of these ontologies as I perhaps envisaged in 2007. But I still maintain that when more is published, the mapping problem will be difficult. Having said all of that, I am also unconvinced that building one ontology for each domain by attempting to get all communities to agree to every definition of every biological concept is the answer either. I've seen ontologies grind to a halt over analysis-paralysis over the last five years and this is also not the way to go. Sacrificing one critical problem for another is not a good solution.

Agent technology will help in our mappings and in the way we discover data.
Pretty much didn't happen - in bioinformatics anyway. But I think this was because I was overly-optimistic about how much of the infrastructure would exist in this semantic web world. It is worth though that Google's agents (web crawlers) do use rich snippets tags which includes an RDFa version about products on web pages to help populate the 'shopping' search you see. So I was a bit off the mark but not completely.

That biological triples would be championed by all.
I think this has definitely been wrong up until very recently and in some ways I am guilty of being sucked in by the hype - by the promise that integrating all of our RDF data would bring. This year though, the EBI has started an RDF Frontier group to trial this work. You can see the work Simon and I have be doing at our FGPT Atlas RDF page to see how this is progressing - well so far I'd say. I was a bit premature on this prediction. Which leads me on nicely to...

October 2012 - My future ontology predictions for the next five years

The number of reference ontologies will level off (and some will disappear) and natural 'winner' will emerge.
I make a deliberate distinction here by saying 'reference' ontologies as I think the number of ontologies put together for applications will likely increase but they are unlikely to be considered as references for a domain. I think funding for building these reference ontologies will fall and some may even become moribund sadly. But a lot of the important ones will live on and continue to develop. The natural winners - ontologies that become the de facto choice for a domain will emerge. We will need to find ways of using these ontologies that does not necessitate building a whole new reference ontology.

Upper ontologies will play a less important role in the community.

Some might say 'about time' but I do think they've had a role to play and have helped with some things even if the approach of those involved has been, shall we say, less than endearing. But I think their domination in every discussion about whether an ontology is 'good' and how an ontology fits into an upper ontology will decline in favour of focusing on how we can use the ontologies to describe our data and do biology.

Use of ontologies and semantic web technologies in Bioinformatics will become ubiquitous.
This is ambitious but I think it should and will happen. I'm convinced, even from some of the early prototyping work we've been doing, there is enough data out there now that warrant applications for biologists to use.

Publishers will curate literature using ontologies and make the API to these annotations public.
I think the great work that happens already in GoPubMed should happen for more ontologies and for more publishers. It's just great. For more on this I refer you to Phil Lord who wages a one-man war for more semantics in publications (amongst other improvements in the industry).

Google will endorse the semantic web.

Or they will at least admit that it's useful. They already use semantics this with rich snippets. I'd like to see them support this area of web science and I think they, eventually, will. If they outwardly endorsed this, then who knows, certainly I think more people would use semantics when publishing their data on the web.

I'd love to hear more from those in the community who are willing to stick a stake in the ground.

Thursday 13 September 2012

Bringing Genome Wide Associations to Life

A year in to the EBI's joint project with the NHGRI, the GWAS catalog and data visualization has become interactive with help from Semantic Web technologies.

In the fall of 2010, we started an informal collaboration with the NHGRI's Catalog of Genome Wide Association Studies (Hindorff et al, 2009) with our student, Paulo Silva. In that project, he rapidly created an ontology based on our ongoing EFO efforts to describe a lot of the traits in GWAS (see the NHGRI site for more on how the catalog is put together). Paulo deployed this using the existing ArrayExpress website infrastructure as a backbone to illustrate the benefits of using an ontology for curation and searching. Later that year, Paulo and Helen Parkinson presented this work to the NHGRI.

Figure 1. GWAS Catalog as of 2011. Much skilled work put this beautiful artefact
together. "My God, it's full of stars!"

Thankfully, they liked it and in October 2011 the EBI started a formal collaborative project with the NHGRI to improve the process of curation based on our expertise in using ontologies for annotation and in deploying them in applications. In addition, the famous and much-cited GWAS catalog diagram (see figure 1) was also to be given an overhaul in the way it was generated. The diagram illustrates GWAS traits associated with a SNP on a band on a human chromosome, and is generated manually by a skilled graphical artist every three months. If you take a look at the diagram you can quickly see that this is no easy task. New traits can appear (almost) anywhere each month and this can mean a lot of shuffling around of coloured dots to accommodate them. In addition, each trait has a unique colour which is described by a complex key (see figure 2) - complex because there are so many traits and therefore a lot of colours required. In fact, the harder the team work to curate this data, the harder it becomes to generate the diagram. The lazy person in me thought the answer was simple - stop working so hard!

GWAS Catalog key with many different colours for the many different traits

Figure 2. The key keeps growing as more work is included. Each of
these colours is unique. My favourite is #99FFFF

But they continue to work hard regardless and so a further issue presents itself; that searching the catalogue, either by the search interface or by viewing the GWAS image, is also hampered by the growing size and lack of structure to the trait names. A small list and this is not really a problem, but as the catalog has grown thanks to the curation efforts of the NHGRI, so have the amount of traits.

Motivation complete, let's look at the progress so far. You can click here to see the new diagram generated by the team at EBI* (please see below for the list of contributors). The first thing to note is that it is now an interactive, dynamic diagram. You can zoom in and out, you can mouse over traits and find out what they represent rather than have to look at the key. The default key has also been reduced down to higher level groupings, significantly reducing the number of colours used. These groupings correspond to ontology classes in the EFO which are superclasses of other traits and are generated based on maximum coverage (i.e. the 18 classes that covered the most amount of traits). One of the advantages of using an ontology emerge here; that you can begin to aggregate data together and get a better feel for global trends.

In addition, you can also drill down to more specific traits using the filter option. Because the diagram is dynamic, you can highlight only those traits of interest to you. Try entering 'cancer' into the 'Filter' box and press enter. The ontology is being used to show those SNPs annotated to cancer or subclasses of cancer. Now try 'breast cancer' and you can see a subset of these cancer trait dots remain highlighted. What about 'breast carcinoma'? Have a go. Clear the filters (click clear filter) and then enter 'breast carcinoma' and you'll see the same results as for breast cancer. Again, this is ontology at work; the browser is using synonyms stored in the ontology to perform the same query it does for 'breast cancer'. Simple, but very useful.

The benefit of generating the diagram programmatically is perhaps most evident in the time series view (click on the tab). A selection of diagrams from over the last seven years is shown here and all were generated with just a few clicks. This is in much contrast to the manual, hand-crafted artefacts that went before which took many weeks and much skill to produce. Seems almost unfair really.

The Techy Bit

So what's going on behind the scenes? The list of over 650 highly diverse traits, are mapped to EFO. These traits include phenotypes, e.g. hair colour, treatment responses, e.g. response to antineoplastic agents, diseases and more. Compound traits such as 'Type 2 diabetes and gout' are separated. Links between relevant traits are also made to facilitate querying (e.g. partonomy). It's also worth mentioning that EFO reuses (i.e. imports from) a lot of existing ontologies such as the Gene Ontology and Human Phenotype Ontology which also facilitates future integration.

The GWAS ontology is a small ontology used within the triple store to describe a GWAS experiment structure (not including traits) and describes: links between a trait and a SNP and the p-value of that association; where that SNP appears on a chromosomal band; and on which chromosome. Each chromosome is an SVG in which, the XML source has been tagged with ontology information, e.g. these coordinates of the SVG are band x. The dots representing traits are similarly assigned the trait type (from the ontology) within the class attribute of the SVG. These bits of information are used to render each trait on the images, and allows for the filtering on trait - and (in the very near future) on other properties such as the association p-value.

DL queries are also employed as part of this visualisation, making use of the OWL describing the two ontologies (EFO and GWAS ontology). The initial creation of the diagram uses a DL query which asks, amongst other things, which bands are on which chromosomes, which SNPs are on each band and which traits should be rendered where. Traits are selected using another DL query which, as well as location information, asks if they satisfy a certain p-value (which is a datatype property) - those below this threshold are not rendered. In the future this will become more dynamic so a user can specify their own p-value threshold.

These queries all take a while so they are all pre-computed when the diagram is built each release and then each SVG is cached on disk for quick retrieval. The actual filtering queries (which will also become auto-complete in the near future) also use a simple DL query; they ask for all the subclasses of a given trait class from EFO and these traits are passed back in a JSON object which is used to refresh the diagram, showing SVGs of this class. In the longer term, the aim is to use disk caching of the reasoner using something like Ehcache. Currently this is not possible due to some serialization issues with the version of the OWL-API that is being used but this is set to change. This will enable much more complex queries to be performed, utilising EFO, dynamically such as traits that may be the subject of a protocol of measurement or diseases that affect a particular system. There are many possibilities.

A side-effect of this project is that the technology is reusable for other projects and indeed we intend to utilise some of this for some of our current work in rendering the Gene Expression Atlas in RDF (more on that very soon). The model is relatively simple; describe your data in some schema (OWL ontology), work out what each ontology looks in SVG (the relationship between an ontology class and the image) and then render it.

This is all done using Semantic Web technologies; OWL, RDF, triple stores, ontologies, description logic reasoners. And you'd never know looking at the website. I've always thought this is the perfect example of when Sem Web comes good - it just works and the user never knows about the guts. This is the right way round I think; the biology should come first in this application, the technology is secondary. 99% of users won't really care that your website uses CSS or JavaScript the same should be true here. When Semantic Web technologies hit this level of brilliant but quiet, accepted ubiquity, we have reached our goal.

*List of contributors
The tooling for this project was developed by Danielle Welter and Tony Burdett, with input from Helen Parkinson, Jackie MacArthur and Joannella Morales at the EBI and the catalogue curation team at the NHGRI.

Tuesday 4 September 2012

Semantics of the Semantic Web Journal

I've been looking at the Semantic Web Journal as we've a few bits of Sem Web work we're doing (which I'll discuss in a future blog) which I think is interesting and probably worth publishing (eventually) and I've seen a few interesting articles published there. My friend and colleague Phil Lord has oft talked /battered me into submission about the Semantics of Publishing - a slightly different take than a journal about the Semantic Web - but, nevertheless, relevant.

Phil's work has concentrated on adding value to publications using things like Greycite which gives a nice way of searching webpages for metadata and presenting a way of citing, for example, blog articles. This is extremely useful in a world where a journal publication is really only part of the modern online literature we read and might like to cite.

This is what I was hoping for deep below the surface of each
webpage. Beautiful, icy RDF and maybe a penguin or two.

I had hoped the Semantic Web Journal would take things a step further by providing Semantic Web style metadata behind the pages so I had a bit of trawl using Greycite and cURL to see what I could get back by analysing the HTML or perhaps any RDF through content negotiation. Sadly, I got nothing back other than HTML. In fact, not even any meta keywords were in the head of the HTML, and nothing back from cURL requests for RDF.

So I'm left a little disheartened; not annoyed, but disappointed. I think if you're going to lead the line by publishing articles on the Semantic Web, it would be good practise to add semantics to your Semantic Web journal, probably. I suppose, in return the onus is then us so-called practitioners to also follow the good example set. We've been doing this for a while for EFO but I confess my own website also lacks RDF behind it but I could be convinced to do so if there was a Linked Data framework that would add value to it. Ironically, this journal has some great articles on exactly these areas. The OntoBee paper talks about a service which does a lot of this stuff for ontology class and individuals in RDF and HTML at the same time and it all works rather well, but there are multiple other ways of doing this, and, as Greycite demonstrates, they need not be complex. Just useful.

There is one other thing I should add, which I really like about this journal; an open and transparent review process, posting all peer reviews on the journal website. This is one of my biggest gripes about science as it stands. I think the review process is flawed. I think reviews should be open and transparent for journals, conferences and grants.

Thursday 26 July 2012

Why choosing ontologies should not be like choosing Pepsi or Coke

I've just returned from the International Conference on Biomedical Ontology (ICBO) which is the biggest conference in the field of bio-ontology. I was invited to sit on a panel which was somewhat provocatively titled 'How to deal with sectarianism in biomedical ontology' during which we discussed how we might better get along and defuse some of the issues that have plagued the community over the last decade or so.

Fish-Tree-Pepsi-Coke. I know what you're thinking but
just go with me on this and read the post.

The range of views of the panelists was interesting, in part because they were not as extreme as one might have expected, or at least as they might have been five years ago. I'll attempt to summarise the panelists thoughts based on their initial two minute 'pitch' slide, I've included a link to the participants slide:

Alan Rector - Alan mentioned the need for humility and to understand what a given ontology is designed to do before we criticise it as they are can be made for different purposes. He also mentioned the need for proper evaluation.

Chris Stoeckert - Chris stated that sectarianism is inevitable and that he had chosen his sect which was BFO/realism. Ultimately, he said the biggest sect wins and that this is the OBO Foundry, which, as a community effort, we should join.

Barry Smith - Barry suggested that any ontology of any worth should be developed by an ontologist that has signed up to a 'code of ethics' which includes principles of reuse, aggressive testing in multiple real world applications and of 'thinking first' before adding a term or definition.

My own stance was that in general, I don't think a sectarian approach is very useful, not only because it causes political divides within our community, but because it also alienates us from other communities who, from the outside looking in, may be less likely to engage with us. And that hurts us because above all else we need users, more than they need us. I also think competition is fine. This is in general how science has worked for quite some time, moreover, if it didn't then we would never have made leaps forward by listening to the minority voices on issues such as evolution and Copernican heliocentrism.

But underlying everything I said is my desire to see ontology engineering become a first class citizen and mature as a discipline. My job, in part, entails building ontologies for millions of data points with much diversity; 1,000 species, 30,000 experiments, 250,000 unique annotations. If people are willing to call out that I should be using ontology a instead of ontology b, then I need to know why, and this can not be based on subjective or political opinions. I want to see the development of formal, objective metrics to determine whether or not one ontology is better than another, so that we can really measure these artifacts and have something scientific to base our judgements on.

Alan Rector also rightly points out ontologies are built for different purposes so we need to factor that in. As Einstein said "if you judge a fish by its ability to climb a tree, it will live its whole life believing that it is stupid." If Amazon used an ontology to power their website, it would be hard to argue that particular fish is not a good artifact as the Amazon application seems to work pretty well.

I've also heard many comments from certain quarters about an 'ontology crisis' wherein ontologies of poor quality are now everywhere to be seen, polluting the pool. This sort of comment is similar to comments made during the software crisis of the 1960s, and, given that funding for ontologies can be hard to come by, we can ill afford to overrun. They reacted to this by developing software engineering processes and methods which, over time, helped enormously, though they did not resolve all the issues, cf. no silver bullet. Whatever your stance, it is hard to argue against wanting proper processes and methods for building in quality; nobody wants a blue screen of death on a plane's fly-by-wire system during a transatlantic flight. Similarly, nobody wants a medical system using an ontology to give incorrect results. An ontology for your photo collection, we care less so.

So what do we need? Here's my list;

A formal set of engineering principles for systematic, disciplined, quantifiable approach to the design, development, operation, and maintenance of ontologies
The use of test driven development, in particular using sets of (if appropriate, user collected) competency questions which an ontology guarantees to answer, with examples of those answers - think of this as similar to unit testing
Cost benefit analysis for adopting frameworks such as upper ontologies, this includes aspects such as cost of training for use in development, cost to end users in understanding ontologies built using such frameworks, cost benefits measured as per metrics such as those above (e.g. answering competency questions) and risk of adoption (such as significant changes or longer term support).

In a sentence; making public judgements on ontologies should be a formal, objective and quantifiable process and less like deciding whether you'd prefer a Pespi or a Coke.

Incidentally, I prefer Coke.

Friday 13 July 2012

A million gene expression annotations with Zooma

ArrayExpress, based at the EBI, is one of the world's largest public repositories of transcriptomic data. One of the much valued features of the repository is that data submitted undergoes curation, not only by computational assessment but ultimately by manual experts - our curators. Their job is to ensure this data meets certain minimum quality requirements and is described in a way that is accurate (ontologies help with this) and therefore searchable in the archive.

Zooma - Like an Ooma but even faster.

This is a much valued service by a lot of the community, but it is not without issues. One of the primary issues is that data submitted to ArrayExpress continues to increase. In fact, even though it has been postulated that microarrays are dead there is no sign submission of these experiments is slowing, in fact our figures show they are on the whole stable. On top of this, new sequencing technologies are emerging almost monthly it would seem and our figures also show a slow but steady increase in this sort of submission. Overall then, this leads to a net increase in the amount of submissions coming into ArrayExpress.

So business is good, but this is not without its drawbacks. The primary one is cost and I mean this in the broadest sense; high quality annotations are time-consuming and there is a limit to how many experiments a curator can curate. Simply put, more experiments means more resources are required. We call this an annotation gap, i.e. the gap between high-quality data annotation (especially using ontology classes) and the amount of resources available to do such annotation.

Search the Zooma KB for an annotation. The pop out
here is showing info for the first hit for Caucaisan used as
as a value for a category ethnicity. This pattern has been
used 608 times and each annotation has its own URL.

One of the ways of reducing this annotation gap is by enabling submitters to annotate their own data more easily and more aligned to a common standard, in this case the ontologies we use. This reduces the effort required by curators to make sure everything is aligned within the repository. Another way, is to maximise the amount of automated annotation against such ontologies that can be done. The is the job of Zooma.

Zooma is an RDF knowledge base of annotation knowledge, extracted from the expert curation performed on a subset of ArrayExpress data. This subset of data has the added advantage of being curated twice because it has also been loaded into the Gene Expression Atlas, where it has been aligned to ontology classes in EFO. This is very powerful for several reasons.

Firstly, it enables access to the curation process. This is useful because it allows a person to easily look-up how a property (some textual item) has been annotated to an ontology class and therefore repeat the process - the users here are both external submitted and our own curation team (see image). This makes curation consistent and rich. An additional benefit of this is that it also enables computational exploitation of the curation process. Not only does Zooma capture how a textual property has been mapped to an ontology class, but it also captures corrections between annotations, for example an update to a more accurate class. What this really gives us is a big set of rules, manually created over several years, which can be applied to data automatically.

A second feature is that provenance is stored and used for ranking and filtering. Using the Open Annotation Model, where an annotation 'rule' has come from, for instance curator asserted or inferred from the knowledge base, is recorded.

A third feature is that everything has a URI. Every annotation, sample, assay, study and they link to experiments in ArrayExpress and the BioSample Database. So this is truly linked (to) data.

Finally, this model is additive. Not only can our own, new additions be added by our curators and submitters, but any annotation based on our simple abstract model can be incorporated - including a whole database dump.

Have a play with the live demo at http://wwwdev.ebi.ac.uk/fgpt/zooma
Tony Burdett will be presenting this work at ISMB Technology Track on Sunday July 15th at 15:30.
This work is supported in part by EMBL and by the DBP project with NCBO via NIH.

Monday 2 July 2012

Has the War of the Words alienated Google?

A great wordsmith once wrote;

What's in a name? that which we call a rose
By any other name would smell as sweet

This is taken from Shakespeare's play Romeo and Juliet and it captures two themes I'd like to briefly explore in this post. Firstly, that the essence of what a thing is does not rely on its name alone. Secondly, that feuding about a thing can unintentionally damage a thing.

Your search did not match any documents.
Did you mean Brazillian Shakespeare Horse Jupiter Mountain?

'And what does that have to do with Google?', you might well ask. In May 2012 Google announced the Knowledge Grap h search enhancement which they headlined as 'things not strings'. This led to much discussion in various press outlets in the last month or so about how Google were going to use this new way of indexing web pages to give you more intelligent searching, such as dealing with homonyms - words that have multiple, different meanings (e.g. tire - car wheel, tire - sleepy). This sounds great and, in fact, I think it is. But the idea is nothing new to anyone who has been working in ontology, semantic web or more specialist cases like biomedical data curation. It's one of the driving use cases of ontologies - I refer you to my blog on what an ontology does. So how does Google's new Knowledge Graph differ? I noted with particular dismay that Google's blog did not contain the words 'ontology' and 'semantic'. Various press stories which talk about this hint at it without saying it and many proclaim this to be fundamental new technology, with a tip of the hat to Yahoo's 2009 paper.

Credit to Google though - they have actually implemented something and they are using it, that is more than a lot of practitioners do. But the question remains - why are the words 'ontology' and 'semantic web' missing from these articles, including Google's own? An ontology by any other name is still an ontology - concepts, relationships, graph nodes, edges, types, instances, whatever you call it.

I think the answer may lie in my second theme; the War of the Words. In biomedical ontologies, a field in which I am closely involved, there is an undercurrent of strong opinions and cutting debate with the aim of building consensus. Undercurrent is probably inaccurate because it's actually highly visible - it's more like a tsunami. See the 2010 Merrill and Smith papers for a peek at some of this. Ontologies, even from within the community, divide opinion, engender indignation and entrench viewpoints and to those on the outside this must sometimes seem, well, problematic. It's not always this way of course - there are many great things happening in these communities and some times they unite opinion, reduce division and bridge viewpoints. Collaborative work from many different communities continues and I have been party to several such efforts, with mixed success, but then getting everyone to agree is intrinsically hard. The worry is that, perhaps, the success stories are overshadowed by the war of words. The punch is mightier than the handshake, sadly and perhaps this is the root of my disappointment. I've heard the words 'if ontologies/sem web were really that good Google would be using them' often from those outside these communities.

There is also a feeling from certain quarters that the Semantic Web, as it was originally cast, has also failed to live up to the hype and that what I consider to be a simplified version (avoiding grand gestures) - Linked Data - is similarly floundering. I should add this is not a one-sided argument though and many believe it has and is succeeding though perhaps it needs to do more. Indeed I personally believe that we are now in a better position than ever to exploit these technologies and I am already involved in a project here at EBI which is doing just that. I'll report that in the future.

It would seem apparent that Google are using something akin to ontologies, and possibly Semantic Web technologies, but are unwilling to beat the drum about these overloaded and much travelled buzz words. These words come with baggage, high expectations and a strongly opinionated community. It may just be an omission by Google of course, accidental in nature, and in the coming months they will begin to champion the cause and recognise the work that goes on in the ontology and semantic web communities. If they want a success story then I refer them to the Knowledge Graph that is the Gene Ontology, circa 1999.

Perhaps Google's Knowledge Graph is the killer app that everyone has been 'searching' for; the ontology is dead long live the Knowledge Graph. Let us not, then, kill their efforts with semantics, they're just words after all and in the end, what's in a name?

Saturday 23 June 2012

Ontology Turing Test

Wikimedia Under fair use licence
National Portrait Gallery, London

Today is Alan Turing's 100th birthday and is therefore an appropriate day to write about something AI inspired. I attended a meeting in the US a few months ago in which an opinion was offered about computers, as fact, which prodded at the AI researcher that lives inside a deep, dark cave inside me. It was a statement which spoke to over half a century's worth of AI blood, sweat and tears - a lot of tears - and I profess one that I have also questioned over the years. It was this;

computers will never write good textual definitions in ontologies.

There are many ways one could interpret such a statement as the language is, ironically, loose but I took it to mean the following;

computers will never write English definitions in an ontology that are of the same quality as a human.

My interpretation is still a little loose. In the interest of being a good scientist then, let me recast this as a research question which speaks to the beating heart of AI;

Given two textual definitions, can a person determine which is written by machine and which by a human?

This line of thought is, of course, nothing new. For those familiar with AI, Alan Turing first proposed a similar question in what famously became the Turing Test; could a human player determine which of the two (hidden) opponents was human (and therefore which was machine) based on the imagination game?

In 2011 I undertook some work with Robert Stevens of University of Manchester and Richard Power, Sandra Williams and Allan Third of Open Universit to see if we could automate the generation of English definitions based on axiomatisation of ontology classes in EFO. The motivation is fairly straightforward - EFO had a lot of classes which were richly axiomatised but that lacked textual definitions. Could we utilise one to inform the other? There is much to be gained from this. Writing textual definitions is laborious and time consuming and axiomatisation done by hand can be similarly so. If we could automate one we might reduce the cost significantly.

So back to our Ontology Turing Test then. Simple question, can you tell the human from the machine? Here is a smattering of definitions, some machine derived some hand written by human that I've hand picked (to ensure fair comparison I have modified a few to ensure they all start 'an x is a y...'). Answers are at the bottom of the page When you finish you should also question the original statement - computers never write good textual definitions in ontologies.

A Leydig cell is both something that is located in a testis, and something that is part of an endocrine system.
A planned process is a processual entity that realizes a plan which is the concretization of a plan specification.
A Metabolic Encephalopathy (disorder) is a metabolic disease and is a disorder of the brain.
A LY2 cell line is all of the following: something that is bearer of a breast carcinoma, something that derives from a Homo sapiens, something that derives from an epithelial cell, and something that derives from a mammary gland.
A laboratory test is a measurement assay that has as input a patient-derived specimen, and as output a result that represents a quality of the specimen.
A role is a realizable entity the manifestation of which brings about some result or end that is not essential to a continuant in virtue of the kind of thing that it is but that can be served or participated in by that kind of continuant in some kinds of natural, social or institutional contexts.

A wider question then to finish;

Can ontologies help to make machines think more like humans?

Alas, I can not even start with an answer as I barely have the questions to test this.

Spoiler alert!
Answers for above: 1: machine, 2: human (OBI), 3: machine, 4: machine, 5: human (OGMS), 6: human (BFO) taken from latest versions in BioPortal correct as of 23rd June 2012

Thursday 14 June 2012

URIGen the URI generation service

A quick post on a tool Simon Jupp has been developing in our group - URIGen. It's a small thing but very useful for anyone involved in concurrently editing ontologies or attempting to automate the minting of new URIs to avoid conflicts. We've been using this for EFO and SWO and we think there are plenty of others could benefit. OBI in particular comes to mind because URIs created there are not confirmed until an official release; URIGen could make the URI available immediately for use. Here are the basics.

Problem
URIs are used to uniquely identify resources within an ontology. If two resources share the same URI they are considered the same thing - if they are different things they should not share the same URI. In a lot of bio-ontologies, 'semantic free' identifiers are used when creating URIs to ensure meaningful content is not embedded within the URI (for reasons I won't go into here, that's another post). This often takes the form of a simple accession number, i.e. a number that is simply incremented each time a new class is created GO:0000001, GO:0000002 and so on. To ensure unique resources are not accidentally allocated the same URI (when they should be different) we need a method of manging what new URIs are created (often called minting) when we hit the 'new' button in a tool like the ontology editor Protege or other tools.

The URIGen console controlling and monitoring URI creation by multiple people. Duplicate URIs can never be created. And you can also watch what people are doing. Your boss is gonna love it. (click to enlarge)

Solution
URIGen is a client-server tool which controls how URIs are generated when used in tools such as Protege. The tool is installed on a server which can be connected to from Protege, or via an API call, by a client (user) and will take over the generation of new URIs when a new class or property is created. A user is given a unique API key which is required to connect to the server, ensuring a level of security. The form of the URI can be configured by URIGen, such as deciding where the numbers should start, what sort of prefixes might be used (e.g. 'GO' in our previous example) and so on. The base URI of the ontology is used to tie to URIGen to a set of these preferences. So, for example in the figure above, we can see that the ontology (3rd column) is SWO core and this uses the preferences for that ontology but further down the SWO version ontology adds a slightly different form of URI. You can see these differences in the left most column. The server is synchronised to avoid deadlock and to ensure that a URI is only ever allocated once.

Availability
Find the tool and documentation at http://code.google.com/p/urigen/

Wednesday 6 June 2012

The Apprentice: A Lesson in Ontology

I finally got to see the final of the Apprentice today and was interested to hear Lord Sugar describe one of the proposed business plans as requiring 'a trillion hours of software development'. The idea came from Nick Holzherr (I won't say if he won or not for anyone who cares about such a thing) but it was a fairly simple one; when anyone visits an online recipe site and finds a recipe they wish to cook, they click a button and his software enables a trolley of the ingredients to be readily loaded into a supermarket (any supermarket) of their choice.

The Apprentice illustration from around 1882 by S. Barth

The original Sorcerer's Apprentice
creating his first App. iPhones were much
different in the nineteenth century

So this is not revolutionary and I think some of his claims are a little overplayed, not least because all of the main supermarkets already allow you to do this with their own recipe sites - the bridge between them all is what Nick is proposing. But what peaked my interest was how one would go about writing software to do this computationally because the problems are not dissimilar from those we face in bioinformatics; lots of data (though food is more limited in scope) and a desire to consume and integrate it in a meaningful way. And since this is a blog about ontologies and the semantic web and not reality TV I should probably get to the point.

My immediate thought was that this is an ideal showcase for semantic web technology (if it was ubiquitous as per the original vision). In such a scenario this is incredibly easy to do (well easier anyway). In such a world, all of the data on the web is semantically described and ontologies and food products are just another part of this web of data. I can go to any online supermarket web service and ask for their ontology of all of their products and get it and it tells me exactly what they have. If they are using similar ontologies, Nick's application is trivial - I have access to the exact ingredients I need for every supermarket, and if we assume recipe sites do the same then the connection is made between them all. Very simple.

Of course that reality does not exist, and you could speculate whether or not it ever well. Instead the problems you have to overcome are those everyone working in data integration faces - text mining to find relevant data, NLP to try and identify concepts and meaningfully map them between sources and probably some machine learning to work out rules of interest (when people say x they really mean y so map to y).

The advantages for external applications such as Nick's are clear, but for a supermarket the buy-in is perhaps more difficult. So why would they bother? Here are a few reasons:

Better searching. Some of the supermarket searching is sophisticated and some is less so (I shan't name names) but at the very least I should be able to search for egg and get eggs, and a search for jam should probably return me conserves as well. Synonyms and a good inheritance hierarchy would help with this.
Managing classification - The Vegetarian. This is a really good example of when asserting classification gives you poor results. I searched for 'vegetarian' in one of the most widely used supermarket internet sites and I got back 70 products. 70!? They only sell 70 products that are suitable for vegetarians?? Of course the answer is no - a search for vegetable brings back 625 alone, so this tells you the search is very simplistic - it's bringing back a small subset tagged as vegetarian. If we define vegetarian in an ontology as something that does not contain an ingredient derived from an animal then we are getting somewhere. You should get all of the results automatically.
Allergy checking. Filtering out products that contain certain ingredients (nuts, spices, wheat) in a simple and consistent way would be very useful for allergy sufferers and this is more than just saying 'contains nuts' in a text description in the ingredients blurb. Certain food ingredients are themselves derived from foods that someone could be allergic to, for example some curries contain curry powder which in turn contains wheat to prevent clumping. Transitive relations in the ontology would enable this.
Intelligent substitution. At the moment there seems to be a simple system whereby if something is out of stock it gives me stuff based on the same word (a different make of bread for instance). But could axioms (rules) coded in ontologies offer more? If there is no plain flour then self-raising flour would be of no use for a specific recipe, but in contrast if it requires bacon, gammon or ham might suffice since they are from the same part of the animal. Disjoints and explicit axioms between concepts would help with this.
Consistency checking. As per the previous example, an animal based product can't be a vegetarian product - they should be disjoint in ontology parlance.
Linking to your data. This becomes much easier and apps like Nick's could be readily deployed.

Most of this is not about making money directly, but about making your results more meaningful and correct and therefore your customer experience better and that is also our aim in serving bioinformatics data (where our products are genes and proteins etc.) with ontologies. In the field I work in, we want to ensure when someone searches for disease samples they don't get healthy samples and that when they ask for cancer, they get leukaemia as well, and so on. The problems are the same, only the words change. And we don't have a trillion hours but that's lucky because this becomes just a couple when we use these technologies properly.

Friday 1 June 2012

Common Ontology Questions #1: what is it you do again?

I've always maintained the best ideas start life as a problem framed as a question. How can we stop people from catching Polio? What is the Moon really made of? What happens if I push that red flashing button marked 'Never Push Me'? Sometimes there aren't good answers of course.

The question I'm asked most from people is what is it you do again? That includes colleagues in Bioinformatics, my Family and the Student Loan Company, and it is, of course, a good question. And I hope that my answer is demonstrable of a solution to an important problem. So here goes.

The web comes in many different flavours

Image: FreeDigitalPhotos.net

The problem is really one of words and it's one that has existed for a very long time. We give names to everything, me, you, this blog and we often reuse those same words for other things. The good thing is that when I talk to someone about you I usually put it in context, or it's obvious I mean you because I'm talking to someone that knows that I know you so it's clear. And when I say you're small they also know I mean your thin and not short (because you are of course tall) because they've seen you before so I obviously mean height. And similarly, when I say I met you in The Flying Pig, they know I mean the pub down the road because that's where we both like to drink and that I don't mean some new creature that crossed a pig with wings or some such abomination.

So that's clear then. The problem is that if I sent the same information, your name and that you are small to some other people and asked them to point me to exactly which one you are, they'd probably struggle. In the wider world, names are not uniquely given to objects. There is at least one other James Malone in this world - I know because I regularly receive emails intended for him - and there are probably thousands. But I am unique. Similarly, saying I'm small because I'm thin is also fine if you know me, but a lot of people might use that to mean small in height. So your description doesn't mean what you intended.

OK, that's trivial but why am I actually employed you may ask. Well in biology, like many other sciences, we have millions of objects of interest; different animals, diseases, types of cells, you name it, and in order to make sense of the data we produce from experiments we really need to know what they're about. And a mouse is not just a mouse - though that is another blog post.

It gets worse. Humans are quite good at guessing and disambiguating because they have tacit knowledge about the world and more often than not context. Someone might guess I mean you because they know both of us, but a computer? It wouldn't have a clue no matter how many times you strike it and curse at it. Believe me.

This is where I come in. I use a method of writing this stuff down in a way that is (at least a bit) less ambiguous and that method concerns the use of an ontology. Ontologies, ironically, have been defined in a hundred different ways, but people mostly mean the same thing which is that an ontology is a way of talking about the objects we are interested in in some explicit way and in addition describing how those objects relate to one another. So to go back to the example, one such object is me and my tallness and my thinness. All of these things can be considered useful in an ontology about people generally. We might capture the thing I am (human) and the things describing me (tall) as a concept, a class or a type - which all mean roughly the same thing. A human class is everything that is a human, so I'm a type or instance of that class. Relations also exist. Me and tall for example. The relationship there might be called something like has height or more generally has physical characteristic i.e. James has physical characteristic tall. There, our first bit of ontology done.

Of course it's not that simple as we have millions of things in biology but fortunately some of these things are the same and some are closely related. Genes for example, they might all be instances of a gene class, just as humans live under a human class. What adds complexity is how we make this amenable to a computer reading it which is critical when you have terabytes of data (that's a lot I believe). Fortunately languages such as the Web Ontology Language (OWL) help us as they provide the syntax needed to specify classes, instances and relations in a way that both you and I and our stupid computers can understand.

What we really want, in a futuristic (but somewhat unlikely) scenario is all of the data available on the web to be described in this way so that computers can ask sensible questions of it and get back sensible answers because they understand what they're looking at in the same we you and I do. This is hard (and I say unlikely because doing this is never ending - of course doing some of this is achievable and useful) but it's the long term vision of the Semantic Web and ontologies, clearly, play an important role here as they tell us what the data actually means and whether that house I'm buying online is really a house for me and that I won't end up with a cage for a rabbit. This is important to me, but to the wider bioinformatics world, what's really important is that when someone says this gene is somehow linked to this sample with cancer, that we know we're talking about the same type of cancer and the same gene because this is really important. Fortunately work is well underway in this area, for example the Gene Ontology has been producing descriptions of genes and related properties for over a decade now.

Anyway, I hope that helps to explain a bit about what I do and why. In the future I'll be writing about things we do here, thoughts and ideas I have (sometimes even good ones), problems I face and probably general rants.