James Malone's EBI Blog: June 2012

Saturday 23 June 2012

Ontology Turing Test

Wikimedia Under fair use licence
National Portrait Gallery, London

Today is Alan Turing's 100th birthday and is therefore an appropriate day to write about something AI inspired. I attended a meeting in the US a few months ago in which an opinion was offered about computers, as fact, which prodded at the AI researcher that lives inside a deep, dark cave inside me. It was a statement which spoke to over half a century's worth of AI blood, sweat and tears - a lot of tears - and I profess one that I have also questioned over the years. It was this;

computers will never write good textual definitions in ontologies.

There are many ways one could interpret such a statement as the language is, ironically, loose but I took it to mean the following;

computers will never write English definitions in an ontology that are of the same quality as a human.

My interpretation is still a little loose. In the interest of being a good scientist then, let me recast this as a research question which speaks to the beating heart of AI;

Given two textual definitions, can a person determine which is written by machine and which by a human?

This line of thought is, of course, nothing new. For those familiar with AI, Alan Turing first proposed a similar question in what famously became the Turing Test; could a human player determine which of the two (hidden) opponents was human (and therefore which was machine) based on the imagination game?

In 2011 I undertook some work with Robert Stevens of University of Manchester and Richard Power, Sandra Williams and Allan Third of Open Universit to see if we could automate the generation of English definitions based on axiomatisation of ontology classes in EFO. The motivation is fairly straightforward - EFO had a lot of classes which were richly axiomatised but that lacked textual definitions. Could we utilise one to inform the other? There is much to be gained from this. Writing textual definitions is laborious and time consuming and axiomatisation done by hand can be similarly so. If we could automate one we might reduce the cost significantly.

So back to our Ontology Turing Test then. Simple question, can you tell the human from the machine? Here is a smattering of definitions, some machine derived some hand written by human that I've hand picked (to ensure fair comparison I have modified a few to ensure they all start 'an x is a y...'). Answers are at the bottom of the page When you finish you should also question the original statement - computers never write good textual definitions in ontologies.

A Leydig cell is both something that is located in a testis, and something that is part of an endocrine system.
A planned process is a processual entity that realizes a plan which is the concretization of a plan specification.
A Metabolic Encephalopathy (disorder) is a metabolic disease and is a disorder of the brain.
A LY2 cell line is all of the following: something that is bearer of a breast carcinoma, something that derives from a Homo sapiens, something that derives from an epithelial cell, and something that derives from a mammary gland.
A laboratory test is a measurement assay that has as input a patient-derived specimen, and as output a result that represents a quality of the specimen.
A role is a realizable entity the manifestation of which brings about some result or end that is not essential to a continuant in virtue of the kind of thing that it is but that can be served or participated in by that kind of continuant in some kinds of natural, social or institutional contexts.

A wider question then to finish;

Can ontologies help to make machines think more like humans?

Alas, I can not even start with an answer as I barely have the questions to test this.

Spoiler alert!
Answers for above: 1: machine, 2: human (OBI), 3: machine, 4: machine, 5: human (OGMS), 6: human (BFO) taken from latest versions in BioPortal correct as of 23rd June 2012

Thursday 14 June 2012

URIGen the URI generation service

A quick post on a tool Simon Jupp has been developing in our group - URIGen. It's a small thing but very useful for anyone involved in concurrently editing ontologies or attempting to automate the minting of new URIs to avoid conflicts. We've been using this for EFO and SWO and we think there are plenty of others could benefit. OBI in particular comes to mind because URIs created there are not confirmed until an official release; URIGen could make the URI available immediately for use. Here are the basics.

Problem
URIs are used to uniquely identify resources within an ontology. If two resources share the same URI they are considered the same thing - if they are different things they should not share the same URI. In a lot of bio-ontologies, 'semantic free' identifiers are used when creating URIs to ensure meaningful content is not embedded within the URI (for reasons I won't go into here, that's another post). This often takes the form of a simple accession number, i.e. a number that is simply incremented each time a new class is created GO:0000001, GO:0000002 and so on. To ensure unique resources are not accidentally allocated the same URI (when they should be different) we need a method of manging what new URIs are created (often called minting) when we hit the 'new' button in a tool like the ontology editor Protege or other tools.

The URIGen console controlling and monitoring URI creation by multiple people. Duplicate URIs can never be created. And you can also watch what people are doing. Your boss is gonna love it. (click to enlarge)

Solution
URIGen is a client-server tool which controls how URIs are generated when used in tools such as Protege. The tool is installed on a server which can be connected to from Protege, or via an API call, by a client (user) and will take over the generation of new URIs when a new class or property is created. A user is given a unique API key which is required to connect to the server, ensuring a level of security. The form of the URI can be configured by URIGen, such as deciding where the numbers should start, what sort of prefixes might be used (e.g. 'GO' in our previous example) and so on. The base URI of the ontology is used to tie to URIGen to a set of these preferences. So, for example in the figure above, we can see that the ontology (3rd column) is SWO core and this uses the preferences for that ontology but further down the SWO version ontology adds a slightly different form of URI. You can see these differences in the left most column. The server is synchronised to avoid deadlock and to ensure that a URI is only ever allocated once.

Availability
Find the tool and documentation at http://code.google.com/p/urigen/

Wednesday 6 June 2012

The Apprentice: A Lesson in Ontology

I finally got to see the final of the Apprentice today and was interested to hear Lord Sugar describe one of the proposed business plans as requiring 'a trillion hours of software development'. The idea came from Nick Holzherr (I won't say if he won or not for anyone who cares about such a thing) but it was a fairly simple one; when anyone visits an online recipe site and finds a recipe they wish to cook, they click a button and his software enables a trolley of the ingredients to be readily loaded into a supermarket (any supermarket) of their choice.

The Apprentice illustration from around 1882 by S. Barth

The original Sorcerer's Apprentice
creating his first App. iPhones were much
different in the nineteenth century

So this is not revolutionary and I think some of his claims are a little overplayed, not least because all of the main supermarkets already allow you to do this with their own recipe sites - the bridge between them all is what Nick is proposing. But what peaked my interest was how one would go about writing software to do this computationally because the problems are not dissimilar from those we face in bioinformatics; lots of data (though food is more limited in scope) and a desire to consume and integrate it in a meaningful way. And since this is a blog about ontologies and the semantic web and not reality TV I should probably get to the point.

My immediate thought was that this is an ideal showcase for semantic web technology (if it was ubiquitous as per the original vision). In such a scenario this is incredibly easy to do (well easier anyway). In such a world, all of the data on the web is semantically described and ontologies and food products are just another part of this web of data. I can go to any online supermarket web service and ask for their ontology of all of their products and get it and it tells me exactly what they have. If they are using similar ontologies, Nick's application is trivial - I have access to the exact ingredients I need for every supermarket, and if we assume recipe sites do the same then the connection is made between them all. Very simple.

Of course that reality does not exist, and you could speculate whether or not it ever well. Instead the problems you have to overcome are those everyone working in data integration faces - text mining to find relevant data, NLP to try and identify concepts and meaningfully map them between sources and probably some machine learning to work out rules of interest (when people say x they really mean y so map to y).

The advantages for external applications such as Nick's are clear, but for a supermarket the buy-in is perhaps more difficult. So why would they bother? Here are a few reasons:

Better searching. Some of the supermarket searching is sophisticated and some is less so (I shan't name names) but at the very least I should be able to search for egg and get eggs, and a search for jam should probably return me conserves as well. Synonyms and a good inheritance hierarchy would help with this.
Managing classification - The Vegetarian. This is a really good example of when asserting classification gives you poor results. I searched for 'vegetarian' in one of the most widely used supermarket internet sites and I got back 70 products. 70!? They only sell 70 products that are suitable for vegetarians?? Of course the answer is no - a search for vegetable brings back 625 alone, so this tells you the search is very simplistic - it's bringing back a small subset tagged as vegetarian. If we define vegetarian in an ontology as something that does not contain an ingredient derived from an animal then we are getting somewhere. You should get all of the results automatically.
Allergy checking. Filtering out products that contain certain ingredients (nuts, spices, wheat) in a simple and consistent way would be very useful for allergy sufferers and this is more than just saying 'contains nuts' in a text description in the ingredients blurb. Certain food ingredients are themselves derived from foods that someone could be allergic to, for example some curries contain curry powder which in turn contains wheat to prevent clumping. Transitive relations in the ontology would enable this.
Intelligent substitution. At the moment there seems to be a simple system whereby if something is out of stock it gives me stuff based on the same word (a different make of bread for instance). But could axioms (rules) coded in ontologies offer more? If there is no plain flour then self-raising flour would be of no use for a specific recipe, but in contrast if it requires bacon, gammon or ham might suffice since they are from the same part of the animal. Disjoints and explicit axioms between concepts would help with this.
Consistency checking. As per the previous example, an animal based product can't be a vegetarian product - they should be disjoint in ontology parlance.
Linking to your data. This becomes much easier and apps like Nick's could be readily deployed.

Most of this is not about making money directly, but about making your results more meaningful and correct and therefore your customer experience better and that is also our aim in serving bioinformatics data (where our products are genes and proteins etc.) with ontologies. In the field I work in, we want to ensure when someone searches for disease samples they don't get healthy samples and that when they ask for cancer, they get leukaemia as well, and so on. The problems are the same, only the words change. And we don't have a trillion hours but that's lucky because this becomes just a couple when we use these technologies properly.

Friday 1 June 2012

Common Ontology Questions #1: what is it you do again?

I've always maintained the best ideas start life as a problem framed as a question. How can we stop people from catching Polio? What is the Moon really made of? What happens if I push that red flashing button marked 'Never Push Me'? Sometimes there aren't good answers of course.

The question I'm asked most from people is what is it you do again? That includes colleagues in Bioinformatics, my Family and the Student Loan Company, and it is, of course, a good question. And I hope that my answer is demonstrable of a solution to an important problem. So here goes.

The web comes in many different flavours

Image: FreeDigitalPhotos.net

The problem is really one of words and it's one that has existed for a very long time. We give names to everything, me, you, this blog and we often reuse those same words for other things. The good thing is that when I talk to someone about you I usually put it in context, or it's obvious I mean you because I'm talking to someone that knows that I know you so it's clear. And when I say you're small they also know I mean your thin and not short (because you are of course tall) because they've seen you before so I obviously mean height. And similarly, when I say I met you in The Flying Pig, they know I mean the pub down the road because that's where we both like to drink and that I don't mean some new creature that crossed a pig with wings or some such abomination.

So that's clear then. The problem is that if I sent the same information, your name and that you are small to some other people and asked them to point me to exactly which one you are, they'd probably struggle. In the wider world, names are not uniquely given to objects. There is at least one other James Malone in this world - I know because I regularly receive emails intended for him - and there are probably thousands. But I am unique. Similarly, saying I'm small because I'm thin is also fine if you know me, but a lot of people might use that to mean small in height. So your description doesn't mean what you intended.

OK, that's trivial but why am I actually employed you may ask. Well in biology, like many other sciences, we have millions of objects of interest; different animals, diseases, types of cells, you name it, and in order to make sense of the data we produce from experiments we really need to know what they're about. And a mouse is not just a mouse - though that is another blog post.

It gets worse. Humans are quite good at guessing and disambiguating because they have tacit knowledge about the world and more often than not context. Someone might guess I mean you because they know both of us, but a computer? It wouldn't have a clue no matter how many times you strike it and curse at it. Believe me.

This is where I come in. I use a method of writing this stuff down in a way that is (at least a bit) less ambiguous and that method concerns the use of an ontology. Ontologies, ironically, have been defined in a hundred different ways, but people mostly mean the same thing which is that an ontology is a way of talking about the objects we are interested in in some explicit way and in addition describing how those objects relate to one another. So to go back to the example, one such object is me and my tallness and my thinness. All of these things can be considered useful in an ontology about people generally. We might capture the thing I am (human) and the things describing me (tall) as a concept, a class or a type - which all mean roughly the same thing. A human class is everything that is a human, so I'm a type or instance of that class. Relations also exist. Me and tall for example. The relationship there might be called something like has height or more generally has physical characteristic i.e. James has physical characteristic tall. There, our first bit of ontology done.

Of course it's not that simple as we have millions of things in biology but fortunately some of these things are the same and some are closely related. Genes for example, they might all be instances of a gene class, just as humans live under a human class. What adds complexity is how we make this amenable to a computer reading it which is critical when you have terabytes of data (that's a lot I believe). Fortunately languages such as the Web Ontology Language (OWL) help us as they provide the syntax needed to specify classes, instances and relations in a way that both you and I and our stupid computers can understand.

What we really want, in a futuristic (but somewhat unlikely) scenario is all of the data available on the web to be described in this way so that computers can ask sensible questions of it and get back sensible answers because they understand what they're looking at in the same we you and I do. This is hard (and I say unlikely because doing this is never ending - of course doing some of this is achievable and useful) but it's the long term vision of the Semantic Web and ontologies, clearly, play an important role here as they tell us what the data actually means and whether that house I'm buying online is really a house for me and that I won't end up with a cage for a rabbit. This is important to me, but to the wider bioinformatics world, what's really important is that when someone says this gene is somehow linked to this sample with cancer, that we know we're talking about the same type of cancer and the same gene because this is really important. Fortunately work is well underway in this area, for example the Gene Ontology has been producing descriptions of genes and related properties for over a decade now.

Anyway, I hope that helps to explain a bit about what I do and why. In the future I'll be writing about things we do here, thoughts and ideas I have (sometimes even good ones), problems I face and probably general rants.