A year in to the EBI's joint project with the NHGRI, the GWAS catalog and data visualization has become interactive with help from Semantic Web technologies.
In the fall of 2010, we started an informal collaboration with the NHGRI's Catalog of Genome Wide Association Studies (Hindorff et al, 2009) with our student, Paulo Silva. In that project, he rapidly created an ontology based on our ongoing EFO efforts to describe a lot of the traits in GWAS (see the NHGRI site for more on how the catalog is put together). Paulo deployed this using the existing ArrayExpress website infrastructure as a backbone to illustrate the benefits of using an ontology for curation and searching. Later that year, Paulo and Helen Parkinson presented this work to the NHGRI.
|Figure 1. GWAS Catalog as of 2011. Much skilled work put this beautiful artefact |
together. "My God, it's full of stars!"
Thankfully, they liked it and in October 2011 the EBI started a formal collaborative project with the NHGRI to improve the process of curation based on our expertise in using ontologies for annotation and in deploying them in applications. In addition, the famous and much-cited GWAS catalog diagram (see figure 1) was also to be given an overhaul in the way it was generated. The diagram illustrates GWAS traits associated with a SNP on a band on a human chromosome, and is generated manually by a skilled graphical artist every three months. If you take a look at the diagram you can quickly see that this is no easy task. New traits can appear (almost) anywhere each month and this can mean a lot of shuffling around of coloured dots to accommodate them. In addition, each trait has a unique colour which is described by a complex key (see figure 2) - complex because there are so many traits and therefore a lot of colours required. In fact, the harder the team work to curate this data, the harder it becomes to generate the diagram. The lazy person in me thought the answer was simple - stop working so hard!
|Figure 2. The key keeps growing as more work is included. Each of |
these colours is unique. My favourite is #99FFFF
But they continue to work hard regardless and so a further issue presents itself; that searching the catalogue, either by the search interface or by viewing the GWAS image, is also hampered by the growing size and lack of structure to the trait names. A small list and this is not really a problem, but as the catalog has grown thanks to the curation efforts of the NHGRI, so have the amount of traits.
Motivation complete, let's look at the progress so far. You can click here to see the new diagram generated by the team at EBI* (please see below for the list of contributors). The first thing to note is that it is now an interactive, dynamic diagram. You can zoom in and out, you can mouse over traits and find out what they represent rather than have to look at the key. The default key has also been reduced down to higher level groupings, significantly reducing the number of colours used. These groupings correspond to ontology classes in the EFO which are superclasses of other traits and are generated based on maximum coverage (i.e. the 18 classes that covered the most amount of traits). One of the advantages of using an ontology emerge here; that you can begin to aggregate data together and get a better feel for global trends.
In addition, you can also drill down to more specific traits using the filter option. Because the diagram is dynamic, you can highlight only those traits of interest to you. Try entering 'cancer' into the 'Filter' box and press enter. The ontology is being used to show those SNPs annotated to cancer or subclasses of cancer. Now try 'breast cancer' and you can see a subset of these cancer trait dots remain highlighted. What about 'breast carcinoma'? Have a go. Clear the filters (click clear filter) and then enter 'breast carcinoma' and you'll see the same results as for breast cancer. Again, this is ontology at work; the browser is using synonyms stored in the ontology to perform the same query it does for 'breast cancer'. Simple, but very useful.
The benefit of generating the diagram programmatically is perhaps most evident in the time series view (click on the tab). A selection of diagrams from over the last seven years is shown here and all were generated with just a few clicks. This is in much contrast to the manual, hand-crafted artefacts that went before which took many weeks and much skill to produce. Seems almost unfair really.
The Techy Bit
So what's going on behind the scenes? The list of over 650 highly diverse traits, are mapped to EFO. These traits include phenotypes, e.g. hair colour, treatment responses, e.g. response to antineoplastic agents, diseases and more. Compound traits such as 'Type 2 diabetes and gout' are separated. Links between relevant traits are also made to facilitate querying (e.g. partonomy). It's also worth mentioning that EFO reuses (i.e. imports from) a lot of existing ontologies such as the Gene Ontology and Human Phenotype Ontology which also facilitates future integration.
The GWAS ontology is a small ontology used within the triple store to describe a GWAS experiment structure (not including traits) and describes: links between a trait and a SNP and the p-value of that association; where that SNP appears on a chromosomal band; and on which chromosome. Each chromosome is an SVG in which, the XML source has been tagged with ontology information, e.g. these coordinates of the SVG are band x. The dots representing traits are similarly assigned the trait type (from the ontology) within the class attribute of the SVG. These bits of information are used to render each trait on the images, and allows for the filtering on trait - and (in the very near future) on other properties such as the association p-value.
DL queries are also employed as part of this visualisation, making use of the OWL describing the two ontologies (EFO and GWAS ontology). The initial creation of the diagram uses a DL query which asks, amongst other things, which bands are on which chromosomes, which SNPs are on each band and which traits should be rendered where. Traits are selected using another DL query which, as well as location information, asks if they satisfy a certain p-value (which is a datatype property) - those below this threshold are not rendered. In the future this will become more dynamic so a user can specify their own p-value threshold.
These queries all take a while so they are all pre-computed when the diagram is built each release and then each SVG is cached on disk for quick retrieval. The actual filtering queries (which will also become auto-complete in the near future) also use a simple DL query; they ask for all the subclasses of a given trait class from EFO and these traits are passed back in a JSON object which is used to refresh the diagram, showing SVGs of this class. In the longer term, the aim is to use disk caching of the reasoner using something like Ehcache. Currently this is not possible due to some serialization issues with the version of the OWL-API that is being used but this is set to change. This will enable much more complex queries to be performed, utilising EFO, dynamically such as traits that may be the subject of a protocol of measurement or diseases that affect a particular system. There are many possibilities.
A side-effect of this project is that the technology is reusable for other projects and indeed we intend to utilise some of this for some of our current work in rendering the Gene Expression Atlas in RDF (more on that very soon). The model is relatively simple; describe your data in some schema (OWL ontology), work out what each ontology looks in SVG (the relationship between an ontology class and the image) and then render it.
*List of contributors
The tooling for this project was developed by Danielle Welter and Tony Burdett, with input from Helen Parkinson, Jackie MacArthur and Joannella Morales at the EBI and the catalogue curation team at the NHGRI.