Loraine Lab Research

Genome Informatics Systems and Visualization

Data sharing, version-tracking, and integration continue to pose problems in genomics research. Availability of new genomic data sets (such as tiling array and next-gen sequencing data) create the need for computer systems that can combine old and new data sets in ways that expose interesting biological features. Because this new information is genome-wide and genome-centric, it requires systems that can provide access and visualization capability on a genomic scale. New visualization approaches that move beyond standard Web-based, Web-browser-based models for data exploration and dissemination are becoming increasingly necessary [Loraine, 2002].

Our approach is to develop systems and design principles that meet the need for advanced visualization capability but are also practical and easy to deploy. We aim to translate advanced visualization techniques from information visualization fields into practical software applications biologists can use to view and understand their data.

We are developing a data repository and visualization software server for genomic data visualization using the highly-tractable and relatively small (around 10 times smaller than the human genome) Arabidopsis genome as a model. This project receives funding from the Arabidopsis 2010 program at NSF. As the project progresses, we are adding support for other plant species as well as species not supported in the UCSC Genome Browser system, including S. pombe, Dicty, the malarial parasite, and many others. However, our main goal is to make the software system portable so that data providers and research labs can use our software to deploy their own custom data sets. Indeed, many labs and groups have already done so, including the Tim Hughes Lab at the University of Toronto, the Huntsman Cancer Institute at the University of Utah, and others.

We maintain and develop the Integrated Genome Browser, an advanced desktop visualization tool from the open source Genoviz Project. IGB implements a number of advanced visualization features inspired by research in information visualizations fields as well as a few others invented specifically for genomics. Some of these include animated, semantic zooming, a visualization technique many advanced tools implement, as well as edge matching and one-dimensional zooming, techniques that are designed specifically for genome browsing. To find out more about these ideas and try IGB for yourself, visit the IGB Web site.

Screen capture of Integrated Genome Browser

In this image, you see three Arabidopsis expression tiling array data sets corresponding to cold (GSM243694), high salt (GSM243703), and drought (GSM243707) treatments assayed using the Affymetrix AtTile1R tiling array platform. The data were loaded in simple graph format, where probe intensities are shown as vertical bars. The graphs were then configured via the Graph Adjuster tab's Graph Thresholding option to display a bar underneath groups of consecutive probes with intensity values above a certain threshold. Note how the bars seem to correspond to known exons in the gene models displayed in the TAIR9 track, which presents reference gene annotations for Arabidopsis. At the top of the display are short read Illumina RNA-Seq data (75 bases per read) from plants undergoing severe drought stress. Note how the graph thresholding suggests that this gene contains a previously undiscovered five prime exon. However, the RNA-Seq data contain no reads that support this idea. This example illustrates some of the ambiguities that can arise from using data from high-throughput methods - like tiling arrays and Illumina sequencing. Consider that the data sets are enormous and noisy! Thus, purely through chance we will observe at least some genes adjacent to what tiling arrays seem to suggest are unannotated exons. If the RNA-Seq data supported the idea that this region of high probe intensity is indeed an exon, then we would be much more likely to believe the conclusion, because the odds of two entirely different expression measurement technologies giving the same spurious result are very small.