Dimension Reduction Methods in the Study of the Genetics of Gene Expression
Stephanie Santorico, Department of Mathematical & Statistical Sciences, University of Colorado, Denver
Monday, February 15, 2010
4:00 p.m., Weber 223
Combining many types of genomics information, such as genetic marker data and gene expression data, has become a powerful strategy for better understanding the genetic basis of complex traits. One of the challenges of such an approach is how to study the genetics of a vast set of highly interrelated measures which likely represent a much smaller set of truly meaningful variables. In this paper three linear dimension reduction methods: principal components analysis (PCA), partial least squares (PLS), and non-negative matrix factorization (NMF) were reviewed and applied to a large-scale gene expression data set containing 45101 expression phenotypes from a sample of 84 M16xICR F2 mice. Transcripts representing the top 10% of weights for each basis vector derived from the three dimension reduction methods were selected and tested for functional enrichment based on Gene Ontology biological processes annotation. Linkage tests were performed on the components derived from each statistical dimension reduction method, and identified expression Quantitative Trait loci (eQTLs) were compared with the results from the analysis that did not utilize dimension reduction. At a false discovery rate of 0.05, we discovered 18 functional enrichments for the 20 PCA basis vectors, 3 functional enrichments for the 2 PLS basis vectors and 41 functional enrichments for the 20 NMF basis vectors. The biological functions biotin biosynthetic process and Cobalamin Biosynthetic Process are the two enrichments found by all three dimension reduction methods. One significant linkage with LOD score of 4.349 was detected by NMF components. The results demonstrated that all three methods can effectively reduce the dimensionality and discover underlying biological functions. Linkage analysis results suggested appropriate pre-screening to the original gene expression dataset is needed to exclude variation in expression that is non-genetic.