News< Go back to main page
Publication Alert: Investigating population stratification and admixture using eigenanalysis of dense genotypesWednesday, April 6, 2011 —
The CRGGH is interested in defining population structure using genome-wide genotype data. Several techniques to define population structure have been developed over the past decade. Defining population structure in mixtures of ancestrally homogeneous populations is easier than defining population structure in admixed populations. A fundamental challenge is determining how many dimensions of the data should be retained. In an Advance Online Publication of Heredity, I have revisited an algorithm first described in the psychometrics literature in 1976 and found that it outperforms the current standard.
The abstract is printed below and also can be found here:
- Principal components analysis of genetic data is used to avoid inflation in type I error rates in association testing due to population stratification by covariate adjustment using the top eigenvectors and to estimate cluster or group membership independent of self-reported or ethnic identities. Eigendecomposition transforms correlated variables into an equal number of uncorrelated variables. Numerous stopping rules have been developed to identify which principal components should be retained. Recent developments in random matrix theory have led to a formal hypothesis test of the top eigenvalue, providing another way to achieve dimension reduction. In this study, I compare Velicer's minimum average partial test to a test on the basis of Tracy-Widom distribution as implemented in EIGENSOFT, the most widely used implementation of principal components analysis in genome-wide association analysis. By computer simulation of vicariance on the basis of coalescent theory, EIGENSOFT systematically overestimates the number of significant principal components. Furthermore, this overestimation is larger for samples of admixed individuals than for samples of unadmixed individuals. Overestimating the number of significant principal components can potentially lead to a loss of power in association testing by adjusting for unnecessary covariates and may lead to incorrect inferences about group differentiation. Velicer's minimum average partial test is shown to have both smaller bias and smaller variance, often with a mean squared error of 0, in estimating the number of principal components to retain. Velicer's minimum average partial test is implemented in R code and is suitable for genome-wide genotype data with or without population labels.