webp123
10 November 2022Association studies have been around for a few decades in plant and animal breeding. In a GWAS evaluation, the whole genome is studied for genomic variants associated with a particular trait. These methods rely on the correlation that exists between a genetic marker and a phenotype among a collection of germplasm. The idea is to detect specific markers that can be used to identify a given differential response on the group of genotypes evaluated. These markers are key to breeding tools such as Marker Assisted Selection (MAS); its use can result in important genetic gains by focusing on only a small set of molecular markers in a large pool of individuals.
The simplest approach to find the significant markers is based on a t-test that evaluates each marker individually. However, phenotypic data has a complex structure requiring, for example, the inclusion of design factors. At present, most GWAS applications use linear mixed models (LMMs) to fit complex models controlling for different data aspects and therefore providing more accurate results. Below, we introduce the standard model and procedures in GWAS used for plant, animal, aquaculture, and even human genetics to detect significant markers in a wide range of response variables.
The general model used for GWAS is defined below:
where
is the vector of observed phenotypic observations
is the overall mean,
is the slope associated with the th marker,
is the vector of fixed design effects,
is the vector of random design effects (i.e., blocks, pens), with ~ ,
is the vector of fixed group or population effects,
is the vector of random additive effects (i.e., breeding values), with ~ ,
is the vector of random residual effects, with ~ ,
is the vector of values (0, 1, 2) for the th marker extracted from the marker matrix .
The matrices , and are incidence matrices, is a vector of ones, is a matrix of vectors describing the population structure, and finally is the genomic relationship matrix (GRM) derived from the markers.
As noted above, this model has several fixed and random effects that play an important role in GWAS models. We will describe each of these terms below.
Marker Effects. Estimating marker effects is the main objective of a GWAS analysis. The estimate associated with is the slope of the th marker, and the data for the associated explanatory variable comes from the th column of the marker matrix (i.e., ).
Design Model Terms. These include both fixed (i.e., ) and random effects (i.e., ) effects associated with the data. Often, in plant breeding, replicates and incomplete blocks are included here, and for animal/aquaculture we often have herd-year-season or pens. In addition, covariates can be also considered.
Population Structure. The presence of genetic structure in the population needs to be controlled for. This is usually done by considering the first few eigenvectors from the marker data matrix , or from the genomic relationship matrix , to form the matrix of coefficients that is incorporated in the model as shown before. The decision on the number of dimensions to include is often supported by a scree-plot.
Family Structure. This is another potential source of bias, which is controlled by incorporating the genomic relationship matrix into the model. This matrix is part of the assumptions of the additive effects . It is obtained directly from the marker data matrix using one of several available expressions, with the most commonly used proposed by VanRaden (2008).
Residual Term. Typically, this is assumed to be independent and identically distributed, but depending on the complexity of the LMM specified, this can have many different error structures (e.g., heteroscedastic errors).
There are many variants to the above general GWAS model. It is not uncommon to ignore some of the terms (such as the matrix), or to change the parameterization of the matrix, which at present is coded for each marker in an additive way (0, 1, and 2) but can be coded as a dominance matrix (0, 1, 0) or with each marker considered as a factor for non-additive modelling.
At VSNi we have developed a free R library to provide a complete tool to implement Genome-Wide Association Studies using the full modelling flexibility available in ASReml-R . This library, called ASRgwas, assists with preparing data and matrices, and verifies that they are adequate to perform GWAS. In addition, it has a set of complementary functions to be used for post-GWAS analyses to help with the interpretation and use of the output information, and for producing graphical outputs.
The main tasks considered within ASRgwas are:
ASRgwas is designed to allow for any number of fixed and/or random structures and heterogeneous error variances. It also handles raw and replicated data. In addition it accepts missing values in the marker information avoiding the need to implement marker imputation. We have also extended the GWAS analytical options by allowing the evaluation of phenotypic responses that follow a Binomial distribution, using the framework of generalized linear mixed models (GLMMs).
We invite you to explore and evaluate ASRgwas, by downloading it from the vsni website or directly from github.
Related Reads