Introduction

The likelihood-based procedure (Lander & Botstein, 1989) and asymptotically equivalent regression methods (Haley & Knott, 1992) are now widely used for the statistical analysis of data sets concerning organisms for which dense maps of molecular markers are available, with the aim of localizing on these maps the loci affecting important quantitative traits (quantitative trait loci, QTLs). Advanced methods allowing for multiple QTLs situated on different chromosomes have been developed (Jansen & Stam, 1994; Zeng, 1994). All these methods are, however, univariate, and their usage in real multitrait experiments consists of analysing each trait independently and combining somehow the obtained results, taking (in some cases) the correlations between traits vaguely into account.

For this reason, some multivariate methodologies of QTL localization have been recently proposed. One of them is based on the generalization of the maximum likelihood procedure for mixture models (Jiang & Zeng, 1995; Korol et al., 1995, 1998). Other propositions utilize canonical transformation of the traits into canonical variates to which the univariate techniques are then applied (Weller et al., 1996; Mangin et al., 1998). The results given by Mangin et al. (1998) assure that if the correlations among traits can be reliably estimated, such an approach is asymptotically correct and, under certain assumptions concerning the effects of a QTL on individual traits, more powerful.

All the above methods have, however, limitations. The results of Mangin et al. (1998) are limited to the case of the pleiotropic QTLs, that is, of loci affecting all analysed traits simultaneously. The mixture models allow for closely linked, nonpleiotropic QTLs, but cannot be applied without relevant computational procedures for likelihood maximization, and therefore their usage is limited to simple experimental situations allowed for by the specialized software available in the public domain.

The purpose of this paper is to describe some methods of localization of multiple QTLs based on the multivariate model of multiple regression. In other words, we explore the possibility of a natural generalization of the regression methods used widely for the univariate case. It turns out that the formulated model and the methods following from it allow evaluation of descriptive indices pertaining to the QTLs postulated by the model and to the observed traits, and also formal testing of hypotheses concerning the existence of QTLs and their pleiotropy. Similarly to the univariate case, the regression model allows the analysis to be adjusted easily to various experimental situations. The closed-form formulae are used to calculate estimates of QTL effects and values of the test statistic, which eliminates iterative procedures required by the approach of Jiang & Zeng (1995).

The model is described and the method used to evaluate the contributions of the QTLs and of the traits to the rejection of the general no-QTL hypothesis is given; it is based on a particular use of the canonical analysis as described by Lejeune & Caliński (2000). An approach to testing hypotheses about the pleiotropy, formulated in terms of the extended linear hypotheses considered by Mudholkar et al. (1974) is presented. The mapping approach, being a generalized version of the composite interval mapping of Zeng (1994) and similar to the one used by Sari-Gorla et al. (1997), is detailed. A practical example of the proposed analysis, performed for the data collected in an experiment with maize recombinant inbred (RI) lines and extensively analysed with univariate methods by Frova et al. (1999) and Sari-Gorla et al. (1999), is given. The paper is completed by a discussion.

It should be noted that the method is described for the situation in which no phenotypic data and no marker data are missing. This corresponds to the assumption that proper preliminary analyses are performed prior to the multivariate analysis; these concern univariate analyses of variance followed by imputation of missing phenotypic observations, and correction of marker data as described by Martinez & Curnow (1994). Also, in the description of the method it is assumed that the experiment is designed in complete randomized blocks, the assumption being implied by the analysed real data set.

A model for multivariate QTL mapping

The n × v data matrix Y of n observations for v traits can be modelled by a multivariate extension of the model used for univariate QTL localization, of the form

where μ is the v × 1 vector of v means for the different traits, XB is the known n × b design matrix for b blocks, B is the b × v matrix of block parameters, XM is the known n × m matrix of corrected marker observations (having values from −1 to 1), D is the m × v matrix of marker parameters (regression coefficients), XQ is the initially unknown matrix of QTL genotypes at s positions p1, p2, …, ps on the chromosome, Q is the s × v matrix of additive genetic parameters of s putative QTLs and E is the n × v matrix of experimental unit errors. All the parameters are unknown and have to be estimated under the usual assumptions concerning random errors, such as that of the same covariance matrix for each experimental unit and of no correlations between different experimental units. Model (1) is a standard multivariate model used for multiple regression of observed traits on marker covariates and on QTL genotypes, with additional block nuisance parameters. It is suitable for the experimental situation of RI lines as considered here, but can be extended to a model that includes dominance or nonallelic interaction effects, if needed for other population types.

Although QTL genotypes are not known, columns of the matrix XQ can be calculated on the basis of known genotypes at markers flanking each putative QTL, using the genetic model of recombinations in RI lines (Knapp et al., 1990; Sari-Gorla et al., 1997). The formulae given below assume “full interference” or, in other words, the coefficient of coincidence equal to zero. Their application leads to an “approximate discrete” (Jansen, 1997) model of QTL effects, similar to the one widely used in the univariate QTL localization methods.

If any 1 ≤ ts QTLs are assumed to be in the same marker interval, they are considered as being linked. Denoting by r0 the recombination fraction per meiosis (r.f.) between the left flanking marker and first QTL, by ri, i=1, 2, …, t − 1, r.f. between the ith and (i+1)th QTL, and by rt r.f. between the last QTL and the right flanking marker, we get the column of XQ corresponding to the ith QTL as

where Ri = 2ri/(1+2ri), i=0, 1, …, t, R=∑ti=0Ri and xL, xR are n × 1 vectors of known genotypes at the markers flanking the group of t QTLs from the left and from the right, respectively.

Formula (2) applies also to the case t=1, i.e. to the case of one isolated QTL assumed in a marker interval, in which case it reduces to the usual form (see, e.g., Sari-Gorla et al., 1997). Therefore, eqn (2) can be used to calculate columns of XQ corresponding to any sequence of linked and unlinked QTLs.

In consequence, the element qij of Q represents the effect of the ith QTL for the jth trait. The hypothesis

can be tested (as a general linear hypothesis) by, e.g., the Lawley–Hotelling T20 statistic:

where mE denotes the number of degrees of freedom for error [=n − 1 − (b − 1) − ms], and E, H denote the sum of squares and products (SSP) matrices corresponding to by errors and the deviation from the hypothesis H, respectively.

In the above formulation, a problem occurs if some of the putative QTLs are assumed to be in the same marker interval. In such a situation, the corresponding columns of XQ span the same space for any chosen positions of the QTLs. Therefore, the test of H is not sensitive to these positions, or, in other words, conveys no information on the most probable positions of the loci controlling the traits.

Moreover, a problem would occur if a QTL were assumed to be located at the position of one of the markers constituting XM. Such situations can be avoided by omitting from XM the markers flanking any of the putative QTLs or even omitting markers lying on the chromosome under consideration, a practice suggested in the case of univariate interval mapping (Jansen, 1994; Zeng, 1994).

Canonical analysis

Instead of hypothesis (3), a more general linear hypothesis is usually of interest in the analysis. It can be written as

with the s × n matrix Ω obtained as

where Ξ=[1′, B′, D′]′ is the (1+b+m+s) × v matrix of the model parameters specified in model (1), C is a s × (1+b+m+s) matrix extracting Q from Ξ, M is a v × u matrix of rank u defining linear functions of columns of Q (in our application, M is mainly used for making a choice of the columns of Q). Hypothesis (4) can also be tested by a proper Lawley–Hotelling statistic, and all component hypotheses corresponding to the rows of Q and subsets of columns of M can be verified simultaneously by one of the known tests. Also, the canonical variate analysis, as described by Lejeune & Caliński (2000), can be used to identify component hypotheses responsible for the rejection of (4).

The canonical variate analysis for hypothesis (4) consists of calculating the standardized form of Ω^, the BLUE of Ω. A natural standardization is given by

where X=[1 : XB : XM : XQ], C=[0 : 0 : 0 : Is] and S=m−1EYQEY, with QE=InX(XX)X′. It can be seen that

with T02 being the statistic for testing the general hypothesis (4). Moreover, the squared (i,j)th element of Ω^std, ω^2ij, provides a measure of the contribution of the (i, j)th element of Ω to T20, and so of the responsibility for the rejection of (4). Consequently, the row and column sums of elements of the matrix [ω^2ij] provide a similar measure of responsibility [for the rejection of (4)] of individual QTLs and linear parametric functions defined by columns of M, respectively.

Extended linear hypotheses

Model (1) can also be used for recognizing pleiotropic and nonpleiotropic QTLs, the problem treated by the maximum likelihood approach by Jiang & Zeng (1995).

Consider a special, simple case of s=2 (two QTLs at positions p1, p2) and v=2 (two traits). Then

with qij denoting the additive effect of the ith QTL on the jth trait.

The case of two nonpleiotropic QTLs, at p1 for the first trait and at p2 for the second trait, corresponds to the situation in which

So, we are interested in testing the hypotheses of the form

involving just diagonal or just off-diagonal elements of Q. Such hypotheses cannot be written as in (4), even when using a more general version of Ω involving a matrix defining functions of the rows of Q. Instead, a more general approach known from the literature must be used.

In connection with the general linear hypothesis of the type (4), tested by T20=mEtrace(E−1H), where

an extended linear hypothesis has been defined by Mudholkar et al. (1974) as

for all matrices G of order s × u from a set Γ. In a special case, when Γ is defined as the span

where G1, G2,…,GK are s × u linearly independent matrices and λ1, λ2,…,λK are real numbers, the hypothesis HΓ is called the extended linear hypothesis of order K. The test statistic proper for an extended linear hypothesis of order 1, defined by a matrix GG1, is of the form

where R = C(XX)C′. The calculated value of t(G) can be compared with the critical value Cα, where C2α=T20,α/mE, with T20,α being the appropriate upper α point of the distribution of T02 (obtainable from a table, e.g. in Seber, 1984, table D15, or by an approximation proposed by McKeon, 1974). The test of HΓ of order K>1 can be accomplished using the test statistic

where τ = [τ1, τ2, …, τK]′, with τk=trace(GkΩ^), and T = [tkk], with tkk=trace(GkRGkE), for k, k′=1, 2,…,K. The calculated value of t2(Γ) is to be compared with C2α.

Returning to the special case described above, to decide on the type of two QTLs for two traits we can test the two hypotheses (5) written as

where

as two extended linear hypotheses of order 2. Rejection of H1 and not rejection of H2 indicates the existence of two nonpleiotropic QTLs, the first having effect on the first trait and the second having effect on the second trait only, whereas the reverse situation points to a reverse effect of the two QTLs.

In the general case of s and v greater than 2, the choice of hypotheses to be tested in order to reject or confirm pleiotropy or nonpleiotropy of (a subset of) QTLs can be more difficult. To some extent, it can be directed by results of the canonical analysis described in Section 3, after possible re-ordering of QTLs and traits suggested by the dominant elements of the matrix Ω^std.

Interval mapping

The model described in this paper and all additional methods that follow from it can be used for interval QTL mapping in the way similar to that used in the application of the univariate regression model to the QTL mapping for one trait. The practical algorithm may be described as follows.

Selection of marker co-variates for the model

Selection of markers defining the matrix XM in model (1) can be performed analogously to the forward selection procedure. We start from the model similar to (1) but involving only the general and block parameters. For each marker we perform the multivariate regression, and select the marker which gives the maximum value of the test statistic for regression (e.g. Rao’s, 1973, section 8c.5, approximation to Wilks’s test statistic can be used). This marker is included in the model and the selection is repeated until nonsignificant changes in the value of the test statistic are observed (at a low, of about 0.001, significance level, to avoid inclusion of markers correlated with the phenotypic traits by pure chance).

Interval mapping of one QTL (s=1)

The linkage group (chromosome) is scanned with a given step [cM], and model (1) is used to find positions suspected of bearing a QTL, that is, positions giving large values of the T20 statistic for hypothesis H : Q=0 [hypothesis (3) with M=I]. At each position p, the estimate

is computed, giving the effects of the putative QTL on all traits. Also, the elements of the 1 × v matrix [ω^2ij] are evaluated, giving the contributions of the traits to the total T20.

Note that the formal test for the existence of a QTL on the chromosome can be accomplished by treating hypotheses connected with all positions on this chromosome as a family, and using the sequentially rejective Bonferroni testing procedure (SRTP) as described by Sari-Gorla et al. (1997) after Hochberg & Tamhane (1987, pp. 57–58).

Scanning for two QTLs (s=2)

Any chromosome region suspected of bearing a QTL affecting more than one trait is scanned with the model (1) for s = 2, that is, assuming two QTLs at positions p1, p2 in that region. For each (p1, p2) the estimate

is obtained. The contribution of the parameters {qij} to the total T20 can be represented by elements of the matrix [ω^2ij]. The contribution of two QTLs can be separated by calculating the two row sums of [ω^2ij], and the contribution of v traits by calculating all v column sums of [ω^2ij]. To declare the existence of two QTLs, the values of T20 calculated for the whole region can be subjected to SRTP as mentioned above.

At this point, the matrix M of a proper form may be used to restrict the tested hypotheses to those concerning the variables which have the dominant contribution to the rejection of the no-QTL hypothesis, both for s = 1 and for s = 2. In this way the dimensionality of the problem can be reduced and the decision about pleiotropy of the QTLs can be made on the basis of a relatively simple testing procedure.

Comparison between the fit of model (1) for s = 1 and s = 2 is not simple, because optimal positions of one QTL and of two QTLs do not lead, in general, to models which are nested one within the other. The two positions p1, p2 giving the maximum value of T20 for s = 2 may be, and usually are, different from the position p giving the maximum for s = 1. Therefore, any formal test of the two-QTL model against the one-QTL model would involve a function of two statistics, each being a maximum of T20 over a region (similarly as in Jiang & Zeng, 1995), distributions of which can only be approximated by some simulation studies (see, e.g., Rebaï et al., 1994). In our opinion the decision concerning the use of the two-QTL model should be made taking into account not only the relative value of the two statistics, but also the practical meaning of the distance between the positions of the two hypothesized QTLs and their location in relation to markers.

Pleiotropy

If the model involving two QTLs at some p1, p2 is accepted, the hypotheses concerning pleiotropy can be tested. Theoretically such hypotheses can be formulated for any number of traits. Practically, the most clear inference can be made for two traits with dominant contribution to the rejection of the general hypothesis, as has been exemplified before.

Example

Data analysed in this section come from an experiment with maize recombinant inbred lines obtained from the cross B73 × H99, performed to study their drought tolerance. Details of the experiment and results of the data analysis performed by univariate QTL localization techniques have been described by Frova et al. (1999) and Sari-Gorla et al. (1999).

In this example we take into consideration nine phenotypic traits, observed for 142 RI lines in two replications (blocks) and transformed into drought tolerance indices by taking (as observations) the ratios of means from plots subjected to water stress to means from plots receiving normal irrigation. The traits are as follows.

1. Five yield components: ear length (EL), ear weight (EW), kernel weight per ear (KW), kernel number per ear (KN) and fifty kernel weight (50 KW).

2. Flowering components: male flowering time (MFT), female flowering time (FFT) and anthesis-silking interval (ASI=FFT − MFT).

3. Plant height (PH).

The marker data available consist of observations of 153 molecular markers [RFLP, microsatellites (SSR) and AFLP] located in 12 linkage groups assigned to 10 maize chromosomes (the map obtained by the MAPMARKER/EXP computer program). In the analysis below we take into consideration a subset of 144 markers selected in such a way that no two markers are located closer to each other than 1 cM.

Prior to the multivariate analysis, ANOVA was performed for each trait independently, and the fitted values obtained from it were imputed in place of (a few — less than 1%) missing phenotypic observations. About 11% of the marker observations were missing and the method of Martinez & Curnow (1994) was used to correct them. Because of some large marker intervals, the correction could not be completed for four RI lines, which were removed. Such preliminary analysis gave a complete data set for 138 lines replicated in two blocks and described by values of nine traits and by (corrected) observations of genotypes at 144 markers.

Selection of marker covariates identified nine markers providing good explanation of the total variability of nine traits. One of these markers, p20052, is located at 74.6 cM in the linkage group identified as the ninth maize chromosome. From now on we restrict our attention to this chromosome.

Figure 1 summarizes results of interval mapping on chromosome 9 using the multivariate model postulating one QTL. In Fig. 1(a), a clear peak in the values of the T20 statistic occurs at 75 cM, very close to p20052. The largest contribution to this peak comes from traits KW, 50 KW and KN. The SRTP with the chromosome-wise significance level of 0.05 declares as significant T20 values at 74 and 75 cM. Simultaneous testing of the contribution of the traits at 75 cM with α=0.05 (with the critical value equal to 7.84, as approximated by the Bonferroni inequality) suggests a highly significant value for KW; the simultaneous test at α=0.1 (critical value 6.55) gives significance also to the contributions of 50 KW and KN. The estimates of QTL additive effects, shown in Fig. 1(b), are positive for all these traits, indicating that the parental line H99 is the one which contributes alleles increasing the phenotypic values.

Figure 1
figure 1

Results of scanning chromosome 9 with the model of one QTL affecting nine traits. (a) Profiles of T02 values; continuous line corresponds to the total T20 statistic, numbered lines to the contribution of the traits: 1, EL; 2, EW; 3, KW; 4, KN; 5, 50 KW; 6, MFT; 7, FFT; 8, ASI; 9, PH. (b) Profiles of the estimates of QTL additive effects for all traits.

So far, the analysis suggests the existence of a QTL at marker p20052 for KW. Its importance for 50 KW and KN is not quite clear. Let us restrict attention to these two traits and continue the analysis with a bivariate model. By repeating the procedure of marker selection and mapping with a properly defined matrix M in (4), we obtain then the maximum value of T20=21.34 at 74 cM, with the contributions of KN and 50 KW equal to 15.17 and 6.17, respectively. Because of the lower number of traits, the SRTP now declares significance of the total T20 for positions 67–77 cM, and both contributions are significant while testing simultaneously at α=0.05. Scanning the region with the bivariate model declaring two QTLs gives the maximum value of T20=23.38 (P < 0.001) at positions (72, 75). Although the increase of the T20 value in comparison to the one-QTL model is not very large, the two-QTL model may be interesting, especially because it declares significant positions in two different marker intervals. Contributions of QTLs and traits are given in Table 1. They clearly indicate importance of the QTL at 72 cM for 50 KW and importance of the QTL at 75 cM for KN, which suggests two nonpleiotropic QTLs.

Table 1 Values of the contribution to T20 for individual QTLs and traits

The nonpleiotropy suggested by Table 1 can be further checked by testing parametric functions as described. At the position (72, 75) the test of H1 defined in (7) gives the value mEt2(Γ)=1.08, which should be compared with the critical value T20;0.05=9.65. We conclude that the influence of the first QTL on KN and the influence of the second QTL on 50 KW are negligible. Although the hypothesis H2 is also not rejected (mEt2(Γ)=3.58), the nonpleiotropy seems to be more likely than the reverse.

Discussion

The aim of the paper was to show the way along which a person interested in using the multivariate analogue of the regression model, widely applied to QTL localization, may proceed. The main arguments for using the multivariate model are the same here as in any statistical analysis of experimental data: taking into account correlations among the analysed traits and drawing inferences on various traits simultaneously. It seems quite natural, at least, to start the analysis within a multitrait set up, as in practical applications the observed traits cannot be assumed to be uncorrelated. Note that only the multivariate approach can properly tackle the problem of recognizing pleiotropic and nonpleiotropic QTLs, which is important for the perspective of marker-assisted selection.

The presented approach to finding loci affecting quantitative traits relies on the model of multivariate multiple regression. By addition of the special use of the canonical analysis and of the tests of extended linear hypotheses, the method can be used to verify simultaneously several conjectures concerning the type of influence of the putative QTLs on the studied traits and also to evaluate the role of the individual traits. All inference is made within one general model, using well-founded simultaneous testing procedures. Properties of these procedures are important theoretically and also very convenient from the practical point of view.

A theoretical advantage over the univariate approach, not always realized, lies in the fact that analysing many phenotypic traits and many markers by univariate methods may result in determining a great number of positions on the chromosome with significant values of the test statistic, that is, result in an excessive number of declared QTLs. In this approach, the probability of determining a false QTL by sheer chance as a result of the number of traits taken into account is not controlled. The multivariate model takes care of this type of error. The price for it is usually the large critical value for the test statistic, and a decrease of power of the significance test used to detect QTLs.

The practical convenience of the method consists in the relative simplicity of the algorithm, which can be programmed in any of the high-level statistical languages (we used GENSTAT 5 Release 4.1; GENSTAT 5 COMMITTEE, 1996). For a multitrait data set, it conveniently allows assessment of the effect of addition/deletion of traits by specifying the proper matrix postmultiplying the matrix of QTL parameters in the expression of the general hypothesis.

Taking as the basis of the analysis the multivariate linear model, we also accept the usual assumption concerning the homogeneity of the covariance matrix of the traits over QTL and marker groups. This assumption has been considered as very restrictive in the QTL mapping analyses by Korol et al. (1998). We agree that its violation may have an effect on the power of the presented method, especially because it uses different forms of the T20 statistic, which was found to be not quite robust against heteroscedasticity in MANOVA by Ito (1980) (see also Seber, 1984, section 9.2.3). Here we deal with multivariate regression and it is not quite clear if the same conditions apply. Alternative multivariate tests with better properties are available and can be used to test the general no-QTL hypothesis if needed. This, however, makes the rest of the calculations described difficult to accomplish, and the exact decomposition of the general test statistic into meaningful components impossible.

The model and implied methods have been presented in a general way, for any number of QTLs and traits. It is obvious that a fair thorough inference would be difficult in the case of many traits. Also, because of the high density of the marker maps used, hypothesizing a large number of QTLs in one marker interval would not be sensible. Therefore, the method should be used rather to reduce the dimensionality of the problem in order to focus on the most important traits and a few putative QTLs, as shown in the example, and is particularly suitable for this purpose. It is possible that the bivariate version of model (1) will prove to be the most useful in practice.

A question may be raised about applicability of the usual significance levels for tests carried out in the analysis of the data set reduced to only a selection of traits. In our approach any reduction of the model is made through hypothesis testing, in the sense that a hypothesis with a smaller number of traits is tested only if a relevant hypothesis with a larger number of traits has been rejected. Such a procedure allows one to control the significance level for testing the hypothesis under the reduced model. However, if the reduction of the number of traits involves some other model modifications the nominal significance level has to be considered as approximate.

Although the method presented has no inherent ability to deal with missing phenotypic and genotypic observations, we believe that preliminary correction steps taken with the data set considered here before mapping did not distort the conclusions. Insertion of a few missing phenotypic observations did not result in a serious understatement of the true variability of the analysed traits. The effect of correcting the missing marker observations is not exactly known, but the idea of using expected genotypes in place of missing values is consistent with the whole approach to the linear regression model, in which genotypic observations at putative QTLs are not known and are all replaced by calculated values. Finally, markers located very closely to each other, less than the mapping step used, did not provide additional information about possible QTLs, and could be safely removed.