Kruskal-Wallis Statistic to Identify Associations Between Markers and Traits

Author:

Heather L. Merk, The Ohio State University

The Kruskal–Wallis statistic, H, can be used in a plant breeding context to identify associations between molecular markers and traits, particularly when the population structure is unbalanced and the trait values can be ranked. This module provides an overview of the Kruskal–Wallis statistic and an example calculation of H for a single marker.

Introduction

The Kruskal–Wallis statistic, H, can be used in a plant breeding context to identify associations between molecular markers and traits, particularly when the population structure is unbalanced. The Kruskal–Wallis statistic is an alternative to ANOVA, which requires that populations meet statistical assumptions such as the population being normally distributed and the population variances being equal. Although the Kruskal–Wallis test does not require these assumptions, it is assumed that samples from each group (each genotype in this case) are independent and come from distributions with the same shape. The fact that the Kruskal–Wallis test uses ranks rather than raw data allows it to be used for most data that can be ordered.

For example, the Kruskal–Wallis test was used in the Marker-Assisted Selection for Disease Resistance in Tomato Tutorial to test whether or not each molecular marker genotyped in an inbred backcross (IBC) population was associated with bacterial spot resistance (ranked according to the presence or absence of the hypersensitive disease response).

Calculating H

When calculating H for single marker analysis, the null hypothesis tested is that the median phenotypic value for each genotype is equal. In the bacterial spot example, the null hypothesis is that there is no difference in disease response between the heterozygous individuals and the homozygous individuals.

The procedure for calculating H is based on Kruskal and Wallis (1952) and Clewer and Scarisbrick (2001).

First, the individuals should be ranked from lowest to highest phenotypic value, regardless of group (genotype). If individuals are tied, all those individuals should receive their average rank (e.g., if the third, fourth, and fifth lowest individuals were tied, they would each receive a rank of 4).

Second, H is calculated using the following equation:

H = [12/(N(N+1)] SUM[Ri**2/Ni] - 3(N+1)

where k is the total number of groups (genotypes), n is the number of samples, and Ri is the sum of the ranks for group i.

Provided that all groups have five individuals or more, H shares approximately the same distribution as chi-square with k-1 degrees of freedom. Thus, the chi-square distribution is consulted to assess whether or not to accept or reject the null hypothesis that phenotypic values do not differ between genotypes.

SAS Code to Calculate H

The “NPAR1WAY” procedure is used to perform the Kruskal-Wallis test in SAS. The class parameter is used for marker genotypes and the var parameter is used for the phenotypic value or rank.

References Cited

  • Clewer, A. G., and D. H. Scarisbrick. 2001. Practical statistics and experimental design for plant and crop science. John Wiley & Sons, Ltd, NY.
  • Kruskal, W. H., and W. A. Wallis. 1952. Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association 47: 583–621. (Available online at: http://www.jstor.org/stable/2280779) (verified 12 May 2012).

Additional Resources

Funding Statement

Development of this page was supported in part by the National Institute of Food and Agriculture (NIFA) Solanaceae Coordinated Agricultural Project, agreement 2009-85606-05673, administered by Michigan State University. Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the view of the United States Department of Agriculture.

PBGworks 857