Fast Semi-Parallel Linear and Logistic Regression for Genome-Wide Association Studies


Karolina Sikorska, Department of Biostatistics, Erasmus Medical Centre in Rotterdam

This tutorial demonstrates semi-parallel computing and SNP data re-organization using the statistical program, R. Karolina Sikroska describes techniques for speeding up genome-wide association studies (GWAS) and making genome-wide association scans possible on a notebook computer using matrix operations and matrix oriented binary files. The video is from a webinar recorded September 12, 2013.

Part 1: Semi-Parallel Linear Regression

Part one explains GWA analysis in a loop using lm and lsfit functions and semi-parallel computations of linear regression with covariates.  Also explains how to handle missing phenotype and SNP data.

Part 2: Semi- Parallel Logisitic Regression

Part two explains semi-parallel logisitic regression in R based on iteratively reweighted least squares (equivalent to glm), with and without covariates.

Part 3: Efficient Data Access

Part three explains how to convert the SNP matrix from a text file to an array-oriented binary file using the Ncdf and ff packages.  Array-oriented binary files allow efficient access to blocks (columns) of SNPs by SNP, as opposed to by individual/line (rows).

Full Recording

​Download R and Individual R Packages

R Packages Specific to this Tutorial

ncdf: Interface to Unidata netCDF data files
ff: memory-efficient storage of large data on disk and fast access functions

R Codes Available

About the Presenter

Karolina Sikorska received a Master’s degree in Mathematics from the Gdansk University of Technology, Poland, with a specialization in financial mathematics.  In 2009 she started  her PhD project in the Department of Biostatistics, Erasmus Medical Centre in Rotterdam.  Her research is related to fast computations in genome-wide association studies.  Her work is focused on developing new methodology and algorithms which significantly speed up computations in GWAS for simple models, such as linear and logistic regression, as well as, mixed models for analyzing longitudinal data. She is also interested in improving tools for efficient data access in GWAS framework.


Related Publications

Sikorska, K., Lesaffre, E., Groenen, P. F., & Eilers, P. H. (2013). GWAS on your notebook: fast semi-parallel linear and logistic regression for genome-wide association studies. BMC Bioinformatics, 14(1), 166.

Sikorska, K., Rivadeneira, F., Groenen, P. J., Hofman, A., Uitterlinden, A. G., Eilers, P. H., & Lesaffre, E. (2013). Fast linear mixed model computations for genome‐wide association studies with longitudinal data. Statistics in Medicine, 32(1), 165-180.


Funding Statement

Development of this resource was supported in part by the National Institute of Food and Agriculture (NIFA) Solanaceae Coordinated Agricultural Project, Dry Bean Root Health East Africa, and the Erasmus Medical Center  Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the view of the United States Department of Agriculture.


Slides.pdf (570.88 KB)

PBGworks 1641