Review Article| Volume 28, ISSUE 1, P145-166, March 2008

Data Mining in Genomics

      This article reviews important emerging statistical concepts, data mining techniques, and applications that have been recently developed and used for genomic data analysis. First, general background and some critical issues in genomic data mining are summarized. A novel concept of statistical significance is described, the so-called “false discovery rate”—the rate of false-positives among all positive findings—which has been suggested to control the error rate of numerous false-positives in large screening biological data analysis. Two recent statistical testing methods are then introduced: significance analysis of microarray and local pooled error tests. Statistical modeling in genomic data analysis is then presented, such as analysis of variance and heterogeneous error modeling approaches that have been suggested for analyzing microarray data obtained from multiple experimental or biological conditions. Two sections then describe data exploration and discovery tools largely termed as supervised learning and unsupervised learning. The former approaches include several multivariate statistical methods to investigate coexpression patterns of multiple genes, and the latter are the classification methods to discover genomic biomarker signatures for predicting important subclasses of human diseases. The last section briefly summarizes various genomic data mining approaches in biomedical pathway analysis and patient outcome or chemotherapeutic response prediction.
      To read this article in full you will need to make a payment

      Purchase one-time access:

      Academic & Personal: 24 hour online accessCorporate R&D Professionals: 24 hour online access
      One-time access price info
      • For academic or personal research use, select 'Academic and Personal'
      • For corporate R&D use, select 'Corporate R&D Professionals'

      Subscribers receive full online access to your subscription and archive of back issues up to and including 2002.

      Content published before 2002 is available via pay-per-view purchase only.


      Subscribe to Clinics in Laboratory Medicine
      Already a print subscriber? Claim online access
      Already an online subscriber? Sign in
      Institutional Access: Sign in to ScienceDirect


        • Tusher V.G.
        • Tibshirani R.
        • Chu G.
        Significance analysis of microarrays applied to the ionizing radiation response.
        Proc Natl Acad Sci U S A. 2001; 98: 5116-5121
        • Storey J.D.
        • Tibshirani R.
        Statistical significance for genomewide studies.
        Proc Natl Acad Sci U S A. 2003; 100: 9440-9445
        • Hastie T.
        • Tibshirani R.
        • Eisen M.B.
        • et al.
        ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns.
        Genome Biol. 2000; 1 (p. RESEARCH0003)
        • Soukup M.
        • Cho H.
        • Lee J.K.
        Robust classification modeling on microarray data using misclassification penalized posterior.
        Bioinformatics. 2005; 21: i423-i430
        • Benjamini Y.
        • Drai D.
        • Elmer G.
        • et al.
        Controlling the false discovery rate in behavior genetics research.
        Behav Brain Res. 2001; 125: 279-284
        • Jain N.
        • Thatte J.
        • Braciale T.
        • et al.
        Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays.
        Bioinformatics. 2003; 19: 1945-1951
        • Jain N.
        • Cho H.
        • O'Connell N.
        • et al.
        Rank-invariant resampling based estimation of false discovery rate for analysis of small sample microarray data.
        BMC Bioinformatics. 2005; 6: 187
        • Baldi P.
        • Long A.D.
        A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes.
        Bioinformatics. 2001; 17: 509-519
        • Efron B.
        • Tibshirani R.
        Empirical Bayes methods and false discovery rates for microarrays.
        Genet Epidemiol. 2002; 23: 70-86
        • Kerr M.K.
        • Martin M.
        • Churchill G.A.
        Analysis of variance for gene expression microarray data.
        J Comput Biol. 2000; 7: 819-837
        • Kerr M.K.
        • Churchill G.A.
        Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments.
        Proc Natl Acad Sci U S A. 2001; 98: 8961-8965
        • Wolfinger R.D.
        • Gibson G.
        • Wolfinger E.D.
        • et al.
        Assessing gene significance from cDNA microarray expression data via mixed models.
        J Comput Biol. 2001; 8: 625-637
        • Newton M.A.
        • Kendziorski C.M.
        • Richmond C.S.
        • et al.
        On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data.
        J Comput Biol. 2001; 8: 37-52
        • Ibrahim J.GaC.
        • M.-H.
        • Gray R.J.
        Bayesian models for gene expression with DNA microarray Data.
        J Am Stat Assoc. 2002; 97: 88-99
        • Cho H.
        • Lee J.K.
        Bayesian hierarchical error model for analysis of gene expression data.
        Bioinformatics. 2004; 20: 2016-2025
        • Kerr M.K.
        • Churchill G.A.
        Statistical design and the analysis of gene expression microarray data.
        Genet Res. 2001; 77: 123-128
        • Lee J.K.
        • Bussey K.J.
        • Gwadry F.G.
        • et al.
        Comparing cDNA and oligonucleotide array data: concordance of gene expression across platforms for the NCI-60 cancer cells.
        Genome Biol. 2003; 4: R82
        • Scherf U.
        • Ross D.T.
        • Waltham M.
        • et al.
        A gene expression database for the molecular pharmacology of cancer.
        Nat Genet. 2000; 24: 236-244
        • Weinstein J.N.
        • Scherf U.
        • Lee J.K.
        • et al.
        The bioinformatics of microarray gene expression profiling.
        Cytometry. 2002; 47: 46-49
        • Tseng G.C.
        • Wong W.H.
        Tight clustering: a resampling-based approach for identifying stable and tight patterns in data.
        Biometrics. 2005; 61: 10-16
        • Golub T.R.
        • Slonim D.K.
        • Tamayo P.
        • et al.
        Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.
        Science. 1999; 286: 531-537
        • West M.
        • Blanchette C.
        • Dressman H.
        • et al.
        Predicting the clinical status of human breast cancer by using gene expression profiles.
        Proc Natl Acad Sci U S A. 2001; 98: 11462-11467
        • Su A.I.
        • Welsh J.B.
        • Sapinoso L.M.
        • et al.
        Molecular classification of human carcinomas by use of gene expression signatures.
        Cancer Res. 2001; 61: 7388-7393
        • Furey T.S.
        • Cristianini N.
        • Duffy N.
        • et al.
        Support vector machine classification and validation of cancer tissue samples using microarray expression data.
        Bioinformatics. 2000; 16: 906-914
        • Nguyen D.V.
        • Rocke D.M.
        Partial least squares proportional hazard regression for application to DNA microarray survival data.
        Bioinformatics. 2002; 18: 1625-1632
        • Li L.
        • Darden T.A.
        • Weinberg C.R.
        • et al.
        Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method.
        Comb Chem High Throughput Screen. 2001; 4: 727-739
        • Hand D.J.
        Construction and assessment of classification rules.
        John Wiley & Sons, Chichester1997
        • Soukup M.
        • Lee J.K.
        Developing optimal prediction models for cancer classification using gene expression data.
        J Bioinform Comput Biol. 2004; 1: 681-694
      1. Pampel FC. Logistic regression: a primer. Sage University Papers Series on Quantitative Applications of the Social Sciences; 2000.

        • Ambroise C.
        • McLachlan G.J.
        Selection bias in gene extraction on the basis of microarray gene-expression data.
        Proc Natl Acad Sci U S A. 2002; 99: 6562-6566
        • Romero P.R.
        • Karp P.D.
        Using functional and organizational information to improve genome-wide computational prediction of transcription units on pathway-genome databases.
        Bioinformatics. 2004; 20: 709-717
        • Brivanlou A.H.
        • Darnell Jr., J.E.
        Signal transduction and the control of gene expression.
        Science. 2002; 295: 813-818
        • Friedman N.
        • Linial M.
        • Nachman I.
        • et al.
        Using Bayesian networks to analyze expression data.
        J Comput Biol. 2000; 7: 601-620
        • Segal E.
        • Taskar B.
        • Gasch A.
        • et al.
        Rich probabilistic models for gene expression.
        Bioinformatics. 2001; 17: S243-S252
        • Segal E.
        • Friedman L.
        • Koller D.
        • et al.
        A module map showing conditional activity of expression modules in cancer.
        Nat Genet. 2004; 36: 1090-1098
        • Conlon E.M.
        • Liu X.S.
        • Lieb J.D.
        • et al.
        Integrating regulatory motif discovery and genome-wide expression analysis.
        Proc Natl Acad Sci U S A. 2003; 100: 3339-3344
        • van 't Veer L.J.
        • Dai H.
        • van de Vijver M.J.
        • et al.
        Gene expression profiling predicts clinical outcome of breast cancer.
        Nature. 2002; 415: 530-536
        • van 't Veer L.J.
        • Dai H.
        • van de Vijver M.J.
        • et al.
        Expression profiling predicts outcome in breast cancer.
        Breast Cancer Res. 2003; 5: 57-58
        • Dressman H.K.
        • Hans C.
        • Bild A.
        • et al.
        Gene expression profiles of multiple breast cancer phenotypes and response to neoadjuvant chemotherapy.
        Clin Cancer Res. 2006; 12: 819-826
        • Potti A.
        • Mukherjee S.
        • Petersen R.
        • et al.
        A genomic strategy to refine prognosis in early-stage non-small-cell lung cancer.
        N Engl J Med. 2006; 355: 570-580
        • Miller L.D.
        • Smeds J.
        • George J.
        • et al.
        An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival.
        Proc Natl Acad Sci U S A. 2005; 102: 13550-13555
        • Havaleshko D.M.
        • Cho H.
        • Conaway M.
        • et al.
        Prediction of drug combination chemosensitivity in human bladder cancer.
        Mol Cancer Ther. 2007; 6: 578-586
        • Paik S.
        • Shak S.
        • Tang G.
        • et al.
        A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer.
        N Engl J Med. 2004; 351: 2817-2826
        • Horvath S.
        • Zhang B.
        • Carlson M.
        • et al.
        Analysis of oncogenic signaling networks in glioblastoma identifies ASPM as a molecular target.
        Proc Natl Acad Sci U S A. 2006; 103: 17402-17407
        • Bild A.H.
        • Yao G.
        • Chang J.T.
        • et al.
        Oncogenic pathway signatures in human cancers as a guide to targeted therapies.
        Nature. 2006; 439: 353-357
        • Potti A.
        • Yao G.
        • Chang J.T.
        • et al.
        Genomic signatures to guide the use of chemotherapeutics.
        Nat Med. 2006; 12: 1294-1300
        • Ma X.J.
        • Patel R.
        • Wang X.
        • et al.
        Molecular classification of human cancers using a 92-gene real-time quantitative polymerase chain reaction assay.
        Arch Pathol Lab Med. 2006; 130: 465-473
        • Puskas L.G.
        • Juhasz F.
        • Zarva A.
        • et al.
        Gene profiling identifies genes specific for well-differentiated epithelial thyroid tumors.
        Cell Mol Biol (Noisy-le-grand). 2005; 51: 177-186