Genomics and Big Data-Part 1

Gerry Higgins | OSEHRA Blog | January 23, 2013

This will be the introductory part of a long report that I completed in response to a 'big data' study being performed by MITRE for the U.S. Army. It will be released in phases, with some text redacted. Here is the Executoive Summary. See link below for the first part of the PDF. 


This draft report addresses data quality issues associated with the analysis of a massive dataset of whole human genome sequences.   The dataset described in this report may be the world’s largest collection of whole human sequences. This dataset includes raw reads, aligned, and assembled sequences generated by 2nd generation sequencing, as well as annotated variant types.  Annotation, one of the data quality issues, is an ongoing process, as the pace of discovery in clinical genomics is extremely rapid.

As of early 2013, there have been regulations and guidelines issued by the College of American Pathologists for how to use next generation sequencing (NGS) in a laboratory that is certified by the Clinical Laboratory Improvements Amendments (CLIA) by the Centers for Medicare and Medicaid Services (CMS).  The CLIA-certificated laboratory is the only setting in which such sequences may be generated for clinical applications.  Although the recent regulations are informative, the rate of technology acceleration and accompanying discovery in biology and medicine is so fast that no regulatory or professional organizations can hope to keep abreast of such changes using current, highly bureaucratized processes.

There are many quality concerns associated with this massive dataset. These include differences in quality between the NGS platforms that contributed source data, inherent problems associated with short read, 2nd generation sequencing, and issues related to clinical interpretation.  It should be understood that there is continuing controversy in genomics concerning metadata standards, mutation nomenclature and quality control metrics, which partly reflects the divide between researchers in the life sciences and those that perform clinical genomic sequencing.

Following a short introductory section, answers to the ‘Big Data Quality’ questionnaire begin on page 8 of this draft report.  It is recommended that the reader pay special attention to the ‘Flowchart’ and related text that begins on page 30, as this is the process that was used for analysis of this specific dataset.  Sample preparation using enzymatic amplification always introduces error, as does the use of a human reference sequence, from the National Center for Biotechnology Information (NCBI), National Institutes of Health, for alignment that is based largely on whole genome sequences from Caucasians.  However, there are many additional data quality problems, and yet the NGS technology is so powerful that it has led to the greatest number of biomedical advances in the history of medicine.

This report is the first phase of a process designed to resolve continuing questions about data quality, variant interpretation, and methodology.  The next few years will usher in more accurate sequencing that will decrease the need for bioinformatic pre-processing as it is use in NGS now, and will increase the need for  signal analysis. This shift will transform bioinformatics, and will equal the Coulter counter and the polymerase chain reaction in its impact in biomedicine.