Data Analysis
Among the high-throughput technologies operated by the Cologne Center for Genomics are gene expression and genotyping microarray facilities (Affymetrix, Illumina) as well as the in-house deep sequencing platforms. These devices produce very large amounts of data: the storage demands in deep sequencing of genomic DNA or whole transcriptomes range from hundreds of gigabytes up to several terabytes. Thus, despite the great research opportunities enabled by high-throughput laboratory resources, these innovative technologies also pose major challenges on data analysis as to the availability and use of hardware resources as well as efficient computational analyses by tailor-made scripts or compiled code. To this end, a bioinformatics facility has been established at the CCG, offering services and consulting for computational data analysis with a particular expertise in biostatistics.
Deep Sequencing
The in-house technical and bioinformatic tools for the analysis of deep sequencing data include a powerful infrastructure currently being built up at the CCG, aiming at semi-automatic processing and filtering of the huge sequence data amounts. Furthermore, the bioinformatics group of the CCG operates high-throughput implementations of classical bioinformatic tasks such as alignments and assemblies.
The applications of highest-throughput DNA read alignments to reference sequences range from analyses of single nucleotide polymorphisms (SNPs), genomic mutations, copy number alterations up to heavy structural variations in cancer genomes. In whole-transcriptome sequencing, alignments are a tool for building highest-resolution expression profiles and investigation of novel isoforms of known genes. The CCG provides data analysis support also for more specialized approaches such as sequencing of target-enriched libraries, miRNAs, ChIP-Seq experiments, and multiplex libraries.
The construction of de novo whole-genome assemblies is composed of the assembly itself, i.e. the procedure of rebuilding the genomic DNA sequence from millions of short reads, as well as genomic annotation and prediction of genetic structure of novel organisms to be sequenced.
To organize the huge amounts of data produced in deep sequencing, the CCG bioinformatics group closely cooperates with the Regional Computing Center Cologne (RRZK). The RRZK provides the storage and computational facilities needed to handle these data. Details about the High-Performance Computing (HPC) facilities at the RRZK can be accessed at HPC and SuGI Cluster an der Universität zu Köln.
Hybridization
Besides the new generation of nucleic acid sequencing, microarrays are still a widely used tool in genomic research. Although in many studies, the size of the data is usually in a range that is still manageable on standard desktop computers, the data analysis requires the use of appropriate scripts for reproducible computations and to filter out the essential information from very large collections of genomic markers. The CCG has acquired expertise in processing microarray data from all platforms provided by the manufacturers Affymetrix and Illumina. These methods include the necessary background correction, normalization and summarization of the markers interrogated on the respective arrays. For SNP arrays, genotype calling and calculation of copy number variations with subsequent segmentation is required. Investigation of statistically significant copy number alterations is an additional step.
Among the statistical methods applied for evaluation are state-of-the-art quality assessments, descriptive statistics, multiple inference, statistical learning and hierarchical clustering methods. Genomic annotation can be carried out by high-throughput database access. Finally, the essential information from an experiment can be extracted by individual filtering methods.
Contact
For detailed questions on data analysis for high-throughput genomic experiments at the CCG, please contact
Michael Nothnagel michael.nothnagel@uni-koeln.de
or
Susanne Motameny s.motameny@uni-koeln.de.