Big Data Analysis

The analysis of large amounts of data has gained significant importance in recent years in business and research. Technological advances make it possible in all areas to generate and to collect ever-larger amounts of data.
 

Especially in the life sciences field, large amounts of biological data are generated using the latest high-throughput methods, such as in DNS sequencing. Meanwhile, it is possible to sequence human genomes in less than a day. These large amounts of data are, on the one hand, a technological challenge, but represent an opportunity to obtain valuable information, on the other hand.
Since traditional bioinformatics applications are often designed for the local processing of data, but local computers do not always have the desired power, the RISC Software GmbH has analyzed the possibilities for using high-performance computing, grid computing and cloud computing, from efficient processing to the interactive visualization of sequence data.

Application: Dotplot
For the evaluation of different technology approaches, the generation of dot plots by the efficient comparison of large DNA sequences was selected as a particularly frequently occurring application in bioinformatics.

When a dot plot is generated, two DNA or protein sequences are compared with each other and find very similar sequence parts. These matches are highlighted in a two-dimensional coordinate system. On the two axes of the coordinate system, one of the two sequences to be compared is applied. Once both sequences have been completely compared and all sequence segments that have a certain similarity are marked, the resulting image can then be interpreted very well.
With the help of scaling, a dot plot is excellent to get an overview of the comparison of arbitrarily large DNA sequences, as well as to be able to view a specific sequence section in detail.

Distributed computing infrastructures
The sequence comparison was implemented using the open source framework Hadoop, which is used for distributed processing of large data sets and is therefore particularly well-suited for the selected job definition.

For this purpose, so-called MapReduce jobs are automatically distributed onto a cluster by means of the master-worker principle. To use this distribution as effectively as possible, the data to be processed is stored in a distributed manner in the Hadoop Distributed File System (HDFS).

Since the results of sequence comparisons can be compared with dictionaries containing millions of entries, the open source library HBase was chosen for storing the results. Based on HDFS, HBase offers the possibility to store large unstructured or semi-structured data sets in the form of tables. HBase tables can be built on a Hadoop cluster, as well as processed efficiently.

Visualization
Finally, an interactive visualization application was designed to obtain information, which represents the results of sequence comparisons as a dot plot. In the final application, a web service interface now enables the outsourcing of memory and processor-intensive operations and supports a diverse usability of the application on devices with low memory and computing capacity, such as smartphones and tablets.

The development of this application, which was used for technology evaluation and the integration of future new services, was carried out within the framework of the Austrian Grid Development Center, which is sponsored by the Federal Ministry for Science and Research and the State of Upper Austria. 
 
Subsequently, the project results served as the basis for the European Commission-sponsored project Mr. SymBioMath, which will focus on the adoption of cloud computing and high-performance computing in the field of bioinformatics until 2017.
 
Photo:iStockphoto


Footer Mobile en

COPYRIGHT © 2016 by
RISC Software GmbH
Softwarepark 35
4232 Hagenberg
Austria
Imprint
Driving directions