Big Data Analysis
The analysis of large amounts of data has gained significant importance in recent years in business and research. Technological advances make it possible in all areas to generate and to collect ever-larger amounts of data.
Especially in the life sciences field, large amounts of biological data are generated using the latest high-throughput methods, such as in DNS sequencing. Meanwhile, it is possible to sequence human genomes in less than a day. These large amounts of data are, on the one hand, a technological challenge, but represent an opportunity to obtain valuable information, on the other hand.
Since traditional bioinformatics applications are often designed for the local processing of data, but local computers do not always have the desired power, the RISC Software GmbH has analyzed the possibilities for using high-performance computing, grid computing and cloud computing, from efficient processing to the interactive visualization of sequence data.
For the evaluation of different technology approaches, the generation of dot plots by the efficient comparison of large DNA sequences was selected as a particularly frequently occurring application in bioinformatics.
When a dot plot is generated, two DNA or protein sequences are compared with each other and find very similar sequence parts. These matches are highlighted in a two-dimensional coordinate system. On the two axes of the coordinate system, one of the two sequences to be compared is applied. Once both sequences have been completely compared and all sequence segments that have a certain similarity are marked, the resulting image can then be interpreted very well.
With the help of scaling, a dot plot is excellent to get an overview of the comparison of arbitrarily large DNA sequences, as well as to be able to view a specific sequence section in detail.
Distributed computing infrastructures
The sequence comparison was implemented using the open source framework Hadoop, which is used for distributed processing of large data sets and is therefore particularly well-suited for the selected job definition.
For this purpose, so-called MapReduce jobs are automatically distributed onto a cluster by means of the master-worker principle. To use this distribution as effectively as possible, the data to be processed is stored in a distributed manner in the Hadoop Distributed File System (HDFS).
Since the results of sequence comparisons can be compared with dictionaries containing millions of entries, the open source library HBase was chosen for storing the results. Based on HDFS, HBase offers the possibility to store large unstructured or semi-structured data sets in the form of tables. HBase tables can be built on a Hadoop cluster, as well as processed efficiently.
Finally, an interactive visualization application was designed to obtain information, which represents the results of sequence comparisons as a dot plot. In the final application, a web service interface now enables the outsourcing of memory and processor-intensive operations and supports a diverse usability of the application on devices with low memory and computing capacity, such as smartphones and tablets.