{"id":3230,"date":"2023-05-24T08:14:43","date_gmt":"2023-05-24T06:14:43","guid":{"rendered":"https:\/\/www.risc-software.at\/fachbeitrag-methoden-und-werkzeuge-fuer-die-datenaufbereitung-im-big-data-bereich\/"},"modified":"2024-11-06T17:35:28","modified_gmt":"2024-11-06T16:35:28","slug":"technical-article-methods-and-tools-for-data-preparation-in-the-big-data-area","status":"publish","type":"publication","link":"https:\/\/www.risc-software.at\/en\/technicalarticles\/technical-article-methods-and-tools-for-data-preparation-in-the-big-data-area\/","title":{"rendered":"Methods and tools for data preparation in the big data area"},"content":{"rendered":"\n<h3 class=\"wp-block-heading\">by DI Paul Heinzlreiter<\/h3>\n\n\n<div class=\"wp-block-group-container alignfull \">\n<div class=\"wp-block-group alignfull is-layout-constrained wp-block-group-is-layout-constrained\">\n<p class=\"wp-block-paragraph\">In recent years, the role of big data in numerous economic sectors such as the manufacturing industry, logistics or trade has become increasingly important. Using a wide variety of sensor systems, large amounts of data are collected that can subsequently be used to optimize machines or business processes. Methods from the fields of artificial intelligence, machine learning or statistics are often used here.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">However, all these methods require a larger <a href=\"http:\/\/ris.w4.at\/en\/technical-article-data-quality\" target=\"_blank\" rel=\"noreferrer noopener\">quantity of high-quality and valid data<\/a> as a basis. In this context, <a href=\"http:\/\/ris.w4.at\/en\/technical-article-data-engineering-the-solid-basis-for-effective-data-utilization\" target=\"_blank\" rel=\"noreferrer noopener\">data engineering<\/a> is used to collect the raw data, cleanse it and merge it into an integrated database. While a <a href=\"http:\/\/ris.w4.at\/en\/technical-article-data-engineering-the-solid-basis-for-effective-data-utilization\" target=\"_blank\" rel=\"noreferrer noopener\">previous article<\/a> (the magazine INSIGHT #1) highlighted the general role and goals of data engineering, this article will focus on methods and proven tools as well as provide an exemplary insight into the algorithmic implementation of data engineering tasks.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><br><br><\/p>\n\n\n\n<div style=\"height:100px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div class=\"wp-block-media-text has-media-on-the-right is-stacked-on-mobile is-vertically-aligned-center\"><div class=\"wp-block-media-text__content\">\n<p class=\"wp-block-paragraph\"><strong>Table of contents<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data stream and batch processing\n<ul class=\"wp-block-list\">\n<li>Data stream processing: Apache NiFi<\/li>\n\n\n\n<li>Batch processing: Apache Hadoop<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Hadoop File System (HDFS)<\/li>\n\n\n\n<li>Map-Reduce-Framework (YARN)<\/li>\n\n\n\n<li>Batch and data stream processing: Apache Spark<\/li>\n\n\n\n<li>Application example: Processing of industrial sensor and log data<\/li>\n\n\n\n<li>Application example: Cleaning sensor data and storing it in an SQL database<\/li>\n\n\n\n<li>Choosing the right tools for big data engineering<\/li>\n\n\n\n<li>Sources<\/li>\n\n\n\n<li>Author<\/li>\n<\/ul>\n<\/div><figure class=\"wp-block-media-text__media\"><img decoding=\"async\" width=\"1024\" height=\"768\" src=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/iStock-157192883-1024x768.jpg\" alt=\"Data\" class=\"wp-image-3126 size-full\" srcset=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/iStock-157192883-1024x768.jpg 1024w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/iStock-157192883-300x225.jpg 300w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/iStock-157192883-768x576.jpg 768w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/iStock-157192883-1536x1152.jpg 1536w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/iStock-157192883.jpg 1920w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure><\/div>\n<\/div>\n<\/div>\n\n<div class=\"wp-block-group-container alignfull \">\n<div class=\"wp-block-group alignfull is-layout-constrained wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-7387b849 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:66.66%\">\n<h3 class=\"wp-block-heading\">Data stream and batch processing<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">If, for example, industrial sensor data is collected over time, large amounts of data do not accumulate per unit of time (e.g. every few seconds), but over months and years the stored data volumes often increase into the terabyte range. If data in this order of magnitude is to be processed, this can essentially be done using two different paradigms, described here for converting the data type of a table column:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Batch Processing:<\/strong><br>Here, all rows in a table are processed in parallel to convert one column.<\/li>\n\n\n\n<li><strong>Data stream processing (Data Streaming):<\/strong><br>Here, the rows of the table are read sequentially and the column conversion is performed per row.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The main difference between the two data processing approaches is that in data streaming, the necessary data transformations &#8211; such as converting data fields to other data types &#8211; are performed directly on the currently supplied data set, whereas in batch processing, the data is first collected, and subsequently the data transformations are performed on the entirety of the data. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Which approach is chosen depends on the data transformation requirements:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>If the transformation can be performed locally on the currently queried or received data, the use of the data streaming approach is often preferable, since it is usually a simple and local operation that can also be processed more quickly due to the smaller input data. A typical application of Data Streaming is the direct conversion of sensor data arriving distributed over time, as these can then be converted and stored individually.<\/li>\n\n\n\n<li>However, if the data transformation requires input data from the entire data already stored or if all data is already available, a batch approach is more suitable. Parallel processing of the data is also often easier to implement here, as this is directly supported by frameworks such as Apache Hadoop (through the Map-Reduce approach) or Apache Spark.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">In general, the data obtained should be stored once in raw format in order to not lose any data that could still be needed as a basis for future analyses. Further processing of data stored in this way can then be done by batch processing or data streaming. In the second case, a data stream is generated again from the stored data by continuous reading. Conversely, a data stream can be stored continuously and thus serve as a starting point for batch processing.<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:33.33%\">\n<figure class=\"wp-block-image size-full is-style-rounded\"><img decoding=\"async\" width=\"377\" height=\"402\" sizes=\"(max-width: 377px) 100vw, 377px\" src=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/2022-06-13-fachbeitrag-data-engineering-3-batch-stream-01_EN.png\" alt=\"data engineering batch stream\" class=\"wp-image-3112\" srcset=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/2022-06-13-fachbeitrag-data-engineering-3-batch-stream-01_EN.png 377w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/2022-06-13-fachbeitrag-data-engineering-3-batch-stream-01_EN-281x300.png 281w\" \/><\/figure>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n<div class=\"wp-block-group-container alignfull \">\n<div class=\"wp-block-group alignfull is-layout-constrained wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-7387b849 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:66.66%\">\n<h3 class=\"wp-block-heading\">Data stream processing: Apache NiFi<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/nifi.apache.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">NiFi<\/a> represents a tool for data stream processing, which makes it possible to connect data transformations in a graphical, web-based user interface to form a continuous data pipeline through which the source data flows and is transformed step by step. The strengths of Apache NiFi lie in the wide range of modules already available, which enable, for example, the reading and storing of numerous data formats. Due to the open source character of NiFi and the object-oriented structure of its modules, it is easy to develop your own modules and integrate them into data pipelines. Furthermore, NiFi also addresses issues such as the automated handling of different processing speeds of the modules.<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:33.33%\">\n<figure class=\"wp-block-image size-full is-style-rounded\"><img decoding=\"async\" width=\"376\" height=\"158\" sizes=\"(max-width: 376px) 100vw, 376px\" src=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/2022-06-13-fachbeitrag-data-engineering-3-apache-nifi-logo-01.png\" alt=\"apache nifi\" class=\"wp-image-3114\" srcset=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/2022-06-13-fachbeitrag-data-engineering-3-apache-nifi-logo-01.png 376w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/2022-06-13-fachbeitrag-data-engineering-3-apache-nifi-logo-01-300x126.png 300w\" \/><\/figure>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n<div class=\"wp-block-group-container alignfull \">\n<div class=\"wp-block-group alignfull is-layout-constrained wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-7387b849 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:66.66%\">\n<h3 class=\"wp-block-heading\">Batch processing: Apache Hadoop<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/hadoop.apache.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">Hadoop<\/a> is a software framework based on the fundamental principle of parallel data processing in a cluster environment. Within the distributed processing, each cluster computer takes over the processing of the data locally available there, which above all saves communication effort during the calculations. Hadoop distinguishes here between controller and responder services in the cluster, whereby the responder services take over the processing of the locally available data, while the controller services are responsible for the coordination of the cluster. Parts of the algorithms implemented in Hadoop were developed by Google and the concepts published in research papers, such as the <a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/1165389.945450\" target=\"_blank\" rel=\"noreferrer noopener\">Google File System<\/a>, <a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/1327452.1327492\" target=\"_blank\" rel=\"noreferrer noopener\">Map-Reduce<\/a> and <a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/1365815.1365816\" target=\"_blank\" rel=\"noreferrer noopener\">Google Bigtable<\/a>. At Google, these solutions are used to operate the global search infrastructure, while the Hadoop project is an open-source implementation of these concepts.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">At its core, a Hadoop system consists of a usually Linux-based cluster running the Hadoop File System (HDFS) and YARN as an implementation of the Map-Reduce algorithm. A Hadoop cluster with the HDFS and YARN services provides a solid technological basis for a wide variety of Big Data services such as BigTable databases like HBase &#8211; see below &#8211; or <a href=\"http:\/\/ris.w4.at\/en\/technical-article-graphdatabases-2\" target=\"_blank\" rel=\"noreferrer noopener\">graph databases<\/a> like JanusGraph, for example.<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:33.33%\">\n<figure class=\"wp-block-image size-full is-style-rounded\"><img decoding=\"async\" width=\"800\" height=\"240\" sizes=\"(max-width: 800px) 100vw, 800px\" src=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/2022-06-13-fachbeitrag-data-engineering-3-Apache_Hadoop_logo.png\" alt=\"apache hadoop\" class=\"wp-image-3116\" srcset=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/2022-06-13-fachbeitrag-data-engineering-3-Apache_Hadoop_logo.png 800w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/2022-06-13-fachbeitrag-data-engineering-3-Apache_Hadoop_logo-300x90.png 300w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/2022-06-13-fachbeitrag-data-engineering-3-Apache_Hadoop_logo-768x230.png 768w\" \/><\/figure>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n<div class=\"wp-block-group-container alignfull \">\n<div class=\"wp-block-group alignfull is-layout-constrained wp-block-group-is-layout-constrained\">\n<h3 class=\"wp-block-heading\">Hadoop File System (HDFS)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">HDFS is an open source implementation of the Google Filesystem. Like other Hadoop subsystems, it consists of controller and responder components, in the case of HDFS Namenodes (controllers) and Datanodes (responders). While a Namenode stores where on the cluster the data for individual files is stored, the Datanodes handle the storage of the data blocks. Basically, HDFS is optimized for large files, the block size for storage is usually 128 megabytes. On the one hand, a file can consist of many individual blocks, on the other hand, the data blocks are replicated across multiple cluster nodes for redundancy and performance reasons. The access semantics of HDFS are different from the usual Posix semantics, since only data can be appended to HDFS files, but they cannot be edited. To create a new version of a file, it must be replaced. This can be done very effectively even for large files using the Map-Reduce algorithm described below.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-style-rounded\"><img decoding=\"async\" width=\"945\" height=\"495\" sizes=\"(max-width: 945px) 100vw, 945px\" src=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/2022-06-13-fachbeitrag-data-engineering-3-hdfs-01.png\" alt=\"hadoop file system\" class=\"wp-image-3118\" srcset=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/2022-06-13-fachbeitrag-data-engineering-3-hdfs-01.png 945w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/2022-06-13-fachbeitrag-data-engineering-3-hdfs-01-300x157.png 300w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/2022-06-13-fachbeitrag-data-engineering-3-hdfs-01-768x402.png 768w\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">In the context of a Hadoop system, text files stored in CSV format, for example, can now be processed with Map-Reduce jobs, with the distribution of sub-jobs across the cluster based on the distribution of the HDFS file blocks being handled automatically by the Hadoop framework. In addition to plain text files, structured binary data such as <a href=\"https:\/\/orc.apache.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">ORC<\/a>, <a href=\"https:\/\/parquet.apache.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">Parquet<\/a> or <a href=\"https:\/\/avro.apache.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">AVRO<\/a> formats can also be processed directly by Hadoop. In addition, specific splitter classes for Map-Reduce can be implemented for new formats. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Furthermore, as part of a Map-Reduce algorithm, it is possible without problems to perform only one map stage, for example, to add new columns to a CSV file.<\/p>\n<\/div>\n<\/div>\n\n<div class=\"wp-block-group-container alignfull \">\n<div class=\"wp-block-group alignfull is-layout-constrained wp-block-group-is-layout-constrained\">\n<h3 class=\"wp-block-heading\">Map-Reduce Framework (YARN)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Based on the data distribution in HDFS shown above, a data-parallel batch job can now be executed by the YARN service, with each responder node processing the locally available data blocks. Conceptually, the execution follows the Map-Reduce algorithm. A classic application example for the Map-Reduce algorithm is the counting of words in text documents. Here, the map step emits a set of pairs of the form (word, number of occurrences in the line) per line. In the Shuffle step, these pairs of values are grouped according to the words, since they represent the key. In the final Reduce step, the word frequencies per word are summed. An exemplary execution could run as follows:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The input text is divided into individual text lines. (Splitting)<\/li>\n\n\n\n<li>The map step, which is executed in parallel for each line individually, creates a pair of the word and the number 1 for each word in the line. (Mapping)<\/li>\n\n\n\n<li>The pairs are sorted by the words and combined into one list per word. (Shuffling)<\/li>\n\n\n\n<li>For each word, the number in the total text is determined by adding up the numbers. (Reducing)<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">While the Map step and the Reduce step must each be programmed out, the global Shuffle step is automatically taken over by the Map-Reduce framework. In practice, the implementation of the Map and Reduce steps requires, for example, the object-oriented overwriting of one Map and one Reduce method each, whose interfaces are already specified. This allows the focus to be placed on the transformation of a pair of values, while the framework subsequently takes care of the scaled execution on the cluster.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-style-rounded\"><img decoding=\"async\" width=\"1024\" height=\"365\" sizes=\"(max-width: 1024px) 100vw, 1024px\" src=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/2022-06-13-fachbeitrag-data-engineering-3-map-reduce-01-1024x365.png\" alt=\"map-reduce framework\" class=\"wp-image-3120\" srcset=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/2022-06-13-fachbeitrag-data-engineering-3-map-reduce-01-1024x365.png 1024w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/2022-06-13-fachbeitrag-data-engineering-3-map-reduce-01-300x107.png 300w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/2022-06-13-fachbeitrag-data-engineering-3-map-reduce-01-768x273.png 768w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/2022-06-13-fachbeitrag-data-engineering-3-map-reduce-01-1536x547.png 1536w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/2022-06-13-fachbeitrag-data-engineering-3-map-reduce-01-2048x729.png 2048w\" \/><\/figure>\n<\/div>\n<\/div>\n\n<div class=\"wp-block-group-container alignfull \">\n<div class=\"wp-block-group alignfull is-layout-constrained wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-7387b849 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:66.66%\">\n<h3 class=\"wp-block-heading\">Batch and data stream processing: Apache Spark<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/spark.apache.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">Spark<\/a> is a flexible data processing layer that can be built on top of various infrastructures, such as Hadoop, and can be used for various data engineering and data science tasks. As a general data processing framework, Spark can perform data preprocessing tasks as well as machine learning tasks.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For example, Apache Spark can be installed on an existing Hadoop cluster and directly access the data stored there and process it in parallel. One approach to this is the Map-Reduce algorithm mentioned above, although Spark can also apply other flexible methods such as data filtering. Spark stores intermediate results as resilient distributed datasets (RDDs) in main memory, which avoids slow repetitive disk accesses &#8211; as is often the case with classic databases.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Key features of Spark include:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Parallel batch processing, for example using the Map-Reduce algorithm.<\/li>\n\n\n\n<li>Support of SQL queries on arbitrary (e.g. in HDFS) stored data. To do this, you only need to interactively create a table that defines the data schema to be used and references the underlying data.<\/li>\n\n\n\n<li>Based on sequential processing of multiple RDDs, data stream processing can be performed.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Just like an underlying Hadoop cluster, a Spark installation can be made fit for processing larger amounts of data by a simple hardware upgrade.<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:33.33%\">\n<figure class=\"wp-block-image size-full is-style-rounded\"><img decoding=\"async\" width=\"512\" height=\"266\" sizes=\"(max-width: 512px) 100vw, 512px\" src=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/2022-06-13-fachbeitrag-data-engineering-3-Apache_Spark_logo.png\" alt=\"apache spark\" class=\"wp-image-3122\" srcset=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/2022-06-13-fachbeitrag-data-engineering-3-Apache_Spark_logo.png 512w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/2022-06-13-fachbeitrag-data-engineering-3-Apache_Spark_logo-300x156.png 300w\" \/><\/figure>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n<div class=\"wp-block-group-container alignfull \">\n<div class=\"wp-block-group alignfull is-layout-constrained wp-block-group-is-layout-constrained\">\n<h3 class=\"wp-block-heading\">Application example: Processing of industrial sensor and log data<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">As part of the VPA4.0 research project, a data pipeline was set up for the pre-processing of production sensor data. This represents a good example of linking streaming and batch processing. Apache NiFi was used as a streaming solution to transmit the data directly from the project partner over the Internet in encrypted form before storing it locally on the Hadoop cluster. Further data processing was then performed using Spark in parallel on the Hadoop cluster and included the following steps:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Unpacking the received data archives and removing unneeded files<\/li>\n\n\n\n<li>Preparation and storage of data as CSV files in HDFS<\/li>\n\n\n\n<li>Creating virtual tables based on CSV files enables further processing with SQL<\/li>\n\n\n\n<li>Data filtering and storage in optimized Parquet format for interactive SQL queries<\/li>\n<\/ul>\n<\/div>\n<\/div>\n\n<div class=\"wp-block-group-container alignfull \">\n<div class=\"wp-block-group alignfull is-layout-constrained wp-block-group-is-layout-constrained\">\n<h3 class=\"wp-block-heading\">Application example: Cleaning sensor data and storing it in an SQL database<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">This example includes sensor data collected on a heat engine. In the following example, negative values can be seen in the column power_dynamo, which were caused by a measurement inaccuracy. Rows with such values should now be filtered out as erroneous and the cleaned data stored in a database.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-style-rounded\"><img decoding=\"async\" width=\"852\" height=\"272\" sizes=\"(max-width: 852px) 100vw, 852px\" src=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/WhatsApp-Image-2023-06-20-at-14.53.34.jpeg\" alt=\"\" class=\"wp-image-3128\" srcset=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/WhatsApp-Image-2023-06-20-at-14.53.34.jpeg 852w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/WhatsApp-Image-2023-06-20-at-14.53.34-300x96.jpeg 300w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/WhatsApp-Image-2023-06-20-at-14.53.34-768x245.jpeg 768w\" \/><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Implementation in Spark:<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">In Spark, the data can be read in as a first step, converted to the correct data types and stored in a correctly typed dataframe. This represents a Spark standard data structure in which data is held in main memory. This can be implemented with a command in an interactive pyspark shell, which uses Python as the implementation language:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-style-rounded\"><img decoding=\"async\" width=\"1024\" height=\"714\" sizes=\"(max-width: 1024px) 100vw, 1024px\" src=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/WhatsApp-Image-2023-06-20-at-14.53.58-1024x714.jpeg\" alt=\"\" class=\"wp-image-3130\" srcset=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/WhatsApp-Image-2023-06-20-at-14.53.58-1024x714.jpeg 1024w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/WhatsApp-Image-2023-06-20-at-14.53.58-300x209.jpeg 300w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/WhatsApp-Image-2023-06-20-at-14.53.58-768x535.jpeg 768w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/WhatsApp-Image-2023-06-20-at-14.53.58-369x257.jpeg 369w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/WhatsApp-Image-2023-06-20-at-14.53.58.jpeg 1076w\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">With the following command the data can be stored directly in the SQL table EngineData:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-style-rounded\"><img decoding=\"async\" width=\"852\" height=\"199\" sizes=\"(max-width: 852px) 100vw, 852px\" src=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/WhatsApp-Image-2023-06-20-at-14.54.14.jpeg\" alt=\"\" class=\"wp-image-3132\" srcset=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/WhatsApp-Image-2023-06-20-at-14.54.14.jpeg 852w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/WhatsApp-Image-2023-06-20-at-14.54.14-300x70.jpeg 300w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/WhatsApp-Image-2023-06-20-at-14.54.14-768x179.jpeg 768w\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">To filter out rows with incorrect values, queries can now be used based on the SQL table.<br>In this case negative power_dynamo values are filtered out:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-style-rounded\"><img decoding=\"async\" width=\"550\" height=\"21\" sizes=\"(max-width: 550px) 100vw, 550px\" src=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/WhatsApp-Image-2023-06-20-at-14.54.26.jpeg\" alt=\"\" class=\"wp-image-3134\" srcset=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/WhatsApp-Image-2023-06-20-at-14.54.26.jpeg 550w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/WhatsApp-Image-2023-06-20-at-14.54.26-300x11.jpeg 300w\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">A dataframe can be saved again as a CSV file after cleaning. The inclusion of the repartition function ensures that the result is saved in a file, even if the data frame was previously partitioned. This can be the result of parallel processing steps.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-style-rounded\"><img decoding=\"async\" width=\"776\" height=\"23\" sizes=\"(max-width: 776px) 100vw, 776px\" src=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/WhatsApp-Image-2023-06-20-at-14.54.39.jpeg\" alt=\"\" class=\"wp-image-3136\" srcset=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/WhatsApp-Image-2023-06-20-at-14.54.39.jpeg 776w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/WhatsApp-Image-2023-06-20-at-14.54.39-300x9.jpeg 300w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/WhatsApp-Image-2023-06-20-at-14.54.39-768x23.jpeg 768w\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">As an alternative, the dataframe can also be stored in a database via the JDBC API.<br>The following command saves the data in a SQLite database, for example:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized is-style-rounded\"><img decoding=\"async\" sizes=\"(max-width: 639px) 100vw, 639px\" src=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/WhatsApp-Image-2023-06-20-at-14.54.49.jpeg\" alt=\"\" class=\"wp-image-3138\" width=\"639\" height=\"22\" srcset=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/WhatsApp-Image-2023-06-20-at-14.54.49.jpeg 639w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/WhatsApp-Image-2023-06-20-at-14.54.49-300x10.jpeg 300w\" \/><\/figure>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-7387b849 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:66.66%\">\n<h4 class=\"wp-block-heading\">Implementation in NiFi:<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">The example shown here again shows the reading of the CSV file with the heat engine data and its storage in a SQLite database. Here the CSV file is read in using the GetFile processor and converted into NiFi flowfiles. These are fed into a PutDatabaseRecord processor, which is configured to parse the CSV file correctly and access the database. Just like connecting the individual modules, their configuration is done interactively in the NiFi web interface.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The final PutFile processor is used to catch and store error conditions, such as incorrectly formatted lines in the input file. This allows error conditions to be easily traced in the saved text file.<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:33.33%\">\n<figure class=\"wp-block-image size-full is-style-rounded\"><img decoding=\"async\" width=\"803\" height=\"859\" sizes=\"(max-width: 803px) 100vw, 803px\" src=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/2022-06-13-fachbeitrag-data-engineering-3-Screenshot_NiFi-Flowfiles.png\" alt=\"NiFi Flowfile\" class=\"wp-image-3124\" srcset=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/2022-06-13-fachbeitrag-data-engineering-3-Screenshot_NiFi-Flowfiles.png 803w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/2022-06-13-fachbeitrag-data-engineering-3-Screenshot_NiFi-Flowfiles-280x300.png 280w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/2022-06-13-fachbeitrag-data-engineering-3-Screenshot_NiFi-Flowfiles-768x822.png 768w\" \/><\/figure>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n<div class=\"wp-block-group-container alignfull \">\n<div class=\"wp-block-group alignfull is-layout-constrained wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-7387b849 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:66.66%\">\n<h3 class=\"wp-block-heading\">Choosing the right tools for big data engineering<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">As can be seen from the application examples shown below (data transfer from a CSV file to an SQL database), different paths often lead to the same goal in the field of data engineering. Which methods should be used often depends on the specific requirements of the customer as well as their system environment:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>For example, if a Hadoop cluster is already in use or planned, it can already be integrated when designing a solution.<\/li>\n\n\n\n<li>Public cloud offerings such as Amazon AWS, for example, in turn offer alternatives to the open source solutions described above, which primarily simplify the operation of the solution, but can also lead to vendor lock-in.<\/li>\n\n\n\n<li>Other criteria for a technology decision are requirements for scalability and the planned integration of additional tools.<\/li>\n\n\n\n<li>Last but not least, open source solutions often offer cost advantages, as there are no licensing costs even for highly scalable solutions.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><em>With its expertise in the field of data engineering built up over more than ten years, RISC Software GmbH represents a reliable consulting and implementation partner, regardless of the area of application.<\/em><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:33.33%\">\n<figure class=\"wp-block-image size-large is-style-rounded\"><img decoding=\"async\" width=\"1024\" height=\"576\" sizes=\"(max-width: 1024px) 100vw, 1024px\" src=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/iStock-1216520813-1-1024x576.jpg\" alt=\"Data analysis\" class=\"wp-image-1970\" srcset=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/iStock-1216520813-1-1024x576.jpg 1024w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/iStock-1216520813-1-300x169.jpg 300w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/iStock-1216520813-1-768x432.jpg 768w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/iStock-1216520813-1-1536x864.jpg 1536w, https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/iStock-1216520813-1.jpg 1920w\" \/><\/figure>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n<div class=\"wp-block-group-container alignfull \">\n<div class=\"wp-block-group alignfull is-layout-constrained wp-block-group-is-layout-constrained\">\n<h3 class=\"wp-block-heading\">Sources<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Bill Chambers, Matei Zaharia: Spark: The Definitive Guide, O\u2019Reilly Media, Inc., February 2018, ISBN: 9781491912218<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Tom White: Hadoop: The Definitive Guide, O\u2019Reilly Media, Inc., June 2009, ISBN: 9780596521974<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Marz Nathan, Warren James: Big Data. Principles and best practices of scalable realtime data systems, Manning Publications, April 2015, ISBN 9781617290343<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Kleppmann Martin: Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems, O\u2019Reilly Media, March 2017, ISBN 9781491903063<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">V. Naresh Kumar, Prashant Shindgikar: Modern Big Data Processing with Hadoop: Expert techniques for architecting end-to-end Big Data solutions to get valuable insights, Packt Publishing, March 2018, ISBN 978-1787122765<\/p>\n<\/div>\n<\/div>\n\n<div class=\"wp-block-group-container alignfull \">\n<div class=\"wp-block-group alignfull is-layout-constrained wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-7387b849 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:66.66%\">\n<h3 class=\"wp-block-heading\">Contact<\/h3>\n\n\n\n<div class=\"wp-block-contact-form-7-contact-form-selector\">\n<div class=\"wpcf7 no-js\" id=\"wpcf7-f663-o1\" lang=\"en-US\" dir=\"ltr\" data-wpcf7-id=\"663\">\n<div class=\"screen-reader-response\"><p role=\"status\" aria-live=\"polite\" aria-atomic=\"true\"><\/p> <ul><\/ul><\/div>\n<form action=\"\/en\/wp-json\/wp\/v2\/publication\/3230#wpcf7-f663-o1\" method=\"post\" class=\"wpcf7-form init\" aria-label=\"Contact form\" novalidate=\"novalidate\" data-status=\"init\">\n<fieldset class=\"hidden-fields-container\"><input type=\"hidden\" name=\"_wpcf7\" value=\"663\" \/><input type=\"hidden\" name=\"_wpcf7_version\" value=\"6.1.6\" \/><input type=\"hidden\" name=\"_wpcf7_locale\" value=\"en_US\" \/><input type=\"hidden\" name=\"_wpcf7_unit_tag\" value=\"wpcf7-f663-o1\" \/><input type=\"hidden\" name=\"_wpcf7_container_post\" value=\"0\" \/><input type=\"hidden\" name=\"_wpcf7_posted_data_hash\" value=\"\" \/>\n<\/fieldset>\n<div class=\"form-row\">\n\t<div class=\"form-input\">\n\t\t<p><label class=\"sr-only\" for=\"your-name\">Your name <\/label><br \/>\n<span class=\"wpcf7-form-control-wrap\" data-name=\"your-name\"><input size=\"40\" maxlength=\"400\" class=\"wpcf7-form-control wpcf7-text wpcf7-validates-as-required\" id=\"your-name\" aria-required=\"true\" aria-invalid=\"false\" placeholder=\"Name\" value=\"\" type=\"text\" name=\"your-name\" \/><\/span>\n\t\t<\/p>\n\t<\/div>\n\t<div class=\"form-input\">\n\t\t<p><label class=\"sr-only\" for=\"your-email\">Your email<\/label><br \/>\n<span class=\"wpcf7-form-control-wrap\" data-name=\"your-email\"><input size=\"40\" maxlength=\"400\" class=\"wpcf7-form-control wpcf7-email wpcf7-validates-as-required wpcf7-text wpcf7-validates-as-email\" id=\"your-email\" aria-required=\"true\" aria-invalid=\"false\" placeholder=\"E-Mail\" value=\"\" type=\"email\" name=\"your-email\" \/><\/span>\n\t\t<\/p>\n\t<\/div>\n<\/div>\n<div class=\"form-row\">\n\t<div class=\"form-input\">\n\t\t<p><label class=\"sr-only\" for=\"your-company\">Company <\/label><br \/>\n<span class=\"wpcf7-form-control-wrap\" data-name=\"your-company\"><input size=\"40\" maxlength=\"400\" class=\"wpcf7-form-control wpcf7-text\" id=\"your-company\" aria-invalid=\"false\" placeholder=\"Unternehmen\" value=\"\" type=\"text\" name=\"your-company\" \/><\/span>\n\t\t<\/p>\n\t<\/div>\n\t<div class=\"form-input\">\n\t\t<p><label class=\"sr-only\" for=\"your-position\">Position<\/label><br \/>\n<span class=\"wpcf7-form-control-wrap\" data-name=\"your-position\"><input size=\"40\" maxlength=\"400\" class=\"wpcf7-form-control wpcf7-text\" aria-invalid=\"false\" placeholder=\"Position\" value=\"\" type=\"text\" name=\"your-position\" \/><\/span>\n\t\t<\/p>\n\t<\/div>\n<\/div>\n<div class=\"form-row\">\n\t<div class=\"form-input\">\n\t\t<p><label class=\"sr-only\" for=\"your-subject\"> Subject <\/label><br \/>\n<span class=\"wpcf7-form-control-wrap\" data-name=\"your-subject\"><input size=\"40\" maxlength=\"400\" class=\"wpcf7-form-control wpcf7-text wpcf7-validates-as-required\" id=\"your-subject\" aria-required=\"true\" aria-invalid=\"false\" placeholder=\"Thema\" value=\"\" type=\"text\" name=\"your-subject\" \/><\/span>\n\t\t<\/p>\n\t<\/div>\n<\/div>\n<p><span id=\"wpcf7-6a2580e683f57-wrapper\" class=\"wpcf7-form-control-wrap phone-95-wrap\" style=\"display: block;\n\t\t    width: 0px;\n\t\t    height: 0px;\n\t\t    padding: 0px;\n\t\t    border: 1px solid transparent;\n\t\t    display: block;\n\t\t    overflow: hidden;\n\t\t    \"><input type=\"hidden\" name=\"phone-95-random-hash\" value=\"72784593\"><label\n\t\t    for=\"wpcf7-6a2580e683f57-field\"\n\t\t    class=\"hp-message\"\n        >Please leave this field empty.<\/label><input\n\t    id=\"wpcf7-6a2580e683f57-field\"\n\t    \n\t    class=\"wpcf7-form-control wpcf7-text\"\n\t    type=\"text\"\n\t    name=\"uwg3fxargxhm\"\n\t    value=\"\"\n\t    size=\"40\"\n\t    autocomplete=\"new-password\"\n\t    tabindex=\"1000\"\n    \/><\/span><br \/>\n<label class=\"sr-only\" for=\"your-message\"> Your message (optional)<\/label><br \/>\n<span class=\"wpcf7-form-control-wrap\" data-name=\"your-message\"><textarea cols=\"40\" rows=\"10\" maxlength=\"2000\" class=\"wpcf7-form-control wpcf7-textarea\" id=\"your-message\" aria-invalid=\"false\" placeholder=\"Ihre Nachricht an uns\" name=\"your-message\"><\/textarea><\/span><br \/>\n<span class=\"wpcf7-form-control-wrap\" data-name=\"hcap-cf7\">\t\t<input\n\t\t\t\ttype=\"hidden\"\n\t\t\t\tclass=\"hcaptcha-widget-id\"\n\t\t\t\tname=\"hcaptcha-widget-id\"\n\t\t\t\tvalue=\"eyJzb3VyY2UiOlsiY29udGFjdC1mb3JtLTdcL3dwLWNvbnRhY3QtZm9ybS03LnBocCJdLCJmb3JtX2lkIjo2NjN9-5cf29316f0fc31f5a29d11a228757560\">\n\t\t\t\t<span id=\"hcap_cf7-6a2580e6847a88.11800698\" class=\"wpcf7-form-control h-captcha \"\n\t\t\tdata-sitekey=\"3a6a81c1-2b2e-4b2a-b1eb-d9446bc09afb\"\n\t\t\tdata-theme=\"light\"\n\t\t\tdata-size=\"normal\"\n\t\t\tdata-auto=\"false\"\n\t\t\tdata-ajax=\"false\"\n\t\t\tdata-force=\"false\">\n\t\t<\/span>\n\t\t<input type=\"hidden\" id=\"_wpnonce\" name=\"_wpnonce\" value=\"5508b23f82\" \/><input type=\"hidden\" name=\"_wp_http_referer\" value=\"\/en\/wp-json\/wp\/v2\/publication\/3230\" \/><\/span><input class=\"wpcf7-form-control wpcf7-submit has-spinner btn\" type=\"submit\" value=\"Senden\" \/>\n<\/p><div class=\"wpcf7-response-output\" aria-hidden=\"true\"><\/div>\n<\/form>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:33.33%\">\n<h3 class=\"wp-block-heading\">Author<\/h3>\n\n\n<div class=\"contact-person\">\n      <picture>\n      \n      \n      \n      \n      <img decoding=\"async\" data-aos=\"fade-zoom-in\"\n           data-aos-offset=\"0\" class=\"w-full\" width=\"212\" height=\"293\"\n           src=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/pheinzlr1-removebg-preview.png\"\n           alt=\"\">\n    <\/picture>\n    \n\n<h5 class=\"wp-block-heading\">DI Paul Heinzlreiter<\/h5>\n\n\n\n<p class=\"wp-block-paragraph\">Senior Data Engineer<\/p>\n\n  <\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n<div class=\"wp-block-group-container alignfull \">\n<div class=\"wp-block-group alignwide is-layout-constrained wp-block-group-is-layout-constrained\"><div class=\"posts-slider-block\" data-aos=\"fade-up\" data-aos-offset=\"0\" data-aos-anchor-placement=\"top-bottom\">\n        <section class=\"splide posts-slider\" aria-label=\"Gallery Slides\">\n            <div class=\"splide__arrows\">\n                <button class=\"splide__arrow splide__arrow--prev\">\n                    <span class=\"sr-only\">Previous<\/span>\n                    <img decoding=\"async\" loading=\"lazy\" width=\"25\" height=\"21\" src=\"https:\/\/www.risc-software.at\/app\/themes\/risc-theme-main\/public\/images\/icon-arrow.35d2ec.svg\"\n                         alt=\"Previous\">\n                <\/button>\n                <button class=\"splide__arrow splide__arrow--next\">\n                    <span class=\"sr-only\">Next<\/span>\n                    <img decoding=\"async\" loading=\"lazy\" width=\"25\" height=\"21\" src=\"https:\/\/www.risc-software.at\/app\/themes\/risc-theme-main\/public\/images\/icon-arrow.35d2ec.svg\"\n                         alt=\"Next\">\n                <\/button>\n            <\/div>\n            <div class=\"inner\">\n                <div class=\"splide__track\">\n                    <div class=\"splide__list\">\n\n                                                    <a href=\"https:\/\/www.risc-software.at\/en\/technicalarticles\/technical-article-graphdatabases-1\/\" class=\"splide__slide blog-post-teaser mb-1 lg:mb-3\">\n                                <div class=\"blog-image\">\n                                                                                                                                <picture>\n                                                                                        <img decoding=\"async\" src=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/iStock-1194430859-1-360x214.jpg\"\n                                                 alt=\"Data Understanders: Leveraging enterprise data through intelligent Graph Databases\">\n                                        <\/picture>\n                                                                    <\/div>\n                                <div class=\"blog-content px-2 py-3 xl:px-4 xl:py-5\">\n                                    <h3>Data Understanders: Leveraging enterprise data through intelligent Graph Databases<\/h3>\n                                    <div class=\"blog-post-excerpt mt-2\">\n                                        Graph databases enable intuitive mapping of many real-world scenarios such as industrial manufacturing, traffic data analysis or IT infrastructure monitoring. This makes data not only more efficiently stored, but also much more usable.\n                                    <\/div>\n                                    <span class=\"inline-block mt-2 more\">mehr erfahren <span class=\"ml-1 icon-more\"><\/span><\/span>\n\n                                <\/div>\n                            <\/a>\n                                                    <a href=\"https:\/\/www.risc-software.at\/en\/technicalarticles\/technical-article-data-engineering-the-solid-basis-for-effective-data-utilization\/\" class=\"splide__slide blog-post-teaser mb-1 lg:mb-3\">\n                                <div class=\"blog-image\">\n                                                                                                                                <picture>\n                                                                                        <img decoding=\"async\" src=\"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/iStock-966899060-1-360x214.jpg\"\n                                                 alt=\"Data Engineering \u2013 the solid basis for effective data utilization\">\n                                        <\/picture>\n                                                                    <\/div>\n                                <div class=\"blog-content px-2 py-3 xl:px-4 xl:py-5\">\n                                    <h3>Data Engineering \u2013 the solid basis for effective data utilization<\/h3>\n                                    <div class=\"blog-post-excerpt mt-2\">\n                                        Data engineering integrates data from a wide variety of sources and makes them effectively usable. This makes it a prerequisite for effective data science, machine learning and artificial intelligence, especially in the big data area.\n                                    <\/div>\n                                    <span class=\"inline-block mt-2 more\">mehr erfahren <span class=\"ml-1 icon-more\"><\/span><\/span>\n\n                                <\/div>\n                            <\/a>\n                                            <\/div>\n                <\/div>\n            <\/div>\n        <\/section>\n    <\/div>\n<\/div>\n<\/div>","protected":false},"excerpt":{"rendered":"<p>Die Rolle von Big Data hat in zahlreichen Wirtschaftsbereichen stark an Bedeutung gewonnen. Es werden gro\u00dfe Datenmengen gesammelt, die zur Optimierung herangezogen werden k\u00f6nnen. Hierbei kommen oft Methoden aus den Bereichen k\u00fcnstliche Intelligenz, maschinelles Lernen oder Statistik zum Einsatz.<\/p>\n","protected":false},"featured_media":1971,"template":"","publication-category":[50,72],"class_list":["post-3230","publication","type-publication","status-publish","has-post-thumbnail","hentry","publication-category-data-science-and-a-i","publication-category-industrie-4-0"],"acf":[],"portrait_thumb_url":"https:\/\/www.risc-software.at\/app\/uploads\/2023\/06\/iStock-1216520813-1-360x214.jpg","_links":{"self":[{"href":"https:\/\/www.risc-software.at\/en\/wp-json\/wp\/v2\/publication\/3230","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.risc-software.at\/en\/wp-json\/wp\/v2\/publication"}],"about":[{"href":"https:\/\/www.risc-software.at\/en\/wp-json\/wp\/v2\/types\/publication"}],"version-history":[{"count":18,"href":"https:\/\/www.risc-software.at\/en\/wp-json\/wp\/v2\/publication\/3230\/revisions"}],"predecessor-version":[{"id":5006,"href":"https:\/\/www.risc-software.at\/en\/wp-json\/wp\/v2\/publication\/3230\/revisions\/5006"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.risc-software.at\/en\/wp-json\/wp\/v2\/media\/1971"}],"wp:attachment":[{"href":"https:\/\/www.risc-software.at\/en\/wp-json\/wp\/v2\/media?parent=3230"}],"wp:term":[{"taxonomy":"publication-category","embeddable":true,"href":"https:\/\/www.risc-software.at\/en\/wp-json\/wp\/v2\/publication-category?post=3230"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}