Mastering the (industrial) Data (EN)
How improved manufacturing is created from industrial and production process data.
by DI Paul Heinzlreiter and Dr. Roxana Holom
Not only since the buzzword “Industry 4.0” and “Internet of Things” has it been known that essential data can be obtained from manufacturing processes: Sensors that record movement, heat or pressure, machines that report damage or fully automated warehouses that document utilisation and throughput times with their scanners. The first step towards a smart factory – which in the best case will optimise itself and save resources – is to collect, record and store this data. In the second step, this data must be linked to each other. Since these different data from different sources also have different formats – some are continuous, others only ever record a state at a certain point in time – this linking is no longer trivial, also due to the volumes of data. This requires a framework that takes over this processing and linking automatically. And it needs specialists who construct such a framework and combine data know-how with domain knowledge – data scientists.
In the BOOST 4.0 project, data scientists from RISC Software GmbH developed such a framework to map a smart factory. Together with industry partners, solutions were developed based on practical examples. Last but not least, this data must be analysed, interpreted and used to derive actions for the future. This requires the interaction of technical experts and data scientists. This means that experts with their knowledge and experience are involved right from the start when it comes to linking the data and then interpreting it.
Table of contents
- Merging Data: Big Data Architecture using the RISC Software Data Analytics Framework
- Prepare data: Spark
- Analysing data: Comparison of high-frequency time series
- Authors
Merging Data: Big Data Architecture using the RISC Software Data Analytics Framework
The concept of the Big Data architecture of the RISC Data Analytics Framework is based on the combination of Big Data technologies with semantic approaches to process and store large amounts of data from different sources. The architecture is able to process data with Big Data technologies such as Spark. The data itself and the connections to each other are stored separately. This makes the system generic, reusable and yet responsive. The obligatory check of the data for correctness (data validation) is thus just as guaranteed as continuous analyses. Apache Spark is used – a parallel in-memory data processing framework that transfers the data from the raw format into a standardised storage – in order to carry out further analyses by data scientists with this database.
The processing of the data takes place in a cluster, into whose data storage the data, which originate from different data sources, are transferred. This transfer into a central memory includes an interpretation and transformation of the data in order to bring them into a uniform format. At the syntax level, the structure of this format is defined by reading in and processing the associated metadata information. In this way, the data is validated, cleaned, transformed accordingly and stored in Spark tables. By using Spark, one can perform initial data explorations directly on the cluster to gain first insights into the data. As a further step in data preparation, aggregation of data can also be carried out, enabling complex analysis tasks to be performed locally outside the BigData cluster. An example of this is the comparison of high-frequency time series described below.
In order to connect the relationships of the different data sources (i.e. represent Linked Data) and achieve a more powerful metadata representation, the Data Scientists of RISC Software GmbH rely on the schema language SALAD (Semantic Annotations for Linked Avro Data), which enriches Apache AVRO as a structural data storage format with semantic meta-information. This in turn enables an automated interpretation of the data. Furthermore, the pre-processing, structural validation and verification of linked documents is supported by a semantic set of rules. This closes the gap between record-oriented data modelling (supported by Apache Avro) and the Semantic Web (by extending SALAD).
Info
Projekt BOOST 4.0
The European project “Big Data Value Spaces for COmpetitiveness of European COnnected Smart FacTories 4.0 (BOOST 4.0)” (Horizon 2020) addresses the need to develop large-scale experimentation and demonstration of data-driven “connected smart” factories 4.0. https://boost40.eu/
SALAD (Semantic Annotations for Linked Avro Data)
SALAD (Semantic Annotations for Linked Avro Data) is a schema language for describing structured Linked Data in JSON or YAML (a recursive acronym for “YAML Ain’t Markup Language”). https://github.com/common-workflow-language/schema_salad
Apache Spark
Apache Spark is a framework for cluster computing that is publicly available under an open source licence. https://spark.apache.org/
Hadoop-Cluster
A Hadoop cluster is a computer cluster used to store and analyse large amounts of unstructured data in a distributed computing environment. https://hadoop.apache.org/
The video shows the architecture of the RISC Software Data Analytics Framework.
Prepare data: Spark
Data preparation is generally of great importance in this environment, as the data from customers and project partners is usually delivered in the raw form in which it was generated by the industrial machines. First of all, the various raw forms of the data have to be converted into a uniform format. This is the basis for further analyses. The reusability of the subsequent data analysis steps across a wide range of use cases is another important goal.
The data preparation as well as the subsequent analysis of the data is carried out using a Hadoop cluster, which on the one hand enables the efficient processing of data in the terabyte range, but on the other hand also specifies the target formats of the data preparation. The cluster enables data-parallel, horizontally scalable processing of data in the terabyte range, using open source software technologies and commodity hardware, which keeps the costs of deployment low and makes this solution very interesting, especially for SMEs.
The goal is to extract and summarise the relevant tables and files from the individual source databases to enable further efficient processing by the Hadoop framework. The result of the data preprocessing are CSV files that correspond to the desired schema and can be queried directly via Spark SQL after the one-time definition of a table schema. CSV-based tables can, for example, also be transferred to another storage format such as compressed and partitioned parquet tables with the help of Spark SQL. The use of SQL as a query language allows data scientists to easily aggregate and combine the data.
The video shows the use of the distributed processing framework Spark in a typical data engineering task, the conversion and preparation of industrial data.
Analysing data: Comparison of high-frequency time series
Time series are data that occur during continuous series of measurements, such as during the running time of a production machine. Comparing two machines that are identical in construction but do not work synchronously can lead to important conclusions. This also occurs in production; for example, two machines are started synchronously, but after only a few components have been produced, their cycle time is shifted. Using a method from the field of time series analysis called Dynamic Time Warping, it is possible to determine where exactly in the programme the lead/delay occurs as well as its magnitude. This information cannot be read from the comparison of the raw time series, especially if the behaviour between the time series keeps changing. A similarity matrix, in which two such time series are compared with each other, makes it possible to quickly gain an overview of all the time series under consideration and to recognise potentially problematic time series (errors, outliers). The value in the similarity matrix measures in a number the relevant distance between two time series for the respective use case.
The data considered in the video are high-frequency time series (sampling rate approx. 6ms), which hardly differ with the naked eye. With the help of Dynamic Time Warping (DTW), a common method for comparing curves, it was possible to determine where exactly in the programme the lead/delay occurs and its size. The similarity matrix shown at the end of the video makes it possible to quickly gain an overview of all the time series considered and to identify potentially problematic time series (errors, outliers), which are then subjected to a precise comparison in the next step using DTW path analysis (in video before).
Authors
DI Paul Heinzlreiter
Senior Data Engineer
Dr. Roxana-Maria Holom, MSc
Data Science Project Manager & Researcher