One of the central goals of data engineering is the preparation of data sets according to the requirements of the user or the subsequent process steps. The use of data can range from model training in the field of machine learning to improved internal company reporting based on an integrated database.
Ensuring sufficient data quality is central in all cases. While the various fundamental aspects of data quality and their importance for companies were examined in an earlier article, this article presents examples of data quality problems from practice and discusses possible solutions.
Table of contents
Data formats
Error cause in structured text data
Categories of data errors in structured data
Sample data set
Methods of data troubleshooting
Algorithmic data error recovery
Data loss prevention
Role of data quality in project planning
Author
Data formats
A data error always represents a deviation from a target value. This means that possible data errors are strongly dependent on the type and format of the available data. Essentially, a distinction must be made here between structured and unstructured data. Unstructured data – especially text data – usually do not follow a schema, which means that a data error can only be detected by machine in rare cases.
In text files, a typical example is incorrect localisation of floating point values due to inconsistent use of the decimal separator:
1.23 6.4532 7,564 -0.2
In this example, the wrong decimal separator is used in the third line, in this case the comma, which is common in German-speaking countries. Determining which decimal separator is the correct one in each case is usually done by external additional information or by determining the majority within the given data.
Unstructured text data
Possible data errors in text files:
Undefined or deviating character sets: A character set describes the mapping of characters (a, b, ä, €, …) to their binary representation in memory. If this is not correctly defined or is unknown, this leads to incorrect representation and processing of special characters such as German umlauts.
Encoding of line breaks: A line break is represented differently between the operating systems Microsoft Windows, Apple MacOS and GNU/Linux:
In Windows, two characters are used for this purpose: A sequence of Carriage Return (ascii code 13) and Line Feed (ascii code 10).
In MacOS only Carriage Return is used.
In GNU/Linux only Line Feed is used.
Different localisation of the data such as German and English decimal separators
Binary formats and structured text files
In contrast, structured data is based on a schema that contains the data format, the structure of the data, as well as the data types and value ranges of the data values contained. Data schemas can be explicit or implicit depending on the data format and describe, for example, tabular data per column:
Data type
cells can contain zero values
Validity range for numerical values
Format for string values (e.g. date and timestamp)
In any case, a data schema makes it possible to validate the content of a data set or to check for errors. Because it is easier to check by machine, this article focuses on structured data, such as those that occur in an industrial environment. From the point of view of data validation, structured data can be divided into two rough classes. These differ in whether the format already provides the data schema:
Binary data formats with schema in the metadata provided: Examples are storage formats of commercial programs such as Microsoft Excel, as well as image files, standardised binary protocols such as OPC UA or Protobuf, but also open BigData formats such as Apache Parquet. Another very typical class of storage solutions that fall into this category are relational databases such as Microsoft SQL Server, PostgreSQL or MySQL.
Structured text files without schema information in the data format:
Comma separated values (CSV)
XML files
JSON files
Error cause in structured text data
While XML or JSON files rarely contain syntactic errors, as they are usually generated programmatically, data errors occur more frequently in CSV files, as these are often maintained manually (e.g. in Microsoft Excel). Typical causes of data format inconsistencies in CSV files are that there is no explicit specification of the format and errors can occur when the format is transferred manually from previous lines. Typical examples are:
Inconsistent use of inverted commas for string fields
Different localisation (e.g. dot or comma as decimal separator)
Empty columns and different numbers of columns per row
Different string representation of timestamps, date and time fields
Numerical values under inverted commas
Fluctuating accuracy for numeric entries from integer to double
Deviations in the data representation can occur not only due to human errors during manual data entry, but also due to process changes during automated data generation. Especially with CSV and JSON files, it is often difficult to determine the type of a data entry, especially if the source data is not consistently filled. The same error categories can occur here as with manual data transfer.
Categories of data errors in structured data
Depending on the type of structured data, different categories of data errors can occur:
Violation of the data syntax
This error category takes on a special role compared to the following ones, as it can normally only occur in structured text files, since binary files are almost without exception generated algorithmically, and are thus normally syntactically correct.
A syntax error occurs when the text file does not follow the specified syntax of the required file format. Examples of this are:
missing closing tags in XML or HTML files
wrong number of columns in a CSV file
wrong format of a date or time stamp in a text file
missing, excess or incorrect inverted commas
Incorrect localisation such as comma instead of dot as decimal separator
Wrong data types
This error occurs if a field to be validated has an incorrect data type. Typical examples are:
Text in a field where numeric values are expected.
Specification to less strict data types in binary formats, for example, the definition of a text field in which a floating point value is semantically expected.
Missing data
Data schemas often allow data fields to be marked as optional, so it is possible for data fields to remain empty even in structured binaries. However, these are necessary for the application semantics that are based on them, such as fields that are to be used as foreign keys for linking tables. If data is to be read in from a structured text data format, it happens much more often than with binary data that data is missing. A classic example of this is a missing column in a CSV file.
Missing meta information
A typical example of missing meta-information is the specification of a time without a time zone. Storing local time without identifying the time zone can even lead to data loss, depending on the type of storage, because it results in duplicated time stamps for different points in time when switching from summer to winter time. Another example of missing meta-information is a data type specification if it cannot be clearly derived from the data element.
An example of this is the following representation in a CSV file:
... ;10.3352; ...
Such a field is usually interpreted and stored as a floating point value. However, there are different data types for single or double precision. Which data type should be chosen depends on which value ranges (e.g. minimum and maximum values) are contained in the total amount of data. If all data is already available, this can be derived programmatically. If, however, the data is only delivered gradually, it is safer to decide on the data type with the larger value range. The disadvantage here is, of course, the double memory requirement for the data field.
Violation of the semantic scope
These errors describe values that are outside their range of validity, although they have a valid value for their data type. An example of this are outdoor temperatures of over 100° Celsius in Central Europe.
Wrong order
This error category describes the storage of data in the wrong order. This can be, for example, a time series of sensor values that has not been stored sorted in ascending order by timestamp. As long as the timestamp is available for each value, such a data set can still be read in correctly, but often timestamps are not explicitly stored to save storage space if the sensor values were determined by regular sampling. In such a case, the start time and the time interval between two measuring points are sufficient for determining all timestamps – if the storage sequence is correct.
Another example where the order of storage is critical for semantics is the sequential storage of measurement data points in a text file, which are arranged on a regular grid, and whose positioning on the grid results implicitly from start position and step size along the axes of the coordinate system.
Format changes for continuously supplied data
In practice, this is one of the biggest data quality problems. If data are delivered in a consistent format, the processing of the data can be adapted to this format and also absorb certain recurring fluctuations in data quality. However, when there is an abrupt change in the format of the delivered data, data processing usually needs to be adapted. Typically, this involves data fields that are dropped or added, changes in data types or in the data format. From a data quality perspective, there is basically no distinction between data that is delivered in a block and has a non-uniform format and data that is delivered over time as a data stream and changes format over time. The difference for the data recipient, however, is that if data errors already exist, one can adjust the data import to them right away, whereas changes often happen unexpectedly with continuous data delivery.
Sample data set
Typical structured data from the industrial environment are time series of sensor data. The time series of measurement data from a heat engine can serve as an example here, which is shown in excerpts. These data were taken directly from the operation of the machine via sensors and stored in the CSV file by a programme running on a Raspberry Pi mini-computer, which can be seen as quite representative of industrial data in terms of data quality.
In this CSV file, some of the data errors shown above are evident:
The timestamp in the first column does not include a time zone
Negative values are shown in the power_dynamo column
In the last two lines shown, the value for water_level is missing
Methods of data troubleshooting
An obvious – and good – approach to resolving data quality deficiencies is to request an improved version of the data from the data provider. However, this approach is often not feasible in practice. For example, if faulty sensor data has been recorded in a production line because a sensor is defective, it is often not possible, or at least very cost-intensive, to repeat the data recording. While the real monetary value of the collected data is often not assessable for the project participants at the beginning, the directly incurred costs for a repetition of a measurement – e.g. due to a production interruption – can be quantified very quickly. Furthermore, the replacement of a defective sensor can also lead to a considerable delay of the planned data analyses due to the often necessary involvement of external companies. A typical scenario here is the collection of sufficient training data for machine learning models, which can often take months. Here, a delay of possibly several weeks due to the replacement of a sensor can jeopardise the entire project schedule without the real benefit of such an intervention being clear in advance.
For these reasons, algorithmic handling of the data error is often the most favourable solution overall.
Algorithmic data error recovery
Unfortunately, there are no generally applicable methods for always bringing source data into the desired target format. In general, however, it can be said that the comprehensive availability of meta information or the use of a structured binary format for the source data greatly reduces the effort required for data validation. The desired type of a data field is already known and therefore the data cannot be stored incorrectly. Thus, in the case of binary data, the most common error is missing data if the underlying schema has changed or a data field has been specified as optional, although it is needed for the application logic. In general, it can be said that data errors in structured binary data can usually be traced back to errors in the data schema or to an unplanned change to it.
If structured text data are used as data sources, additional classes of possible errors are added – as described above.
Explicit schema information in text data
In the case of structured text files, however, schema information can be included by convention, such as in the header line of the CSV file, which can include the names of the columns. As an extension of this common methodology, one can extend the header line in the data type information to ensure that the correct target data type is used:
In the above example, the corresponding Java data types are specified, whereby it naturally depends on the target data storage system which data types are available for storage. Normally, however, it is sufficient to uniquely define the data type for a database or programming language, because then the conversion for other target systems can be done automatically.
Automated data error recovery
How can one react to data errors in the context of an automated process? If data is to be stored as part of an Extract – Transform – Load (ETL) process using a specific schema and a data set does not meet the requirements of the data schema, the simplest method is to discard that data set. This may be appropriate for some use cases – such as large datasets for AI model training – but in general the goal is to transform a dataset so that it can be stored in the intended schema. The following methods can be used to do this automatically:
Schema evolution or optional type fields: Schema evolution describes the possibility of versioning a schema, whereby data stored with an earlier version of the data schema remains processable with the new schema. A schema evolution can include adding, removing and type conversion of data fields. A good tool for this is optional type fields, which make it possible, for example, to add new fields and still process existing old data correctly. Optional data fields are also a good way to store empty data fields correctly without having to discard the entire record.
Implicit type conversion: If a source data type can be automatically converted to the target data type without loss of accuracy, this can be done automatically in the ETL process:
Data interpolation for missing values in time series: This is an obvious operation, but it is highly dependent on the intended use of the data whether such an operation is permissible.
If data errors cannot be corrected automatically, it is a good idea to keep the raw data and send a notification, for example, so that the error can be investigated and the ETL process completed – if necessary after manual correction. If it is not a one-off error, the ETL process is usually adjusted in the course of investigating and correcting the problem in order to eliminate the error in the future. This applies in particular to errors resulting from a change in the data source format.
Data loss prevention
If the source data is delivered continuously via a streaming process, it is particularly important to save the raw data first before it goes into further processing. This can prevent the data from being lost if data processing fails at a later point in the ETL process. After an error has been corrected or the ETL process has been adjusted after a format change, the stored raw data can be reprocessed. Most of the time, the raw data consumes more storage space, especially if it is delivered as text files, compared to a later structured and compressed storage. Therefore, it is often advisable to delete the successfully processed raw data after validation. To save storage space, raw data can of course also be compressed using standard algorithms, which leads to significant storage space savings, especially with text data.
Role of data quality in project planning
Before the start of a data science or data engineering project, it is often difficult for all those involved to estimate the quality of the data to be included. This is often because the data has already been collected over a certain period of time but has not yet been used in operations, for example because it is not yet available in sufficient quantity.
Furthermore, raising data quality to a level required for the project goals often represents a considerable share of the project effort, which is difficult to estimate without knowledge of the data or its quality. To address this problem, it is possible, for example, to include a pre-project phase to jointly clarify the initial situation, or to choose an agile approach that enables a step-by-step joint procedure with flexible definition of milestones.
With its expertise in the field of data engineering built up over more than ten years, RISC Software GmbH represents a reliable consulting and implementation partner, regardless of the area of application.