The path to a customized data platform
How can internal company data be made more usable?
by DI Paul Heinzlreiter
How can companies use their data more efficiently? Whether sensor data to improve quality or business data for better planning – the path to the ideal data platform is complex. What does the process look like?
- How can internal company data be made more usable?
- What is a suitable solution for the application?
- The technology selection process
- Where is the finished solution operated?
- The role of open source frameworks
- Consideration of changing requirements
How can internal company data be made more usable?
There are a wide variety of scenarios in which improving a company’s own data management is a good idea. One example of this is the use of sensor data from in-house production to improve product quality, reduce waste or save energy. Another possibility is the integration of business data from various systems in order to achieve better planning of your own business processes.
The basis of such a project is work in the area of data engineering, which, depending on the specific project objective, can include the introduction of a new data platform or simply the adaptation and optimization of data input and data models.
What is a suitable solution for the application?
A common feature that has emerged in such projects is that the optimal solution approach for the issue at hand must be developed individually. Different technologies such as application-specific databases, graph or time series databases and scalable data processing frameworks are often combined for an integrated overall solution. Technologies such as containerization and Kubernetes are often used to make the solution portable and scalable.
Through the development, setup and operation of various server and data storage systems, RISC Software GmbH has acquired solid expertise over many years in order to develop exactly the right solution for specific customer requirements. The requirements can cover a wide range of dimensions such as reliability, data sovereignty or the existing use of a technology. RISC Software GmbH has developed a structured process in order to effectively collect requirements and select suitable technologies based on them, particularly in the area of data engineering. This involves determining customer requirements and selecting suitable technologies on this basis.

The technology selection process
The basis for a well-founded selection of the technologies to be used is a survey of customer requirements. In the area of data engineering, these include the amount and speed of arrival of input data, and therefore how quickly the amount of data in the system grows. It is also crucial how the data is to be used, for example whether data queries are fixed or can be made dynamically by the user. The expected response times are also key: Are answers expected in real time, or can a response time of a few seconds be tolerated? Another question is how up-to-date the data in the system must be. Does all data have to be available within seconds, or is it sufficient if new data is only available the next day?

The first step is to limit the range of possible technological solutions by eliminating configurations that are out of the question for non-technical reasons. One example of this could be high license costs for commercial software components. For the further selection of the remaining solutions, test runs are carried out to determine which solutions best meet the customer’s requirements. In the field of data engineering, a representative subset of the planned data volume is often used for this purpose. This can be provided by the customer or generated specifically for this purpose.

However, the planned data is often not yet available at the start of a project, when the fundamental technology decision is to be made, as data collection often begins at the start of the project. This is the case in an industrial context, for example, when new sensor systems are installed whose data is to form the basis for the project. In this case, generated data is of central importance for the performance evaluation of various systems in the technology evaluation phase. The customer usually provides a very small sample data set, from which key characteristics of the input data, such as the format, can be derived. Based on this, a larger amount of data can then be generated that is still representative of the data expected during the project. This can be done by simply copying the data, or by modifying timestamps or measured values and introducing a certain random component. The factual accuracy of the data is of secondary importance at this stage, as the performance characteristics of the data processing are of central importance.

The type of expected queries is also central to the technology analysis and is often not finalized until later in the project. At the beginning of the project, the customer usually has an idea of what kind of questions should be answered with the help of the data. These can be used for the system design. Exemplary queries can also be developed together with the customer, the speed of which can then be evaluated using the test data set. Once all essential requirements are known, the technology decision can be made based on the results of test runs and previous experience. The focus here is on technologies that have previously delivered good results with similar data volumes and access patterns.
Often, customer requirements cannot be covered by a single technology. In this case, the system design combines various systems, such as databases, data processing frameworks and a caching layer, in order to be able to access the data quickly.
Where is the finished solution operated?
In addition to the technological selection of the required components, the choice of where to implement the finished solution is often an important element in the decision-making process. This decision is usually made in parallel to the technological decisions, as both decision-making processes influence each other. In contrast to technology decisions, the choice of operating location is driven more by non-technical criteria such as costs or legal framework conditions. An example of such a framework condition would be the selection of an infrastructure operator that is subject to European legislation such as the General Data Protection Regulation.
Various types of server infrastructures are used today for the operation of IT services. Depending on the type of operation, a distinction can be made between on-premise, dedicated hosting in a data center and cloud computing:
- Cloud computing:
Computing resources are rented dynamically from a cloud operator as required and can usually be billed by the minute. The major cloud providers such as Amazon AWS, Google GCP and Microsoft Azure offer numerous services that simplify operations. Examples include firewalls, load balancers and various database services. Such standard services are required by many online applications and can be integrated into your own solutions with a certain amount of configuration effort. However, it is also possible to use virtualized computers with full administration access in the cloud as Infrastructure as a Service (IaaS). - Dedicated hosting:
The hardware is located in a remote data center and is rented from the operator there. Monthly termination is often possible here. In this case, further configuration is carried out by the customer themselves, starting with a basic operating system installation. In this case, the customer has full control over the server, but is also confronted with issues such as possible hardware failures, which are abstracted away by virtualization in cloud computing. However, operation in a dedicated data center ensures a stable and redundant power supply and network connection. High-quality and stable hardware that is designed for continuous operation is also usually offered here. - On-premise:
In the case of an on-premise installation, you operate the server hardware within your own organization, so you are free to choose the hardware to be used, but you also have to manage issues such as power supply, network connection and cooling and the associated costs yourself.
In contrast to cloud computing, with dedicated hosting and on-premise operation, the required services are installed in an operating system that is not preconfigured. This means more effort during setup, but this can often be compensated for by lower operating costs.
The role of open source frameworks
When selecting technology, a particular focus is placed on the use of open source components, which can be used flexibly without additional license costs and regardless of the type of service hosting. Through a suitable selection and configuration of the software stack and the use of open source software, it is usually also possible to replicate the offerings of various cloud providers if other reasons, such as legal requirements or data sovereignty considerations, speak against the use of these services. This means that every customer can be offered the best technical solution for their task, regardless of their framework conditions.
By using open source software, it is possible to draw on technologically mature solutions and integrate them cost-effectively into complete solutions. Furthermore, open source solutions often offer interaction options with other open source solutions such as interfaces, connectors or data import or export modules due to their broad use. This is particularly useful in the field of data engineering, where the creation of a data connection between heterogeneous systems is a central task.
Consideration of changing requirements
In addition to taking the initial requirements into account, the right choice of technology can also significantly improve the service life and flexibility of a system. This relates above all to the openness to new requirements. One example of this is the use of container solutions, which provide a flexible execution environment for a wide variety of use cases. Another example is interoperable technologies that are based on the same technology stack, such as Hadoop. For example, the Hadoop Distributed File System (HDFS) can be combined with the parallel execution framework Map-Reduce and the NoSQL database HBase.
In such a case, the same execution platform can easily be extended with additional services, as these are built on the same foundation. Another way to maintain a high degree of flexibility is to use widely used formats and interfaces. A good example of this is the ability to access a data management system with SQL queries. This makes it possible to replace the underlying system more quickly and flexibly, or to allow a new client application to access the database quickly. Another example is the use of a message broker such as Apache Kafka, for which a large number of connectors to other systems are available and which can therefore be used as a flexible tool for connecting different systems.
RISC Software GmbH is happy to support you in the implementation or further development of your data solutions. The analytical approach described here enables us to respond specifically to your requirements.
Ansprechperson
Author
DI Paul Heinzlreiter
Senior Data Engineer