Skip to content

Abra CaTabRa: Automatically analyse and validate data and use it to train machine learning models

by Sophie Kaltenleithner, MSc

Data is now collected in almost all areas of life – be it the products purchased when shopping online, movement and nutrition information in fitness apps or machine data in the production process. A frequent goal of this is to make automatic predictions: Which target groups should my product be suggested to? What weight loss can I expect if I run a round every day? When do I have to replace the wearing parts of my machines in order to have the shortest possible downtimes?

Complex analytical activities and technical expertise are required to make such predictions possible. This effort cannot always be invested in projects. CaTabRa provides a remedy: CaTabRa is an open source tool for automating steps in the analysis of tabular data and the development of predictive models. It is suitable for domain experts without technical know-how as well as for data scientists who want to efficiently extract information from their data. Statistical analysis, training of machine learning models, explanation of model decisions, validation of input data. All this is done with little effort!


Table of contents

  • Application example from medicine: Covid-19 detection in blood tests
  • Data analysis and training
  • Model explanation – Explainable AI
  • Invalid Input Detection – Out-of-Distribution Detection
  • Conclusion – Caution is needed when predicting Covid-19
  • Advantages through the use of CaTabRa
  • Sources
  • Author
Machine learning

Application example from medicine: Covid-19 detection in blood tests

Can COVID19 be diagnosed on the basis of values from a standard blood test? Researchers from JKU, KUK MedCampus III and RISC Software GmbH dealt with this topic in 2022 [1]. The aim was to detect Covid-19 infections from routine laboratory tests in order to be able to test a large number of patients quickly and without additional effort. How similar results can be generated using CaTabRa alone is demonstrated below.

CaTabRa works on tabular data. Rows are individual samples and columns are their characteristics (features). In our example, the samples are patients and the features are their blood values. In addition, the target value must be defined. This can be a numerical value (regression) or – as here – a categorical one (classification): “infected” and “not infected”.

A typical workflow consists of applying the following four steps, which can be called via command line commands:

1. Analyze

Creates statistics and trains predictive models.

2. Evaluate

Evaluates the models on a test data set to check their quality.

3. Explain

Generates explanations for model decisions in the form of feature importance scores.

4. Apply

Makes predictions for new samples by applying the previously trained models.

Analyze, Evaluate, Explain and Apply.

Figure 1: The typical workflow when using CaTabRa consists of the four steps Analyze, Evaluate, Explain and Apply.

Data analysis and training

In the first step, Analyze generates descriptive statistics to give a better overview of the data. These are calculated per feature. Depending on the data type, these are, for example, the number of entries in the data set, extreme values, mean values, correlations with other columns, etc. The tables in the figure below show examples of this for selected features of the Covid-19 data set. The tables in the figure below show this as an example for selected features of the Covid-19 data.

Covid-19 data
Covid-19 data

Figure 2: Exemplary extract from the Covid-19 data.

In the second step, a model is trained that predicts the defined target value – in this case, whether a Covid-19 infection is present. The quality of machine learning models depends heavily on the algorithms used and their configurations. These cannot be readily determined in advance. CaTabRa therefore uses state-of-the-art AutoML methods to find the right configuration quickly and without much manual effort. AutoML stands for “Automated Machine Learning”. Complicated optimisation methods are used to approach the best solution step by step – without any manual effort.

After completion of the training, the quality of the model can be checked via the Evaluate functionality. Detailed performance reports and corresponding visualisations are generated. The evaluation is carried out using a part of the Covid-19 data set that was not used for training. This allows us to estimate how well the model can handle new data. The figure below shows examples of the graphs obtained. These visualise certain quality metrics depending on the model predictions. The obtained ROC-AUC value – a quality measure for classification problems – is comparable to the one in the original publication.

ROC-Curve
metric

Figure 3: Example graphics for model evaluation generated by CaTabRa. Left: ROC curve; Right: Metric values as a function of the decision threshold.

Model explanation – Explainable AI

Decisions made by black-box machine learning models are difficult for humans to understand. In medicine in particular, however, it is important not to trust the models blindly. There is often an unintentional bias in the data that causes the models to draw incorrect conclusions. If, for example, more men than women had Covid-19 by chance in the training data, it could be that the gender of the patients is too strongly influenced in the decision.

CaTabRa therefore allows the importance of individual features to be determined with the help of the explain function. SHAP is used for this by default. You can read about how this method works in detail in the technical article Explainable AI. In addition to the pure calculations, meaningful visualisations are also automatically created here.

The figure below shows the feature importance scores of the Covid data for the five most important features. A point corresponds to a sample, with the colour representing the feature value (blue: low, red: high). The position on the x-axis shows how a feature for a particular sample affects the result. For example, the absence of glomerular filtration rate measurements (“MISSING_GFR”; a parameter that primarily measures kidney function) tends to indicate covid infection, and high age also seems to be an indicator for the dataset used. Overall, however, the prediction model looks at many different features rather than basing the decision on a few features.

Feature-Importance Plot

Figure 4: Feature Importance Plot for the Covid data based on SHAP values.

Invalid Input Detection – Out-of-Distribution Detection

Machine learning models generally assume that new data to be predicted correspond to the distribution of the original training data. In reality, however, there are often so-called “domain shifts”, i.e. a change in the distribution of the data. The reason for this can be many things: The training data set was not representative enough, measurement methods have changed, characteristics change over time, etc. In any case, the model decisions in such cases are no longer trustworthy. CaTabRa therefore trains out-of-distribution (OOD) detectors to check how much a certain input differs from the training data. They are automatically applied when Apply (the prediction functionality) is called. This way, users know when they should better question model predictions.

With the Covid-19 data, it was found that models trained only on data at the beginning of the pandemic made worse predictions at later points in time. This could be because the virus has generally spread more widely in society, but also because new mutations have occurred. If only data from the first ten months of the pandemic are used for model training and the generated OOD detectors are then applied to data that also includes months eleven and twelve, changed distributions can be detected in 81 of a total of 95 continuous features.

Conclusion – Caution is needed when predicting Covid-19

This is shown in the results of the original publication and is also evident when CaTabRa is used: Blood tests are a relatively good indicator of whether a person has Covid-19. However, this is only true as long as one can be sure that the characteristics of the virus and its spread will not change too much. As most people will have noticed, this is not the case in reality. Rapid spread of the virus, lockdowns and mutations could all lead to a change in distribution. It is therefore advisable to continuously check the quality of machine learning models on current data and to retrain them if necessary.

Advantages through the use of CaTabRa

It makes the evaluation of data easier and more efficient – you can quickly and easily gain insight into the data to determine, for example, whether the use of machine learning methods makes sense.

It creates appealing visualisations that can be used as such directly in publications.

Unlike similar cloud solutions, no sensitive data needs to be uploaded, everything happens locally.


The focus is on flexibility: CaTabRa can be easily extended so that the process can be adapted by own methods. In addition, a variety of configurations are offered out-of-the-box.

CaTabRa is also a Python library that provides the individual features as well as methods for data preparation via programming interfaces.

Those who want to try CaTabRa can find it on GitHub.

Sources

[1] T. Roland et al., ‘Domain Shifts in Machine Learning Based Covid-19 Diagnosis From Blood Tests’, J Med Syst, vol. 46, no. 5, p. 23, Mar. 2022, doi: 10.1007/s10916-022-01807-1.

Contact









    Author

    Sophie Kaltenleithner, MSc

    Researcher & Developer