With Natural Language Understanding (NLU) from text chaos to knowledge gain

How Natural Language Understanding also helps your company to optimise existing processes

by Sandra Wartner, MSc

In many companies, there is an increasing shift towards digitalisation and automation. In the process, enormous amounts of unstructured data are continuously accumulating, the scope and complexity of which deter the stakeholders concerned from evaluating it, or the potential in the existing data is often not even recognised in the first place. Regardless of whether fault messages in production processes are to be analysed, doctor’s letters are to be filed in a structured manner or products are to be suggested automatically, Natural Language Understanding (NLU) offers a broad spectrum of industry-specific and cross-industry applications.

Table of contest

In the beginning there is the mountain of data… and now what?
How do I teach the AI system what to do?
Current trends and challenges
Conclusion
Sources
Author

Language is omnipresent and we encounter it in many different facets in our everyday lives as well as in our professional environment – written by humans, spoken and communicated in different languages, but also analysed, processed and synthesised by machines. With Natural Language Processing (NLP), computers are able to process and generate natural language automatically and act as an interface between humans and machines (for more details on NLP see [1]). As an application area of artificial intelligence (AI), NLP is always used when monotonous processes or frequently recurring tasks in text processing are to be automated, subsequently optimised and integrated into a higher-level framework. In this way, errors can be minimised in various areas, processes can be (partially) automated and savings can be achieved (through reduced personnel costs).

RISC Software GmbH supports its customers with its many years of practical experience when it comes to the development of individually tailored, AI-supported solutions, including in the area of Natural Language Understanding (NLU), a sub-area of Natural Language Processing.

Natural Language Understanding (NLU) focuses on the extraction of information from written text and thus on the acquisition of text comprehension with regard to a certain aspect. Syntax (grammatical structure) and semantics (meaning of words) play an important role. Examples of this are:

Information extraction, e.g. the recognition of persons, places or other keywords in texts (e.g. Named Entity Recognition (NER)),
- Use case “Newsadoo”: “Newsadoo – All the news about your interests” – allows users to access news articles from numerous sources and offers relevant news personalised according to interests. In the background, NLP is used to transform unstructured text data into structured, analysable content.
- Use case “FLOWgoesS2T”: Voice messages on current traffic events are converted into written texts, in which important information such as roads, locations, driving directions and events are then automatically recognised and stored in a structured manner using NLP. This serves to support the editors in the processing of transmitted voice messages in order to be able to quickly identify traffic-relevant events.
Classification of text into predefined categories
- Use Case “ACT4”: In an expansion stage of the existing platform solution ACT4 of Compliance 2b GmbH, RISC Software GmbH is developing a trustworthy AI component together with the company, which on the one hand supports whistleblowers in submitting the report and on the other hand should enable the responsible officers to process the reports more efficiently and less error-prone. The system will automatically derive information (e.g. category or roles of the persons involved) from the textual information and compare it with already structurally recorded data in the form of a plausibility check.
Sentiment and opinion analysis (sentiment analysis)
- Use case “Intelligent Twitter Analysis”: Are positive emotions in tweets about listed companies related to their share price development? Sentiment analysis can be used to analyse a text in terms of sentiment (positive, negative, etc.) and evaluate how much information is actually contained between the lines.

In the beginning there is the mountain of data… and now what?

The first steps are almost always the hardest. The following (certainly not exhaustive) checklist provides an overview of the most relevant questions that every project team should clarify before the concrete planning or implementation of NLU or AI systems in general.

Is the problem formulated sufficiently well?

What requirements must the AI system meet in order to be used beneficially in operations?
Are the expected results clearly defined?
Do all stakeholders have the same expectations?

Is the type of problem to be solved known or clearly delimited (e.g. classification of words or documents, sentiment analysis)?

If not, can the problem be solved in several sub-problems that can be clearly delineated?

Can I solve the problem using the existing database?

If not, are there ways to get this data, e.g. by using data from other/public sources, or by collecting your own data?

Is the data quality sufficiently “good”?

Data quality results from the interaction of different criteria that depend on the use case (see [2]).
If the data quality is not sufficient – what measures can be taken to improve it? Is it possible to establish a robust data strategy in the company in the long term?

Is a ground truth (correctly annotated examples) available?

If not, can this be created? Are resources available or is technical/domain-specific know-how available to annotate it?

How do I assess whether a solution works “well enough”? How can I “measure” errors?

On the one hand, metrics are needed for the accuracy of the models themselves, and on the other hand, evaluation strategies are needed to determine whether and what added value is generated by the use of the solution, e.g. a certain percentage increase in one or more of the company’s KPIs.

Are there already solutions to similar problems or does the project have a high degree of innovation? How risk-tolerant is my organisation?

If the degree of innovation is high and there are many risk factors, funding opportunities can also be used to be able to implement the project nevertheless, but with less risk (see [3]).
If the risk factors are (still) unknown or unclear, a feasibility study can help to assess them (see [4]).

How can I create a trustworthy AI system?

Which areas are relevant for my use case, e.g. comprehensibility, fairness, technical robustness (see [5])?
Can I use methods from the field of Explainable AI to check my black box (see [6])?

How do I teach the AI system what to do?

To move from raw data to a successfully implemented NLU component, a number of steps are necessary. The concrete measures vary from one project to the next, but the basic procedure follows the scheme shown in Figure 1.

Data basis

The existing raw data can be in many different formats, e.g. as text fields in databases, contents of web pages, text files or text in images or scans. If texts are contained in (complex-)structured PDFs or web pages, relevant content can be extracted with some effort. For scans of documents, the Optical Character Recognition (OCR) method is used, which recognises texts in a two-dimensional image and stores them with their position for further processing. OCR systems already achieve very good results for images with structured, typewritten texts (e.g. scans or photos of analogue documents), but this step is often a challenge for photos (e.g. of street signs) or handwritten texts. Audio files can also be transcribed into written text using Speech-To-Text technologies. Depending on the quality of the recording, language and dialect, this can also involve considerable effort until the texts are available in sufficiently good quality for further processing.

Data preparation

Next, the texts must be prepared for further processing. Depending on the application, this step requires, for example, removing certain punctuation marks and/or excess spaces or converting texts to lower case. Although some information is lost as a result, this makes both manual and machine processing of the texts by AI models much easier. Another essential step is to tokenise the texts. Since computers cannot “calculate” with words, a unique number is assigned to each word and all texts are converted into this uniform number scheme.

Language models

Modern, deep-learning-based language models are pretrained in a self-supervised way on extensive text databases such as BookCorpus. A very common approach is so-called masked language modelling, where random parts of sentences (e.g. words) are blacked out and the model tries to refill the gap text as close as possible to the original text. For the model to build up a good understanding of natural language structures, millions of examples and many iterations of this guessing game are necessary. Since this process is very resource-intensive (high computing power and costs), these are usually pre-trained by large organisations such as Google or Facebook and – thankfully – made publicly available to other developers.

Finetuning

Using the principle of so-called transfer learning, pre-trained models can now use their language understanding to learn the solution of concrete tasks (such as NER, text classification or sentiment analysis) with smaller amounts of data. Depending on the complexity of the task, hundreds to thousands of sample data are necessary for this fine-tuning.

Evaluation

The quality of these models is then quantitatively evaluated using test or validation data provided. Depending on the task and goal, different metrics are used. It may therefore be necessary to evaluate and compare models on the basis of several metrics.

Productive use

The predictions of the models on new data (inference) provide results according to the structure from the sample data and can thus be integrated into the company workflow.

Current trends and challenges: When AI’s learn to write, draw and communicate like humans do

In recent years, almost everything in the NLU field has revolved around the so-called Transformer models. These are a special architecture of artificial neural networks that is particularly suitable for dealing with text data (see also [7]). Google’s Language Model for Dialogue Applications – LaMDA for short – has attracted particular attention in recent months (see [8]). This model is trained to behave as humanly as possible in dialogue, and the model has already been able to prove this ability in several “interviews” (see [9]). The DALL-E models developed by OpenAI (see [10]) can also (among other things) generate images that match an input text. The model is based on the GPT-3 architecture (see [11]), which has previously been able to convince with its ability to generate new texts in previously unattained quality. A simplified model based on DALL-E is publicly available at craiyon.com: What is a fun gimmick for average internet users can also find numerous productive applications.

The biggest challenge in using these new models in innovative research projects is the data available for the task at hand. For successful fine-tuning of a pre-trained model to a new task, appropriate data is needed to show the model what to do. This data must also be available in sufficient quantity and meet the specified data quality criteria. Furthermore, the selection of the pre-trained model is also a challenge. In order to achieve the best results, a literature review and the testing and evaluation of different models is essential.

Keeping track of all these exciting new innovations is not always easy. However, not only the latest trends should always be taken into account here either. Some tasks can also be solved with older methods or (in combination) with sophisticated rule-based systems, some of which are more efficient to use and also enable the traceability of model decisions on an ad-hoc basis. It is therefore definitely worthwhile to test out methods that have already been established in the long term for a first prototype.

Conclusion

Human language is amazingly complex and versatile. NLU solutions understand and interpret linguistically conveyed content better and better, and the rapid progress is becoming more and more impressive. Almost daily, the number of publicly available models is increasing, and at the same time it is becoming apparent how diversely they can already be used. With increasing digitalisation as well as the amount of routine processes, there is still a lot of untapped potential in the unstructured text data of companies to take their processes and products to the next level with NLP solutions. If you are also interested in using such technologies in your company, we would be happy to support you in planning and implementing NLP projects (https://www.risc-software.at/annalyze-nlp/).

Sources

[1] Wartner, Sandra (2021): „OK Google: What is Natural Language Processing?” – How machines reas, decode and undertsnad human language (ris.w4.at/en/technical-article-natural-language-processing-1/)

[2] Wartner, Sandra (2021): Data quality: From information flow to information content – why clean data (quality) management pays off (ris.w4.at/en/technical-article-data-quality/)

[3] Hochleitner, Christina (2021): Förderungen mit laufender Einreichmöglichkeit (https://www.risc-software.at/foerderungen-mit-laufender-einreichmoeglichkeit/)

[4] Wartner, Sandra (2021): Why even an good idea needs a feasibility study (ris.w4.at/en/technical-article-why-even-a-good-idea-needs-a-feasibility-study/)

[5] Wartner, Sandra (2021): Trust in Artificial Intelligence – how we create and use trustworthy AI systems (ris.w4.at/en/technical-article-trust-in-artificial-intelligence/)

[6] Jaeger, Anna-Sophie (2022): Explainable Artificial Intelligence (XAI) – How machine learning predictions become interpretable (ris.w4.at/en/technical-article-explainable-artificial-intelligence/)

[7] Wartner, Sandra (2022): Transformer models conquer Natural Language Processing (ris.w4.at/en/technical-article-transformer-models-conquer-natural-language-processing/)

[8] Thoppilan, Romal, et al. “Lamda: Language models for dialog applications.” arXiv preprint arXiv:2201.08239 (2022).

[9] Lemoine, Blake (2022): “Is LaMDA Sentient? – an Interview” (https://cajundiscordian.medium.com/is-lamda-sentient-an-interview-ea64d916d917)

[10] OpenAI (2022): https://openai.com/dall-e-2/

[11] Brown, Tom, et al. “Language models are few-shot learners.” Advances in neural information processing systems 33 (2020): 1877-1901.

Contact

Author

Sandra Wartner, MSc

Data Scientist

“OK Google: What is Natural Language Processing?”

Natural Language Processing makes it possible to read, decode and understand human language by machine. Speech assistants, spelling correctors, email spam filters – NLP as a technology is omnipresent and already hides behind many processes and software applications deeply embedded in our everyday lives.

mehr erfahren