Artificial intelligence in research

Hero Image Hero Image Hero Image

Systems working with so-called ‘artificial intelligence’ are used, developed and trained in research in various contexts. Their potential to support, accelerate and sometimes even revolutionise the research process is enormous. However, it is important to assess realistically both the technical limitations and the potential of AI, and to comply with regulatory requirements. On this page, we would like to provide you with some guidance on this topic.

Using AI in the preparation and review of DFG grant applications

The DFG generally permits the use of AI in the preparation and review of funding applications, but defines conditions and requirements.

Basic concepts

Current forms of so-called ‘artificial intelligence’ are based on mathematical and statistical methods for pattern recognition in large amounts of data. During the training process, AI systems are ‘fed’ with relevant training data that they are supposed to classify. Information from this data, and above all relationships between pieces of information, is nowadays usually stored in the form of deep neural networks. In these networks, data nodes (neurons) are arranged in hierarchical layers and linked both to other nodes in the same layer and to ones in layers above and below. Newly fed-in data is compared with the ‘knowledge’ thus stored and can in turn be fed into the knowledge store itself. The successful solving of tasks by the AI, for example the correct classification of newly fed-in data, reinforces the weighting of the applied solutions, while failures weaken them. In this way, the AI ‘learns’ and will use successful approaches preferentially in the future.

 

  • Machine Learning

    Explanation by Fraunhofer Institute for Cognitive Systems: „In machine learning, algorithms learn to independently carry out a task through repetition. The machine orients itself to a predefined quality criteria and the content in the data. Unlike with conventional algorithms, a solution path is not modeled. Instead, the computer learns to recognize the structure of the data on its own. Robots for example, can learn how to grab specific objects in order to transport them from point A to point B. The only instructions they receive are where to retrieve the objects and where to transport them. The robot learns how to grab the object through repeated trial and error and feedback from successful attempts.“ Source: https://www.iks.fraunhofer.de/en/topics/artificial-intelligence.html 

  • Neural networks

    Explanation by Fraunhofer Institute for Cognitive Systems: „A sub-area of machine learning is neural networks. These learning algorithms are inspired by the nerve cell connections in the human brain, which processes information via neurons and synapses. Similarly, artificial neural networks are made of multiple series of data nodes that are networked together with weighted connections. The neural network is trained by repeatedly feeding it with data. Through this repetition, the neural network learns to more precisely classify the data each time. It does this by repeatedly adjusting the weighting for the individual connections between the neuron layers. The model, which is created through the learning iterations, can then be applied to data that the artificial intelligence still has not become familiar with during training. Neural networks with concealed neuron layers that are not directly coupled to the input or output layer are called »deep neural networks«. Deep neural networks can have hundreds of thousands or millions of neuron layers. Through so-called »deep learning«, computers are thus capable of solving increasingly complex problems.“ Source: https://www.iks.fraunhofer.de/en/topics/artificial-intelligence.html 

  • AI model and AI system

    Explanaition of Gesellschaft für Datenschutz: „An AI model is the mathematical or algorithmic core of an AI system. It is a trained statistical function or neural network that identifies patterns and makes predictions through data-based machine learning.“ [...] Source: translated from https://gesellschaft-datenschutz.de/ki-modell-ki-system/

    An AI system includes not only the AI model but further components, such as hardware, software and interfaces for data input and output.

  • Large Language Models (LLM)

    Large language models (LLMs) are AI models that specialise in understanding and generating language. LLMs assign words, syllables and punctuation marks to so-called tokens, which are represented internally by a numerical ID. Using training data, the models learn how often certain tokens follow other tokens. Based on this ‘knowledge,’ texts can be returned that are grammatically correct but do not necessarily make sense in terms of content. Through further training, LLMs can learn to recognise logical and contextual connections in longer passages of text and even generate corresponding texts themselves. Among other things, LLMs can be used as user interfaces for AI systems that work internally with knowledge graphs or other forms of databases that actually represent real information.

  • Generative AI

    Explanation of the Federal Office for Information Security: „Generative AI models are a subclass of general-purpose AI models. They are trained using large amounts of data and thereby identify patterns in existing data. As a result, they can generate new content such as text, images, music and videos that also follow these patterns. Due to their high degree of generalisation, such models can be used in a variety of applications that traditionally require creativity and human understanding. These include applications for translating or classifying texts, processing large amounts of data, and generating or improving visual representations.“ Source: translated from https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/KI/Generative_KI-Modelle.html

     

Application areas of AI systems in research

AI systems are used in research at various levels. At the user level, generative AI systems based on large language models are increasingly being used to create texts and graphics, for example. AI tools can also be used to transcribe and annotate audio and image recordings or to recognise handwriting. In an increasing number of disciplines, however, AI systems are also being trained and developed for specific purposes in order to evaluate very large amounts of data and address very specific research questions. And in computer science, research into AI systems themselves is being advanced in order to increase their performance and reliability, also with regard to regulatory requirements (e.g. the EU's AI Regulation) and technological innovations (e.g. quantum computers).

  • Design of texts, tables, graphics and presentations

    In academics, generative AI systems can especially support  in those tasks that are not directly part of the research itself, but rather involve project preparation or communicating results. For example, generative text AI may suggest wording for proposals, website texts, or academic publications. Other AI specialises in generating graphics and entire presentations and can help to visualise research data and results in an appealing and understandable way. However, AI-generated content should only ever serve as inspiration, as it is almost never free of errors. Thorough review and revision is therefore essential. If AI-generated content is used (almost) unchanged, it must be labelled as AI-generated.

    Using generative AI in grant applications to the DFG

    The DFG generally permits the use of AI in the elaboration of funding applications, provided that it is made transparent and does not contravene the principles of good scientific practice.

  • (Literature) research

    There are now AI systems, such as TIB's ORKG Ask, that are optimised for searching scientific publications. They usually use a large language model as a user interface, which, in turn, taps into a substantial knowledge database. The outputs therefore contain specific source references and are much more reliable than those provided by pure text AI. While the latter's answers may sound convincing, they can be made up or at least be inaccurate. However, even AI tools for scientific literature research do not deliver complete or 100% reliable results. As with all AI systems, the quality of the output data depends heavily on the quality, quantity and representativeness of the training data. A new improvement approach involves the integration of retrieval-augmented generation frameworks, i.e. pre-processed collections of knowledge and documents from which the tool then generates its answers. For example, for the tool Chat AI the GWDG has developed the RAG framework Arcana, which can be integrated into the AI model Meta Llama.

  • Processing of research data

    Raw data often needs to be processed for further analysis. In many cases, this process can be handled at least partially by AI tools. Examples include transcribing and annotating audio and video recordings, as well as handwritten documents.

  • Analyses of research data

    The great strength of many AI applications is their ability to recognise and interpret patterns in large amounts of data. In medicine, AI is increasingly being used to analyse images in order to detect abnormalities, for example in cancer diagnosis. Using a similar principle, satellite imagery is processed in the geosciences to classify and map vegetation, land use patterns and infrastructure. In the field of autonomous driving, AI systems process large amounts of sensor data in real time and make decisions about how to control the vehicle. In climate research, too, simulations and models have long since reached a level of complexity that would be impossible to manage without the use of AI, considering the available computing power. The list could go on.

     

  • Development of AI models and AI systems

    When developing AI systems, the focus may be on the training and improvement of an existing AI model. For example, the GPT 4 model could serve as the basis for a new application designed to recognise certain dialects in voice recordings. To achieve this, the model would be specifically trained with corresponding data. On the other hand, computer science also involves basic research into AI itself, which usually focuses on making AI safer, more reliable and more powerful, or expanding its scope of application. The result can be new or improved AI models or even completely new architectures, for example to optimise AI models for use with quantum computers.

Limitations and risks of AI applications

Even though AI now enables many things that would have been beyond the realm of possibility just a few years ago, AI is not a panacea and is not necessarily always the best choice for processing a specific task. AI is excellent at handling very large amounts of data and recognising patterns that might otherwise have gone unnoticed. However, AI outputs are never exact, as they only work with probabilities. The same input data will tend to produce similar, but not identical, outputs. The quality of these outputs depends heavily on how the AI system has been trained. Before using an AI system, you should therefore critically assess its risks and limitations.

  • Poor quality, quantity or representativeness of training data

    AI produces good results in particular when there are many suitable examples in the training data for a given input. For example, if an AI model has been trained with many photos of traffic signs, it will recognise such traffic signs relatively reliably in newly fed-in photos. However, the same model will fail if it is suddenly asked to recognise animals that did not appear in the training data, or only rarely. Even the large, well-known AI models, such as GPT, LLama or Gemini, which are designed for the most general applicability possible, cannot solve all tasks equally well. The first versions in particular were trained with large amounts of data from the internet that had not been quality-checked and contained many errors or at least questionable information. Those errors are also reproduced time and again in the outputs. When asked very specific questions from areas for which only a small amount of training data was available, the AI often completely ‘invents’ answers (hallucinations). And then there is the problem that the internet is not representative of the real world. For example, there is a disproportionate number of images of young, good-looking, light-skinned people in Western clothing. If you instruct the AI to create an image of ‘a human being,’ the result will probably reflect this. In order to use such AI models to address specific research questions, it will therefore generally be necessary to first train them (further) in a targeted manner using sufficient data that is relevant to the topic, quality-checked and representative.

  • Reproducibility is not achievable

    While a classic algorithm constitutes a mathematical function that always calculates exactly the same result for the same input data, AI works with frequencies and probabilities. Since AI is continuously learning, the probability values are also constantly changing. Therefore, even with identical data input, such as a specific text prompt, an AI system will produce results that may be more or less similar to each other, but never exactly the same. Strict reproducibility, as it is often required in a scientific context, can therefore not be achieved.

  • Sources and solution paths are often not comprehensible

    Most AI systems currently available on the market are largely considered to be ‘black boxes,’ meaning that it is not possible to understand in detail how they arrive at a particular result. No sources are cited (or incorrect sources are cited), nor are the solution paths made transparent. This is problematic as well, especially in a scientific context. This is why there are now a number of research projects focused on developing AI systems that can reliably and comprehensibly document their internal processes.

  • Lack of topicality

    Once the training of an AI model is complete, no new information is added to the knowledge base. If, for example, a chatbot was completed in 2023 and is asked about current events in 2025, the answers will necessarily be incorrect or outdated. This problem can only be circumvented if the AI system is continuously trained with new data during use, for example via an connection to the Internet. However, if these current data are not quality-checked, the quality of the output data may also deteriorate. There is also a risk that potentially sensitive input data may spread uncontrolled.

  • Legal risks and ethical issues

    Both, the mere use of AI systems as well as their training and development can raise a number of legal issues. This is particularly the case when personal or copyright-protected data is used for training, or when the AI application may have implications for safety and health. For more details, please refer to the following section on legal aspects. In addition, ethical issues are affected by the fact that AI applications are particularly energy- and resource-intensive. Furthermore, for the basic training of most commercial models, people from developing countries working under questionable conditions have often been employed at low wages (‘click work’). The Weizenbaum Institute has published a discussion paper on this topic.

Legal aspects

Training AI models requires a huge amount of data, which in very rare cases originate exclusively from the developers themselves. However, when data stems from other sources, the question of the legality of data processing quickly arises. This applies both to usage and exploitation rights and — as soon personal data is involved — to data protection and personal rights. For security-related applications, e.g., in areas such as health, transportation, or critical infrastructure, stricter documentation requirements and technical security measures apply. However, anyone who simply uses AI systems also bears legal responsibility for ensuring that neither the input of data into the system nor the use of the output data infringes on the rights of third parties. In the European Union, the “Artificial Intelligence Act” governs the rights and obligations involved in the development, provision, and usage of AI systems.

  • The Artificial Intellegence Act of the European Union

    With the AI Act, the European Union created in 2024 the world's first legal regulation of artificial intelligence. It will come into force in several stages between August 2025 and August 2027 (in certain cases of existing systems with transition periods until the end of 2030) and is intended to protect citizens from risks while also enabling the responsible and legally compliant development and use of AI systems. Similar to the General Data Protection Regulation, it is directly applicable law throughout the EU. The regulation distinguishes between different roles, each of which is associated with specific obligations. In the context of research, two roles are particularly relevant:

    • Supplier (developer/manufacturer): Develops an AI system or AI model for general purposes and places it on the market or makes it available.
    • Operator/user (e.g., companies and institutions, but not private end users): Uses an AI system on his own responsibility without significantly adapting or further training it. If an operator significantly modifies, improves, or even just operates an AI system under a new name, he is legally equated to a provider.

    The AI act also distinguishes between risk classes of AI systems. While there are no special requirements for systems with minimal risk (e.g. AI-supported video games and spam filters), systems with unacceptable risk (e.g. AI systems for manipulating or illegally monitoring people) are completely prohibited. In between are applications with low and high risk. For low-risk applications, such as chatbots, providers and operators only need to make it transparent that AI is being used. Operators must also ensure that all persons who handle AI system are adequately trained. There are additional obligations for high-risk AI systems. Examples include AI tools that affect the safety of products and critical infrastructure, or that process sensitive personal data. In such cases, the following applies:

    Providers must:

    • Develop a risk management system
    • Ensure an adequate quality (accuracy, representativeness) of training, validation and test data
    • Document the structure and functioning of the AI system and provide this technical documentation as a proof of legal compliance
    • Design AI systems in such a way that they can automatically record security-related events and significant changes
    • Provide instructions for the legally compliant usage of AI systems for distributors/importers
    • Implement an optional human oversight by users
    • Design the system in such a way that it is sufficiently accurate, robust in operation and secured against cyber attacks
    • Develop a quality management system

    Operators must:

    • take technical measures for a secure operation, in accordance with the provider's instructions, including the automatic creation of log files
    • ensure that all persons who handle the AI system are adequately informed and trained
    • monitor the operation of the system and report observed vulnerabilities and malfunctions to the developer
    • ensure that all provisions of the GDPR are complied with, in case that personal data is processed
  • AI and copyrights

    The extent to which copyright-protected works may be used for training AI models without the explicit consent of the rights holders continues to be the subject of legal disputes. Even the limitation provision in § 60d of the German Copyright Law, which permits text and data mining for non-commercial scientific purposes, does not grant carte blanche for an unlimited usage of protected content from the internet. First rulings on this matter (in particular Hamburg Regional Court, 27 September 2024, ref.: 310 O 227/23) currently only refer to the compilation of training data sets, but not to the use (and, where applicable, commercial exploitation) of AI models created with them. If you want to train your own AI models with data sets from external sources, try to obtain information about whether the data has been compiled lawfully and is free of third-party rights. If you collect data yourself through web crawling, make sure that the crawlers employed are able to recognise machine-readable usage restrictions on websites and comply with them.

    If you use generative AI systems to create text or graphics, for example, be aware that the output may sometimes be very similar to specific content from the training data, especially if the training data available for a specific task were limited. AI-generated works are generally not eligible for copyright protection, regardless of whether they meet the necessary threshold of originality, as they were not created by a natural person (i.e. a human being). However, if AI-generated content resembles protected works too closely, this may constitute a copyright infringement with legal consequences if the content is published. Legal responsibility lies with those who deploy an AI tool and utilise the results. If you download AI-generated content from the internet, please note that you must comply with the terms of use of the website operators, even if the content itself is in the public domain. For example, a platform may offer AI-generated images for download free of charge, but stipulate in its terms of use that the platform must be cited as the source whenever the images are utilised.

  • AI and data privacy/personal rights

    The GDPR and other data protection laws obviously also apply to AI applications. Several commercial AI models currently available have been trained with data also containing personal information. In many cases, the legality of this data processing is controversial, at best, as there was often no valid consent from the data subjects or another legal basis. As with copyright-protected content, the problem is that the output data of an AI system can resemble very specific training data with, for example, generated images reminding of real people. If you use generative AI, be very careful with using generated photorealistic images, as it is difficult to assess how closely they resemble a real person. In the case of text output, a request such as ‘Create a CV for person XY’ could lead to results that reveal actually non-public information about a real person or generate false information.

    Under no circumstances should you enter sensitive personal data into an AI system, unless it is a closed, self-developed system under your sole and complete control. You cannot control whether and how this input is ultimately integrated into the AI model and may then be accessed by third parties. This risk cannot be eliminated completely, even if the developers have explicitly attempted to prevent this behaviour by means of programming and other technical measures. LUIS-operated LUHKI and ChatAI from GWDG (if an OpenAI model is selected) also send input prompts to Microsoft/OpenAI servers. The principle that legal responsibility lies (also) with those who use an AI tool and utilise the results also applies with regard to data protection.

    If you train AI models with personal data yourself, make sure to only use data for which the data subjects have given their valid consent in the processing for the purpose of AI training, or for which there is another suitable legal basis. Comply with your information and documentation obligations under the GDPR and take appropriate technical and organisational protective measures. A data protection impact assessment may be necessary in some circumstances. If you process personal data using an AI system from an external provider, sign a processing contract obliging the provider to comply with all GDPR requirements.

Tools & further information

POINTS OF CONTACT AND FURTHER INFORMATION
AI TOOLS
EXAMPLES FOR AI IN RESEARCH
COMPETENCE DEVELOPMENT ON AI