Research

IDEAS develops and applies data-driven methods to urgent challenges in environmental and life sciences—often involving large, heterogeneous datasets and high uncertainty.

Two application pillars

Life Sciences & Health — From biological data science to health-related applications, IDEAS projects address complex systems where data-driven models can accelerate discovery and decision-making.
Environmental Sciences — IDEAS tackles questions shaped by spatio-temporal dynamics, changing conditions, and non-i.i.d. data—where robustness and credible inference are critical.

Research topics

Hybrid Learning & Physics-Based AI — Integrating process-based models with ML.
Unstructured Data Mining & Harmonization — Using text, images, omics, citizen science.
Explainability & Credibility — Interpretable, trustworthy models; links to causal inference.
Robustness in Non-I.I.D. Data — Reliable learning under dataset shift and spatio-temporal correlation.
Generative AI / Foundation Models with Domain Knowledge — Generative models grounded in domain expertise.

Cross-cutting priorities

IDEAS emphasizes uncertainty quantification and improved digital processes in research data management to produce AI-ready FAIR data—essential for trust in scientific predictions.

Current research projects

Climate Disasters

Understanding the drivers of climate disaster impacts

Supervising PIs:
Prof. Dr. Jakob Zscheischler (UFZ) & Prof. Dr. Miguel Mahecha (Leipzig University)

Additional PIs:
Jun.-Prof. Dr. Marlene Kretschmer (Leipzig University)

Disciplines: climate science, environmental sciences, data science

Motivation and research questions:
Climate disasters regularly cause huge human and economic impacts. To improve current and future climate risk assessments, it is crucial to understand how factors such as climate hazard intensity, exposure, vulnerability, and environmental background conditions contribute to the observed impacts. Disentangling these contributions is a persistent challenge due to the limited availability of high-quality data on impacts and socio-economic conditions as well as the multitude of correlated and confounding drivers. Recently, a range of new datasets became available, which offer new opportunities to address this challenge. For instance, new datasets on when and where disasters occurred allow a more detailed assessment of hazard intensity or socio-economic conditions. Novel high-resolution socio-economic datasets and remote sensing-based datasets allow for an improved assessment of exposure and vulnerability. Yet key challenges remain. For instance, disaster impact data contains strong sampling biases. In commonly used national disaster databases, events (and therefore impacts) are only recorded once they exceed certain triggers. Furthermore, certain types of impacts are not well recorded in some countries, creating systematic data gaps that are difficult to identify or correct. Finally, the spatio-temporal nature of climate and environmental conditions violate the i.i.d. assumption of many machine-learning approaches, challenging the robustness and predictive capacity of trained models outside their training domain.

This project will develop robust data science approaches tailored to the data at hand to improve our understanding of the drivers of climate-related disaster impacts. In particular, interpretable machine learning approaches will be used to exploit available datasets to identify key drivers and driver interactions that contribute to disaster occurrence and impacts. Furthermore, robust cross-validation approaches tailored to the data structure will be applied to ensure trust in the findings.

Keywords: climate disasters, hazards, vulnerability, non-i.i.d. data, domain shift

Decoding Protein Darkmatter

Decoding functions in the microbial dark matter: Towards protein classification and design through large language models

Supervising PIs:
Dr. Ulisses Nunes da Rocha (UFZ) & Prof. Dr. Peter Stadler (Leipzig University)

Additional PIs:
Prof. Dr. Jana Schor (UFZ)

Disciplines: Data Science, Bioinformatics, Microbial Ecology

Motivation and research questions:
This PhD project explores how modern machine-learning models can learn meaningful representations from biological sequences at very large scale. Using protein language models and extensive metagenomic datasets, the project investigates how evolutionary and ecological diversity affect generalization, robustness, and interpretability when models are applied to data that differ strongly from their training distributions. The work sits at the intersection of representation learning, large-scale data analysis, and computational biology, offering hands-on experience with foundation models, interdisciplinary research, and biologically relevant applications such as enzyme discovery and sustainability-related biotechnology.
The doctoral researcher will work with large-scale metagenomic datasets and state-of-the-art protein language models, supported by the UFZ’s EVE high-performance cluster and the ScaDS.AI data-science ecosystem. Their tasks will include to

Develop and evaluate protein language models to study representation learning from biological sequence data.
Analyze large-scale metagenomic datasets spanning diverse evolutionary lineages and ecological environments.
Investigate model generalization, robustness, and uncertainty when applied to previously unseen protein families.
Compare model architectures and training strategies across ecologically distinct datasets.
Collaborate with experimental partners to validate computational predictions of protein function.
Participate in interdisciplinary training, scientific workshops, and international research events within IDEAS.

Keywords: Protein Language Models, Representation Learning, Explainable AI, Robustness under Distribution Shift, Hybrid Data–Theory Modeling, Microbial Functional Dark Matter

Estimating LLM Biodiversity

Biodiversity Estimation as a Lens into LLM Knowledge Content

Supervising PIs:
Prof. Justin Calabrese (CASUS) & Prof. Simon Razniewski (TU Dresden)

Disciplines: AI foundations, statistical ecology

Motivation and research questions:
Foundation models, in particular large language models (LLMs), have significantly advanced AI. A major contributor to their success is internalized knowledge, which in quantitative terms, is still poorly understood. LLMs memorize significant amounts of factual knowledge, however, there exists no reliable quantification of the extent of this knowledge, with orders of magnitude between known lower bounds (100 M facts) and naïve estimates of upper bounds (40 B facts) for frontier models like GPT-4. Exhaustively probing LLMs is unfeasible, for both computational and monetary reasons.
In this project, we explore alternative approaches inspired by the study of biodiversity in ecology. We hypothesize that internalized knowledge in LLMs (hereafter “knowledge diversity”) can be viewed analogously to biodiversity in ecological communities. Ecology has decades of experience in developing both theories to explain biodiversity, and statistical approaches to quantify it from limited samples. In particular, named entities in LLMs can, under some circumstances, be considered analogous to individuals within a species. Furthermore, LLM characteristics that correlate with increased knowledge diversity, including number of model parameters, size of the training dataset, and the total amount of compute time can also be mapped onto ecological concepts that correlate with increased biodiversity such as number of resource types, size of the species pool, and amount of successional time, respectively.
Quantifying biodiversity in ecological communities typically involves estimating the total number of species (i.e., species richness) and the abundance of each species from a limited set of samples. Communities can then be characterized, compared, and ranked in terms of their species richness and patterns of relative species abundance. A myriad of richness and abundance estimators exist in the ecological literature, with each making different assumptions and being tailored to different types of data. Limited samples of named entities memorized by an LLM can be readily obtained, which, together with the above-described analogies, suggests the possibility to leverage existing biodiversity estimation techniques to quantify knowledge diversity in LLMs. However, there currently exists no work that explores which biodiversity estimators are most suitable, which estimator assumptions are most plausible for LLMs, how LLMs should be sampled optimally to maximize compatibility with biodiversity estimators, or which existing biodiversity estimators are computationally efficient to handle the large samples that can be extracted from LLMs.
Computer science frequently supplies theory and techniques that accelerate discover in domain sciences like ecology. In this project, however, we look to a domain science to provide inspiration for quantifying the knowledge diversity of LLMs, which is a frontier problem in computer science. This approach could, for the first time, enable reliable estimates of the factual knowledge seen and memorized by LLMs, and therefore advance our understanding of the potentials and limitations of these models. For ecology, it could provide a stress test for estimation techniques on very large datasets, lead to improvements in the computational algorithms underpinning biodiversity estimators, and emphasize the wider relevance of statistical ecology beyond the core conservation science domain. This work therefore has the potential to significantly advance both computer science and ecology.

Keywords: Foundation models, Large language models (LLMs), biodiversity estimation

Agentic Cancer Care

An agentic AI approach to reduce overdiagnosis, overtreatment and unnecessary disease monitoring in prostate cancer

Supervising PIs:
Dr. Michael Bussmann (HZDR) & Prof. Dr. Gerik Scheuermann (Leipzig University) & Prof. Dr. DMSc. Michael Borre (Aarhus University Hospital, Denmark)

Additional PIs:
Dr. Johannes Thestrup Aksglæde (Aarhus University Hospital, Denmark)

Disciplines: Oncology, Urology, Multimodal AI + HPC in oncology

Motivation and research questions:
Overdiagnosis, overtreatment, and unnecessary disease monitoring is a major challenge in prostate cancer and can be avoided for 100,000s of men in Europe alone, while the disease incidence will rise dramatically towards 2040. Many men on Active Surveillance (AS) undergo unnecessary repeated imaging, biopsies, and clinical visits despite indolent disease. When treated surgically, patients experience unnecessary long-term subversive side effects from extended pelvic lymph node dissection based on simple normal glands developed on small cohorts with little external validation. Patients in medical treatment undergo expensive, intensive imaging without any clear benefit or data-driven guidelines. While registries can provide some insight, rich information is severely lacking on disease behavior. This information is embedded in unstructured electronic health record (EHR) text, radiological reports, and raw scans (prostate MRI, PSMA PET/CT, CT-Bone), and longitudinal treatment trajectories, but this is not exploited systematically in large scale for indecision making.
This project proposes to utilize and develop a local, large, multimodal AI agent and foundation model framework to read both unstructured EHR text and raw imaging data (prostate MRI and PSMA PET/CT) and construct a unified, computable, longitudinal disease trajectory both retrospectively and real time for each patient evaluated for suspected or confirmed prostate cancer in all hospitals in Denmark. This trajectory will be harmonized to EAU UroEvidenceHUB (UEH)-aligned, structured datasets hosted at Helmholtz enabling evaluation between Danish and European cohorts without sharing patient level data, thus providing solid evidence to guide treatment.
We will then use the model derived variables from these trajectories to address three clinically distinct but related overtreatment harms. Our core research questions are:

Can locally supercomputer-deployed multimodal agents and foundation models that ingest unstructured electronic health record (EHR) text and raw imaging data (including prostate MRI and PSMA PET/CT) reliably construct a unified, computable, longitudinal patient trajectory for men undergoing evaluation for suspected prostate cancer; with data quality and completeness comparable to expert manual curation, while ensuring harmonization with EAU-aligned structured datasets hosted at Helmholtz?
Can variables derived from these trajectories:
1. identify men on Active Surveillance who can safely avoid or discontinue further follow-up without an increased risk of clinically significant progression, and
2. drive development of an intelligent Briganti-like tool that helps avoid severe long-term side effects of surgery and radiotherapy by accurately predicting adverse pathological features and lymph-node involvement, based on integrated EHR data, prostate MRI, and PSMA PET/CT, and
3. predict which men with hormone-sensitive prostate cancer (HSPC) receive systematic therapy (ADT ± ARPI ± chemotherapy) are unlikely to benefit from routine imaging (CT, bone scan, PSMA PET/CT), enabling personalized imaging de-escalation and development of adaptive, AI-based imaging strategies during systematic treatment, and
4. be trained, recalibrated, and externally validated in Helmholtz supercomputing infrastructures, where Helmholtz contributes EAU-aligned structured datasets

Expected results include:
Once implemented, the innovation will disrupt cancer treatment monitoring and will potentially be scalable across all other cancers and will guide both treatment and data-driven decision-making at the European level through UEH.

Keywords: Prostate cancer treatment monitoring, LLM agents, foundation models, continuous model monitoring

Inequalities Climate Discourse

Inequalities in political attention to climate change through computational text analysis

Supervising PIs:
Dr. Mariana Madruga de Brito (UFZ) & Prof. Manuel Burghardt (Leipzig University)

Additional PIs:
Dr. Taís Maria Nunes Carvalho (UFZ) & Dr. Andreas Niekler (Leipzig University)

Disciplines: Environmental Sociology, Computer Science, Computational Social Sciences

Motivation and research question:
Political responses to climate change are shaped by inequalities in how climate disasters affecting different societal groups receive attention. While disasters in rich or geopolitically central regions attract political attention, equally severe disasters in poorer or marginalized countries often remain ignored. In this context, the political visibility (or lack thereof) influences diplomatic agendas, humanitarian priorities, and the allocation of aid and adaptation finance.
Evidence of these attention dynamics is embedded in large volumes of political text, including UN General Debate speeches (1970–2024) and national parliamentary debates (Germany, UK, USA). Yet these corpora are far too large to analyze manually. Computational and data-science methods are therefore essential for systematically investigating how climate change and its related disasters enter political discourse across different countries. Advanced text-mining techniques enable the identification of which hazards and disasters are mentioned, how they are framed, and how attention shifts over time.
To analyze how climate change and which related disasters are discussed, this project will combine text-embedding-based clustering, classification models, and topic modeling. Specifically, we will (i) identify key frames and narratives discussed in political texts (e.g. humanitarian, security, development, responsibility), (ii) measure how different countries and societal groups are considered or not in the political discourse, and (iii) quantify topic attention over time. Furthermore, by linking these patterns in political discourse with real disaster impact data (EM-DAT, DesInventar) and development-finance datasets (OECD DAC, World Bank, national donor reports), we will assess how political attention matches disaster severity and examine the role of political debates in shaping aid allocation.
The project advances Environmental Sociology research by generating the first international, long-term analysis of political attention to climate change and related disasters. It will reveal which hazards and regions are consistently underrepresented in diplomatic and parliamentary debates. The project also contributes to Computational Social Science by developing innovative pipelines to address challenges in dealing with unstructured political texts.

Keywords: Text Mining, Large Language Models, Machine Learning, Discourse Analysis, Climate Policy

EXACT

Explainable Graph-based AI for Credible Toxicity Prediction

Supervising PIs:
Prof. Dr. Jana Schor (UFZ) & Prof. Dr. Peter F. Stadler (Leipzig University)

Additional PIs:
Jun.-Prof. Dr. Julia Westermayr (Leipzig University)

Disciplines: Computer Science; Computational Toxicology; Graph Theory; Environmental Sciences.

Project focus:
Chemical pollution is a key driver of the triple planetary crisis, yet the use of AI-based toxicity prediction in regulatory and environmental contexts is still limited by a lack of transparency and trust. This PhD project aims to develop explainable and uncertainty-aware graph-based AI models for chemical toxicity prediction. By combining graph theory, machine learning, and computational toxicology, the project seeks to make AI predictions interpretable and credible for real-world environmental decision-making.

Central research question:

How can graph-based AI models for chemical toxicity be designed, trained, and explained such that predictions from chemical structure become both accurate and credibly interpretable for use in computational and regulatory toxicology?

Key tasks and responsibilities

Curate and harmonize public chemical toxicity datasets (e.g. Tox21, ECOTOX, NORMAN SusDat)
Design and train graph neural network models for toxicity prediction
Develop methods for explainability and uncertainty quantification in graph-based models
Link model explanations to chemically meaningful substructures and toxicological concepts
Benchmark models and explanations using real-world toxicological case studies
Publish results in interdisciplinary journals and present them at international conferences

Keywords: Graph neural networks, Explainable AI, Computational Toxicology, Uncertainty quantification

TrustSeg

Trustworthy AI for Clinical Image Segmentation

Supervising PIs:
Prof. Dr. Steffen Löck (HZDR) & Prof. Dr. Stefanie Speidel (TU Dresden)

Disciplines: Computer vision, deep learning, translational cancer research (radiotherapy, surgery), medical physics, human-AI interaction, trustworthy and explainable AI

Motivation and research question:
Artificial intelligence is increasingly used to support clinical decisions in radiotherapy and surgery, yet current segmentation models can fail silently in unfamiliar or safety-critical situations. This PhD project addresses this challenge by developing uncertainty-aware and trustworthy segmentation methods that make model confidence explicit and help clinicians make safer, better-informed decisions.

The central research question is:

How can we design and integrate uncertainty-aware segmentation models that provide reliable confidence estimates and improve safety and decision-making in radiotherapy planning and surgical workflows?

Key Tasks and Responsibilities

Develop deep learning–based segmentation models with integrated uncertainty quantification
Design and evaluate methods to visualize and communicate uncertainty in clinical workflows
Apply and validate the developed approaches in radiotherapy planning and surgical imaging/video scenarios
Collaborate closely with clinicians, medical physicists, and computer scientists in an interdisciplinary environment
Publish research results in peer-reviewed journals and at international conferences

Keywords: Segmentation, uncertainty quantification, spatiotemporal modelling, radiotherapy planning, surgical navigation, clinical decision support, uncertainty visualization, trust and usability

Two application pillars

Research topics

Cross-cutting priorities