Datasets

This page provides an overview of datasets developed and curated in my research context. The collection reflects work on spoken dialogue, voice assistants, health-related speech, and multimodal interaction.

Some datasets are publicly documented and partially accessible, while others are available only under restricted access due to ethical, legal, or privacy-related constraints.

GerParlDia-MM

A multimodal corpus of German parliamentary speeches (1949-2025), designed for longitudinal research on voice, language, and rhetorical change across decades.

Focus: Voice, Longitudinal age shift, political speeches
Modalities: Audio, Video, Transcripts, Metadata
Access: by request
More information

Queer Waves

A dataset related to voice, identity, and queer perspectives in speech communication research. The dataset supports research on inclusive speech technologies and sociophonetic or sociotechnical questions.

Focus: Voice, identity, diversity, and inclusion
Modalities: Audio, Transcripts, Metadata
Access: by request
More information

CCC – Common Cold Corpus

A speech corpus designed for the analysis of cold-related voice changes and health-related speech phenomena. It supports work on speech under physiological variation and health-aware speech processing.

Focus: Cold speech, health-related vocal variation
Modalities: Audio, transcripts, metadata
Access: Restricted / research use
More information

RBC – Restaurant Booking Corpus

A corpus for spoken dialogue research in the context of restaurant booking scenarios. It supports work on task-oriented dialogue, spoken language understanding, and conversational system evaluation.

Focus: Task-oriented spoken dialogue
Modalities: Speech, transcripts, dialogue annotations
Access: Depends on data components

VACC – Voice Assistant Conversations Corpus

A dataset capturing voice assistant interactions in more naturalistic or real-world settings. It is particularly relevant for studying spontaneous use, conversational patterns, and device-directed speech in everyday environments.

Focus: In-the-wild voice assistant interaction
Modalities: Audio, transcripts, contextual annotations
Access: Restricted / research use

VAWC – Voice Assistant Conversations in the Wild

Focus: In-the-wild voice assistant interaction
Modalities: Audio, transcripts, contextual annotations
Access: Restricted / research use

iGF-Corpus – Integrated Health and Fitness Corpus

A corpus developed in the context of health, fitness, and speech-related behavioral or physiological data. It supports multimodal analyses at the intersection of speech, activity, and health-related signals.

Focus: Health, fitness, multimodal behavior
Modalities: Multimodal
Access: Restricted / research use

LMC – Last Minute Corpus

A corpus related to spontaneous, time-constrained, or dynamically produced speech. It can support analyses of urgency, spontaneity, and speech behavior in less scripted interaction scenarios.

Focus: Spontaneous or last-minute speech production
Modalities: Audio, Video, biomarkers (partly), Transcripts
Access: Restricted / research use

Notes on Access and Reuse

Please note that not all datasets can be publicly redistributed in full. In several cases, access to audio, transcripts, or annotations is restricted due to consent conditions, privacy considerations, or ethical constraints.

Where possible, this website provides:

dataset descriptions,
references to related publications,
links to code or project pages,
and information on how to request access.

If you are interested in a specific dataset, please refer to the corresponding link.