Datasets
This page provides an overview of datasets developed and curated in my research context. The collection reflects work on spoken dialogue, voice assistants, health-related speech, and multimodal interaction.
Some datasets are publicly documented and partially accessible, while others are available only under restricted access due to ethical, legal, or privacy-related constraints.
GerParlDia-MM
A multimodal corpus of German parliamentary speeches (1949-2025), designed for longitudinal research on voice, language, and rhetorical change across decades.
- Focus: Voice, Longitudinal age shift, political speeches
- Modalities: Audio, Video, Transcripts, Metadata
- Access: by request
- More information
Queer Waves
A dataset related to voice, identity, and queer perspectives in speech communication research. The dataset supports research on inclusive speech technologies and sociophonetic or sociotechnical questions.
- Focus: Voice, identity, diversity, and inclusion
- Modalities: Audio, Transcripts, Metadata
- Access: by request
- More information
CCC – Common Cold Corpus
A speech corpus designed for the analysis of cold-related voice changes and health-related speech phenomena. It supports work on speech under physiological variation and health-aware speech processing.
- Focus: Cold speech, health-related vocal variation
- Modalities: Audio, transcripts, metadata
- Access: Restricted / research use
- More information
RBC – Restaurant Booking Corpus
A corpus for spoken dialogue research in the context of restaurant booking scenarios. It supports work on task-oriented dialogue, spoken language understanding, and conversational system evaluation.
- Focus: Task-oriented spoken dialogue
- Modalities: Speech, transcripts, dialogue annotations
- Access: Depends on data components
VACC – Voice Assistant Conversations Corpus
A dataset capturing voice assistant interactions in more naturalistic or real-world settings. It is particularly relevant for studying spontaneous use, conversational patterns, and device-directed speech in everyday environments.
- Focus: In-the-wild voice assistant interaction
- Modalities: Audio, transcripts, contextual annotations
- Access: Restricted / research use
VAWC – Voice Assistant Conversations in the Wild
A dataset capturing voice assistant interactions in more naturalistic or real-world settings. It is particularly relevant for studying spontaneous use, conversational patterns, and device-directed speech in everyday environments.
- Focus: In-the-wild voice assistant interaction
- Modalities: Audio, transcripts, contextual annotations
- Access: Restricted / research use
iGF-Corpus – Integrated Health and Fitness Corpus
A corpus developed in the context of health, fitness, and speech-related behavioral or physiological data. It supports multimodal analyses at the intersection of speech, activity, and health-related signals.
- Focus: Health, fitness, multimodal behavior
- Modalities: Multimodal
- Access: Restricted / research use
LMC – Last Minute Corpus
A corpus related to spontaneous, time-constrained, or dynamically produced speech. It can support analyses of urgency, spontaneity, and speech behavior in less scripted interaction scenarios.
- Focus: Spontaneous or last-minute speech production
- Modalities: Audio, Video, biomarkers (partly), Transcripts
- Access: Restricted / research use
Notes on Access and Reuse
Please note that not all datasets can be publicly redistributed in full. In several cases, access to audio, transcripts, or annotations is restricted due to consent conditions, privacy considerations, or ethical constraints.
Where possible, this website provides:
- dataset descriptions,
- references to related publications,
- links to code or project pages,
- and information on how to request access.
If you are interested in a specific dataset, please refer to the corresponding link.