Voice Activity Detection Considerations in a Dialog Agent for Dysarthric Speakers


Conversational dialog technology is increasingly being recognized as auseful means of automating remote patient monitoring and diagnostics for dysarthric speakers at scale. However, the characteristics of dysarthric speech introduce multiple challenges for various speech processing components of such systems. This paper specifically focuses on the voice activity detection (VAD) component of such a cloud-based multimodal dialog agent for monitoring patients with Amyotrophic Lateral Sclerosis (ALS). We describe our baseline VAD setup and configurable parameters, and analyze its online performance vis-a-vis human gold annotations on dialogs collected from ALS patients. We further inspect the differences in system performance between patients and healthy controls for multiple speech tasks to better understand the constraints of the system. Using simulation experiments, we find optimal parameter settings that minimize the VAD NIST detection cost function (DCF), thereby improving system performance and user experience.

In Proceedings of the 12th International Workshop on Spoken Dialog Systems (IWSDS)