Enhancing analysis of diadochokinetic speech using deep neural networks

Yael Segal-Feldman, Kasia Hitczenko, Matthew Goldrick, Adam Buchwald, Angela Roberts, Joseph Keshet

Research output: Contribution to journalArticlepeer-review

Abstract

Diadochokinetic speech tasks (DDK) involve the repetitive production of consonant-vowel syllables. These tasks are useful in detecting impairments, differential diagnosis, and monitoring progress in speech-motor impairments. However, manual analysis of those tasks is time-consuming, subjective, and provides only a rough picture of speech. This paper presents several deep neural network models working on the raw waveform for the automatic segmentation of stop consonants and vowels from unannotated and untranscribed speech. A deep encoder serves as a features extractor module, replacing conventional signal processing features. In this context, diverse deep learning architectures, such as convolutional neural networks (CNNs) and large self-supervised models like HuBERT, are applied for the extraction process. A decoder model uses derived embeddings to identify frame types. Consequently, the paper studies diverse deep architectures, ranging from linear layers, LSTM, CNN, and transformers. These architectures are assessed for their ability to detect speech rate, sound duration, and boundary locations on a dataset of healthy individuals and an unseen dataset of older individuals with Parkinson's Disease. The results reveal that an LSTM model performs better than all other models on both datasets and is comparable to trained human annotators.

Original languageEnglish
Article number101715
JournalComputer Speech and Language
Volume90
DOIs
StatePublished - Mar 2025

Keywords

  • DDK
  • Deep neural networks
  • Diadochokinetic speech
  • Parkinson's Disease
  • Voice onset time
  • Vowel duration

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Human-Computer Interaction

Fingerprint

Dive into the research topics of 'Enhancing analysis of diadochokinetic speech using deep neural networks'. Together they form a unique fingerprint.

Cite this