TY - JOUR
T1 - Enhancing analysis of diadochokinetic speech using deep neural networks
AU - Segal-Feldman, Yael
AU - Hitczenko, Kasia
AU - Goldrick, Matthew
AU - Buchwald, Adam
AU - Roberts, Angela
AU - Keshet, Joseph
N1 - Publisher Copyright:
© 2024
PY - 2025/3
Y1 - 2025/3
N2 - Diadochokinetic speech tasks (DDK) involve the repetitive production of consonant-vowel syllables. These tasks are useful in detecting impairments, differential diagnosis, and monitoring progress in speech-motor impairments. However, manual analysis of those tasks is time-consuming, subjective, and provides only a rough picture of speech. This paper presents several deep neural network models working on the raw waveform for the automatic segmentation of stop consonants and vowels from unannotated and untranscribed speech. A deep encoder serves as a features extractor module, replacing conventional signal processing features. In this context, diverse deep learning architectures, such as convolutional neural networks (CNNs) and large self-supervised models like HuBERT, are applied for the extraction process. A decoder model uses derived embeddings to identify frame types. Consequently, the paper studies diverse deep architectures, ranging from linear layers, LSTM, CNN, and transformers. These architectures are assessed for their ability to detect speech rate, sound duration, and boundary locations on a dataset of healthy individuals and an unseen dataset of older individuals with Parkinson's Disease. The results reveal that an LSTM model performs better than all other models on both datasets and is comparable to trained human annotators.
AB - Diadochokinetic speech tasks (DDK) involve the repetitive production of consonant-vowel syllables. These tasks are useful in detecting impairments, differential diagnosis, and monitoring progress in speech-motor impairments. However, manual analysis of those tasks is time-consuming, subjective, and provides only a rough picture of speech. This paper presents several deep neural network models working on the raw waveform for the automatic segmentation of stop consonants and vowels from unannotated and untranscribed speech. A deep encoder serves as a features extractor module, replacing conventional signal processing features. In this context, diverse deep learning architectures, such as convolutional neural networks (CNNs) and large self-supervised models like HuBERT, are applied for the extraction process. A decoder model uses derived embeddings to identify frame types. Consequently, the paper studies diverse deep architectures, ranging from linear layers, LSTM, CNN, and transformers. These architectures are assessed for their ability to detect speech rate, sound duration, and boundary locations on a dataset of healthy individuals and an unseen dataset of older individuals with Parkinson's Disease. The results reveal that an LSTM model performs better than all other models on both datasets and is comparable to trained human annotators.
KW - DDK
KW - Deep neural networks
KW - Diadochokinetic speech
KW - Parkinson's Disease
KW - Voice onset time
KW - Vowel duration
UR - http://www.scopus.com/inward/record.url?scp=85203266819&partnerID=8YFLogxK
U2 - 10.1016/j.csl.2024.101715
DO - 10.1016/j.csl.2024.101715
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:85203266819
SN - 0885-2308
VL - 90
JO - Computer Speech and Language
JF - Computer Speech and Language
M1 - 101715
ER -