Background: Parkinson’s disease (PD) is a progressive neurodegenerative disorder characterized by motor and non-motor symptoms that significantly affect quality of life. Speech and voice impairments are the earliest indicators of PD, often emerging during the prodromal phase before overt motor dysfunction becomes apparent. These vocal alterations affecting pitch, articulation, intensity, and fluency represent potential non-invasive biomarkers for early detection. Despite significant advances in medical technology, the early and reliable diagnosis of PD remains challenging due to the subtle onset of symptoms and their overlap with normal aging or other neurological conditions. Speech analysis, powered by machine learning (ML) and deep learning (DL) methodologies, offers a promising avenue for developing cost-effective, scalable, and language-independent diagnostic tools. The objective of this study is to evaluate the potential use of ML and DL approach in the identification of Parkinson's speech characteristics across multilingual datasets, as well as the development of scalable, language-neutral diagnostic tools for early detection and continuous monitoring of PD patients. Methods: An integrative mixed-methods approach was employed in this thesis, which integrated a systematic literature review with comprehensive experimental analyses. The systematic review was conducted in accordance with the PRISMA guidelines and encompassed literature published over the past decade. This review identified current advancements, methodological limitations, and existing research gaps in speech-based PD detection. This review provided insights into the experimental design and influenced the feature selection strategies employed during analysis. Experimental evaluations were conducted using multilingual speech datasets in English (MDVR- KCL), Italian (IPVS), and Turkish, encompassing both reading and conversational speech tasks. Acoustic, prosodic, and cepstral features, including fundamental frequency (F0), jitter, shimmer, harmonics-to-noise ratio (HNR), Mel-frequency cepstral coefficients (MFCCs), and gammatone cepstral coefficients (GTCCs), were extracted using signal processing tools such as Praat and Librosa. Speaker diarization and Spectro-temporal representations were employed to capture naturalistic speech patterns. Several supervised ML algorithms were trained, optimized, and evaluated using standard performance metrics. Additionally, convolutional neural networks (CNN-2D) and a residual network (ResNet-18) were implemented for deep learning-based feature learning from spectrogram representations. Results: The experimental findings demonstrated that both ML and DL approaches can effectively capture speech-related biomarkers indicative of PD across multiple languages and speech tasks. For the Turkish dataset, AdaBoost combined with a feature selection technique achived the best overall performance. The model reached an accuracy of 85.09%, precision of 0.92, sensitivity of 0.90, and an F1-score of 0.91, highlighting the importance of targeted feature selection in improving diagnostic accuracy. In the reading task of the English dataset (MDVR-KCL), the SVM model attained accuracy of 95.45%, sensitivity of 94.62%, specificity of 95.97%, F1-score of 94.12%, ROC-AUC of 0.98, and MCC of 0.90. The optimal feature combination of acoustic, prosodic, and GTCC features. In the conversational speech task, the XGB model achived the highest accuracy of 83.7%, sensitivity of 76.3%, specificity of 88.9%, and an F1-score of 79.5% demonstrates strong generalization on spontaneous speech. On the Italian dataset, the SVM model integrating acoustic and GTCC features achieved the most impressive across nearly all metrics. This model attained a test accuracy of 94.68%, sensitivity of 94.37%, specificity of 95.04%, precision of 95.71%, F1-score of 96.04, ROC-AUC of 0.98, and MCC of 0.89. The outcomes underscore the efficacy of feature-level fusion in enhancing model robustness and generalizability across different speaking tasks and recording conditions. In the cross-linguistic evaluation, models developed on English data and tested on Italian (and vice versa) demonstrated varying levels of performance, revealing the linguistic sensitivity of PD-related features. The XGB model achieved the highest accuracy when trained on English and evaluated on Italian data, accuracy: 81.13%, sensitivity: 78.68%, precision: 88.40%, F1- score: 83.34%, ROC-AUC: 0.88. When the model was trained on Italian and tested on English, performance declined modestly, with accuracy at 75.01%, sensitivity at 57.44%, precision at 94.26%, F1-score at 65.36%, and ROC-AUC at 0.73. These results indicate that while cross- linguistic generalization is feasible, linguistic and phonetic variability pose inherent challenges that necessitate further exploration. On Deep learning approaches, both CNN-2D and ResNet-18 models were evaluated on spectrogram inputs. While CNN-2D achieved strong performance, accuracy: 0.95, precision: 0.99, recall: 0.95, F1-score: 0.97, ROC-AUC: 0.99. ResNet-18 demonstrated marginally superior results accuracy: 0.97, precision: 1.00, recall: 0.96, F1-score: 0.98, ROC-AUC: 1.00, highlighting the potential of deeper architectures in capturing complex spectral-temporal dependencies in PD speech. Conclusion: This study demonstrates that speech carries reliable, quantifiable biomarkers of PD that can be effectively identified using machine learning. The proposed frameworks achieve high accuracy across multiple languages, speech tasks, and datasets, supporting the development of scalable and non-invasive diagnostic tools. The outcomes advance understanding of speech-based biomarkers for PD and support future implementation of automated voice analysis as a viable tool for early detection, disease monitoring, and telemedicine-based neurodegenerative disorder management.

VOICE AND SPEECH BASED MACHINE LEARNING APPROACH FOR EARLY DETECTION OF PARKINSON’S DISEASE.

HOSSAIN, MOHAMMAD AMRAN
2026-03-19

Abstract

Background: Parkinson’s disease (PD) is a progressive neurodegenerative disorder characterized by motor and non-motor symptoms that significantly affect quality of life. Speech and voice impairments are the earliest indicators of PD, often emerging during the prodromal phase before overt motor dysfunction becomes apparent. These vocal alterations affecting pitch, articulation, intensity, and fluency represent potential non-invasive biomarkers for early detection. Despite significant advances in medical technology, the early and reliable diagnosis of PD remains challenging due to the subtle onset of symptoms and their overlap with normal aging or other neurological conditions. Speech analysis, powered by machine learning (ML) and deep learning (DL) methodologies, offers a promising avenue for developing cost-effective, scalable, and language-independent diagnostic tools. The objective of this study is to evaluate the potential use of ML and DL approach in the identification of Parkinson's speech characteristics across multilingual datasets, as well as the development of scalable, language-neutral diagnostic tools for early detection and continuous monitoring of PD patients. Methods: An integrative mixed-methods approach was employed in this thesis, which integrated a systematic literature review with comprehensive experimental analyses. The systematic review was conducted in accordance with the PRISMA guidelines and encompassed literature published over the past decade. This review identified current advancements, methodological limitations, and existing research gaps in speech-based PD detection. This review provided insights into the experimental design and influenced the feature selection strategies employed during analysis. Experimental evaluations were conducted using multilingual speech datasets in English (MDVR- KCL), Italian (IPVS), and Turkish, encompassing both reading and conversational speech tasks. Acoustic, prosodic, and cepstral features, including fundamental frequency (F0), jitter, shimmer, harmonics-to-noise ratio (HNR), Mel-frequency cepstral coefficients (MFCCs), and gammatone cepstral coefficients (GTCCs), were extracted using signal processing tools such as Praat and Librosa. Speaker diarization and Spectro-temporal representations were employed to capture naturalistic speech patterns. Several supervised ML algorithms were trained, optimized, and evaluated using standard performance metrics. Additionally, convolutional neural networks (CNN-2D) and a residual network (ResNet-18) were implemented for deep learning-based feature learning from spectrogram representations. Results: The experimental findings demonstrated that both ML and DL approaches can effectively capture speech-related biomarkers indicative of PD across multiple languages and speech tasks. For the Turkish dataset, AdaBoost combined with a feature selection technique achived the best overall performance. The model reached an accuracy of 85.09%, precision of 0.92, sensitivity of 0.90, and an F1-score of 0.91, highlighting the importance of targeted feature selection in improving diagnostic accuracy. In the reading task of the English dataset (MDVR-KCL), the SVM model attained accuracy of 95.45%, sensitivity of 94.62%, specificity of 95.97%, F1-score of 94.12%, ROC-AUC of 0.98, and MCC of 0.90. The optimal feature combination of acoustic, prosodic, and GTCC features. In the conversational speech task, the XGB model achived the highest accuracy of 83.7%, sensitivity of 76.3%, specificity of 88.9%, and an F1-score of 79.5% demonstrates strong generalization on spontaneous speech. On the Italian dataset, the SVM model integrating acoustic and GTCC features achieved the most impressive across nearly all metrics. This model attained a test accuracy of 94.68%, sensitivity of 94.37%, specificity of 95.04%, precision of 95.71%, F1-score of 96.04, ROC-AUC of 0.98, and MCC of 0.89. The outcomes underscore the efficacy of feature-level fusion in enhancing model robustness and generalizability across different speaking tasks and recording conditions. In the cross-linguistic evaluation, models developed on English data and tested on Italian (and vice versa) demonstrated varying levels of performance, revealing the linguistic sensitivity of PD-related features. The XGB model achieved the highest accuracy when trained on English and evaluated on Italian data, accuracy: 81.13%, sensitivity: 78.68%, precision: 88.40%, F1- score: 83.34%, ROC-AUC: 0.88. When the model was trained on Italian and tested on English, performance declined modestly, with accuracy at 75.01%, sensitivity at 57.44%, precision at 94.26%, F1-score at 65.36%, and ROC-AUC at 0.73. These results indicate that while cross- linguistic generalization is feasible, linguistic and phonetic variability pose inherent challenges that necessitate further exploration. On Deep learning approaches, both CNN-2D and ResNet-18 models were evaluated on spectrogram inputs. While CNN-2D achieved strong performance, accuracy: 0.95, precision: 0.99, recall: 0.95, F1-score: 0.97, ROC-AUC: 0.99. ResNet-18 demonstrated marginally superior results accuracy: 0.97, precision: 1.00, recall: 0.96, F1-score: 0.98, ROC-AUC: 1.00, highlighting the potential of deeper architectures in capturing complex spectral-temporal dependencies in PD speech. Conclusion: This study demonstrates that speech carries reliable, quantifiable biomarkers of PD that can be effectively identified using machine learning. The proposed frameworks achieve high accuracy across multiple languages, speech tasks, and datasets, supporting the development of scalable and non-invasive diagnostic tools. The outcomes advance understanding of speech-based biomarkers for PD and support future implementation of automated voice analysis as a viable tool for early detection, disease monitoring, and telemedicine-based neurodegenerative disorder management.
19-mar-2026
Computer Science and Mathematics
Parkinson’s disease (PD); Machine learning (ML); Voice; Speech; Artificial Intelligence (AI).
AMENTA, Francesco
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11581/501031
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact