TY - JOUR
T1 - Modal and nonmodal voice quality classification using acoustic and electroglottographic features
AU - Borsky, Michal
AU - Mehta, Daryush D.
AU - Van Stan, Jarrad H.
AU - Gudnason, Jon
N1 - Funding Information: Manuscript received December 15, 2016; revised August 3, 2017 and September 18, 2017; accepted September 28, 2017. Date of current version November 27, 2017. This work was supported in part by The Icelandic Centre for Research (RANNIS) under the project Model-Based Speech Production Analysis and Voice Quality Assessment, under Grant 152705-051 and in part by the Voice Health Institute and the National Institutes of Health National Institute on Deafness and Other Communication Disorders under Grants R21 DC011588, R33 DC011588, and P50 DC015446. The guest editor coordinating the review of this manuscript and approving it for publication was Prof. Dean J. Krusienski. (Corresponding author: Michal Borsky.) M. Borsky and J. Gudnason are with the School of Science and Engineering, Haskolinn i Reykjavik, Reykjavik 110, Iceland (e-mail: [email protected]; [email protected]). Funding Information: This work was supported in part by The Icelandic Centre for Research (RANNIS) under the project Model-Based Speech Production Analysis and Voice Quality Assessment, under Grant 152705-051 and in part by the Voice Health Institute and the National Institutes of Health National Institute on Deafness and Other Communication Disorders under Grants R21 DC011588, R33 DC011588, and P50 DC015446. The guest editor coordinating the review of this manuscript and approving it for publication was Prof. Dean J. Krusienski. Publisher Copyright: © 2017 IEEE.
PY - 2017/12
Y1 - 2017/12
N2 - The goal of this study was to investigate the performance of different feature types for voice quality classification using multiple classifiers. The study compared the COVAREP feature set; which included glottal source features, frequency warped cepstrum, and harmonicmodel features; against the mel-frequency cepstral coefficients (MFCCs) computed from the acoustic voice signal, acoustic-based glottal inverse filtered (GIF) waveform, and electroglottographic (EGG) waveform. Our hypothesis was that MFCCs can capture the perceived voice quality fromeither of these three voice signals. Experiments were carried out on recordings from 28 participants with normal vocal status who were prompted to sustain vowels with modal and nonmodal voice qualities. Recordings were rated by an expert listener using the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V), and the ratings were transformed into a dichotomous label (presence or absence) for the prompted voice qualities ofmodal voice, breathiness, strain, and roughness. The classification was done using support vector machines, random forests, deep neural networks, and Gaussian mixture model classifiers, which were built as speaker independent using a leave-one-speaker-out strategy. The best classification accuracy of 79.97% was achieved for the full COVAREP set. The harmonic model features were the best performing subset, with 78.47% accuracy, and the static+dynamic MFCCs scored at 74.52%. A closer analysis showed that MFCC and dynamic MFCC features were able to classify modal, breathy, and strained voice quality dimensions fromthe acoustic and GIF waveforms. Reduced classification performance was exhibited by the EGG waveform.
AB - The goal of this study was to investigate the performance of different feature types for voice quality classification using multiple classifiers. The study compared the COVAREP feature set; which included glottal source features, frequency warped cepstrum, and harmonicmodel features; against the mel-frequency cepstral coefficients (MFCCs) computed from the acoustic voice signal, acoustic-based glottal inverse filtered (GIF) waveform, and electroglottographic (EGG) waveform. Our hypothesis was that MFCCs can capture the perceived voice quality fromeither of these three voice signals. Experiments were carried out on recordings from 28 participants with normal vocal status who were prompted to sustain vowels with modal and nonmodal voice qualities. Recordings were rated by an expert listener using the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V), and the ratings were transformed into a dichotomous label (presence or absence) for the prompted voice qualities ofmodal voice, breathiness, strain, and roughness. The classification was done using support vector machines, random forests, deep neural networks, and Gaussian mixture model classifiers, which were built as speaker independent using a leave-one-speaker-out strategy. The best classification accuracy of 79.97% was achieved for the full COVAREP set. The harmonic model features were the best performing subset, with 78.47% accuracy, and the static+dynamic MFCCs scored at 74.52%. A closer analysis showed that MFCC and dynamic MFCC features were able to classify modal, breathy, and strained voice quality dimensions fromthe acoustic and GIF waveforms. Reduced classification performance was exhibited by the EGG waveform.
KW - Acoustics
KW - COVAREP
KW - Consensus auditory-perceptual evaluation of voice
KW - Electroglottograph
KW - Glottal glottal inverse filtering
KW - Mel-frequency cepstral coefficients
KW - Modal voice
KW - Non-modal voice
KW - Voice quality assessment
UR - https://www.scopus.com/pages/publications/85053341388
U2 - 10.1109/TASLP.2017.2759002
DO - 10.1109/TASLP.2017.2759002
M3 - Article
SN - 2329-9290
VL - 25
SP - 2281
EP - 2291
JO - IEEE/ACM Transactions on Audio Speech and Language Processing
JF - IEEE/ACM Transactions on Audio Speech and Language Processing
IS - 12
M1 - 8114356
ER -