Abstract
The goal of this study was to investigate the performance of different feature types for voice quality classification using multiple classifiers. The study compared the COVAREP feature set; which included glottal source features, frequency warped cepstrum, and harmonicmodel features; against the mel-frequency cepstral coefficients (MFCCs) computed from the acoustic voice signal, acoustic-based glottal inverse filtered (GIF) waveform, and electroglottographic (EGG) waveform. Our hypothesis was that MFCCs can capture the perceived voice quality fromeither of these three voice signals. Experiments were carried out on recordings from 28 participants with normal vocal status who were prompted to sustain vowels with modal and nonmodal voice qualities. Recordings were rated by an expert listener using the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V), and the ratings were transformed into a dichotomous label (presence or absence) for the prompted voice qualities ofmodal voice, breathiness, strain, and roughness. The classification was done using support vector machines, random forests, deep neural networks, and Gaussian mixture model classifiers, which were built as speaker independent using a leave-one-speaker-out strategy. The best classification accuracy of 79.97% was achieved for the full COVAREP set. The harmonic model features were the best performing subset, with 78.47% accuracy, and the static+dynamic MFCCs scored at 74.52%. A closer analysis showed that MFCC and dynamic MFCC features were able to classify modal, breathy, and strained voice quality dimensions fromthe acoustic and GIF waveforms. Reduced classification performance was exhibited by the EGG waveform.
| Original language | English |
|---|---|
| Article number | 8114356 |
| Pages (from-to) | 2281-2291 |
| Number of pages | 11 |
| Journal | IEEE/ACM Transactions on Audio Speech and Language Processing |
| Volume | 25 |
| Issue number | 12 |
| DOIs | |
| Publication status | Published - Dec 2017 |
Bibliographical note
Funding Information: Manuscript received December 15, 2016; revised August 3, 2017 and September 18, 2017; accepted September 28, 2017. Date of current version November 27, 2017. This work was supported in part by The Icelandic Centre for Research (RANNIS) under the project Model-Based Speech Production Analysis and Voice Quality Assessment, under Grant 152705-051 and in part by the Voice Health Institute and the National Institutes of Health National Institute on Deafness and Other Communication Disorders under Grants R21 DC011588, R33 DC011588, and P50 DC015446. The guest editor coordinating the review of this manuscript and approving it for publication was Prof. Dean J. Krusienski. (Corresponding author: Michal Borsky.) M. Borsky and J. Gudnason are with the School of Science and Engineering, Haskolinn i Reykjavik, Reykjavik 110, Iceland (e-mail: [email protected]; [email protected]). Funding Information: This work was supported in part by The Icelandic Centre for Research (RANNIS) under the project Model-Based Speech Production Analysis and Voice Quality Assessment, under Grant 152705-051 and in part by the Voice Health Institute and the National Institutes of Health National Institute on Deafness and Other Communication Disorders under Grants R21 DC011588, R33 DC011588, and P50 DC015446. The guest editor coordinating the review of this manuscript and approving it for publication was Prof. Dean J. Krusienski. Publisher Copyright: © 2017 IEEE.Other keywords
- Acoustics
- COVAREP
- Consensus auditory-perceptual evaluation of voice
- Electroglottograph
- Glottal glottal inverse filtering
- Mel-frequency cepstral coefficients
- Modal voice
- Non-modal voice
- Voice quality assessment
Fingerprint
Dive into the research topics of 'Modal and nonmodal voice quality classification using acoustic and electroglottographic features'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver