Swarthmore College Department of Computer Science

A Virtual Vocal Tract: A Novel Approach to Articulatory Speech Synthesis

Martina Costagliola, Tessa Jones, and Adam Lammert

Speech synthesis is the artificial production of human speech. Speech synthesizers provide a natural interface between humans and computers. State-of-the-art synthesizers rely on pre-recorded speech, making them accurate but inflexible. Articulatory synthesizers (i.e. virtual vocal tracts) produce speech like humans do, and are flexible, controllable, and refigurable. Human speech results from the vibrating vocal folds (the "source") and the dynamic shape of the vocal tract (the "filter"). Lammert (2014) developed a model of vocal tract (VT) shape comprising only five concatenated tubes. This five-region model was based on real-time MRI data and represents VT shapes with ~85% accuracy. The purpose of this project is to validate the five-region model by building an articulatory speech synthesizer capable of producing all English phonemes.

We targeted both articulatory and acoustic properties of each English phoneme. The primary articulatory property is the shape of the VT during the phoneme's production. We were informed by rtMRI data and other sources about the shape (Narayanan et al, 2014). In terms of acoustic properties, each phoneme has unique formant frequencies ("formants"), which are peaks in the speech spectrum and resonances of the VT. The formants of many phonemes are well-known, but we also used the speech analysis tool Praat (Boersma, 2001) to calculate formants. Our own program uses both articulatory and acoustic properties to calculate VT shapes for each phoneme. It systematically searches through variations of each of the five constriction degrees, calculates the resultant formants, and finds the best constriction values.

By targeting these properties with the five-region model, we successfully produced a full set of English vowels, diphthongs, most approximants, and certain stop consonants. Our best results were found by taking shape estimates and adding variation based on formant frequencies. Furthermore, we learned that synthesis is more realistic with context-dependent, nonlinear transitions between phoneme shapes than with linear transitions. In future work, we hope to synthesize the remainder of the English phonemes and further validate the five-region model of articulatory synthesis.

Literature cited:
Boersma, P.  (2001).  Praat, a system for doing phonetics by computer. 
Glot International 5:9/10, 341-345.

Lammert, A. & Narayanan, S.  (2014).  Development of a parametric basis for
 vocal tract area function representation from a large speech production database.  
 J. Acoust. Soc. Am. 135, 2198.

Lammert, A. , Goldstein, L. & Narayanan, S.  (in preparation).  Development
 of a minimal regional model of the vocal tract from a large speech production database.  
 For submission to the Journal of the Acoustical Society of America.

Narayanan, S., Toutios, A., Ramanarayanan, V., Lammert, A., Kim, J., Nayak, K., 
Kim, Y.- C., Zhu, Y., Bresch, E., Goldstein, L, Byrd, D., Katsamanis, 
A. & Proctor, M. (2014), Realtime magnetic resonance imaging and electromagnetic 
articulography database for speech production research (TC), Journal 
of the Acoustical Society of America, 136(3): 1307-1311.