Speaking with Computer Generated Voices
Speaking with Computer Generated Voices
While creating Yet Another Choose Your Own Adventure I took a detour into the world of computer-generated voices. As Chris Maury noted in his post, the quality of algorithmically generated voices is improving and I wanted to learn more about the state of the industry.
Pop quiz! Think of a computer generated voice you heard recently. Apple’s Siri? Portal’s GLaDOS? The Mac OS say
command?
While GLaDOS was created by an actual human, say
and Siri both have their roots in a company called Nuance which merged with Kurzweil’s ScanSoft in 2005. Apple acquired Nuance in 2012 and renamed one of the voices, Samantha, to Siri. You can hear more sample voices from Nuance and the general quality is quite good. I didn’t want to pick Siri as the Choose Your Own Adventure voice because she has become a bit too cliché. Commercial applications were either prohibitively expensive or lacked the proper API for a weekend hack so I dug into some of the research communities around text-to-speech (TTS) to find open source solutions.
TTS Software
Many universities have released open-source software applications to generate speech. The Center for Speech Technology Research (CSTR) at the University of Edinburgh has Festival (demo) and Carnegie Mellon University has Flite (Festival Lite) and Festvox. Festival and Flite perform actual TTS services while Festvox aims to make building new synthetic voices more systematic and better documented. They say anyone can make a voice through Festvox by recording their own voice although I expect this would take quite a good deal of time. CMU continues to run the Blizzard Challenge where competitors take a released speech database, build a synthetic voice from the data, and synthesize a prescribed set of test sentences that are evaluated through listening tests.
To install Flite and Festival:
- Download the library (Flite | Festival)
- Unzip the file
- Run sh do_build
- To use Flite follow this guide. To use Festival follow this guide (most commands are LISP-y).
While the software is interesting, the real magic of TTS software is in the voices. Voices are generated from as many as thousands of hours of speech recorded from a single individual. To get an idea of what different voices sound like listen to this voice named “SLT”, an American female, reading an excerpt from The Mystery of Chimney Rock.
The SLT voice (voice_cmu_us_slt_arctic_hts) was developed at CMU based on the ARCTIC library from the Laboratory of Artificial Intelligence, Department of Cybernetics at the University of West Bohemia in Pilsen, Czech Republic.
Here’s another voice, a Scottish male, reading the same excerpt.
Although interesting, these voices (many from the early 2000s) are very robotic and sound outdated compared to their commercial counterparts. I wondered if there had been new releases in the past years and reached out to Dr. Korin Richmond at the Centre for Speech Technology Research at Edinburgh who pointed me to the wonderful homepage of Junichi Yamagishi at the University of Nagoya. He has a fantastic collection of software including HTS, the Hidden Markov Model-based Speech Synthesis System developed at NITECH (Nagoya Institute of Techology) in Japan. I was able to get the CSTR HTS Voice Library, version 0.99 which is available for research purposes only by going to this page and requesting permission from Dr. Yamagishi.
I found HTS 0.99 to be a little more promising. Here’s Nick, a British english male.
Nick is good but a little drab and I thought I could do better still. I wrote back to Dr. Richmond about the possibility of downloading Nick’s successors (voices created in 2010 and 2011) but was told they are unavailable to the public due to licensing restrictions.
A bit distraught I returned to where this story began, OS X’s say
command. If you have OS 10.7, you can download dozens of high-quality (and even multilingual voices) for free. The speech quality is fantastic and was by far the best of the voices I sampled. While not open-source, the application is free to use if you have the OS X and I ended up settling on Serena as my favorite from the bunch.
The future is certainly looking bright for speech synthesis and I’ll (im)patiently await the release of new software and new voices. You can play the Choose Your Own Adventure game at adventure.gleitzman.com.