The Blue Lagoon

The Blue Lagoon

The Blue Lagoon is a 1980 American ro­mance and ad­ven­ture film di­rected by Ran­dal Kleiser”

This phrase is stuck in my head. Not be­cause I’m a huge fan of poorly rated movies, but be­cause it’s a test phrase that Google likes to use to show off its speech syn­the­sis al­go­rithms. Speech syn­the­sis has kinda been stuck in my head too.

Tacotron Yum

Recently, re­searchers at Google pub­lished a pa­per de­scrib­ing a text-to-speech gen­er­a­tion model called Tacotron”. It uses deep learn­ing to learn how to gen­er­ate au­dio based on in­put text. Besides catch­ing my at­ten­tion due to the de­li­cious sound­ing ti­tle, the pa­per in­trigued me be­cause of the prob­lems that arise when try­ing to syn­the­size speech from text. Current speech syn­the­sis mod­els in pro­duc­tion rely pri­mar­ily on con­cate­na­tion of pre-recorded words, with some smooth­ing to make the words flow to­gether more. The prob­lem with this method is that the length of words and in­to­na­tion are not taken fully into con­sid­er­a­tion, caus­ing the syn­the­sized au­dio to sound ro­botic and un­nat­ural.

Synthesizing speech is a non-triv­ial prob­lem, mainly be­cause there is a lot of in­ter­po­la­tion in­volved. Raw text does not pro­vide a lot of clues for the tone, in­flec­tion, and ex­pres­sive­ness. The in­flec­tion in ask­ing a ques­tion, such as It’s your birthay to­day?”, is sig­nif­i­cantly dif­fer­ent from that in a state­ment such as It’s your birth­day to­day!” In ad­di­tion, in­di­vid­ual voices, ob­vi­ously, dif­fer by a lot, based on gen­der, na­tion­al­ity, etc. It’s hard to teach a com­puter to gen­er­al­ize the im­por­tant parts.

Tacotron takes a dif­fer­ent ap­proach from the cur­rent con­cate­na­tive meth­ods: it uses an end-to-end” ap­proach, wherein it learns from text/​speech pairs to de­ter­mine how to di­rectly gen­er­ate the raw spec­tro­gram given an in­put text. This al­lows for it to in­clude fea­tures such as a nat­ural rhythm of speech, in­cor­po­rate stress and in­to­na­tion. The strength of a deep learn­ing model is that it can nat­u­rally in­cor­po­rate fea­tures that may oth­er­wise go over­looked. Since it learns from record­ings, for ex­am­ple, and uses that to gen­er­ate speech, the gen­er­ated au­dio also in­cludes mouth-sounds and breath­ing that make the speech sound more hu­man.

Speech syn­the­sis is an in­cred­i­bly rel­e­vant ap­pli­ca­tion of com­puter sci­ence, which is why I found the topic so in­ter­est­ing. Text to speech could be used to au­to­mat­i­cally gen­er­ate au­dio­books, cre­ate di­a­logue pro­ce­du­rally, and pro­vide ac­cu­rate ver­bal trans­la­tions. Personal as­sis­tant ap­pli­ca­tions that use a con­ver­sa­tional in­ter­face would re­quire nat­ural speech syn­the­sis for a more im­mer­sive user ex­pe­ri­ence.

Additional Reading

You can read the Tacotron pa­per here (arxiv 1703.10135).

You can read about an­other one of Google’s speech syn­the­sis pro­jects, WaveNet, here (website) or here (paper; arxiv 1609.03499).