Adobe wants to create the Photoshop of voiceovers. At Adobe MAX 2016 last week, Zeyu Jin, showed off a project called VoCo that analyzes a speech, and then learns how to reproduce the voice of the speaker. Using some sort of machine learning, the software allows the user to alter the words that the speaker says in the recording. For example, in Zeyu’s demo, he pulled in an audio recording of a speaker saying “And I kissed my dogs and my wife” and was able to change the order of the words “dogs” and “wife” so that the speaker then said “And I kissed my wife and my dogs”. The modification sounded remarkable real, but it’s not so surprising for how simple of a modification it was.
He then blew away the audience by completely changing the wording of the message. He replaced the word “wife” with the name “Jordan” by modifying the text, and to everybody’s surprise, the voice flawlessly rendered the name “Jordan” in the speech and it sounded remarkably like the real speaker. It’s unclear if the “Jordan” snippet was automatically pulled from another section of the 20 minute audio recording that the program was working with, or if it really did generate the audio.
Zeyu took it a step further, though, and completely surpassed everybody’s expectations. He changed the text to “And I kissed Jordan three times”. After thinking for about 2 seconds, the software rendered a voice that realistically weaved the new generated words into the existing audio recording. Again, who knows if the words “three times” was pulled from another section of the recording, but Zeyu claims that the software learns the voice of the human speaker and really does generate the artificial voice.
One thing is for sure though. The crowd went absolutely wild when they heard the completely realistic generated voice. It sounded indistinguishable from the original speaker and the generated audio was seamlessly woven into the original audio.
The double-edged sword
With a heated election around the corner, some have been worried about the chance of voice fraud. Imagine if this technology was used to “speak” in somebody’s voice without their knowledge. What’s stopping a presidential candidate from altering what their opponent said in a speech and using it against them? What’s stopping a hacker from using this technology to learn how your mother speaks, calling your phone, and impersonating your mother saying “Honey, I can’t remember my email password, can you remind me what it is?”
Zeyu addressed this after his demo saying, “Don’t worry”. He says they’ve been working on ways to prevent this kind of scenario.
So what is this project for, anyways? Zeyu says people working on audiobooks or podcasts could easily correct mistakes in recordings. It would be easy for the editor to use text to alter the fragment of audio. It’s easy to imagine how this would save time for the speaker because they wouldn’t have to re-record that section of audio.
It’s very exciting to see machine learning coming this far, but it may be tough — if not impossible — to prevent misuse of this technology.