AI Creates Facial Animation From Audio | Two Minute Papers #185 | Summary and Q&A

247.8K views

•

September 4, 2017

AI Creates Facial Animation From Audio | Two Minute Papers #185

TL;DR

Researchers have developed a Convolutional Neural Network (CNN) that can generate high-quality facial animations from speech input, and can also incorporate emotional states and synthesizes audio using WaveNet.

Install to Summarize YouTube Videos and Get Transcripts

Key Insights

😯 The CNN can generate realistic facial animations from speech input.
🪜 Emotional states can be incorporated into the animations, adding expressiveness.
👻 WaveNet can synthesize audio from written text, allowing digital characters to speak.
🧑‍🏭 The combination of the CNN and WaveNet creates a convenient pipeline for producing voiceovers and animations without the need for actors or motion capture.
🍉 The technique outperforms previous methods in terms of realism and naturalness.
🛀 The user study conducted with the technique showed consistently favorable results across different cases and languages.
🧑‍🏭 The CNN was trained on a small amount of data per actor, showcasing its ability to generalize.

Transcript

Dear Fellow Scholars, this is Two Minute Papers with Károly Zsolnai-Fehér. This work is about creating facial animation from speech in real time. This means that after recording the audio footage of us speaking, we give it to a learning algorithm, which creates a high-quality animation depicting our digital characters uttering these words. This lea... Read More

Questions & Answers

Q: How does the CNN create facial animations from speech?

The CNN is trained on a small amount of footage of actors speaking. It learns to generate facial animations based on the audio input, and can generalize this knowledge to create animations for a variety of expressions and words.

Q: Can the animations incorporate emotional states?

Yes, the researchers can specify an emotional state that the character should express while speaking. This adds a new dimension of expressiveness to the generated animations.

Q: How does WaveNet contribute to the process?

WaveNet is used to synthesize audio from written text, creating a believable human voice. This audio can then be combined with the facial animations, allowing the digital character to speak the written text.

Q: How did the researchers validate the effectiveness of their technique?

The researchers conducted a user study where participants were shown videos created using the old technique and the new technique, without knowing which was which. The vast majority of participants judged the videos created using the new technique to be more natural.

Summary & Key Takeaways

Researchers have created a CNN that can generate realistic facial animations based on speech input, using as little as 3 to 5 minutes of training data per actor.
The CNN can also incorporate emotional states into the animations, allowing for more expressive performances.
By combining this technique with DeepMind's WaveNet, the researchers can synthesize realistic human voices and make digital characters say written text.