This AI Makes "Audio Deepfakes"! | Summary and Q&A

617.7K views

•

April 8, 2020

This AI Makes "Audio Deepfakes"!

TL;DR

Deepfake technology has advanced to the point where it can convincingly animate video footage using synthesized audio, achieved through techniques like Tacotron 2 and Neural Voice Puppetry.

Install to Summarize YouTube Videos and Get Transcripts

Key Insights

😥 Deepfake technology has advanced to the point where it can accurately animate video footage using synthesized audio.
👂 Tacotron 2 is an AI-based voice cloning technique that can synthesize new sentences in a person's voice using a 5-second sound sample.
🙊 Neural Voice Puppetry combines Tacotron 2 with video footage to make the target subject appear as if they are speaking the synthesized audio.
🎯 The deepfake techniques showcased in the video achieve superior quality and can generalize to multiple target subjects.
🏃 The neural rendering part of the process runs in real time, further enhancing the realism of the animated video.
🎯 The combination of multiple existing techniques enables joint video and audio synthesis for a target subject.
🫵 Viewers are encouraged to try out the deepfake tool themselves and share their results.

Transcript

Dear Fellow Scholars, this is Two Minute Papers with this guy's name that is impossible to pronounce. My name is Dr. Károly Zsolnai-Fehér, and indeed, it seems that pronouncing my name requires some advanced technology. So what was this? I promise to tell you in a moment, but to understand what happened here, first, let’s have a look at this deepfa... Read More

Questions & Answers

Q: How does deepfake technology animate video footage using synthesized audio?

Deepfake technology uses techniques like Tacotron 2 and Neural Voice Puppetry to translate mouth, head, and eye movements to a chosen target subject using just one photograph and synthesize new sentences in a person's voice using a sound sample. It then applies these gestures and motions to an intermediate 3D model and adapts it to the target subject using a neural renderer.

Q: Can deepfake technology synthesize sounds and consonants not heard in the original voice sample?

Yes, deepfake technology, such as Tacotron 2, is capable of synthesizing sounds and consonants that were not heard in the original voice sample. This is achieved through advanced AI techniques that infer and generate these sounds based on the given sample.

Q: How does Neural Voice Puppetry improve upon previous techniques?

Neural Voice Puppetry combines the synthesized audio from Tacotron 2 with video footage, animating it to make the target subject appear as if they are speaking the synthesized audio. This technique improves upon previous methods by achieving a higher level of realism and synchronization between the audio and video.

Q: Can anyone try out these deepfake techniques for themselves?

Yes, viewers are encouraged to try out these deepfake techniques themselves. The link to the tool is provided in the video description, and users can leave comments with their results.

Summary & Key Takeaways

Deepfake techniques can now transfer realistic mouth, head, and eye movements to a chosen target subject using just one photograph.
Tacotron 2 is an AI-based voice cloning technique that can synthesize new sentences in a person's voice using just a 5-second sound sample.
Neural Voice Puppetry combines Tacotron 2 with video footage, animating it to make the target subject appear as if they are speaking the synthesized audio.