This AI Sings | Two Minute Papers #230 | Summary and Q&A

43.8K views

•

February 22, 2018

This AI Sings | Two Minute Papers #230

TL;DR

This paper introduces an AI vocoder that can generate realistic singing from MIDI and lyrics inputs, offering advantages in generation times and training data requirements.

Install to Summarize YouTube Videos and Get Transcripts

Key Insights

⌛ The AI vocoder synthesizes singing from MIDI and lyrics inputs by separating pitch and timbre components, offering advantages in generation times and training data requirements.
🛩️ The algorithm utilizes a modified WaveNet architecture with 2-by-1 dilated convolutions, enabling training on small datasets.
💯 Mean opinion scores demonstrate that the AI vocoder outperforms previous methods in creating realistic singing.
🎹 MIDI inputs can be easily created using a midi master keyboard or digital audio workstation programs, enhancing accessibility in the synthesis process.

Transcript

Dear Fellow Scholars, this is Two Minute Papers with Károly Zsolnai-Fehér. This work is about building an AI vocoder that is able to synthesize believable singing from MIDI and lyrics as inputs. But first, what is a vocoder? It works kinda like this. Fellow Scholars who are fans of Jean-Michel Jarre's music are likely very familiar with this effect... Read More

Questions & Answers

Q: How does the AI vocoder generate singing from MIDI and lyrics inputs?

The AI vocoder separates the pitch and timbre components of the voice, using MIDI data to determine the pitch and lyrics text to generate the words. It then synthesizes the singing by combining these elements.

Q: What are the advantages of using the AI vocoder over other methods?

The AI vocoder offers faster generation times, approximately 10-15 times real-time. Additionally, it requires a modest amount of training data, making it feasible to train on smaller datasets.

Q: How does the modified WaveNet architecture contribute to the AI vocoder?

The AI vocoder uses a modified WaveNet architecture with 2-by-1 dilated convolutions. This allows for an exponential growth in the receptive field of the model, while keeping the parameter count low.

Q: How does the AI vocoder compare to other methods in terms of creating authentic singing?

Mean opinion scores indicate that the AI vocoder performs well in generating singing that sounds genuine. It falls between previous works and reference singing footage, showcasing its effectiveness.

Summary & Key Takeaways

The AI vocoder can synthesize singing from MIDI and lyrics inputs, separating pitch and timbre components to generate waveforms.
The algorithm uses a modified WaveNet architecture with 2-by-1 dilated convolutions, enabling training on small datasets.
Mean opinion scores indicate that the new method outperforms previous works in creating authentic human-like singing.