The Evolution of AI Voices: From Phonemes to Neural TTS

May 11, 2023Voice-over-Geschäft

Over the past two decades, text-to-speech (TTS) technology has undergone a rapid evolution, moving from early phoneme-based systems to more advanced neural TTS technology. This development has had a significant impact on the voiceover and dubbing industry, and has raised questions about the future of these professions in the face of increasingly sophisticated AI-generated speech.

Early Phoneme-Based TTS Systems

Phoneme-based TTS systems were the first generation of TTS technology. These systems worked by breaking down a written text into individual phonemes, or speech sounds, and then stitching them together to create a synthesized voice. While these systems were able to produce understandable speech, they often lacked naturalness and were limited in their ability to convey emotion or nuance.

One of the major drawbacks of phoneme-based TTS systems was the robotic-sounding speech they produced. This made them unsuitable for many applications, including creative work such as voiceovers, dubbing, and narration. As a result, human voice actors remained the primary choice for many years.

THE ARABIC VOICE™ studios has produced the phoneme pool of such kind of early TTS systems for one of Egypt-based technology corporate using one of our major male voice talents Ahmed Ragab and under the linguistic and artistic supervision of Ahmed AlQotb, later on in 2018, AlQotb has joint the production process of the TTS system of the voice assistant for one of the 500 fortune corps, in Beirut as the voice coach.

Neural TTS Technology

In recent years, however, TTS technology has undergone a major shift, thanks to the development of neural TTS systems. These systems use artificial neural networks, which are designed to mimic the structure and function of the human brain, to create more natural-sounding speech.

Neural TTS systems work by training the network on a large dataset of human speech samples, allowing it to learn the patterns and nuances of natural speech. This allows the system to produce speech that is not only more natural-sounding, but also able to convey a wider range of emotions and nuance.

Through this stage, companies like Microsoft, started developing what has been called: a Voice-font rather than a phoneme, the term refers to the voice fingerprint that can be used by the neural TTS language to re-produce human-like speech/emotions, this particular technology is what you currently find on AI voice providers’ platforms like revoicer, speechify or murf. Which is actually a cloud service of Microsoft Azure being deployed by each of them.

Impact on the Voiceover and Dubbing Industry

The development of neural TTS technology has raised concerns about the future of the voiceover and dubbing industry. While it’s clear that AI-generated speech can never fully replace the artistry and creativity of human voice actors, it’s also true that the technology is rapidly advancing and becoming more sophisticated.

For example, in certain contexts such as news broadcasts or informational videos, AI-generated speech may be sufficient to convey information accurately and quickly (for those who are not willing to pay a professional voiceover artist). However, for more creative and expressive work, such as cartoons, video games, and advertising, human voice actors are still in high demand and likely to remain so for the foreseeable future.

Moreover, according to some views, the use of AI-generated speech may actually create more work for voice actors. For example, AI-generated audiobooks or podcasts may increase demand for high-quality voice performances, as listeners seek out content that stands out from the automated voices they encounter elsewhere.

Conclusion

In conclusion, the evolution of text-to-speech technology has been significant over the past two decades. The shift from phoneme-based TTS systems to neural TTS technology has resulted in a more natural-sounding and expressive form of speech synthesis, raising questions about the future of the voiceover and dubbing industry.

While it’s clear that AI-generated speech can never fully replace the creativity and artistry of human voice actors, the technology is advancing quickly and is likely to have an impact on the industry in the years to come. However, it’s also possible that the use of AI-generated speech may actually create new opportunities for voice actors, leading to an increase in demand for high-quality voice performances.

Autor

Ahmed AlQotb,

Arabischer Sprachprofi, Schauspiel- und Castingdirektor, Gründer und Stratege von THE ARABIC VOICE, Inc., ArabicIVR.com und El Hakawaaty Bildungszentrum, die Stimme von:

Apple Customer Support IVR-System im Nahen Osten
Western Union telefonischer Support IVR-System im Nahen Osten
Interne Schulungs-E-Learning-Systeme der Intercontinental Hotel Group
Interpol e-learning curriculum produced for the International IP Crime Investigators College
Audioinformationen an Bord von BMW
“Shadow of the TombRaider” Game, main villain character: “Dr.Dominguez”
Google Adwords Customer Support IVR-System im Nahen Osten
Der Arabische Audioguide des Olympischen Museums

Alle Beiträge

Die Evolution der AI-Stimmen: Von Phonemen zu neuronalen TTS.

Early Phoneme-Based TTS Systems

Neural TTS Technology

Impact on the Voiceover and Dubbing Industry

Conclusion

Ahmed AlQotb,

Kommentar hinzufügen Antworten abbrechen

In Kürze

Seitenverzeichnis

Contact Info

Unterstützung