The Evolution of AI Voices: From Phonemes to Neural TTS

May 11, 2023声音配音业务

Over the past two decades, text-to-speech (TTS) technology has undergone a rapid evolution, moving from early phoneme-based systems to more advanced neural TTS technology. This development has had a significant impact on the voiceover and dubbing industry, and has raised questions about the future of these professions in the face of increasingly sophisticated AI-generated speech.

Early Phoneme-Based TTS Systems

Phoneme-based TTS systems were the first generation of TTS technology. These systems worked by breaking down a written text into individual phonemes, or speech sounds, and then stitching them together to create a synthesized voice. While these systems were able to produce understandable speech, they often lacked naturalness and were limited in their ability to convey emotion or nuance.

One of the major drawbacks of phoneme-based TTS systems was the robotic-sounding speech they produced. This made them unsuitable for many applications, including creative work such as voiceovers, dubbing, and narration. As a result, human voice actors remained the primary choice for many years.

THE ARABIC VOICE™ studios has produced the phoneme pool of such kind of early TTS systems for one of Egypt-based technology corporate using one of our major male voice talents Ahmed Ragab and under the linguistic and artistic supervision of Ahmed AlQotb, later on in 2018, AlQotb has joint the production process of the TTS system of the voice assistant for one of the 500 fortune corps, in Beirut as the voice coach.

Neural TTS Technology

In recent years, however, TTS technology has undergone a major shift, thanks to the development of neural TTS systems. These systems use artificial neural networks, which are designed to mimic the structure and function of the human brain, to create more natural-sounding speech.

Neural TTS systems work by training the network on a large dataset of human speech samples, allowing it to learn the patterns and nuances of natural speech. This allows the system to produce speech that is not only more natural-sounding, but also able to convey a wider range of emotions and nuance.

Through this stage, companies like Microsoft, started developing what has been called: a Voice-font rather than a phoneme, the term refers to the voice fingerprint that can be used by the neural TTS language to re-produce human-like speech/emotions, this particular technology is what you currently find on AI voice providers’ platforms like revoicer, speechify or murf. Which is actually a cloud service of Microsoft Azure being deployed by each of them.

Impact on the Voiceover and Dubbing Industry

The development of neural TTS technology has raised concerns about the future of the voiceover and dubbing industry. While it’s clear that AI-generated speech can never fully replace the artistry and creativity of human voice actors, it’s also true that the technology is rapidly advancing and becoming more sophisticated.

For example, in certain contexts such as news broadcasts or informational videos, AI-generated speech may be sufficient to convey information accurately and quickly (for those who are not willing to pay a professional voiceover artist). However, for more creative and expressive work, such as cartoons, video games, and advertising, human voice actors are still in high demand and likely to remain so for the foreseeable future.

Moreover, according to some views, the use of AI-generated speech may actually create more work for voice actors. For example, AI-generated audiobooks or podcasts may increase demand for high-quality voice performances, as listeners seek out content that stands out from the automated voices they encounter elsewhere.

Conclusion

In conclusion, the evolution of text-to-speech technology has been significant over the past two decades. The shift from phoneme-based TTS systems to neural TTS technology has resulted in a more natural-sounding and expressive form of speech synthesis, raising questions about the future of the voiceover and dubbing industry.

While it’s clear that AI-generated speech can never fully replace the creativity and artistry of human voice actors, the technology is advancing quickly and is likely to have an impact on the industry in the years to come. However, it’s also possible that the use of AI-generated speech may actually create new opportunities for voice actors, leading to an increase in demand for high-quality voice performances.

Author

艾哈迈德-阿尔库特布

Arabic voice professional, Acting and Casting Director, Founder and Strategist of THE ARABIC VOICE, Inc., ArabicIVR.com and El Hakawaaty educational hub, the voice of:

中东地区的苹果客户支持IVR系统
西联汇款在中东地区的客户支持IVR系统
洲际酒店集团内部培训电子学习系统
Interpol e-learning curriculum produced for the International IP Crime Investigators College
宝马的车载音频指示
“Shadow of the TombRaider” Game, main villain character: “Dr.Dominguez”
中东地区的谷歌Adwords客户支持IVR系统
奥林匹克博物馆阿拉伯文语音导游

All posts

人工智能声音的进化：从音素到神经TTS

Early Phoneme-Based TTS Systems

Neural TTS Technology

Impact on the Voiceover and Dubbing Industry

Conclusion

艾哈迈德-阿尔库特布

Post comment 取消回复

在一瞥。

网站地图

Contact Info

支持