Amazon, the American technology company, has built an artificial intelligence system that converts text into synthesized speech. The researchers describe it as the largest of its kind to date, with details published on arXiv, a repository for scientific papers.
The model, named Massive Adaptive Streaming TTS with Immediate Options, or BASE TTS, comprises 980 million parameters and was trained on 100,000 hours of collected speech samples, the majority of which are English. It demonstrates the ability to learn nuanced pronunciation, rhythm, and intonation from large-scale data, enabling natural-sounding speech generation. The work also showcases the model’s capacity to handle multilingual pronunciation, providing examples of phrases like “adios, amigo” to ensure accurate articulation across languages. The research notes that these pronunciation cues help BASE TTS render non-English phrases with appropriate stress and cadence. (Source: arXiv)
In testing with comparatively smaller datasets, BASE TTS displayed competence in handling complex nouns, conveying emotion through voice modulation, and applying punctuation to shape intonation. The system can simulate questions by emphasizing the right words and inserting appropriate prosody, illustrating how punctuation and context drive expressive speech in synthetic voices. (Source: arXiv)
Amazon envisions BASE TTS being used in educational contexts as a learning tool, enabling personalized audio content, listening practice, and language exposure for students. The intent is to provide a scalable, natural-sounding voice that can support diverse learning scenarios and accessibility needs. (Source: arXiv)
Historically, the field has seen parallel developments, including Apple’s earlier foray into AI-assisted animation. The juxtaposition of these efforts highlights a broader trend toward integrating AI-generated speech and motion to enhance digital content and learning experiences. (Source: arXiv)