Alibaba's Tongyi Lab Unveils Fun-CosyVoice3.5 and Fun-AudioGen-VD: FreeStyle Voice Generation with a Single Sentence
Alibaba's Tongyi Lab has launched two groundbreaking voice models, Fun-CosyVoice3.5 and Fun-AudioGen-VD, enabling users to generate and control speech and audio scenes entirely through natural language instructions. The models offer fine-grained control over vocal expression, multi-language support, and immersive soundscape creation.
Alibaba's Tongyi Lab has taken a significant leap in generative AI for audio with the release of two new models, Fun-CosyVoice3.5 and Fun-AudioGen-VD. Announced on March 2, 2026, these models introduce a "FreeStyle" instruction paradigm, allowing users to generate and control speech and complex audio scenes using simple, natural language commands.
A New Era of Instructable TTS
The core innovation lies in moving beyond preset parameters. Users can now directly describe the desired output. For instance, instructing the model with phrases like "make the tone more firm," "lower the pitch and slow down the speed a bit," or "add a bit of emotional fluctuation" will result in the corresponding vocal delivery.
Fun-CosyVoice3.5: Precision in Multi-Language Speech
This model focuses on multi-language voice cloning and fine-grained expressive control. Key upgrades include:
- Expanded Language Support: Adds Thai, Indonesian, Portuguese, and Vietnamese to its repertoire.
- Enhanced Accuracy: Claims industry-leading performance on objective metrics (WER and SpkSim) across 13 languages. It significantly reduces mispronunciation rates for rare characters from 15.2% to 5.3%.
- Improved Stability: Delivers more stable performance on complex and long-form text.
- Better Sound Quality: Utilizes reinforcement learning for a more natural, layered auditory experience.
- Faster Performance: A halved Tokenizer frame rate reduces first-packet latency by 35%, enabling quicker responses for real-time interactions.
Fun-AudioGen-VD: Crafting Voices and Their Worlds
This model is designed for holistic sound design and immersive scene generation. It goes beyond generating a voice to create the entire auditory environment.
Voice Attribute Control: Users can specify:
- Basic Traits: Gender, age, accent, pitch, speed.
- Voice Quality: Raspy, clear, deep, magnetic.
- Emotional Expression: Anger, sadness, excitement, determination.
- Role Simulation: Customer service agent, veteran, child, AI, broadcaster.
- Complex Psychology: Can express nuanced states like "calm on the surface but trembling inside."
Scene and Environment Generation: The model can generate the soundscape where the voice exists:
- Background Ambiance: City noise, café chatter, battlefield sounds.
- Spatial Acoustics: Simulates reverb for spaces like cathedrals, metal cells, or underwater.
- Device Filters: Mimics the sound quality of old radios, walkie-talkies, or breathing masks.
- Dynamic Interactions: Supports real-time effects like intermittent wind noise, changing echoes, or hoarseness.
Availability
Both models are accessible via API. Developers and users can explore the capabilities through Alibaba Cloud's Model Studio platform.
This release marks a shift towards more intuitive and creative audio generation, where the only limit is the user's descriptive language.