Alibaba's Qwen3.5-LiveTranslate Achieves 2.8-Second Latency for Real-Time Speech Translation
Alibaba's Qwen3.5-LiveTranslate-Flash reduces end-to-end latency to 2.8 seconds, supports 60 input and 29 output languages, and introduces real-time voice cloning and dynamic hotword engines for cross-border live streaming, meetings, and more.
Alibaba’s Tongyi Lab has unveiled Qwen3.5-LiveTranslate-Flash, a real-time speech translation model designed to overcome the persistent challenges of latency, language coverage, and unnatural voice synthesis in simultaneous interpretation.
Building on its predecessor, the new model cuts end-to-end average word latency to just 2.8 seconds, while expanding input language support from 18 to 60 languages and output speech from 10 to 29 languages. It also introduces real-time voice cloning, ensuring that translated speech retains the speaker’s original tone and emotional expression.
Key Technical Advances
Readable Unit Strategy Qwen3.5-LiveTranslate employs a novel “Readable Unit” real-time translation technique that balances latency with semantic coherence. Compared to the previous Qwen3-LiveTranslate-Flash, the first-word latency dropped by 3.45 seconds and average word latency by 1.88 seconds, with minimal quality loss.
Dynamic Voice Cloning The model automatically captures and replicates the speaker’s voice characteristics during interpretation. This maintains identity consistency across languages, making it ideal for scenarios where multiple speakers, such as hosts and guests, must be clearly distinguished.
Hotword Engine An integrated dynamic hotword engine supports up to 1,000 custom entries, prioritizing accurate recognition and translation of names, places, brands, and industry-specific jargon. This significantly reduces errors in technical, medical, legal, and financial contexts.
Proven Performance in Real-World Scenarios
- Cross-Border Meetings & Travel: Handles multi-language mixing and accents, seamlessly segmenting and translating successive speakers.
- Live Commerce & Content Export: Achieves high accuracy for numbers, prices, and product specifications, ensuring smooth streaming and sales.
- Cultural & Historical Content: Correctly translates Classical Chinese text while preserving cultural nuances.
- Visual Disambiguation: Leverages multimodal understanding to resolve ambiguous terms by incorporating visual context.
Benchmark Results
On public benchmarks such as FLEURS and CoVoST2, Qwen3.5-LiveTranslate-Flash outperforms both the previous Qwen3 generation and current mainstream speech large models in translation accuracy and language coverage.
Availability
The model is available for demo at omni.qwen.ai/live-translate with a detailed blog post at qwen.ai/blog. Alibaba Cloud Bailian integration is coming soon.
Qwen3.5-LiveTranslate-Flash is built on the Qwen3.5-Omni Thinker-Talker architecture, employing chunk-wise streaming input and real-time voice cloning to deliver a new generation of speech-to-speech infrastructure for global collaboration, content export, and smart hardware.