Alibaba's Qwen3.5-LiveTranslate Achieves 2.8-Second Latency for Real-Time Speech Translation

Alibaba’s Tongyi Lab has unveiled Qwen3.5-LiveTranslate-Flash, a real-time speech translation model designed to overcome the persistent challenges of latency, language coverage, and unnatural voice synthesis in simultaneous interpretation.

Building on its predecessor, the new model cuts end-to-end average word latency to just 2.8 seconds, while expanding input language support from 18 to 60 languages and output speech from 10 to 29 languages. It also introduces real-time voice cloning, ensuring that translated speech retains the speaker’s original tone and emotional expression.

Key Technical Advances

Readable Unit Strategy Qwen3.5-LiveTranslate employs a novel “Readable Unit” real-time translation technique that balances latency with semantic coherence. Compared to the previous Qwen3-LiveTranslate-Flash, the first-word latency dropped by 3.45 seconds and average word latency by 1.88 seconds, with minimal quality loss.

Dynamic Voice Cloning The model automatically captures and replicates the speaker’s voice characteristics during interpretation. This maintains identity consistency across languages, making it ideal for scenarios where multiple speakers, such as hosts and guests, must be clearly distinguished.

Hotword Engine An integrated dynamic hotword engine supports up to 1,000 custom entries, prioritizing accurate recognition and translation of names, places, brands, and industry-specific jargon. This significantly reduces errors in technical, medical, legal, and financial contexts.

Proven Performance in Real-World Scenarios

Cross-Border Meetings & Travel: Handles multi-language mixing and accents, seamlessly segmenting and translating successive speakers.
Live Commerce & Content Export: Achieves high accuracy for numbers, prices, and product specifications, ensuring smooth streaming and sales.
Cultural & Historical Content: Correctly translates Classical Chinese text while preserving cultural nuances.
Visual Disambiguation: Leverages multimodal understanding to resolve ambiguous terms by incorporating visual context.

Benchmark Results

On public benchmarks such as FLEURS and CoVoST2, Qwen3.5-LiveTranslate-Flash outperforms both the previous Qwen3 generation and current mainstream speech large models in translation accuracy and language coverage.

Availability

The model is available for demo at omni.qwen.ai/live-translate with a detailed blog post at qwen.ai/blog. Alibaba Cloud Bailian integration is coming soon.

Qwen3.5-LiveTranslate-Flash is built on the Qwen3.5-Omni Thinker-Talker architecture, employing chunk-wise streaming input and real-time voice cloning to deliver a new generation of speech-to-speech infrastructure for global collaboration, content export, and smart hardware.

Key Technical Advances

Proven Performance in Real-World Scenarios

Benchmark Results

Availability

Agent Roundtable