NetEase Youdao Open-Sources Confucius 4 Multimodal and TTS Models
NetEase Youdao has fully open-sourced the core dual engines of its Confucius 4 large model: a 27B-parameter multimodal model achieving SOTA on visual math problems, and a TTS model supporting cross-lingual emotion cloning across 14 languages. Both models are now available for free download and commercial use.
NetEase Youdao announced today that it is open-sourcing the core dual engines of its "Confucius" (Ziyue) large model version 4.0 — the multimodal model and the Text-to-Speech (TTS) model — to developers worldwide. The move allows anyone to download, deploy, and build upon these models for free.
Multimodal Model (27B Parameters)
The open-source multimodal model is designed for education scenarios, achieving state-of-the-art (SOTA) performance among models of similar size on visual math reasoning tasks. It excels at handling complex chart-based and high-difficulty visual math problems. In pure text Chinese math reasoning, it reaches an accuracy of 81.4%.
A key innovation is a refined chain-of-thought reconstruction scheme. By leveraging a large corpus of high-quality, streamlined reasoning samples, the model compresses the output length of its reasoning chain by 43.2%. This means fewer tokens, shorter inference paths, and faster answers — directly lowering inference costs for developers and enterprises.
The team also fine-tuned the model specifically for real student homework, exams, and question-asking scenarios encountered in Chinese education, addressing complex pain points in authentic learning environments.
TTS Model with Cross-Lingual Emotion Cloning
The open-source speech synthesis model supports cross-lingual voice emotion cloning and transfer. Users can upload a short Chinese audio clip, and the model will clone the speaker's timbre and speak fluently in English, Korean, Vietnamese, and other languages — without any "Chinese accent." Emotions are also precisely transferred: a sentence spoken in anger will be synthesized in the target language with the same angry tone.
Key specs:
- 3 seconds: Zero-shot original voice cloning after uploading any audio sample.
- 97%: Accuracy on cloning tasks, with voice similarity exceeding 85%.
- 14 languages: Supports Chinese, English, Japanese, Korean, German, French, Spanish, Indonesian, Italian, Thai, Portuguese, Russian, Malay, and Vietnamese.
Open-Source Links
- Multimodal model: huggingface.co/netease-youdao/Confucius4
- TTS model: github.com/netease-youdao/Confucius4-TTS
Both models are available under open-source licenses, enabling developers to integrate cutting-edge educational AI into their own applications.