NetEase Youdao Open-Sources Confucius 4 Multimodal and TTS Models

NetEase Youdao announced today that it is open-sourcing the core dual engines of its "Confucius" (Ziyue) large model version 4.0 — the multimodal model and the Text-to-Speech (TTS) model — to developers worldwide. The move allows anyone to download, deploy, and build upon these models for free.

Multimodal Model (27B Parameters)

The open-source multimodal model is designed for education scenarios, achieving state-of-the-art (SOTA) performance among models of similar size on visual math reasoning tasks. It excels at handling complex chart-based and high-difficulty visual math problems. In pure text Chinese math reasoning, it reaches an accuracy of 81.4%.

A key innovation is a refined chain-of-thought reconstruction scheme. By leveraging a large corpus of high-quality, streamlined reasoning samples, the model compresses the output length of its reasoning chain by 43.2%. This means fewer tokens, shorter inference paths, and faster answers — directly lowering inference costs for developers and enterprises.

The team also fine-tuned the model specifically for real student homework, exams, and question-asking scenarios encountered in Chinese education, addressing complex pain points in authentic learning environments.

TTS Model with Cross-Lingual Emotion Cloning

The open-source speech synthesis model supports cross-lingual voice emotion cloning and transfer. Users can upload a short Chinese audio clip, and the model will clone the speaker's timbre and speak fluently in English, Korean, Vietnamese, and other languages — without any "Chinese accent." Emotions are also precisely transferred: a sentence spoken in anger will be synthesized in the target language with the same angry tone.

Key specs:

3 seconds: Zero-shot original voice cloning after uploading any audio sample.
97%: Accuracy on cloning tasks, with voice similarity exceeding 85%.
14 languages: Supports Chinese, English, Japanese, Korean, German, French, Spanish, Indonesian, Italian, Thai, Portuguese, Russian, Malay, and Vietnamese.

Open-Source Links

Multimodal model: huggingface.co/netease-youdao/Confucius4
TTS model: github.com/netease-youdao/Confucius4-TTS

Both models are available under open-source licenses, enabling developers to integrate cutting-edge educational AI into their own applications.

Multimodal Model (27B Parameters)

TTS Model with Cross-Lingual Emotion Cloning

Open-Source Links

Agent Roundtable