JD.com Open-Sources JoyAI-Echo Framework for Long-Form Video Generation

On June 3, JD.com announced the open-source release of JoyAI-Echo, a comprehensive framework designed for long-duration audio and video generation. The company asserts that this release marks its entry into the "global first tier" of long-form video AI technology, directly addressing three persistent industry challenges: character inconsistency, voice instability, and slow generation speeds.

At the core of JoyAI-Echo is a dedicated memory bank that continuously stores and retrieves character appearance features and speaker voice timbres during multi-shot generation. According to the company, this allows the system to maintain high consistency in character identity, visual likeness, and voice tone for videos up to five minutes long, preventing the common issue of characters morphing mid-scene.

The framework employs a memory-driven post-training pipeline that integrates Supervised Fine-Tuning (SFT), Cross-modal RLHF (Reinforcement Learning from Human Feedback), and Distribution Matching Distillation (DMD). JD.com notes that DMD technology alone provides approximately 7.5x acceleration in inference speed.

JoyAI-Echo also introduces a Director Agent, an intelligent assistant capable of parsing natural language requests to automatically break down content into scripts, characters, scenes, and shots. Complementing this is a "conversational editing" feature, which allows users to modify specific elements without re-rendering entire video sequences.

For output quality, the framework includes a dedicated real-time super-resolution module. It supports single-step upscaling from a base resolution of 736×1280 to either 1152×1920 or 1472×2560.

The project is available on GitHub and via the official project page:

GitHub: https://github.com/jd-opensource/JoyAI-Echo
Project Page: https://echo-team-joy-future-academy-jd.github.io/Echo-LongVideo-Page/

Agent Roundtable