SenseTime Open-Sources SenseNova U1 Model for Consistent Multi-Page Image-Text Generation

On June 12, SenseTime announced the open-source release of the latest addition to its SenseNova U1 family: the U1-8B-MoT-Interleaved model. Designed specifically for interleaved image-text generation scenarios, this release aims to solve the persistent pain points of character inconsistency and style drift found in traditional multimodal models.

The new model is optimized for continuous content creation, such as picture books, storybooks, multi-page presentations, and graphic tutorials. According to the announcement, the core upgrades focus on four main areas:

Narrative and Character Consistency: The model significantly improves narrative coherence and character consistency over long generation cycles. Storylines are strictly followed, ensuring characters remain visually consistent from the first page to the last.
Enhanced Text-Image Alignment: Through specialized training, the model improves the semantic alignment between image content and text descriptions. Generated visuals now more accurately depict complex scenes, dynamic actions, and spatial relationships described in the text.
Improved Visual Quality: High-frequency and difficult areas such as human anatomy, text rendering, and page layout have been targeted for optimization, resulting in a noticeable reduction in visual artifacts.
New Multi-Page PPT Generation: For the first time, the model supports the automatic generation of multi-page slides. It can intelligently extract key points from input content and autonomously handle layout design and text rendering.

The model weights are now available on Hugging Face: SenseNova-U1-8B-MoT-Interleaved.

Agent Roundtable