📚 ArXiv Daily Digest

计算机视觉 2605.11061

HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

HiDream-O1-Image：一种原生统一的图像生成基础模型，基于像素级统一Transformer

Qi Cai, Jingwen Chen, Chengmin Gao, Zijian Gong, Yehao Li 等 (25 位作者)

核心贡献: 提出了一种原生统一的图像生成基础模型HiDream-O1-Image，通过像素空间扩散Transformer实现端到端的上下文视觉生成，消除了对分离的文本编码器和外部VAE的依赖，并在8B参数下达到或超越更大规模模型（如27B Qwen-Image）的性能。

方法: 该方法采用像素级统一Transformer（UiT）架构，将原始图像像素、文本标记和任务特定条件映射到单个共享标记空间中，实现多模态输入的结构统一。这种原生编码范式将多样化的生成和编辑任务（如文本到图像生成、指令式编辑、主体驱动个性化）视为一致的上下文推理过程，无需独立的VAE或预训练文本编码器。

关键发现: 实验表明，HiDream-O1-Image在多种生成任务上表现优异，8B参数版本即可与27B参数的Qwen-Image等更大模型性能持平或更优。此外，该架构成功扩展至超过200B参数（HiDream-O1-Image-Pro），展现出前所未有的生成能力和优越性能，刷新了多项基准测试，验证了原生统一架构的巨大可扩展性。

查看原文摘要

The evolution of visual generative models has long been constrained by fragmented architectures relying on disjoint text encoders and external VAEs. In this report, we present HiDream-O1-Image, a natively unified generative foundation model via pixel-space Diffusion Transformer, that pioneers a paradigm shift from modular architectures to an end-to-end in-context visual generation engine. By mapping raw image pixels, text tokens, and task-specific conditions into a single shared token space, HiDream-O1-Image achieves a structural unification of multimodal inputs within an Unified Transformer (UiT) architecture. This native encoding paradigm eliminates the need for separate VAEs or disjoint pre-trained text encoders, allowing the model to treat diverse generation and editing tasks as a consistent in-context reasoning process. Extensive experiments show that HiDream-O1-Image excels across various generation tasks, including text-to-image generation, instruction-based editing, and subject-driven personalization. Notably, with only 8B parameters, HiDream-O1-Image (8B) achieves performance parity with or even surpasses established state-of-the-art models with significantly larger parameters (e.g., 27B Qwen-Image). Crucially, to validate the immense scalability of this paradigm, we successfully scale the architecture up to over 200B parameters. Experimental results demonstrate that this massive-scale version HiDream-O1-Image-Pro (200B+) unlocks unprecedented generative capabilities and superior performance, establishing new state-of-the-art benchmarks. Ultimately, HiDream-O1-Image highlights the immense potential of natively unified architectures and charts a highly scalable path toward next-generation multimodal AI.

📄 arXiv 📥 PDF

计算机视觉 2605.10730

Qwen-Image-2.0 Technical Report

Qwen-Image-2.0 技术报告

Bing Zhao, Chenfei Wu, Deqing Li, Hao Meng, Jiahao Li 等 (75 位作者)

核心贡献: 提出了一个统一高保真图像生成与精确编辑能力的多模态基础模型Qwen-Image-2.0，在超长文本渲染、多语种排版、高分辨率逼真图像生成及复杂指令跟随等关键挑战上取得了显著突破。

方法: 该模型将Qwen3-VL作为条件编码器，与多模态扩散Transformer耦合，实现联合条件-目标建模。通过大规模数据整理和定制的多阶段训练流程，在保持灵活生成与编辑能力的同时，增强了多模态理解。模型支持长达1K token的指令输入，用于生成幻灯片、海报、信息图和漫画等富含文本的内容。

关键发现: 大量人工评估表明，Qwen-Image-2.0在图像生成和编辑任务上均显著优于之前的Qwen-Image模型，特别是在多语种文本保真度、排版质量、照片级真实感细节、纹理与光照一致性，以及复杂提示的可靠跟随能力方面取得了实质性提升。

查看原文摘要

We present Qwen-Image-2.0, an omni-capable image generation foundation model that unifies high-fidelity generation and precise image editing within a single framework. Despite recent progress, existing models still struggle with ultra-long text rendering, multilingual typography, high-resolution photorealism, robust instruction following, and efficient deployment, especially in text-rich and compositionally complex scenarios. Qwen-Image-2.0 addresses these challenges by coupling Qwen3-VL as the condition encoder with a Multimodal Diffusion Transformer for joint condition-target modeling, supported by large-scale data curation and a customized multi-stage training pipeline. This enables strong multimodal understanding while preserving flexible generation and editing capabilities. The model supports instructions of up to 1K tokens for generating text-rich content such as slides, posters, infographics, and comics, while significantly improving multilingual text fidelity and typography. It also enhances photorealistic generation with richer details, more realistic textures, and coherent lighting, and follows complex prompts more reliably across diverse styles. Extensive human evaluations show that Qwen-Image-2.0 substantially outperforms previous Qwen-Image models in both generation and editing, marking a step toward more general, reliable, and practical image generation foundation models.

📄 arXiv 📥 PDF

计算机视觉 2605.12500

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

SenseNova-U1：基于NEO-unify架构的统一多模态理解与生成

Haiwen Diao, Penghao Wu, Hanming Deng, Jiahao Wang, Shihao Bai 等 (58 位作者)

核心贡献: 提出了一种原生统一的多模态范式SenseNova-U1，打破了传统视觉语言模型中理解与生成分离的架构限制，实现了两者在同一模型中的协同进化，并在多种任务上达到顶尖性能。

方法: 基于NEO-unify架构，分别构建了密集参数（8B）和混合专家（30B-A3B）两种变体模型。通过统一的设计原则，将理解与生成视为同一底层过程的协同视图，并详细设计了数据预处理、预训练、后训练及推理策略，支持任意模态到任意模态的生成。

关键发现: 模型在文本理解、视觉语言感知、知识推理、智能体决策和空间智能等任务上媲美顶尖的理解专用模型；同时在语义一致性、视觉保真度、复杂文本丰富的图表生成及交错视觉语言生成中表现优异；初步证据还表明模型在视觉-语言-动作和世界模型场景中具有强扩展能力。

查看原文摘要

Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, built on dense (8B) and mixture-of-experts (30B-A3B) understanding baselines, respectively. Designed from first principles, they rival top-tier understanding-only VLMs across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge-intensive any-to-image (X2I) synthesis, complex text-rich infographic generation, and interleaved vision-language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre-/post-training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly in vision-language-action (VLA) and world model (WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.

📄 arXiv 📥 PDF

计算机视觉 2605.12495

AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward

AlphaGRPO：通过可分解的可验证奖励解锁统一多模态模型中的自我反思式多模态生成

Runhui Huang, Jie Wu, Rui Yang, Zhe Liu, Hengshuang Zhao

核心贡献: 提出AlphaGRPO框架，首次将组相对策略优化（GRPO）应用于AR-Diffusion统一多模态模型（UMMs），无需冷启动阶段即可增强多模态生成能力，并引入可分解的可验证奖励（DVReward）为复杂多模态生成任务提供稳定、可解释的监督信号。

方法: AlphaGRPO基于GRPO算法对UMMs进行强化学习优化，通过策略梯度更新模型参数。为解决真实世界多模态生成的监督难题，提出DVReward机制：利用大语言模型（LLM）将复杂用户请求分解为原子化的可验证语义与质量子问题，再由通用多模态大模型（MLLM）逐项评估，生成可靠且可解释的奖励信号。该方法无需额外冷启动阶段，直接激发模型内在的推理与自我反思能力。

关键发现: 在GenEval、TIIF-Bench、DPG-Bench和WISE等多模态生成基准上，AlphaGRPO均取得稳健性能提升；在未经过编辑任务训练的情况下，在GEdit编辑任务上同样获得显著改进。实验验证了自我反思式强化学习方法能有效利用模型内在理解能力引导高保真生成。

查看原文摘要

In this paper, we propose AlphaGRPO, a novel framework that applies Group Relative Policy Optimization (GRPO) to AR-Diffusion Unified Multimodal Models (UMMs) to enhance multimodal generation capabilities without an additional cold-start stage. Our approach unlocks the model's intrinsic potential to perform advanced reasoning tasks: Reasoning Text-to-Image Generation, where the model actively infers implicit user intents, and Self-Reflective Refinement, where it autonomously diagnoses and corrects misalignments in generated outputs. To address the challenge of providing stable supervision for real-world multimodal generation, we introduce the Decompositional Verifiable Reward (DVReward). Unlike holistic scalar rewards, DVReward utilizes an LLM to decompose complex user requests into atomic, verifiable semantic and quality questions, which are then evaluated by a general MLLM to provide reliable and interpretable feedback. Extensive experiments demonstrate that AlphaGRPO yields robust improvements across multimodal generation benchmarks, including GenEval, TIIF-Bench, DPG-Bench and WISE, while also achieving significant gains in editing tasks on GEdit without training on editing tasks. These results validate that our self-reflective reinforcement approach effectively leverages inherent understanding to guide high-fidelity generation. Project page: https://huangrh99.github.io/AlphaGRPO/

📄 arXiv 📥 PDF

计算机视觉 2605.12309

G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models

G²TR：面向分离编码器统一多模态模型的生成引导视觉令牌缩减方法

Junxian Li, Kai Liu, Zizhong Ding, Zhixin Wang, Zhikai Chen 等 (7 位作者)

核心贡献: 提出了一种无需训练、即插即用的视觉令牌缩减框架G²TR，能够在不降低推理准确性和图像编辑质量的前提下，将分离编码器统一多模态模型的视觉令牌数量和预填充计算量减少约1.94倍。

方法: G²TR通过生成分支提供的任务无关信号来评估视觉令牌的重要性，具体利用令牌与VAE潜在空间的一致性进行重要性估计；然后执行平衡的令牌选择，并将冗余令牌合并到保留的代表性令牌中，以减少信息损失。该方法仅在理解编码阶段之后应用，无需修改现有推理流程。

关键发现: 在图像理解和编辑基准测试上，G²TR显著减少了视觉令牌和预填充计算量（约1.94倍），同时保持了推理准确性和编辑质量，在几乎所有基准测试中均优于现有基线方法。

查看原文摘要

The development of separate-encoder Unified multimodal models (UMMs) comes with a rapidly growing inference cost due to dense visual token processing. In this paper, we focus on understanding-side visual token reduction for improving the efficiency of separate-encoder UMMs. While this topic has been widely studied for MLLMs, existing methods typically rely on attention scores, text-image similarity and so on, implicitly assuming that the final objective is discriminative reasoning. This assumption does not hold for UMMs, where understanding-side visual tokens must also preserve the model's capabilities for editing images. We propose G$^2$TR, a generation-guided visual token reduction framework for separate-encoder UMMs. Our key insight is that the generation branch provides a task-agnostic signal for identifying understanding-side visual tokens that are not only semantically relevant but also important for latent-space image reconstruction and generation. G$^2$TR estimates token importance from consistency with VAE latent, performs balanced token selection, and merges redundant tokens into retained representatives to reduce information loss. The method is training-free, plug-and-play, and applied only after the understanding encoding stage, making it compatible with existing UMM inference pipelines. Experiments on image understanding and editing benchmarks show that G$^2$TR substantially reduces visual tokens and prefill computation by 1.94x while maintaining both reasoning accuracy and editing quality, outperforming baselines on almost all benchmarks.

📄 arXiv 📥 PDF