📚 ArXiv Daily Digest

计算机视觉 2604.15309

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

MM-WebAgent：一种用于网页生成的分层多模态网络智能体

Yan Li, Zezi Zeng, Yifan Yang, Yuqing Yang, Ning Liao 等 (15 位作者)

核心贡献: 提出了一个分层智能体框架MM-WebAgent，用于生成风格一致、全局协调的多模态网页；同时引入了一个多模态网页生成基准与多层级评估协议。

方法: 该方法采用分层规划与迭代自反思机制，协调基于AIGC的元素生成过程。框架通过联合优化全局布局、局部多模态内容及其整合，分层次地生成网页元素，确保视觉一致性与整体协调性。

关键发现: 实验表明，MM-WebAgent在代码生成和基于智能体的基线方法中表现更优，尤其在多模态元素生成与整合方面具有显著优势。所提出的评估协议能系统性地衡量网页的协调性与视觉质量。

查看原文摘要

The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. MM-WebAgent jointly optimizes global layout, local multimodal content, and their integration, producing coherent and visually consistent webpages. We further introduce a benchmark for multimodal webpage generation and a multi-level evaluation protocol for systematic assessment. Experiments demonstrate that MM-WebAgent outperforms code-generation and agent-based baselines, especially on multimodal element generation and integration. Code & Data: https://aka.ms/mm-webagent.

📄 arXiv 📥 PDF

计算机视觉 2604.14605

Towards Design Compositing

迈向设计合成

Abhinav Mahajan, Abhikhya Tripathy, Sudeeksha Reddy Pala, Vaibhav Methi, K J Joseph 等 (6 位作者)

核心贡献: 提出了GIST，一种无需训练、能保持元素身份的图像合成器，解决了现有组件到设计流程中因输入元素视觉风格不匹配而难以实现和谐合成的问题。

方法: GIST是一种无需训练的、保持元素身份的图像合成器，可插入布局预测与字体生成之间的环节。它通过身份保持的风格化与合成技术，使来自不同来源的输入元素在视觉上和谐统一，并能无缝集成到现有的设计流程中。

关键发现: 将GIST集成到LaDeCo和Design-o-meter两种不同的现有方法中后，通过LLaVA-OV和GPT-4V评估，在视觉和谐度与美学质量上均显著优于直接粘贴的方法，验证了其有效性。

查看原文摘要

Graphic design creation involves harmoniously assembling multimodal components such as images, text, logos, and other visual assets collected from diverse sources, into a visually-appealing and cohesive design. Recent methods have largely focused on layout prediction or complementary element generation, while retaining input elements exactly, implicitly assuming that provided components are already stylistically harmonious. In practice, inputs often come from disparate sources and exhibit visual mismatch, making this assumption limiting. We argue that identity-preserving stylization and compositing of input elements is a critical missing ingredient for truly harmonized components-to-design pipelines. To this end, we propose GIST, a training-free, identity-preserving image compositor that sits between layout prediction and typography generation, and can be plugged into any existing components-to-design or design-refining pipeline without modification. We demonstrate this by integrating GIST with two substantially different existing methods, LaDeCo and Design-o-meter. GIST shows significant improvements in visual harmony and aesthetic quality across both pipelines, as validated by LLaVA-OV and GPT-4V on aspect-wise ratings and pairwise preference over naive pasting. Project Page: abhinav-mahajan10.github.io/GIST/.

📄 arXiv 📥 PDF

计算机视觉 2604.15311

LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

LeapAlign：通过构建两步轨迹在任意生成步骤对流程匹配模型进行后训练对齐

Zhanhao Liang, Tao Yang, Jie Wu, Chengjian Feng, Liang Zheng

核心贡献: 提出LeapAlign方法，通过将长生成轨迹缩短为两步，显著降低了基于直接梯度传播的偏好对齐计算成本，并实现了对早期生成步骤的有效更新，从而提升图像全局结构质量。

方法: 该方法设计了两个连续的“跳跃”步骤，每个跳跃跳过多个ODE采样步骤，直接预测未来潜在表示，从而将长轨迹压缩为两步。通过随机化跳跃的起始和结束时间步，使模型能在任意生成步骤进行高效稳定的更新；同时为与长轨迹更一致的缩短轨迹分配更高训练权重，并通过降低大梯度幅值项的权重（而非完全剔除）来增强梯度稳定性。

关键发现: 在微调Flux模型时，LeapAlign在多项指标上均优于基于GRPO和传统直接梯度的方法，实现了更优的图像质量与图文对齐效果，且计算成本显著降低。

查看原文摘要

This paper focuses on the alignment of flow matching models with human preferences. A promising way is fine-tuning by directly backpropagating reward gradients through the differentiable generation process of flow matching. However, backpropagating through long trajectories results in prohibitive memory costs and gradient explosion. Therefore, direct-gradient methods struggle to update early generation steps, which are crucial for determining the global structure of the final image. To address this issue, we introduce LeapAlign, a fine-tuning method that reduces computational cost and enables direct gradient propagation from reward to early generation steps. Specifically, we shorten the long trajectory into only two steps by designing two consecutive leaps, each skipping multiple ODE sampling steps and predicting future latents in a single step. By randomizing the start and end timesteps of the leaps, LeapAlign leads to efficient and stable model updates at any generation step. To better use such shortened trajectories, we assign higher training weights to those that are more consistent with the long generation path. To further enhance gradient stability, we reduce the weights of gradient terms with large magnitude, instead of completely removing them as done in previous works. When fine-tuning the Flux model, LeapAlign consistently outperforms state-of-the-art GRPO-based and direct-gradient methods across various metrics, achieving superior image quality and image-text alignment.

📄 arXiv 📥 PDF

计算机视觉 2604.14910

Reward-Aware Trajectory Shaping for Few-step Visual Generation

面向少步视觉生成的奖励感知轨迹塑形

Rui Li, Bingyu Li, Yuanzhi Liang, HuangHai Bin, Chi Zhang 等 (6 位作者)

核心贡献: 提出了奖励感知轨迹塑形（RATS）框架，通过引入偏好对齐机制，使少步生成模型能够超越模仿的教师模型，实现基于奖励的持续质量改进。

方法: 该方法通过水平匹配在关键去噪阶段对齐师生潜轨迹，并引入奖励感知门控机制，根据师生相对奖励表现自适应调节教师指导强度。当教师奖励更高时加强轨迹塑形，当学生表现相当或更优时则放松约束，从而在轨迹蒸馏、奖励感知门控和偏好对齐的协同下实现知识迁移。

关键发现: 实验表明，RATS显著改善了少步视觉生成中效率与质量的权衡，大幅缩小了少步学生模型与更强多步生成器之间的性能差距，且未增加推理时的计算开销。

查看原文摘要

Achieving high-fidelity generation in extremely few sampling steps has long been a central goal of generative modeling. Existing approaches largely rely on distillation-based frameworks to compress the original multi-step denoising process into a few-step generator. However, such methods inherently constrain the student to imitate a stronger multi-step teacher, imposing the teacher as an upper bound on student performance. We argue that introducing \textbf{preference alignment awareness} enables the student to optimize toward reward-preferred generation quality, potentially surpassing the teacher instead of being restricted to rigid teacher imitation. To this end, we propose \textbf{Reward-Aware Trajectory Shaping (RATS)}, a lightweight framework for preference-aligned few-step generation. Specifically, teacher and student latent trajectories are aligned at key denoising stages through horizon matching, while a \textbf{reward-aware gate} is introduced to adaptively regulate teacher guidance based on their relative reward performance. Trajectory shaping is strengthened when the teacher achieves higher rewards, and relaxed when the student matches or surpasses the teacher, thereby enabling continued reward-driven improvement. By seamlessly integrating trajectory distillation, reward-aware gating, and preference alignment, RATS effectively transfers preference-relevant knowledge from high-step generators without incurring additional test-time computational overhead. Experimental results demonstrate that RATS substantially improves the efficiency--quality trade-off in few-step visual generation, significantly narrowing the gap between few-step students and stronger multi-step generators.

📄 arXiv 📥 PDF

计算机视觉 2604.14591

Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models

基于掩码逻辑微调的提示引导图像编辑在视觉自回归模型中的应用

Amir El-Ghoussani, Marc Hölle, Gustavo Carneiro, Vasileios Belagiannis

核心贡献: 提出了掩码逻辑微调方法，用于在视觉自回归模型中实现精确的提示引导图像编辑，能够在根据目标提示修改图像的同时，有效保留与编辑请求无关的区域。

方法: 该方法首先利用源图像的分词图，在目标提示下引导模型预测与源分词图对齐。通过将固定的源编码转换为逻辑值，并沿源-目标提示定义的语义轨迹微调模型预测的逻辑值。编辑仅应用于通过利用源提示与编辑提示间交叉注意力差异获得的掩码区域内，并引入细化步骤以纠正量化误差并提升重建质量。

关键发现: 在PIE基准测试的512px和1024px分辨率上取得了最佳的图像编辑性能；在COCO（512px）和OpenImages（1024px）数据集上实现了更忠实于源图像的重建，超越了先前方法；整体性能优于相关VAR方法，与扩散模型相当甚至更好，且速度显著更快。

查看原文摘要

We address the problem of prompt-guided image editing in visual autoregressive models. Given a source image and a target text prompt, we aim to modify the source image according to the target prompt, while preserving all regions which are unrelated to the requested edit. To this end, we present Masked Logit Nudging, which uses the source image token maps to introduce a guidance step that aligns the model's predictions under the target prompt with these source token maps. Specifically, we convert the fixed source encodings into logits using the VAR encoding, nudging the model's predicted logits towards the targets along a semantic trajectory defined by the source-target prompts. Edits are applied only within spatial masks obtained through a dedicated masking scheme that leverages cross-attention differences between the source and edited prompts. Then, we introduce a refinement to correct quantization errors and improve reconstruction quality. Our approach achieves the best image editing performance on the PIE benchmark at 512px and 1024px resolutions. Beyond editing, our method delivers faithful reconstructions and outperforms previous methods on COCO at 512px and OpenImages at 1024px. Overall, our method outperforms VAR-related approaches and achieves comparable or even better performance than diffusion models, while being much faster. Code is available at 'https://github.com/AmirMaEl/MLN'.

📄 arXiv 📥 PDF