SeeThrough3D:文本到图像生成中具有遮挡感知的三维控制
Vaibhav Agrawal, Rishubh Parihar, Pradhaan Bhat, Ravi Kiran Sarvadevabhatla, R. Venkatesh Babu
核心贡献: 本文提出了SeeThrough3D模型,首次在三维布局条件生成中显式建模物体间的遮挡关系,实现了具有深度一致几何与尺度的部分遮挡物体合成,并提供了精确的相机视角控制。
方法: 方法首先引入了一种遮挡感知的三维场景表示(OSCR),将物体表示为虚拟环境中的半透明三维框,从指定相机视角渲染,其透明度编码了被遮挡区域。然后,基于预训练的流式文本到图像生成模型,将从渲染的三维表示中提取的视觉标记作为条件输入。此外,采用掩码自注意力机制将每个物体边界框与其对应的文本描述准确绑定,防止不同物体属性混淆。
关键发现: 实验表明,SeeThrough3D能够生成符合输入三维布局、具有真实遮挡关系和一致相机控制的图像。该方法在包含强遮挡关系的合成多物体数据集上训练,能有效泛化到未见过的物体类别,并在多物体生成中避免了属性混合问题。
查看原文摘要
We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can generate realistic scenes that follow input layouts, they often fail to model precise inter-object occlusions. We propose SeeThrough3D, a model for 3D layout conditioned generation that explicitly models occlusions. We introduce an occlusion-aware 3D scene representation (OSCR), where objects are depicted as translucent 3D boxes placed within a virtual environment and rendered from desired camera viewpoint. The transparency encodes hidden object regions, enabling the model to reason about occlusions, while the rendered viewpoint provides explicit camera control during generation. We condition a pretrained flow based text-to-image image generation model by introducing a set of visual tokens derived from our rendered 3D representation. Furthermore, we apply masked self-attention to accurately bind each object bounding box to its corresponding textual description, enabling accurate generation of multiple objects without object attribute mixing. To train the model, we construct a synthetic dataset with diverse multi-object scenes with strong inter-object occlusions. SeeThrough3D generalizes effectively to unseen object categories and enables precise 3D layout control with realistic occlusions and consistent camera control.
SceneTransporter:基于最优传输引导的组合式潜在扩散模型,用于单图像结构化三维场景生成
Ling Wang, Hao-Xiang Guo, Xinzhou Wang, Fuchun Sun, Kai Sun 等 (12 位作者)
核心贡献: 提出了一个端到端框架,将结构化三维场景生成重新定义为全局关联分配问题,并通过在去噪循环中引入熵正则化最优传输目标,有效解决了现有方法在开放世界场景中难以将部件组织成独立实例的难题。
方法: 首先通过去偏聚类探测发现模型内部缺乏结构约束是失败主因;进而将场景生成建模为全局关联分配问题,在组合式DiT模型的去噪循环中构建并求解熵正则化最优传输目标;该目标通过传输计划对交叉注意力进行门控,强制图像块与部件级三维潜在表示之间的一对一独占路由,并利用基于边缘的成本正则化竞争性传输,促使相似图像块聚合成连贯物体。
关键发现: 实验表明,SceneTransporter在开放世界场景生成任务上优于现有方法,显著提升了实例级连贯性和几何保真度;所提出的最优传输约束能有效防止部件纠缠与碎片化,生成结构合理的三维场景。
查看原文摘要
We introduce SceneTransporter, an end-to-end framework for structured 3D scene generation from a single image. While existing methods generate part-level 3D objects, they often fail to organize these parts into distinct instances in open-world scenes. Through a debiased clustering probe, we reveal a critical insight: this failure stems from the lack of structural constraints within the model's internal assignment mechanism. Based on this finding, we reframe the task of structured 3D scene generation as a global correlation assignment problem. To solve this, SceneTransporter formulates and solves an entropic Optimal Transport (OT) objective within the denoising loop of the compositional DiT model. This formulation imposes two powerful structural constraints. First, the resulting transport plan gates cross-attention to enforce an exclusive, one-to-one routing of image patches to part-level 3D latents, preventing entanglement. Second, the competitive nature of the transport encourages the grouping of similar patches, a process that is further regularized by an edge-based cost, to form coherent objects and prevent fragmentation. Extensive experiments show that SceneTransporter outperforms existing methods on open-world scene generation, significantly improving instance-level coherence and geometric fidelity. Code and models will be publicly available at https://2019epwl.github.io/SceneTransporter/.
去噪即路径规划:利用DPCache实现扩散模型的无训练加速
Bowen Cui, Yuanbin Wang, Huajiang Xu, Biaolong Chen, Aixi Zhang 等 (9 位作者)
核心贡献: 提出了一种新颖的无训练加速框架DPCache,将扩散采样加速问题形式化为一个全局路径规划问题,通过动态规划选择最优的关键时间步序列,在显著提升推理速度的同时保持甚至提升生成质量。
方法: DPCache首先利用一个小型校准集构建一个路径感知成本张量,以量化在给定前一个关键时间步的条件下跳过某些时间步所产生的路径依赖误差。然后,它运用动态规划算法,全局地选择一组能最小化总路径成本、同时保持去噪轨迹保真度的最优关键时间步序列。在推理时,模型仅在这些关键时间步进行完整计算,并使用缓存的特征高效预测中间输出。
关键发现: 在DiT、FLUX和HunyuanVideo等多个模型上的广泛实验表明,DPCache能以最小的质量损失实现强劲的加速效果。例如,在FLUX模型上,DPCache在4.87倍加速下,其ImageReward分数比之前的加速方法高出+0.031;在3.54倍加速下,其ImageReward分数甚至超过了完整步数的基线模型+0.028,验证了该路径感知全局调度框架的有效性。
查看原文摘要
Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead of multi-step iterative sampling. Among acceleration strategies, caching-based methods offer a training-free and effective solution by reusing or predicting features across timesteps. However, existing approaches rely on fixed or locally adaptive schedules without considering the global structure of the denoising trajectory, often leading to error accumulation and visual artifacts. To overcome this limitation, we propose DPCache, a novel training-free acceleration framework that formulates diffusion sampling acceleration as a global path planning problem. DPCache constructs a Path-Aware Cost Tensor from a small calibration set to quantify the path-dependent error of skipping timesteps conditioned on the preceding key timestep. Leveraging this tensor, DPCache employs dynamic programming to select an optimal sequence of key timesteps that minimizes the total path cost while preserving trajectory fidelity. During inference, the model performs full computations only at these key timesteps, while intermediate outputs are efficiently predicted using cached features. Extensive experiments on DiT, FLUX, and HunyuanVideo demonstrate that DPCache achieves strong acceleration with minimal quality loss, outperforming prior acceleration methods by $+$0.031 ImageReward at 4.87$\times$ speedup and even surpassing the full-step baseline by $+$0.028 ImageReward at 3.54$\times$ speedup on FLUX, validating the effectiveness of our path-aware global scheduling framework. Code will be released at https://github.com/argsss/DPCache.
DiffBMP:基于位图基元的可微分渲染
Seongmin Hong, Junghun James Kim, Daehyeop Kim, Insoo Chung, Se Young Chun
核心贡献: 本文提出了DiffBMP,一个面向位图图像集合的可扩展、高效的可微分渲染引擎,解决了传统可微分渲染器局限于矢量图形的关键问题。
方法: 该方法构建了一个高度并行化的渲染管线,并采用自定义CUDA实现进行梯度计算。通过高斯模糊实现软栅格化,并结合结构感知初始化、噪声画布等技术优化过程。此外,针对视频或空间受限图像设计了专门的损失函数与启发式方法。
关键发现: 实验表明,DiffBMP可在消费级GPU上1分钟内优化数千个位图基元的位置、旋转、缩放、颜色和透明度。该系统支持将合成结果导出为分层文件格式,并能有效融入实际创作流程,相关代码已开源为易于扩展的Python工具包。
查看原文摘要
We introduce DiffBMP, a scalable and efficient differentiable rendering engine for a collection of bitmap images. Our work addresses a limitation that traditional differentiable renderers are constrained to vector graphics, given that most images in the world are bitmaps. Our core contribution is a highly parallelized rendering pipeline, featuring a custom CUDA implementation for calculating gradients. This system can, for example, optimize the position, rotation, scale, color, and opacity of thousands of bitmap primitives all in under 1 min using a consumer GPU. We employ and validate several techniques to facilitate the optimization: soft rasterization via Gaussian blur, structure-aware initialization, noisy canvas, and specialized losses/heuristics for videos or spatially constrained images. We demonstrate DiffBMP is not just an isolated tool, but a practical one designed to integrate into creative workflows. It supports exporting compositions to a native, layered file format, and the entire framework is publicly accessible via an easy-to-hack Python package.
基于指令的图像编辑:规划、推理与生成
Liya Ji, Chenyang Qi, Qifeng Chen
核心贡献: 提出了一种新的多模态模型,通过引入多模态思维链提示,将理解与生成能力相结合,显著提升了基于指令的图像编辑在复杂场景下的质量。
方法: 该方法将指令编辑任务分解为三个步骤:1)利用大语言模型进行思维链规划,根据指令和编辑网络能力推理出合适的子提示;2)训练一个基于多模态大语言模型的指令编辑区域生成网络,用于推理需要编辑的区域;3)提出一个基于提示引导的指令编辑网络,该网络基于大规模文本到图像扩散模型构建,能够接受区域提示信息进行图像生成。
关键发现: 大量实验表明,该方法在复杂的真实世界图像上具有竞争力的编辑能力,能够有效处理需要深度场景理解和精确生成的复杂编辑指令。
查看原文摘要
Editing images via instruction provides a natural way to generate interactive content, but it is a big challenge due to the higher requirement of scene understanding and generation. Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task. However, the understanding models provide only a single modality ability, restricting the editing quality. We aim to bridge understanding and generation via a new multi-modality model that provides the intelligent abilities to instruction-based image editing models for more complex cases. To achieve this goal, we individually separate the instruction editing task with the multi-modality chain of thought prompts, i.e., Chain-of-Thought (CoT) planning, editing region reasoning, and editing. For Chain-of-Thought planning, the large language model could reason the appropriate sub-prompts considering the instruction provided and the ability of the editing network. For editing region reasoning, we train an instruction-based editing region generation network with a multi-modal large language model. Finally, a hint-guided instruction-based editing network is proposed for editing image generations based on the sizeable text-to-image diffusion model to accept the hints for generation. Extensive experiments demonstrate that our method has competitive editing abilities on complex real-world images.