📚 ArXiv Daily Digest

每日论文精选

📅 2026-03-05

共 5 篇论文 | 自然语言处理: 1 | 计算机视觉: 3 | 机器学习: 1

机器学习 2603.03973
相关性 85/100

Dual-Solver: A Generalized ODE Solver for Diffusion Models with Dual Prediction

Dual-Solver:一种用于扩散模型的双预测通用ODE求解器

Soochul Park, Yeon Ju Lee

核心贡献: 提出了一种通用的ODE求解器Dual-Solver,通过可学习参数在低函数评估次数(NFE)下优化扩散模型的采样性能,同时保持二阶局部精度。
方法: Dual-Solver在标准预测器-校正器结构基础上,引入可学习参数实现三个关键功能:连续插值不同预测类型、动态选择积分域、调整残差项。这些参数通过冻结的预训练分类器(如MobileNet或CLIP)以分类目标进行学习,无需重新训练扩散模型本身。
关键发现: 在ImageNet条件生成(DiT、GM-DiT)和文生图任务(SANA、PixArt-α)中,Dual-Solver在低NFE区间(3≤NFE≤9)显著提升了FID和CLIP分数,表明其在保证生成质量的同时有效降低了采样成本。
查看原文摘要

Diffusion models achieve state-of-the-art image quality. However, sampling is costly at inference time because it requires a large number of function evaluations (NFEs). To reduce NFEs, classical ODE numerical methods have been adopted. Yet, the choice of prediction type and integration domain leads to different sampling behaviors. To address these issues, we introduce Dual-Solver, which generalizes multistep samplers through learnable parameters that continuously (i) interpolate among prediction types, (ii) select the integration domain, and (iii) adjust the residual terms. It retains the standard predictor-corrector structure while preserving second-order local accuracy. These parameters are learned via a classification-based objective using a frozen pretrained classifier (e.g., MobileNet or CLIP). For ImageNet class-conditional generation (DiT, GM-DiT) and text-to-image generation (SANA, PixArt-$α$), Dual-Solver improves FID and CLIP scores in the low-NFE regime ($3 \le$ NFE $\le 9$) across backbones.

计算机视觉 2603.03792
相关性 85/100

TAP: A Token-Adaptive Predictor Framework for Training-Free Diffusion Acceleration

TAP:一种用于免训练扩散加速的令牌自适应预测器框架

Haowei Zhu, Tingxuan Huang, Xing Wang, Tianyu Zhao, Jiexi Wang 等 (10 位作者)

核心贡献: 提出了一种免训练、基于探针驱动的令牌自适应预测器框架,通过为每个令牌在每个采样步骤自适应选择最优预测器,显著提升扩散模型的推理速度,同时保持生成质量。
方法: 该方法仅使用模型第一层的单次完整前向传播作为低成本探针,计算一组紧凑候选预测器(主要基于不同阶数和时间跨度的泰勒展开)的代理损失,然后为每个令牌选择代理误差最小的预测器。这种“先探针后选择”的逐令牌策略无需额外训练,且兼容多种预测器设计。
关键发现: 实验表明,TAP在多种扩散模型架构和生成任务中,以可忽略的开销实现了大幅加速(最高可达数倍),且感知质量损失极小或为零。其性能显著优于固定全局预测器和仅使用缓存的基线方法,提升了精度-效率的权衡边界。
查看原文摘要

Diffusion models achieve strong generative performance but remain slow at inference due to the need for repeated full-model denoising passes. We present Token-Adaptive Predictor (TAP), a training-free, probe-driven framework that adaptively selects a predictor for each token at every sampling step. TAP uses a single full evaluation of the model's first layer as a low-cost probe to compute proxy losses for a compact family of candidate predictors (instantiated primarily with Taylor expansions of varying order and horizon), then assigns each token the predictor with the smallest proxy error. This per-token "probe-then-select" strategy exploits heterogeneous temporal dynamics, requires no additional training, and is compatible with various predictor designs. TAP incurs negligible overhead while enabling large speedups with little or no perceptual quality loss. Extensive experiments across multiple diffusion architectures and generation tasks show that TAP substantially improves the accuracy-efficiency frontier compared to fixed global predictors and caching-only baselines.

自然语言处理 2603.03714
相关性 85/100

Order Is Not Layout: Order-to-Space Bias in Image Generation

顺序不等于布局:图像生成中的顺序到空间偏见

Yongkang Zhang, Zonglin Zhao, Yuechen Zhang, Fei Ding, Pei Li 等 (6 位作者)

核心贡献: 本文揭示了现代图像生成模型中存在的一种系统性偏见——文本中实体的提及顺序会虚假地决定其空间布局和角色绑定,并将此现象命名为“顺序到空间偏见”。
方法: 研究首先通过构建OTS-Bench基准来量化该偏见,该基准使用仅实体顺序不同的成对提示来隔离顺序效应,并从同质化和正确性两个维度评估模型。随后,通过实验分析偏见的成因,并探索了针对性微调和在布局形成的早期阶段进行干预两种缓解策略。
关键发现: 实验表明,顺序到空间偏见在多种现代图像生成模型中普遍存在,且主要由数据驱动,在布局形成的早期阶段就已显现。研究证明,通过针对性微调或早期干预策略,可以在保持生成质量的同时,显著减少这种偏见。
查看原文摘要

We study a systematic bias in modern image generation models: the mention order of entities in text spuriously determines spatial layout and entity--role binding. We term this phenomenon Order-to-Space Bias (OTS) and show that it arises in both text-to-image and image-to-image generation, often overriding grounded cues and causing incorrect layouts or swapped assignments. To quantify OTS, we introduce OTS-Bench, which isolates order effects with paired prompts differing only in entity order and evaluates models along two dimensions: homogenization and correctness. Experiments show that Order-to-Space Bias (OTS) is widespread in modern image generation models, and provide evidence that it is primarily data-driven and manifests during the early stages of layout formation. Motivated by this insight, we show that both targeted fine-tuning and early-stage intervention strategies can substantially reduce OTS, while preserving generation quality.

计算机视觉 2603.03692
相关性 85/100

Error as Signal: Stiffness-Aware Diffusion Sampling via Embedded Runge-Kutta Guidance

误差即信号:通过嵌入式龙格-库塔引导实现基于刚度感知的扩散采样

Inho Kong, Sojin Lee, Youngjoon Hong, Hyunwoo J. Kim

核心贡献: 本文提出了一种新的引导方法ERK-Guid,它将求解器引起的局部截断误差转化为引导信号,通过检测和利用扩散过程中ODE轨迹的“刚度”区域来稳定采样并提升生成质量。
方法: 该方法基于一个关键观察:在ODE轨迹变化剧烈的“刚度”区域,局部截断误差会沿主导特征向量方向对齐。ERK-Guid利用嵌入式龙格-库塔方法,通过比较不同阶数求解器的输出,在线估计局部截断误差和刚度,并将此误差信号作为引导项来修正采样过程,从而主动减少误差、稳定轨迹。
关键发现: 在合成数据集和ImageNet等基准上的实验表明,ERK-Guid在生成质量上持续优于现有的先进方法(如Classifier-Free Guidance和Autoguidance),能有效处理刚度区域,提升采样稳定性与样本保真度。
查看原文摘要

Classifier-Free Guidance (CFG) has established the foundation for guidance mechanisms in diffusion models, showing that well-designed guidance proxies significantly improve conditional generation and sample quality. Autoguidance (AG) has extended this idea, but it relies on an auxiliary network and leaves solver-induced errors unaddressed. In stiff regions, the ODE trajectory changes sharply, where local truncation error (LTE) becomes a critical factor that deteriorates sample quality. Our key observation is that these errors align with the dominant eigenvector, motivating us to leverage the solver-induced error as a guidance signal. We propose Embedded Runge-Kutta Guidance (ERK-Guid), which exploits detected stiffness to reduce LTE and stabilize sampling. We theoretically and empirically analyze stiffness and eigenvector estimators with solver errors to motivate the design of ERK-Guid. Our experiments on both synthetic datasets and the popular benchmark dataset, ImageNet, demonstrate that ERK-Guid consistently outperforms state-of-the-art methods. Code is available at https://github.com/mlvlab/ERK-Guid.

计算机视觉 2603.03657
相关性 85/100

InEdit-Bench: Benchmarking Intermediate Logical Pathways for Intelligent Image Editing Models

InEdit-Bench:面向智能图像编辑模型的中间逻辑路径基准测试

Zhiqiang Sheng, Xumeng Han, Zhiwei Zhang, Zenghui Xiong, Yifan Ding 等 (9 位作者)

核心贡献: 本文提出了首个专门用于评估图像编辑模型对中间逻辑路径推理能力的基准测试集InEdit-Bench,并设计了一套细粒度的评估标准,以系统性地衡量现有模型在动态、多步推理任务上的局限性。
方法: 研究团队构建了InEdit-Bench基准测试集,其中包含精心标注的测试案例,涵盖状态转换、动态过程、时间序列和科学模拟四大基础任务类别。同时,提出了一套评估标准,用于衡量生成路径的逻辑连贯性、视觉自然度,以及模型对指定路径约束的遵循程度。最后,利用该基准对14个代表性图像编辑模型进行了全面评估。
关键发现: 对14个主流图像编辑模型的评估结果表明,它们在需要动态推理和多步中间路径生成的复杂任务上普遍存在显著缺陷。这揭示了当前多模态生成模型在程序性和因果性视觉理解方面的关键不足,凸显了向更动态、具备推理意识的智能模型发展的必要性。
查看原文摘要

Multimodal generative models have made significant strides in image editing, demonstrating impressive performance on a variety of static tasks. However, their proficiency typically does not extend to complex scenarios requiring dynamic reasoning, leaving them ill-equipped to model the coherent, intermediate logical pathways that constitute a multi-step evolution from an initial state to a final one. This capacity is crucial for unlocking a deeper level of procedural and causal understanding in visual manipulation. To systematically measure this critical limitation, we introduce InEdit-Bench, the first evaluation benchmark dedicated to reasoning over intermediate pathways in image editing. InEdit-Bench comprises meticulously annotated test cases covering four fundamental task categories: state transition, dynamic process, temporal sequence, and scientific simulation. Additionally, to enable fine-grained evaluation, we propose a set of assessment criteria to evaluate the logical coherence and visual naturalness of the generated pathways, as well as the model's fidelity to specified path constraints. Our comprehensive evaluation of 14 representative image editing models on InEdit-Bench reveals significant and widespread shortcomings in this domain. By providing a standardized and challenging benchmark, we aim for InEdit-Bench to catalyze research and steer development towards more dynamic, reason-aware, and intelligent multimodal generative models.