📚 ArXiv Daily Digest

每日论文精选

📅 2026-03-10

共 5 篇论文 | 计算机视觉: 5

计算机视觉 2603.07276
相关性 85/100

Variational Flow Maps: Make Some Noise for One-Step Conditional Generation

变分流映射:为一步条件生成引入噪声

Abbas Mammadov, So Takao, Bohan Chen, Ricardo Baptista, Morteza Mardani 等 (7 位作者)

核心贡献: 提出了变分流映射框架,将条件生成的核心从“引导采样路径”转变为“学习合适的初始噪声”,实现了在单步或少量步数内从复杂数据后验中生成高质量条件样本。
方法: 该方法通过训练一个噪声适配器模型,学习根据观测条件输出一个噪声分布;将该噪声通过流映射转换到数据空间后,生成的样本能同时满足观测约束和数据先验。采用一个原则性的变分目标联合训练噪声适配器和流映射,以改善噪声与数据的对齐。
关键发现: 在多种逆问题上的实验表明,VFMs能够以单步或少量步数生成校准良好的条件样本。在ImageNet数据集上,VFMs在保持竞争力的生成质量的同时,相比迭代式扩散/流模型将采样速度提升了数个数量级。
查看原文摘要

Flow maps enable high-quality image generation in a single forward pass. However, unlike iterative diffusion models, their lack of an explicit sampling trajectory impedes incorporating external constraints for conditional generation and solving inverse problems. We put forth Variational Flow Maps, a framework for conditional sampling that shifts the perspective of conditioning from "guiding a sampling path", to that of "learning the proper initial noise". Specifically, given an observation, we seek to learn a noise adapter model that outputs a noise distribution, so that after mapping to the data space via flow map, the samples respect the observation and data prior. To this end, we develop a principled variational objective that jointly trains the noise adapter and the flow map, improving noise-data alignment, such that sampling from complex data posterior is achieved with a simple adapter. Experiments on various inverse problems show that VFMs produce well-calibrated conditional samples in a single (or few) steps. For ImageNet, VFM attains competitive fidelity while accelerating the sampling by orders of magnitude compared to alternative iterative diffusion/flow models. Code is available at https://github.com/abbasmammadov/VFM

计算机视觉 2603.07244
相关性 85/100

PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide Generation

PresentBench:一个基于细粒度评分标准的幻灯片生成基准

Xin-Sheng Chen, Jiayu Zhu, Pei-lin Li, Hanzheng Wang, Shuojin Yang 等 (6 位作者)

核心贡献: 提出了一个基于细粒度评分标准的基准测试PresentBench,用于评估现实场景下的自动化幻灯片生成,解决了现有评估方法粒度粗、依赖整体判断的问题。
方法: 研究团队构建了一个包含238个评估实例的数据集,每个实例都附带了创建幻灯片所需的背景材料。他们为每个实例手动设计了平均54.1个检查项,每个检查项都表述为一个二元问题,从而实现对生成幻灯片的细粒度、实例化评估。
关键发现: 大量实验表明,PresentBench比现有方法提供了更可靠的评估结果,并且与人类偏好表现出显著更强的一致性。此外,该基准测试揭示NotebookLM在幻灯片生成方面显著优于其他方法,凸显了该领域近期取得的实质性进展。
查看原文摘要

Slides serve as a critical medium for conveying information in presentation-oriented scenarios such as academia, education, and business. Despite their importance, creating high-quality slide decks remains time-consuming and cognitively demanding. Recent advances in generative models, such as Nano Banana Pro, have made automated slide generation increasingly feasible. However, existing evaluations of slide generation are often coarse-grained and rely on holistic judgments, making it difficult to accurately assess model capabilities or track meaningful advances in the field. In practice, the lack of fine-grained, verifiable evaluation criteria poses a critical bottleneck for both research and real-world deployment. In this paper, we propose PresentBench, a fine-grained, rubric-based benchmark for evaluating automated real-world slide generation. It contains 238 evaluation instances, each supplemented with background materials required for slide creation. Moreover, we manually design an average of 54.1 checklist items per instance, each formulated as a binary question, to enable fine-grained, instance-specific evaluation of the generated slide decks. Extensive experiments show that PresentBench provides more reliable evaluation results than existing methods, and exhibits significantly stronger alignment with human preferences. Furthermore, our benchmark reveals that NotebookLM significantly outperforms other slide generation methods, highlighting substantial recent progress in this domain.

计算机视觉 2603.07240
相关性 85/100

FabricGen: Microstructure-Aware Woven Fabric Generation

FabricGen:基于微观结构感知的机织面料生成

Yingjie Tang, Di Luo, Zixiong Wang, Xiaoli Ling, jian Yang 等 (6 位作者)

核心贡献: 提出了一个端到端的框架,能够根据文本描述生成高质量、符合编织规则的机织面料材料,通过解耦宏观纹理与微观编织结构,显著提升了生成面料的细节丰富度和真实感。
方法: 方法将生成过程分解为宏观纹理和微观编织图案两部分:1)针对宏观纹理,在收集的无微观结构面料数据集上微调预训练的扩散模型;2)针对微观编织图案,开发了一个增强的程序化几何模型来合成具有纱线滑动和飞散纤维的纱线级几何结构,该模型由一个专门微调的大型语言模型(WeavingLLM)驱动,该模型基于格式化的编织图数据集进行训练,并融合了领域知识进行提示调优。
关键发现: 实验结果表明,与先前的生成模型相比,该框架生成的面料材料具有更丰富的细节和更高的真实感,能够从文本提示中生成多样且符合编织原理的微观编织图案,并结合宏观纹理进行最终的面料渲染。
查看原文摘要

Woven fabric materials are widely used in rendering applications, yet designing realistic examples typically involves multiple stages, requiring expertise in weaving principles and texture authoring. Recent advances have explored diffusion models to streamline this process; however, pre-trained diffusion models often struggle to generate intricate yarn-level details that conform to weaving rules. To address this, we present FabricGen, an end-to-end framework for generating high-quality woven fabric materials from textual descriptions. A key insight of our method is the decomposition of macro-scale textures and micro-scale weaving patterns. To generate macro-scale textures free from microstructures, we fine-tune pre-trained diffusion models on a collected dataset of microstructure-free fabrics. As for micro-scale weaving patterns, we develop an enhanced procedural geometric model capable of synthesizing natural yarn-level geometry with yarn sliding and flyaway fibers. The procedural model is driven by a specialized large language model, WeavingLLM, which is fine-tuned on an annotated dataset of formatted weaving drafts, and prompt-tuned with domain-specific fabric expertise. Through fine-tuning and prompt tuning, WeavingLLM learns to design weaving drafts and fabric parameters from textual prompts, enabling the procedural model to produce diverse weaving patterns that stick to weaving principles. The generated macro-scale texture, along with the micro-scale geometry, can be used for fabric rendering. Consequently, our framework produces materials with significantly richer detail and realism compared to prior generative models.

计算机视觉 2603.07236
相关性 85/100

HY-WU (Part I): An Extensible Functional Neural Memory Framework and An Instantiation in Text-Guided Image Editing

HY-WU(第一部分):一种可扩展的功能性神经记忆框架及其在文本引导图像编辑中的实例化

Tencent HY Team

核心贡献: 提出了HY-WU(权重释放)框架,将适应压力从覆盖单一共享参数点转移到功能性神经记忆上,实现了无需测试时优化的实例特定算子生成。
方法: 该框架采用记忆优先的适应方法,将功能性(算子级)记忆实现为一个神经模块。该模块作为生成器,能够根据实例条件动态合成权重更新,从而为每个实例生成特定的操作算子,避免了传统静态权重范式中的参数空间单点限制。
关键发现: 通过将适应过程从重复覆盖共享权重转变为基于记忆的实例特定算子生成,HY-WU框架能够有效缓解持续学习和个性化任务中的性能退化、干扰或过度专门化问题,为异构和持续演化的部署环境提供了更灵活的架构支持。
查看原文摘要

Foundation models are transitioning from offline predictors to deployed systems expected to operate over long time horizons. In real deployments, objectives are not fixed: domains drift, user preferences evolve, and new tasks appear after the model has shipped. This elevates continual learning and instant personalization from optional features to core architectural requirements. Yet most adaptation pipelines still follow a static weight paradigm: after training (or after any adaptation step), inference executes a single parameter vector regardless of user intent, domain, or instance-specific constraints. This treats the trained or adapted model as a single point in parameter space. In heterogeneous and continually evolving regimes, distinct objectives can induce separated feasible regions over parameters, forcing any single shared update into compromise, interference, or overspecialization. As a result, continual learning and personalization are often implemented as repeated overwriting of shared weights, risking degradation of previously learned behaviors. We propose HY-WU (Weight Unleashing), a memory-first adaptation framework that shifts adaptation pressure away from overwriting a single shared parameter point. HY-WU implements functional (operator-level) memory as a neural module: a generator that synthesizes weight updates on-the-fly from the instance condition, yielding instance-specific operators without test-time optimization.

计算机视觉 2603.07119
相关性 85/100

TIQA: Human-Aligned Text Quality Assessment in Generated Images

TIQA:生成图像中符合人类感知的文本质量评估

Kirill Koltsov, Aleksandr Gushchin, Dmitriy Vatolin, Anastasia Antsiferova

核心贡献: 提出了文本图像质量评估(TIQA)任务,并发布了两个人工标注的数据集(TIQA-Crops和TIQA-Images),同时设计了一种轻量级方法ANTIQA,显著提升了与人类评分的一致性。
方法: 论文提出了ANTIQA方法,这是一种轻量级的文本质量评估模型,通过引入针对文本渲染特性的偏置设计,专门用于预测裁剪文本区域的质量分数。该方法不依赖OCR正确性或通用视觉语言模型的判断,而是直接学习人类对文本保真度的感知评分。
关键发现: ANTIQA在TIQA-Crops和TIQA-Images数据集上,与人类评分的相关性(PLCC)分别比OCR置信度、VLM评判和通用无参考图像质量评估指标提高至少约0.05和0.08;在下游任务中,使用ANTIQA进行五选一筛选可使人类评定的文本质量平均提升14%,证明其在生成流程中具有实际过滤和重排序价值。
查看原文摘要

Text rendering remains a persistent failure mode of modern text-to-image models (T2I), yet existing evaluations rely on OCR correctness or VLM-based judging procedures that are poorly aligned with perceptual text artifacts. We introduce Text-in-Image Quality Assessment (TIQA), a task that predicts a scalar quality score that matches human judgments of rendered-text fidelity within cropped text regions. We release two MOS-labeled datasets: TIQA-Crops (10k text crops) and TIQA-Images (1,500 images), spanning 20+ T2I models, including proprietary ones. We also propose ANTIQA, a lightweight method with text-specific biases, and show that it improves correlation with human scores over OCR confidence, VLM judges, and generic NR-IQA metrics by at least $\sim0.05$ on TIQA-Crops and $\sim0.08$ on TIQA-Images, as measured by PLCC. Finally, we show that TIQA models are valuable in downstream tasks: for example, selecting the best-of-5 generations with ANTIQA improves human-rated text quality by $+14\%$ on average, demonstrating practical value for filtering and reranking in generation pipelines.