📚 ArXiv Daily Digest

计算机视觉 2602.23235

相关性 85/100

Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents

面向高效高分辨率图形用户界面智能体的时空令牌剪枝

Zhou Xu, Bowen Zhou, Qi Wang, Shuwen Feng, Jingyu Xiao

核心贡献: 提出了GUIPruner框架，通过解决现有压缩方法中的时间错位和空间拓扑冲突问题，首次实现了对高分辨率GUI导航智能体视觉输入的高效、无训练压缩，在保持性能的同时大幅提升效率。

方法: 方法包含两个核心组件：1) 时间自适应分辨率(TAR)，基于衰减机制动态调整历史轨迹的编码分辨率，以匹配智能体的“记忆衰减”注意力模式；2) 分层结构感知剪枝(SSP)，在保护全局布局完整性的前提下，优先保留交互前景和语义锚点等关键空间信息，避免坐标定位错误。

关键发现: 在多个基准测试上的实验表明，GUIPruner能持续取得最优性能，有效防止高压缩下的大模型性能崩溃。以Qwen2-VL-2B模型为例，该方法在保持94%以上原始性能的同时，实现了3.4倍的浮点运算量减少和3.3倍的视觉编码延迟加速，使高精度实时GUI导航得以在极低资源消耗下运行。

查看原文摘要

Pure-vision GUI agents provide universal interaction capabilities but suffer from severe efficiency bottlenecks due to the massive spatiotemporal redundancy inherent in high-resolution screenshots and historical trajectories. We identify two critical misalignments in existing compression paradigms: the temporal mismatch, where uniform history encoding diverges from the agent's "fading memory" attention pattern, and the spatial topology conflict, where unstructured pruning compromises the grid integrity required for precise coordinate grounding, inducing spatial hallucinations. To address these challenges, we introduce GUIPruner, a training-free framework tailored for high-resolution GUI navigation. It synergizes Temporal-Adaptive Resolution (TAR), which eliminates historical redundancy via decay-based resizing, and Stratified Structure-aware Pruning (SSP), which prioritizes interactive foregrounds and semantic anchors while safeguarding global layout. Extensive evaluations across diverse benchmarks demonstrate that GUIPruner consistently achieves state-of-the-art performance, effectively preventing the collapse observed in large-scale models under high compression. Notably, on Qwen2-VL-2B, our method delivers a 3.4x reduction in FLOPs and a 3.3x speedup in vision encoding latency while retaining over 94% of the original performance, enabling real-time, high-precision navigation with minimal resource consumption.

📄 arXiv 📥 PDF

计算机视觉 2602.23191

相关性 85/100

Uni-Animator: Towards Unified Visual Colorization

Uni-Animator：迈向统一的视觉着色

Xinyuan Chen, Yao Xu, Shaowen Wang, Pengjie Song, Bowen Deng

核心贡献: 提出了一个基于扩散Transformer的统一图像与视频草图着色框架，解决了现有方法在跨任务统一、颜色迁移精度、细节保持和时间一致性方面的不足。

方法: 方法采用扩散Transformer架构，通过实例块嵌入增强视觉参考以实现精确颜色对齐与融合；利用物理特征强化机制捕捉并保留高频纹理细节；设计了基于草图的动态RoPE编码，自适应建模运动感知的时空依赖关系以提升时序一致性。

关键发现: 实验表明，Uni-Animator在图像和视频草图着色任务上均达到与任务专用方法相当的性能，同时实现了跨域统一能力，具有高细节保真度和鲁棒的时间一致性，尤其在大运动场景中有效减少了运动伪影。

查看原文摘要

We propose Uni-Animator, a novel Diffusion Transformer (DiT)-based framework for unified image and video sketch colorization. Existing sketch colorization methods struggle to unify image and video tasks, suffering from imprecise color transfer with single or multiple references, inadequate preservation of high-frequency physical details, and compromised temporal coherence with motion artifacts in large-motion scenes. To tackle imprecise color transfer, we introduce visual reference enhancement via instance patch embedding, enabling precise alignment and fusion of reference color information. To resolve insufficient physical detail preservation, we design physical detail reinforcement using physical features that effectively capture and retain high-frequency textures. To mitigate motion-induced temporal inconsistency, we propose sketch-based dynamic RoPE encoding that adaptively models motion-aware spatial-temporal dependencies. Extensive experimental results demonstrate that Uni-Animator achieves competitive performance on both image and video sketch colorization, matching that of task-specific methods while unlocking unified cross-domain capabilities with high detail fidelity and robust temporal consistency.

📄 arXiv 📥 PDF

图形学 2602.23010

相关性 85/100

HELMLAB: An Analytical, Data-Driven Color Space for Perceptual Distance in UI Design Systems

HELMLAB：一种用于UI设计系统中感知距离的分析型数据驱动色彩空间

Gorkem Yildiz

核心贡献: 本文提出了HELMLAB，一个包含72个参数的分析型色彩空间，旨在为UI设计系统提供更准确的色彩感知距离度量，其性能显著优于CIEDE2000标准。

方法: 该方法通过一系列可学习的矩阵、逐通道幂压缩、傅里叶色调校正以及内嵌的亥姆霍兹-科勒劳施明度调整，将CIE XYZ色彩空间映射到感知组织的Lab表示。流程后包含中性色校正以确保消色差色彩映射准确，并通过刚性旋转优化色调角度对齐而不影响距离度量。

关键发现: 在COMBVD数据集（3,813个色彩对）上，HELMLAB的STRESS指标为23.22，较CIEDE2000（29.18）降低了20.4%。跨数据集验证显示其具有竞争力的泛化性能，且变换可逆，往返误差低于10^-14。该空间还集成了色域映射、设计令牌导出及明暗模式适配等实用工具。

查看原文摘要

We present HELMLAB, a 72-parameter analytical color space for UI design systems. The forward transform maps CIE XYZ to a perceptually-organized Lab representation through learned matrices, per-channel power compression, Fourier hue correction, and embedded Helmholtz-Kohlrausch lightness adjustment. A post-pipeline neutral correction guarantees that achromatic colors map to a=b=0 (chroma < 10^-6), and a rigid rotation of the chromatic plane improves hue-angle alignment without affecting the distance metric, which is invariant under isometries. On the COMBVD dataset (3,813 color pairs), HELMLAB achieves a STRESS of 23.22, a 20.4% reduction from CIEDE2000 (29.18). Cross-validation on He et al. 2022 and MacAdam 1974 shows competitive cross-dataset performance. The transform is invertible with round-trip errors below 10^-14. Gamut mapping, design-token export, and dark/light mode adaptation utilities are included for use in web and mobile design systems.

📄 arXiv 📥 PDF

计算机视觉 2602.22948

相关性 85/100

ToProVAR: Efficient Visual Autoregressive Modeling via Tri-Dimensional Entropy-Aware Semantic Analysis and Sparsity Optimization

ToProVAR：通过三维熵感知语义分析与稀疏性优化的高效视觉自回归建模

Jiayu Chen, Ruoyu Lin, Zihao Zheng, Jingxin Li, Maoliang Li 等 (7 位作者)

核心贡献: 提出了一种基于三维（词元、层、尺度）熵感知语义分析与稀疏性优化的视觉自回归模型加速框架，从根本上改进了传统启发式跳过策略，在保持生成质量的同时显著提升效率。

方法: 首先利用注意力熵分析模型在不同词元粒度、语义范围和生成尺度下的参数动态特性；进而识别出词元、层和尺度三个维度的稀疏性模式；最后针对这些模式设计细粒度的优化策略，实现精准的计算加速。

关键发现: 在Infinity-2B和Infinity-8B模型上的实验表明，ToProVAR最高可实现3.4倍的生成加速，且质量损失极小，在效率和质量上均优于传统方法（如FastVAR和SkipVAR），有效解决了视觉自回归模型后期阶段的效率瓶颈问题。

查看原文摘要

Visual Autoregressive(VAR) models enhance generation quality but face a critical efficiency bottleneck in later stages. In this paper, we present a novel optimization framework for VAR models that fundamentally differs from prior approaches such as FastVAR and SkipVAR. Instead of relying on heuristic skipping strategies, our method leverages attention entropy to characterize the semantic projections across different dimensions of the model architecture. This enables precise identification of parameter dynamics under varying token granularity levels, semantic scopes, and generation scales. Building on this analysis, we further uncover sparsity patterns along three critical dimensions-token, layer, and scale-and propose a set of fine-grained optimization strategies tailored to these patterns. Extensive evaluation demonstrates that our approach achieves aggressive acceleration of the generation process while significantly preserving semantic fidelity and fine details, outperforming traditional methods in both efficiency and quality. Experiments on Infinity-2B and Infinity-8B models demonstrate that ToProVAR achieves up to 3.4x acceleration with minimal quality loss, effectively mitigating the issues found in prior work. Our code will be made publicly available.

📄 arXiv 📥 PDF

计算机视觉 2602.22809

相关性 85/100

PhotoAgent: Agentic Photo Editing with Exploratory Visual Aesthetic Planning

PhotoAgent：基于探索性视觉美学规划的智能照片编辑

Mingde Yao, Zhiyuan You, Tam-King Man, Menglu Wang, Tianfan Xue

核心贡献: 提出了一个能够自主进行照片编辑的智能体系统，通过显式的美学规划将编辑任务转化为长程决策问题，并引入了一个用于真实场景评估的美学评价基准。

方法: 该系统将自主图像编辑建模为一个长程决策问题。它首先推理用户的美学意图，然后通过树搜索规划多步编辑动作，最后利用记忆和视觉反馈进行闭环迭代执行，整个过程无需用户逐步提供指令。

关键发现: 广泛的实验表明，与基线方法相比，PhotoAgent在指令遵循度和视觉质量方面均有持续提升。为支持评估而构建的测试集（包含1,017张照片）也系统性地验证了其自主照片编辑性能。

查看原文摘要

With the recent fast development of generative models, instruction-based image editing has shown great potential in generating high-quality images. However, the quality of editing highly depends on carefully designed instructions, placing the burden of task decomposition and sequencing entirely on the user. To achieve autonomous image editing, we present PhotoAgent, a system that advances image editing through explicit aesthetic planning. Specifically, PhotoAgent formulates autonomous image editing as a long-horizon decision-making problem. It reasons over user aesthetic intent, plans multi-step editing actions via tree search, and iteratively refines results through closed-loop execution with memory and visual feedback, without requiring step-by-step user prompts. To support reliable evaluation in real-world scenarios, we introduce UGC-Edit, an aesthetic evaluation benchmark consisting of 7,000 photos and a learned aesthetic reward model. We also construct a test set containing 1,017 photos to systematically assess autonomous photo editing performance. Extensive experiments demonstrate that PhotoAgent consistently improves both instruction adherence and visual quality compared with baseline methods. The project page is https://github.com/mdyao/PhotoAgent.

📄 arXiv 📥 PDF