📚 ArXiv Daily Digest

计算机视觉 2601.18698

Are Video Generation Models Geographically Fair? An Attraction-Centric Evaluation of Global Visual Knowledge

视频生成模型具有地理公平性吗？基于旅游景点的全球视觉知识评估

Xiao Liu, Jiawei Zhang

核心贡献: 提出了一个系统性的评估框架（GAP）和基准数据集（GEOATTRACTION-500），用于评估文本到视频模型对不同地理区域视觉知识的编码公平性与准确性。

方法: 研究设计了“地理-景点地标探测”（GAP）框架，通过构建包含全球500个不同地区、不同知名度景点的基准数据集（GEOATTRACTION-500），并采用互补的评估指标（包括全局结构对齐、基于关键点的细粒度对齐以及视觉语言模型判断）来分离视频整体质量与景点特定知识，所有指标均经过人工评估验证。

关键发现: 对当前最先进的文本到视频模型Sora 2的评估发现，与普遍存在强烈地理偏见的假设相反，该模型在不同地区、发展水平和文化群体中表现出相对均匀的地理视觉知识编码水平，且对景点知名度的依赖较弱。这表明当前模型表达的全球视觉知识比预期更均衡。

查看原文摘要

Recent advances in text-to-video generation have produced visually compelling results, yet it remains unclear whether these models encode geographically equitable visual knowledge. In this work, we investigate the geo-equity and geographically grounded visual knowledge of text-to-video models through an attraction-centric evaluation. We introduce Geo-Attraction Landmark Probing (GAP), a systematic framework for assessing how faithfully models synthesize tourist attractions from diverse regions, and construct GEOATTRACTION-500, a benchmark of 500 globally distributed attractions spanning varied regions and popularity levels. GAP integrates complementary metrics that disentangle overall video quality from attraction-specific knowledge, including global structural alignment, fine-grained keypoint-based alignment, and vision-language model judgments, all validated against human evaluation. Applying GAP to the state-of-the-art text-to-video model Sora 2, we find that, contrary to common assumptions of strong geographic bias, the model exhibits a relatively uniform level of geographically grounded visual knowledge across regions, development levels, and cultural groupings, with only weak dependence on attraction popularity. These results suggest that current text-to-video models express global visual knowledge more evenly than expected, highlighting both their promise for globally deployed applications and the need for continued evaluation as such systems evolve.

📄 arXiv 📥 PDF

计算机视觉 2601.18543

GenAgent: Scaling Text-to-Image Generation via Agentic Multimodal Reasoning

GenAgent：通过智能体多模态推理扩展文本到图像生成

Kaixun Jiang, Yuzheng Wang, Junjie Zhou, Pandeng Li, Zhihang Liu 等 (9 位作者)

核心贡献: 提出了GenAgent，一个通过智能体框架将视觉理解与生成解耦的统一多模态模型，它利用多轮自主交互和反思迭代优化生成结果，显著提升了基础图像生成器的性能。

方法: GenAgent采用智能体框架，由多模态模型负责视觉理解，将图像生成模型作为可调用工具。其核心是通过多模态思维链进行自主多轮交互，包含推理、工具调用、判断和反思。训练采用两阶段策略：首先用高质量工具调用和反思数据进行监督微调；然后进行端到端的智能体强化学习，结合最终图像质量的点奖励和反思准确性的对奖励，并通过轨迹重采样增强多轮探索。

关键发现: GenAgent在GenEval++和WISE基准上，分别将基础生成器（FLUX.1-dev）的性能显著提升了23.6%和14%。框架展现出三个关键特性：1）能够泛化到不同能力的生成器；2）在测试时可通过增加交互轮次持续提升性能；3）具备任务自适应推理能力，能自动调整以适应不同任务。

查看原文摘要

We introduce GenAgent, unifying visual understanding and generation through an agentic multimodal model. Unlike unified models that face expensive training costs and understanding-generation trade-offs, GenAgent decouples these capabilities through an agentic framework: understanding is handled by the multimodal model itself, while generation is achieved by treating image generation models as invokable tools. Crucially, unlike existing modular systems constrained by static pipelines, this design enables autonomous multi-turn interactions where the agent generates multimodal chains-of-thought encompassing reasoning, tool invocation, judgment, and reflection to iteratively refine outputs. We employ a two-stage training strategy: first, cold-start with supervised fine-tuning on high-quality tool invocation and reflection data to bootstrap agent behaviors; second, end-to-end agentic reinforcement learning combining pointwise rewards (final image quality) and pairwise rewards (reflection accuracy), with trajectory resampling for enhanced multi-turn exploration. GenAgent significantly boosts base generator(FLUX.1-dev) performance on GenEval++ (+23.6\%) and WISE (+14\%). Beyond performance gains, our framework demonstrates three key properties: 1) cross-tool generalization to generators with varying capabilities, 2) test-time scaling with consistent improvements across interaction rounds, and 3) task-adaptive reasoning that automatically adjusts to different tasks. Our code will be available at \href{https://github.com/deep-kaixun/GenAgent}{this url}.

📄 arXiv 📥 PDF

计算机视觉 2601.18346

Q-Bench-Portrait: Benchmarking Multimodal Large Language Models on Portrait Image Quality Perception

Q-Bench-Portrait：面向人像图像质量感知的多模态大语言模型基准测试

Sijing Wu, Yunhao Li, Zicheng Zhang, Qi Jia, Xinyue Li 等 (8 位作者)

核心贡献: 本文提出了首个专门针对人像图像质量感知的综合性基准测试Q-Bench-Portrait，并系统评估了25个开源与闭源多模态大语言模型在该任务上的表现。

方法: 研究构建了一个包含2,765个图像-问题-答案三元组的数据集，涵盖自然图像、合成失真图像、AI生成图像、艺术图像和计算机图形图像等多种人像来源。基准从技术失真、AIGC特有失真和美学三个维度设计问题，并包含单选、多选、判断和开放式等多种问题形式，覆盖全局与局部两个层次。在此基础上，对20个开源和5个闭源MLLM进行了系统性评估。

关键发现: 评估结果表明，当前MLLM在人像图像感知方面虽具备一定能力，但其表现仍有限且不够精确，与人类判断存在明显差距。该基准揭示了现有模型在专业图像领域感知能力的不足，为未来提升通用及领域专用MLLM的人像感知性能提供了研究基础。

查看原文摘要

Recent advances in multimodal large language models (MLLMs) have demonstrated impressive performance on existing low-level vision benchmarks, which primarily focus on generic images. However, their capabilities to perceive and assess portrait images, a domain characterized by distinct structural and perceptual properties, remain largely underexplored. To this end, we introduce Q-Bench-Portrait, the first holistic benchmark specifically designed for portrait image quality perception, comprising 2,765 image-question-answer triplets and featuring (1) diverse portrait image sources, including natural, synthetic distortion, AI-generated, artistic, and computer graphics images; (2) comprehensive quality dimensions, covering technical distortions, AIGC-specific distortions, and aesthetics; and (3) a range of question formats, including single-choice, multiple-choice, true/false, and open-ended questions, at both global and local levels. Based on Q-Bench-Portrait, we evaluate 20 open-source and 5 closed-source MLLMs, revealing that although current models demonstrate some competence in portrait image perception, their performance remains limited and imprecise, with a clear gap relative to human judgments. We hope that the proposed benchmark will foster further research into enhancing the portrait image perception capabilities of both general-purpose and domain-specific MLLMs.

📄 arXiv 📥 PDF

多媒体 2601.18321

Integrating Fine-Grained Audio-Visual Evidence for Robust Multimodal Emotion Reasoning

融合细粒度视听证据以实现鲁棒的多模态情感推理

Zhixian Zhao, Wenjie Tian, Xiaohai Tian, Jun Zhang, Lei Xie

核心贡献: 本文提出了SABER-LLM框架，通过构建大规模细粒度情感推理数据集SABER，并设计了结构化证据分解范式，有效缓解了多模态大语言模型在复杂场景下的单模态主导与幻觉问题。

方法: 首先，构建了一个包含60万视频片段的大规模情感推理数据集SABER，采用一种新颖的六维标注模式，联合捕捉视听线索与因果逻辑。其次，提出了结构化证据分解范式，强制将证据提取与推理过程分离，遵循“先感知后推理”的原则。此外，通过一致性感知的直接偏好优化技术，在感知模糊或冲突的条件下显式鼓励多模态间的对齐，以增强对复杂场景的感知能力。

关键发现: 在EMER、EmoBench-M和SABER-Test等基准测试上的实验表明，SABER-LLM显著优于开源基线模型，并且在解码复杂情感动态方面达到了与闭源模型相竞争的鲁棒性。

查看原文摘要

Multimodal emotion analysis is shifting from static classification to generative reasoning. Beyond simple label prediction, robust affective reasoning must synthesize fine-grained signals such as facial micro-expressions and prosodic which shifts to decode the latent causality within complex social contexts. However, current Multimodal Large Language Models (MLLMs) face significant limitations in fine-grained perception, primarily due to data scarcity and insufficient cross-modal fusion. As a result, these models often exhibit unimodal dominance which leads to hallucinations in complex multimodal interactions, particularly when visual and acoustic cues are subtle, ambiguous, or even contradictory (e.g., in sarcastic scenery). To address this, we introduce SABER-LLM, a framework designed for robust multimodal reasoning. First, we construct SABER, a large-scale emotion reasoning dataset comprising 600K video clips, annotated with a novel six-dimensional schema that jointly captures audiovisual cues and causal logic. Second, we propose the structured evidence decomposition paradigm, which enforces a "perceive-then-reason" separation between evidence extraction and reasoning to alleviate unimodal dominance. The ability to perceive complex scenes is further reinforced by consistency-aware direct preference optimization, which explicitly encourages alignment among modalities under ambiguous or conflicting perceptual conditions. Experiments on EMER, EmoBench-M, and SABER-Test demonstrate that SABER-LLM significantly outperforms open-source baselines and achieves robustness competitive with closed-source models in decoding complex emotional dynamics. The dataset and model are available at https://github.com/zxzhao0/SABER-LLM.

📄 arXiv 📥 PDF

人机交互 2601.18785

Design Techniques for LLM-Powered Interactive Storytelling: A Case Study of the Dramamancer System

基于大语言模型的交互式叙事设计技术：以Dramamancer系统为例

Tiffany Wang, Yuqian Sun, Yi Wang, Melissa Roemmele, John Joon Young Chung 等 (6 位作者)

核心贡献: 提出了一个利用大语言模型（LLM）将作者创作的故事框架转化为玩家驱动叙事体验的新范式，并通过Dramamancer系统展示了如何平衡作者意图与玩家自主性。

方法: 研究以Dramamancer系统为案例，采用基于大语言模型的技术架构。系统首先接收作者预先设计的故事纲要（故事模式），然后利用LLM实时生成符合玩家选择的分支叙事内容，在保持故事整体结构的同时赋予玩家高度的互动自由。

关键发现: 论文概述了与此类系统相关的关键设计技术和评估考量，表明LLM能够有效桥接预设叙事框架与动态玩家互动，为交互式叙事创作提供了新的技术路径和设计思路。

查看原文摘要

The rise of Large Language Models (LLMs) has enabled a new paradigm for bridging authorial intent and player agency in interactive narrative. We consider this paradigm through the example of Dramamancer, a system that uses an LLM to transform author-created story schemas into player-driven playthroughs. This extended abstract outlines some design techniques and evaluation considerations associated with this system.

📄 arXiv 📥 PDF

自然语言处理 2601.18512

Using Large Language Models to Construct Virtual Top Managers: A Method for Organizational Research

利用大语言模型构建虚拟高层管理者：一种组织研究方法

Antonio Garzon-Vico, Krithika Sharon Komalapati, Arsalan Shahid, Jan Rosier

核心贡献: 本研究提出了一种利用大语言模型构建真实高层管理者虚拟人格的方法论框架，为在难以直接接触高管的情况下进行组织研究提供了可信且互补的工具。

方法: 该方法基于真实CEO的沟通文本和道德基础理论，构建能够模拟个体领导者决策的LLM参与者。研究通过三个阶段，以人类参与者为基准，评估了这些虚拟CEO的结构效度、信度和行为保真度。

关键发现: 结果表明，基于理论构建的虚拟人格能够近似人类样本中观察到的道德判断，说明LLM构建的虚拟人格在无法直接接触高管的研究情境中，可以作为组织研究的有效工具。

查看原文摘要

This study introduces a methodological framework that uses large language models to create virtual personas of real top managers. Drawing on real CEO communications and Moral Foundations Theory, we construct LLM-based participants that simulate the decision-making of individual leaders. Across three phases, we assess construct validity, reliability, and behavioral fidelity by benchmarking these virtual CEOs against human participants. Our results indicate that theoretically scaffolded personas approximate the moral judgements observed in human samples, suggesting that LLM-based personas can serve as credible and complementary tools for organizational research in contexts where direct access to executives is limited. We conclude by outlining implications for future research using LLM-based personas in organizational settings.

📄 arXiv 📥 PDF

自然语言处理 2601.18486

Demographic Probing of Large Language Models Lacks Construct Validity

大语言模型的人口统计学探测缺乏构念效度

Manuel Tonneau, Neil K. R. Seghal, Niyati Malhotra, Victor Orozco-Olvera, Ana María Muñoz Boudet 等 (8 位作者)

核心贡献: 本文揭示了当前广泛使用的人口统计学探测方法存在根本缺陷，其假设单一人口线索（如姓名或方言）可等效代表同一人口群体的行为，但实际缺乏构念效度，导致对模型行为的评估不稳定且不可靠。

方法: 研究在美国语境下聚焦种族和性别，通过模拟现实中的寻求建议互动场景，测试不同人口线索（如名字、方言等）对大型语言模型行为的影响。方法上比较了旨在代表同一人口群体的不同线索所诱导的模型行为变化，并分析了线索间差异的来源。

关键发现: 关键发现包括：1) 代表同一人口群体的不同线索仅能部分重叠地改变模型行为，而同一线索内不同群体间的区分度弱且不均匀；2) 估计的差异不稳定，其大小和方向因线索不同而变化；3) 这种不一致部分源于线索编码人口属性的强度差异，以及独立影响模型行为的语言混淆因素。结果表明，单线索探测无法获得对LLMs如何基于人口信息调整行为的单一、稳定表征。

查看原文摘要

Demographic probing is widely used to study how large language models (LLMs) adapt their behavior to signaled demographic attributes. This approach typically uses a single demographic cue in isolation (e.g., a name or dialect) as a signal for group membership, implicitly assuming strong construct validity: that such cues are interchangeable operationalizations of the same underlying, demographically conditioned behavior. We test this assumption in realistic advice-seeking interactions, focusing on race and gender in a U.S. context. We find that cues intended to represent the same demographic group induce only partially overlapping changes in model behavior, while differentiation between groups within a given cue is weak and uneven. Consequently, estimated disparities are unstable, with both magnitude and direction varying across cues. We further show that these inconsistencies partly arise from variation in how strongly cues encode demographic attributes and from linguistic confounders that independently shape model behavior. Together, our findings suggest that demographic probing lacks construct validity: it does not yield a single, stable characterization of how LLMs condition on demographic information, which may reflect a misspecified or fragmented construct. We conclude by recommending the use of multiple, ecologically valid cues and explicit control of confounders to support more defensible claims about demographic effects in LLMs.

📄 arXiv 📥 PDF

人工智能 2601.18631

AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning

AdaReasoner：面向迭代式视觉推理的动态工具编排框架

Mingyang Song, Haoyu Sun, Jiawei Gu, Linjie Li, Luxin Xu 等 (7 位作者)

核心贡献: 提出了一种能够将工具使用作为通用推理技能进行学习的多模态模型家族，使模型能够根据任务上下文和中间结果动态推断工具效用、编排多工具协作，并泛化至未见过的工具。

方法: 方法主要包括三个部分：1）一个可扩展的数据构建流程，让模型接触长视野、多步骤的工具交互序列；2）Tool-GRPO强化学习算法，基于最终任务成功率优化工具的选择与顺序编排；3）一种自适应学习机制，能够动态调节工具的使用频率和时机。

关键发现: 实验表明，AdaReasoner能自主采用有益工具、抑制无关工具，并根据任务需求动态调整工具使用策略，而无需对此进行显式训练。其在多个挑战性基准测试中取得了最先进的性能，将7B基础模型的平均性能提升了24.9%，并在VSP、Jigsaw等任务上超越了GPT-5等强大的闭源系统。

查看原文摘要

When humans face problems beyond their immediate capabilities, they rely on tools, providing a promising paradigm for improving visual reasoning in multimodal large language models (MLLMs). Effective reasoning, therefore, hinges on knowing which tools to use, when to invoke them, and how to compose them over multiple steps, even when faced with new tools or new tasks. We introduce \textbf{AdaReasoner}, a family of multimodal models that learn tool use as a general reasoning skill rather than as tool-specific or explicitly supervised behavior. AdaReasoner is enabled by (i) a scalable data curation pipeline exposing models to long-horizon, multi-step tool interactions; (ii) Tool-GRPO, a reinforcement learning algorithm that optimizes tool selection and sequencing based on end-task success; and (iii) an adaptive learning mechanism that dynamically regulates tool usage. Together, these components allow models to infer tool utility from task context and intermediate outcomes, enabling coordination of multiple tools and generalization to unseen tools. Empirically, AdaReasoner exhibits strong tool-adaptive and generalization behaviors: it autonomously adopts beneficial tools, suppresses irrelevant ones, and adjusts tool usage frequency based on task demands, despite never being explicitly trained to do so. These capabilities translate into state-of-the-art performance across challenging benchmarks, improving the 7B base model by +24.9\% on average and surpassing strong proprietary systems such as GPT-5 on multiple tasks, including VSP and Jigsaw.

📄 arXiv 📥 PDF

自然语言处理 2601.18572

One Persona, Many Cues, Different Results: How Sociodemographic Cues Impact LLM Personalization

一种身份，多种线索，不同结果：社会人口学线索如何影响大语言模型的个性化

Franziska Weeber, Vera Neplenbroek, Jan Batzner, Sebastian Padó

核心贡献: 本文揭示了仅依赖单一社会人口学线索（如姓名或明确属性提及）来研究大语言模型个性化与偏见存在局限性，并提出了应使用多种外部有效线索进行评估的建议。

方法: 研究在四个写作和建议任务上，比较了六种常用的人设提示线索（如姓名、代词、显性描述等）在七个开源和专有大语言模型中的表现。该方法重点关注不同提示线索对模型响应的影响，以评估其鲁棒性和外部有效性。

关键发现: 关键发现是，尽管不同线索的响应总体上高度相关，但它们在不同人设上产生的响应存在显著差异。单一线索得出的结论可能不可靠，因此未来关于个性化的研究应评估多种外部有效的线索。

查看原文摘要

Personalization of LLMs by sociodemographic subgroup often improves user experience, but can also introduce or amplify biases and unfair outcomes across groups. Prior work has employed so-called personas, sociodemographic user attributes conveyed to a model, to study bias in LLMs by relying on a single cue to prompt a persona, such as user names or explicit attribute mentions. This disregards LLM sensitivity to prompt variations (robustness) and the rarity of some cues in real interactions (external validity). We compare six commonly used persona cues across seven open and proprietary LLMs on four writing and advice tasks. While cues are overall highly correlated, they produce substantial variance in responses across personas. We therefore caution against claims from a single persona cue and recommend future personalization research to evaluate multiple externally valid cues.

📄 arXiv 📥 PDF

机器学习 2601.18615

Geometry-Free Conditional Diffusion Modeling for Solving the Inverse Electrocardiography Problem

用于求解心电逆问题的无几何条件扩散建模

Ramiro Valdes Jara, Adam Meyers

核心贡献: 提出了一种无几何、纯数据驱动的条件扩散模型框架，用于求解心电逆问题，能够生成多个可能的概率性重建结果，而非单一确定性估计，从而捕捉该问题的非唯一性和欠定性本质。

方法: 该方法采用条件扩散模型，学习从含噪声的体表信号到心脏表面电位的概率映射。它利用扩散模型的生成特性，无需构建患者特定的几何网格，完全基于数据驱动。框架通过反向扩散过程，在给定体表测量值的条件下，对心脏电位分布进行概率采样。

关键发现: 在真实心电逆问题数据集上的实验表明，与包括卷积神经网络、长短期记忆网络和基于Transformer的模型在内的强确定性基线相比，所提出的扩散方法实现了更高的重建精度，证明了扩散模型作为无创心脏电生理成像鲁棒工具的潜力。

查看原文摘要

This paper proposes a data-driven model for solving the inverse problem of electrocardiography, the mathematical problem that forms the basis of electrocardiographic imaging (ECGI). We present a conditional diffusion framework that learns a probabilistic mapping from noisy body surface signals to heart surface electric potentials. The proposed approach leverages the generative nature of diffusion models to capture the non-unique and underdetermined nature of the ECGI inverse problem, enabling probabilistic sampling of multiple reconstructions rather than a single deterministic estimate. Unlike traditional methods, the proposed framework is geometry-free and purely data-driven, alleviating the need for patient-specific mesh construction. We evaluate the method on a real ECGI dataset and compare it against strong deterministic baselines, including a convolutional neural network, long short-term memory network, and transformer-based model. The results demonstrate that the proposed diffusion approach achieves improved reconstruction accuracy, highlighting the potential of diffusion models as a robust tool for noninvasive cardiac electrophysiology imaging.

📄 arXiv 📥 PDF

计算机视觉 2601.18585

GimmBO: Interactive Generative Image Model Merging via Bayesian Optimization

GimmBO：基于贝叶斯优化的交互式生成图像模型融合

Chenxi Liu, Selena Ling, Alec Jacobson

核心贡献: 提出了一个名为GimmBO的交互式系统，通过偏好贝叶斯优化（PBO）帮助用户高效探索扩散模型适配器（adapter）的权重融合空间，解决了传统手动滑块调节方法在高维空间中难以有效探索的问题。

方法: 该方法采用两阶段贝叶斯优化后端：首先基于真实使用场景中观察到的稀疏性和权重范围约束，构建高效的搜索策略；其次通过交互式偏好反馈引导优化过程，允许用户对生成的图像结果进行偏好选择，从而逐步逼近符合用户意图的适配器权重组合。

关键发现: 实验表明，GimmBO在模拟用户测试和真实用户研究中均表现出更高的收敛效率和成功率，优于传统的贝叶斯优化和线性搜索基线；该系统框架具有良好的扩展性，可支持多种扩展应用，有效降低了用户探索高维适配器融合空间的难度。

查看原文摘要

Fine-tuning-based adaptation is widely used to customize diffusion-based image generation, leading to large collections of community-created adapters that capture diverse subjects and styles. Adapters derived from the same base model can be merged with weights, enabling the synthesis of new visual results within a vast and continuous design space. To explore this space, current workflows rely on manual slider-based tuning, an approach that scales poorly and makes weight selection difficult, even when the candidate set is limited to 20-30 adapters. We propose GimmBO to support interactive exploration of adapter merging for image generation through Preferential Bayesian Optimization (PBO). Motivated by observations from real-world usage, including sparsity and constrained weight ranges, we introduce a two-stage BO backend that improves sampling efficiency and convergence in high-dimensional spaces. We evaluate our approach with simulated users and a user study, demonstrating improved convergence, high success rates, and consistent gains over BO and line-search baselines, and further show the flexibility of the framework through several extensions.

📄 arXiv 📥 PDF

计算机视觉 2601.18493

DisasterInsight: A Multimodal Benchmark for Function-Aware and Grounded Disaster Assessment

DisasterInsight：面向功能感知与实体化灾害评估的多模态基准

Sara Tehrani, Yonghao Xu, Leif Haglund, Amanda Berg, Michael Felsberg

核心贡献: 提出了一个面向真实灾害分析任务的多模态基准DisasterInsight，用于评估视觉-语言模型在灾害场景下的功能理解与指令鲁棒性；并提出了通过参数高效微调方法构建的领域适应基线模型DI-Chat。

方法: 基于xBD数据集重构出约11.2万个以建筑物为中心的实例，支持建筑功能分类、损毁程度与灾害类型分类、计数及符合人道评估指南的结构化报告生成等多任务指令化评估。通过参数高效的LoRA方法对现有视觉-语言模型骨干进行灾害领域指令数据微调，构建了领域适应基线模型DI-Chat。

关键发现: 实验表明，现有通用及遥感视觉-语言模型在各项任务上存在显著性能差距，尤其在损毁理解和结构化报告生成方面；DI-Chat在损毁程度分类、灾害类型分类及报告生成质量上取得显著提升，但建筑功能分类对所有模型仍具挑战。该基准为灾害影像中的实体化多模态推理研究提供了统一评估平台。

查看原文摘要

Timely interpretation of satellite imagery is critical for disaster response, yet existing vision-language benchmarks for remote sensing largely focus on coarse labels and image-level recognition, overlooking the functional understanding and instruction robustness required in real humanitarian workflows. We introduce DisasterInsight, a multimodal benchmark designed to evaluate vision-language models (VLMs) on realistic disaster analysis tasks. DisasterInsight restructures the xBD dataset into approximately 112K building-centered instances and supports instruction-diverse evaluation across multiple tasks, including building-function classification, damage-level and disaster-type classification, counting, and structured report generation aligned with humanitarian assessment guidelines. To establish domain-adapted baselines, we propose DI-Chat, obtained by fine-tuning existing VLM backbones on disaster-specific instruction data using parameter-efficient Low-Rank Adaptation (LoRA). Extensive experiments on state-of-the-art generic and remote-sensing VLMs reveal substantial performance gaps across tasks, particularly in damage understanding and structured report generation. DI-Chat achieves significant improvements on damage-level and disaster-type classification as well as report generation quality, while building-function classification remains challenging for all evaluated models. DisasterInsight provides a unified benchmark for studying grounded multimodal reasoning in disaster imagery.

📄 arXiv 📥 PDF

人工智能 2601.18353

Can Good Writing Be Generative? Expert-Level AI Writing Emerges through Fine-Tuning on High-Quality Books

优秀写作能否是生成式的？基于高质量书籍微调催生专家级AI写作

Tuhin Chakrabarty, Paramveer S. Dhillon

核心贡献: 本研究通过行为实验证明，对作者完整作品进行微调的大型语言模型（LLMs）能够生成被专家评判者认为优于人类专家模仿作品的文本，从而挑战了创意写作是人类专属领域的传统假设。

方法: 研究设计了一项行为实验，让28位艺术硕士（MFA）专业写作者与三个大型语言模型（LLMs）竞赛，模仿50位广受好评的作家风格。通过两种条件进行对比：一是上下文提示（in-context prompting），二是对作者全部作品进行微调（fine-tuning）。随后，由28位专家评委和131位非专业评委进行双盲配对比较。

关键发现: 在上下文提示条件下，专家评委在82.7%的情况下更偏好人类写作；但在对作者作品进行微调后，这一结果发生逆转，专家评委对AI写作的偏好率升至62%。非专业评委则始终更偏好AI写作。事后访谈显示，专家写作者对AI写作的偏好引发了他们的身份危机，削弱了其审美自信，并促使他们重新思考“优秀写作”的本质。

查看原文摘要

Creative writing has long been considered a uniquely human endeavor, requiring voice and style that machines could not replicate. This assumption is challenged by Generative AI that can emulate thousands of author styles in seconds with negligible marginal labor. To understand this better, we conducted a behavioral experiment where 28 MFA writers (experts) competed against three LLMs in emulating 50 critically acclaimed authors. Based on blind pairwise comparisons by 28 expert judges and 131 lay judges, we find that experts preferred human writing in 82.7% of cases under the in-context prompting condition but this reversed to 62% preference for AI after fine-tuning on authors' complete works. Lay judges, however, consistently preferred AI writing. Debrief interviews with expert writers revealed that their preference for AI writing triggered an identity crisis, eroding aesthetic confidence and questioning what constitutes "good writing." These findings challenge discourse about AI's creative limitations and raise fundamental questions about the future of creative labor.

📄 arXiv 📥 PDF

cs.DL 2601.18271

Designing large language model prompts to extract scores from messy text: A shared dataset and challenge

设计大型语言模型提示词从杂乱文本中提取分数：一个共享数据集与挑战

Mike Thelwall

核心贡献: 本文提出了一个用于评估大型语言模型从非结构化文本中准确提取数值分数能力的共享数据集与挑战任务，旨在推动提示词设计技术的发展和提升LLM处理复杂数值任务的理解。

方法: 研究创建了一个包含1446条短文本的数据集，每条文本描述了基于英国1*至4*研究质量评分的杂乱内容。方法核心是设计有效的LLM提示词，要求模型仅输出数字（或代表缺失值的-1），并准确推断文本中的有效分数。论文提供了一个基础提示词示例作为基准。

关键发现: 论文提出的初始解决方案（简单提示词）在数据集上的准确率为72.6%，这为后续研究设立了明确的性能基准。关键发现是，该任务挑战不仅在于让LLM正确解析分数，还在于确保其输出格式严格符合要求（仅数字），并妥善处理文本中分数缺失或格式异常的情况。

查看原文摘要

In some areas of computing, natural language processing and information science, progress is made by sharing datasets and challenging the community to design the best algorithm for an associated task. This article introduces a shared dataset of 1446 short texts, each of which describes a research quality score on the UK scale of 1* to 4*. This is a messy collection, with some texts not containing scores and others including invalid scores or strange formats. With this dataset there is also a description of what constitutes a valid score and a "gold standard" of the correct scores for these texts (including missing values). The challenge is to design a prompt for Large Language Models (LLMs) to extract the scores from these texts as accurately as possible. The format for the response should be a number and no other text so there are two aspects to the challenge: ensuring that the LLM returns only a number, and instructing it to deduce the correct number for the text. As part of this, the LLM prompt needs to explain when to return the missing value code, -1, instead of a number when the text does not clearly contain one. The article also provides an example of a simple prompt. The purpose of the challenge is twofold: to get an effective solution to this problem, and to increase understanding of prompt design and LLM capabilities for complex numerical tasks. The initial solution suggested has an accuracy of 72.6%, so the challenge is to beat this.

📄 arXiv 📥 PDF

人机交互 2601.18759

UI Remix: Supporting UI Design Through Interactive Example Retrieval and Remixing

UI Remix：通过交互式示例检索与混搭支持用户界面设计

Junling Wang, Hongyi Lan, Xiaotian Su, Mustafa Doga Dogan, April Yi Wang

核心贡献: 提出了UI Remix系统，通过一个结合全局与局部示例检索、选择与适配的交互式工作流，帮助非专业设计者进行移动UI设计，并利用来源透明度提示增强用户对设计决策的信任。

方法: 系统基于多模态检索增强生成（MMRAG）模型构建，支持用户在整体界面和单个组件两个层级上，对设计示例进行迭代式的搜索、选择和修改。同时，系统通过展示示例的评分、下载量和开发者信息等来源透明度线索，帮助用户评估和信任所选示例。

关键发现: 一项针对24名终端用户的实证研究表明，使用UI Remix显著提升了参与者实现其设计目标的能力，促进了有效的设计迭代，并鼓励了对替代设计的探索。参与者反馈表明，来源透明度线索增强了他们在适配示例时的信心。

查看原文摘要

Designing user interfaces (UIs) is a critical step when launching products, building portfolios, or personalizing projects, yet end users without design expertise often struggle to articulate their intent and to trust design choices. Existing example-based tools either promote broad exploration, which can cause overwhelm and design drift, or require adapting a single example, risking design fixation. We present UI Remix, an interactive system that supports mobile UI design through an example-driven design workflow. Powered by a multimodal retrieval-augmented generation (MMRAG) model, UI Remix enables iterative search, selection, and adaptation of examples at both the global (whole interface) and local (component) level. To foster trust, it presents source transparency cues such as ratings, download counts, and developer information. In an empirical study with 24 end users, UI Remix significantly improved participants' ability to achieve their design goals, facilitated effective iteration, and encouraged exploration of alternative designs. Participants also reported that source transparency cues enhanced their confidence in adapting examples. Our findings suggest new directions for AI-assisted, example-driven systems that empower end users to design with greater control, trust, and openness to exploration.

📄 arXiv 📥 PDF