Many situations require the simultaneous processing of auditory and visual information, however, stimuli presented to one sensory modality can sometimes interfere with processing in a second sensory ...
description [ICCV 2025][Object Detection][Visual Prompt] This paper proposes ModPrompt, an encoder-decoder-based visual prompting strategy that adapts vision-language object detectors (e.g., ...
Abstract: This paper introduces AVCaps, an audio-visual dataset that contains separate textual captions for the audio, visual, and audio-visual contents of video clips. The dataset contains 2061 video ...
Audio-visual learning has been a major pillar of multi-modal machine learning, where the community mostly focused on its modality-aligned setting, i.e., the audio and visual modality are both assumed ...
Abstract: Partially Relevant Video Retrieval (PRVR) aims to retrieve videos that match a given textual query only partially. This task is inherently challenging due to the modality gap between text ...
Medical Visual Question Answering (Med-VQA) aims to combine medical image understanding with clinical language reasoning, enabling automatic answering of natural language questions grounded on medical ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results