Accounting for Focus Ambiguity in Visual Questions

University of Texas at Austin University of Colorado Boulder

Focus Ambiguity

Overview

No published work on visual question answering (VQA) accounts for ambiguity regarding where the content described in the question is located in the image. To fill this gap, we introduce VQ-FocusAmbiguity, the first VQA dataset that visually grounds each plausible image region a question could refer to when arriving at valid answers. We next analyze and compare our dataset to existing datasets to reveal its unique properties. Finally, we benchmark modern models for two novel tasks related to acknowledging focus ambiguity: recognizing whether a visual question has focus ambiguity and locating all plausible focus regions within the image. Results show that the dataset is challenging for modern models.

Contact

For any questions, comments, or feedback, please send them to Chongyan Chen at chongyanchen_hci@utexas.edu.