Robust Scene Change Detection Using Visual Foundation Models and Cross-Attention Mechanisms

IEEE International Conference on Robotics and Automation (ICRA 2025)
Australian Institute for Machine Learning (AIML), University of Adelaide, Australia
chun-jung.lin, sourav.garg, tat-jun.chin, feras.dayoub@adelaide.edu.au

Abstract

We present a novel method for scene change detection that leverages the robust feature extraction capabilities of a visual foundational model, DINOv2, and integrates fullimage cross-attention to address key challenges such as varying lighting, seasonal variations, and viewpoint differences. In order to effectively learn correspondences and mis-correspondences between an image pair for the change detection task, we propose to a) “freeze” the backbone in order to retain the generality of dense foundation features, and b) employ “fullimage” cross-attention to better tackle the viewpoint variations between the image pair. We evaluate our approach on two benchmark datasets, VL-CMU-CD and PSCD, along with their viewpoint-varied versions. Our experiments demonstrate significant improvements in F1-score, particularly in scenarios involving geometric changes between image pairs. The results indicate our method’s superior generalization capabilities over existing state-of-the-art approaches, showing robustness against photometric and geometric variations as well as better overall generalization when fine-tuned to adapt to new environments. Detailed ablation studies further validate the contributions of each component in our architecture. Our source code is available at: https://github.com/ChadLin9596/Robust-SceneChange-Detection.

BibTeX

@misc{lin2024robustscenechangedetection, title={Robust Scene Change Detection Using Visual Foundation Models and Cross-Attention Mechanisms}, author={Chun-Jung Lin and Sourav Garg and Tat-Jun Chin and Feras Dayoub}, year={2024}, eprint={2409.16850}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2409.16850}, }

Robust Scene Change Detection Using Visual Foundation Models and Cross-Attention Mechanisms

Fig. 1: Unaligned images change detection: we approach the change detection problem with cross attention module, making robust detection on unaligned scenes.

Abstract

Fig. 2: Architecture: An overview of the proposed change detection architecture, where the backbone is kept frozen to achieve better overall generalization. F0 and F1 are the dense feature from t0 and t1 images, respectively..

TABLE I: Change detection datasets: we list the number of image pairs, the number of scenes/sources, and environments for data choices. The “imgs” and “env.” represent “images” and “environment”, respectively.

TABLE II: Aligned and Unaligned Test sets: the definition and number of image pairs of each test set.

TABLE IV: Different Viewpoint Augmentation: we report F1-score on VL-CMU-CD dataset after training with the unaligned dataset.

Fig. 4: Qualitative Results: we visualize results from “Aligned” of VL-CMU-CD in rows 2 and 5. The other rows are from “Diff-2”. The first scene compares the same t0 image with a sequence of t1 images, while the other compares the opposite.

TABLE V: F1-score after fine-tuning on PSCD: we report F1-scores of aligned/unaligned of VL-CMU-CD and PSCD to compare adaption ability with Tab. III.

TABLE VI: F1-score of Different Feature Comparator: we compare the results after replacing our cross-attention modules with feature comparators from baselines.

TABLE VII: Choice of Architecture: we compare different backbones with different cross-attention composition to specify our motivation of using the DinoV2 backbone and two cross-attentions.

Video Presentation

BibTeX