|
|
Please use this identifier to cite or link to this item:
http://hdl.handle.net/10174/41416
|
| Title: | Describing Land Cover Changes via Multi-Temporal Remote Sensing Image Captioning Using LLM, ViT, and LoRA |
| Authors: | Lamar-Leon, Javier Nogueira, Vitor Salgueiro, Pedro Quaresma, Paulo |
| Editors: | Pan, Jiayi Li, Xinghua |
| Keywords: | Image Captioning Remote Sensing LLM LoRA |
| Issue Date: | 4-Jan-2026 |
| Publisher: | Remote Sensing MDPI |
| Citation: | León, Javier Lamar, Vitor Nogueira, Pedro Salgueiro, and Paulo Quaresma. 2026. "Describing Land Cover Changes via Multi-Temporal Remote Sensing Image Captioning Using LLM, ViT, and LoRA" Remote Sensing 18, no. 1: 166. https://doi.org/10.3390/rs18010166 |
| Abstract: | Describing land cover changes from multi-temporal remote sensing imagery requires capturing both visual transformations and their semantic meaning in natural language. Existing methods often struggle to balance visual accuracy with descriptive coherence. We propose MVLT-LoRA-CC (Multi-modal Vision Language Transformer with Low-Rank Adaptation for Change Captioning), a framework that integrates a Vision Transformer (ViT), a Large Language Model (LLM), and Low-Rank Adaptation (LoRA) for efficient multi-modal learning. The model processes paired temporal images through patch embeddings and transformer blocks, aligning visual and textual representations via a multi-modal adapter. To improve efficiency and avoid unnecessary parameter growth, LoRA modules are selectively inserted only into the attention projection layers and cross-modal adapter blocks rather than being uniformly applied to all linear layers. This targeted design preserves general linguistic knowledge while enabling effective adaptation to remote sensing change description. To assess performance, we introduce the Complementary Consistency Score (CCS) framework, which evaluates both descriptive fidelity for change instances and classification accuracy for no change cases. Experiments on the LEVIR-CC test set demonstrate that MVLT-LoRA-CC generates semantically accurate captions, surpassing prior methods in both descriptive richness and temporal change recognition. The approach establishes a scalable solution for multi-modal land cover change description in remote sensing applications. |
| URI: | http://hdl.handle.net/10174/41416 |
| Type: | article |
| Appears in Collections: | INF - Publicações - Artigos em Revistas Internacionais Com Arbitragem Científica
|
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.
|