Please use this identifier to cite or link to this item: http://hdl.handle.net/10174/41416

Title: Describing Land Cover Changes via Multi-Temporal Remote Sensing Image Captioning Using LLM, ViT, and LoRA
Authors: Lamar-Leon, Javier
Nogueira, Vitor
Salgueiro, Pedro
Quaresma, Paulo
Editors: Pan, Jiayi
Li, Xinghua
Keywords: Image Captioning
Remote Sensing
LLM
LoRA
Issue Date: 4-Jan-2026
Publisher: Remote Sensing MDPI
Citation: León, Javier Lamar, Vitor Nogueira, Pedro Salgueiro, and Paulo Quaresma. 2026. "Describing Land Cover Changes via Multi-Temporal Remote Sensing Image Captioning Using LLM, ViT, and LoRA" Remote Sensing 18, no. 1: 166. https://doi.org/10.3390/rs18010166
Abstract: Describing land cover changes from multi-temporal remote sensing imagery requires capturing both visual transformations and their semantic meaning in natural language. Existing methods often struggle to balance visual accuracy with descriptive coherence. We propose MVLT-LoRA-CC (Multi-modal Vision Language Transformer with Low-Rank Adaptation for Change Captioning), a framework that integrates a Vision Transformer (ViT), a Large Language Model (LLM), and Low-Rank Adaptation (LoRA) for efficient multi-modal learning. The model processes paired temporal images through patch embeddings and transformer blocks, aligning visual and textual representations via a multi-modal adapter. To improve efficiency and avoid unnecessary parameter growth, LoRA modules are selectively inserted only into the attention projection layers and cross-modal adapter blocks rather than being uniformly applied to all linear layers. This targeted design preserves general linguistic knowledge while enabling effective adaptation to remote sensing change description. To assess performance, we introduce the Complementary Consistency Score (CCS) framework, which evaluates both descriptive fidelity for change instances and classification accuracy for no change cases. Experiments on the LEVIR-CC test set demonstrate that MVLT-LoRA-CC generates semantically accurate captions, surpassing prior methods in both descriptive richness and temporal change recognition. The approach establishes a scalable solution for multi-modal land cover change description in remote sensing applications.
URI: http://hdl.handle.net/10174/41416
Type: article
Appears in Collections:INF - Publicações - Artigos em Revistas Internacionais Com Arbitragem Científica

Files in This Item:

File Description SizeFormat
Describing Land Cover Changes via Multi-Temporal Remote Sensing Image Captioning Using LLM, ViT, and LoRA - remotesensing-18-00166.pdf8.39 MBAdobe PDFView/Open
FacebookTwitterDeliciousLinkedInDiggGoogle BookmarksMySpaceOrkut
Formato BibTex mendeley Endnote Logotipo do DeGóis 

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

 

Dspace Dspace
DSpace Software, version 1.6.2 Copyright © 2002-2008 MIT and Hewlett-Packard - Feedback
UEvora B-On Curriculum DeGois