Repositório Digital de Publicações Científicas: Describing Land Cover Changes via Multi-Temporal Remote Sensing Image Captioning Using LLM, ViT, and LoRA


Sign on to:
	Login
	My DSpace authorized users
	Edit Profile
	Receive email updates

Browse
	Communities & Collections
	Issue Date
	Author
	Title
	Subject

Helps
	Regulamento RDPC
	Guia do Utilizador RDPC
	Depósito RDPC
	Faq's RDPC

	Integração CV DeGóis
	Workshop Open Access

	Newsletter Open Access


	About Dspace
	DSpace Software

Repositorio Digital de Publicacoes Cientificas da Universidade de Evora

/ Departamento de Informática / INF - Publicações - Artigos em Revistas Internacionais Com Arbitragem Científica /

Please use this identifier to cite or link to this item: http://hdl.handle.net/10174/41416

Title:	Describing Land Cover Changes via Multi-Temporal Remote Sensing Image Captioning Using LLM, ViT, and LoRA
Authors:	Lamar-Leon, Javier Nogueira, Vitor Salgueiro, Pedro Quaresma, Paulo
Editors:	Pan, Jiayi Li, Xinghua
Keywords:	Image Captioning Remote Sensing LLM LoRA
Issue Date:	4-Jan-2026
Publisher:	Remote Sensing MDPI
Citation:	León, Javier Lamar, Vitor Nogueira, Pedro Salgueiro, and Paulo Quaresma. 2026. "Describing Land Cover Changes via Multi-Temporal Remote Sensing Image Captioning Using LLM, ViT, and LoRA" Remote Sensing 18, no. 1: 166. https://doi.org/10.3390/rs18010166
Abstract:	Describing land cover changes from multi-temporal remote sensing imagery requires capturing both visual transformations and their semantic meaning in natural language. Existing methods often struggle to balance visual accuracy with descriptive coherence. We propose MVLT-LoRA-CC (Multi-modal Vision Language Transformer with Low-Rank Adaptation for Change Captioning), a framework that integrates a Vision Transformer (ViT), a Large Language Model (LLM), and Low-Rank Adaptation (LoRA) for efficient multi-modal learning. The model processes paired temporal images through patch embeddings and transformer blocks, aligning visual and textual representations via a multi-modal adapter. To improve efficiency and avoid unnecessary parameter growth, LoRA modules are selectively inserted only into the attention projection layers and cross-modal adapter blocks rather than being uniformly applied to all linear layers. This targeted design preserves general linguistic knowledge while enabling effective adaptation to remote sensing change description. To assess performance, we introduce the Complementary Consistency Score (CCS) framework, which evaluates both descriptive fidelity for change instances and classification accuracy for no change cases. Experiments on the LEVIR-CC test set demonstrate that MVLT-LoRA-CC generates semantically accurate captions, surpassing prior methods in both descriptive richness and temporal change recognition. The approach establishes a scalable solution for multi-modal land cover change description in remote sensing applications.
URI:	http://hdl.handle.net/10174/41416
Type:	article
Appears in Collections:	INF - Publicações - Artigos em Revistas Internacionais Com Arbitragem Científica

Files in This Item:

File	Description	Size	Format
Describing Land Cover Changes via Multi-Temporal Remote Sensing Image Captioning Using LLM, ViT, and LoRA - remotesensing-18-00166.pdf		8.39 MB	Adobe PDF	View/Open

Serviços de Ciência e Cooperação - Universidade de Évora