Please use this identifier to cite or link to this item: http://hdl.handle.net/10174/39862

Title: Enhancing Large Language Models for Underrepresented Varieties: Pretraining Strategies in the Galician-Portuguese Diasystem
Authors: Rodríguez, Pablo
Gamallo, Pablo
Santos, Daniel
Sotelo, Susana
Paniagua, Silvia
Pichel, José
Salgueiro, Pedro
Nogueira, Vítor
Quaresma, Paulo
Garcia, Marcos
Barro, Senén
Keywords: Large Language Models
Continual Pretraining
European Portuguese
Galician
Issue Date: Oct-2025
Citation: Rodríguez, P., Gamallo, P., Santos, D., Sotelo, S., Paniagua, S., Pichel, J. R., Salgueiro, P., Nogueira, V., Quaresma, P., Garcia, M., & Barro, S. (2025). Enhancing Large Language Models for Underrepresented Varieties: Pretraining Strategies in the Galician-Portuguese Diasystem. Journal of the Brazilian Computer Society, 31(1), 1050–1063. https://doi.org/10.5753/jbcs.2025.5766
Abstract: This study presents a systematic exploration of strategies for pretraining generative Large Language Models (LLMs) within the Galician-Portuguese diasystem, by focusing on two underrepresented varieties of this diasystem, namely European Portuguese and Galician. We investigate the impact of combining versus separating linguistic varieties during continued pretraining, the trade-offs between large-scale noisy data and smaller high-quality corpora, and the potential gains from incorporating instruction-based data during the training phase instead of in post-training (e.g., instruction tuning). Our findings show that the inclusion of language varieties in training enhances both task-solving performance and linguistic quality in text generation, especially when leveraging curated linguistic resources. By integrating technical experimentation with sociolinguistic insight, this work underscores the importance of equitable and context-aware LLM development in multilingual and minority-language settings.
URI: https://journals-sol.sbc.org.br/index.php/jbcs/article/view/5766
http://hdl.handle.net/10174/39862
Type: article
Appears in Collections:VISTALab - Publicações - Artigos em Revistas Internacionais Com Arbitragem Científica

Files in This Item:

File Description SizeFormat
5766-Article Text-32717-2-10-20251014.pdf463.26 kBAdobe PDFView/Open
FacebookTwitterDeliciousLinkedInDiggGoogle BookmarksMySpaceOrkut
Formato BibTex mendeley Endnote Logotipo do DeGóis 

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

 

Dspace Dspace
DSpace Software, version 1.6.2 Copyright © 2002-2008 MIT and Hewlett-Packard - Feedback
UEvora B-On Curriculum DeGois