|
|
Please use this identifier to cite or link to this item:
http://hdl.handle.net/10174/39862
|
| Title: | Enhancing Large Language Models for Underrepresented Varieties: Pretraining Strategies in the Galician-Portuguese Diasystem |
| Authors: | Rodríguez, Pablo Gamallo, Pablo Santos, Daniel Sotelo, Susana Paniagua, Silvia Pichel, José Salgueiro, Pedro Nogueira, Vítor Quaresma, Paulo Garcia, Marcos Barro, Senén |
| Keywords: | Large Language Models Continual Pretraining European Portuguese Galician |
| Issue Date: | Oct-2025 |
| Citation: | Rodríguez, P., Gamallo, P., Santos, D., Sotelo, S., Paniagua, S., Pichel, J. R., Salgueiro, P., Nogueira, V., Quaresma, P., Garcia, M., & Barro, S. (2025). Enhancing Large Language Models for Underrepresented Varieties: Pretraining Strategies in the Galician-Portuguese Diasystem. Journal of the Brazilian Computer Society, 31(1), 1050–1063. https://doi.org/10.5753/jbcs.2025.5766 |
| Abstract: | This study presents a systematic exploration of strategies for pretraining generative Large Language Models (LLMs) within the Galician-Portuguese diasystem, by focusing on two underrepresented varieties of this diasystem, namely European Portuguese and Galician. We investigate the impact of combining versus separating linguistic varieties during continued pretraining, the trade-offs between large-scale noisy data and smaller high-quality corpora, and the potential gains from incorporating instruction-based data during the training phase instead of in post-training (e.g., instruction tuning). Our findings show that the inclusion of language varieties in training enhances both task-solving performance and linguistic quality in text generation, especially when leveraging curated linguistic resources. By integrating technical experimentation with sociolinguistic insight, this work underscores the importance of equitable and context-aware LLM development in multilingual and minority-language settings. |
| URI: | https://journals-sol.sbc.org.br/index.php/jbcs/article/view/5766 http://hdl.handle.net/10174/39862 |
| Type: | article |
| Appears in Collections: | VISTALab - Publicações - Artigos em Revistas Internacionais Com Arbitragem Científica
|
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.
|