Robust Generalization Strategies for Morpheme Glossing in an Endangered Language Documentation Context

Michael Ginn, Alexis Palmer


Abstract
Generalization is of particular importance in resource-constrained settings, where the available training data may represent only a small fraction of the distribution of possible texts. We investigate the ability of morpheme labeling models to generalize by evaluating their performance on unseen genres of text, and we experiment with strategies for closing the gap between performance on in-distribution and out-of-distribution data. Specifically, we use weight decay optimization, output denoising, and iterative pseudo-labeling, and achieve a 2% improvement on a test set containing texts from unseen genres. All experiments are performed using texts written in the Mayan language Uspanteko.
Anthology ID:
2023.genbench-1.7
Volume:
Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP
Month:
December
Year:
2023
Address:
Singapore
Editors:
Dieuwke Hupkes, Verna Dankers, Khuyagbaatar Batsuren, Koustuv Sinha, Amirhossein Kazemnejad, Christos Christodoulopoulos, Ryan Cotterell, Elia Bruni
Venues:
GenBench | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
89–98
Language:
URL:
https://aclanthology.org/2023.genbench-1.7
DOI:
10.18653/v1/2023.genbench-1.7
Bibkey:
Cite (ACL):
Michael Ginn and Alexis Palmer. 2023. Robust Generalization Strategies for Morpheme Glossing in an Endangered Language Documentation Context. In Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP, pages 89–98, Singapore. Association for Computational Linguistics.
Cite (Informal):
Robust Generalization Strategies for Morpheme Glossing in an Endangered Language Documentation Context (Ginn & Palmer, GenBench-WS 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.genbench-1.7.pdf
Video:
 https://aclanthology.org/2023.genbench-1.7.mp4