Do not Mask Randomly: Effective Domain-adaptive Pre-training by Masking In-domain Keywords

Shahriar Golchin; Mihai Surdeanu; Nazgol Tavabi; Ata Kiapour

doi:10.18653/v1/2023.repl4nlp-1.2

Do not Mask Randomly: Effective Domain-adaptive Pre-training by Masking In-domain Keywords

Shahriar Golchin, Mihai Surdeanu, Nazgol Tavabi, Ata Kiapour

Abstract

We propose a novel task-agnostic in-domain pre-training method that sits between generic pre-training and fine-tuning. Our approach selectively masks in-domain keywords, i.e., words that provide a compact representation of the target domain. We identify such keywords using KeyBERT (Grootendorst, 2020). We evaluate our approach using six different settings: three datasets combined with two distinct pre-trained language models (PLMs). Our results reveal that the fine-tuned PLMs adapted using our in-domain pre-training strategy outperform PLMs that used in-domain pre-training with random masking as well as those that followed the common pre-train-then-fine-tune paradigm. Further, the overhead of identifying in-domain keywords is reasonable, e.g., 7-15% of the pre-training time (for two epochs) for BERT Large (Devlin et al., 2019).

Anthology ID:: 2023.repl4nlp-1.2
Volume:: Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023)
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Burcu Can, Maximilian Mozes, Samuel Cahyawijaya, Naomi Saphra, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Chen Zhao, Isabelle Augenstein, Anna Rogers, Kyunghyun Cho, Edward Grefenstette, Lena Voita
Venue:: RepL4NLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 13–21
Language:
URL:: https://aclanthology.org/2023.repl4nlp-1.2
DOI:: 10.18653/v1/2023.repl4nlp-1.2
Bibkey:
Cite (ACL):: Shahriar Golchin, Mihai Surdeanu, Nazgol Tavabi, and Ata Kiapour. 2023. Do not Mask Randomly: Effective Domain-adaptive Pre-training by Masking In-domain Keywords. In Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023), pages 13–21, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: Do not Mask Randomly: Effective Domain-adaptive Pre-training by Masking In-domain Keywords (Golchin et al., RepL4NLP 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.repl4nlp-1.2.pdf

PDF Cite Search