Samantha Wray


2024

pdf bib
LexiVault: A Repository for Psycholinguistic Lexicons of Lesser-studied Languages
Hind Saddiki | Samantha Wray | Daisy Li
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper presents LexiVault, an open-source web tool with annotated lexicons and rich retrieval capabilities primarily developed for, but not restricted to, the support of psycholinguistic research with key measures to design stimuli for low-resource languages. Psycholinguistic research relies on human responses to carefully crafted stimuli for a better understanding of the mechanisms by which we learn, store and process language. Stimuli design captures specific language properties such as frequency, morphological complexity, or stem likelihood in a part of speech, typically derived from a corpus that is representative of the average speaker’s linguistic experience. These measures are more readily available for well-resourced languages, whereas efforts for lesser-studied languages come with substantial overhead for the researcher to build corpora and calculate these measures from scratch. This stumbling block widens the gap, further skewing our modeling of the mental architecture of linguistic processing towards a small, over-represented set of the world’s languages. To lessen this burden, we designed LexiVault to be user friendly and accommodate incremental growth of new and existing low-resource language lexicons in the system through moderated community contributions while abstracting programming complexity to foster more interest from the psycholinguistics community in exploring low-resource languages.

2018

pdf bib
Development of Natural Language Processing Tools for Cook Islands Māori
Rolando Coto Solano | Sally Akevai Nicholas | Samantha Wray
Proceedings of the Australasian Language Technology Association Workshop 2018

This paper presents three ongoing projects for NLP in Cook Islands Maori: Untrained Forced Alignment (approx. 9% error when detecting the center of words), speech-to-text (37% WER in the best trained models) and POS tagging (92% accuracy for the best performing model). Included as part of these projects are new resources filling in a gap in Australasian languages, including gold standard POS-tagged written corpora, transcribed speech corpora, time-aligned corpora down to the level of phonemes. These are part of efforts to accelerate the documentation of Cook Islands Maori and to increase its vitality amongst its users.

pdf bib
Classification of Closely Related Sub-dialects of Arabic Using Support-Vector Machines
Samantha Wray
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2015

pdf bib
Best Practices for Crowdsourcing Dialectal Arabic Speech Transcription
Samantha Wray | Hamdy Mubarak | Ahmed Ali
Proceedings of the Second Workshop on Arabic Natural Language Processing