SynthNID: Synthetic Data to Improve End-to-end Bangla Document Key Information Extraction

Syed Mostofa Monsur, Shariar Kabir, Sakib Chowdhury


Abstract
End-to-end Document Key Information Extraction models require a lot of compute and labeled data to perform well on real datasets. This is particularly challenging for low-resource languages like Bangla where domain-specific multimodal document datasets are scarcely available. In this paper, we have introduced SynthNID, a system to generate domain-specific document image data for training OCR-less end-to-end Key Information Extraction systems. We show the generated data improves the performance of the extraction model on real datasets and the system is easily extendable to generate other types of scanned documents for a wide range of document understanding tasks. The code for generating synthetic data is available at https://github.com/dv66/synthnid
Anthology ID:
2023.banglalp-1.13
Volume:
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)
Month:
December
Year:
2023
Address:
Singapore
Editors:
Firoj Alam, Sudipta Kar, Shammur Absar Chowdhury, Farig Sadeque, Ruhul Amin
Venue:
BanglaLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
117–123
Language:
URL:
https://aclanthology.org/2023.banglalp-1.13
DOI:
10.18653/v1/2023.banglalp-1.13
Bibkey:
Cite (ACL):
Syed Mostofa Monsur, Shariar Kabir, and Sakib Chowdhury. 2023. SynthNID: Synthetic Data to Improve End-to-end Bangla Document Key Information Extraction. In Proceedings of the First Workshop on Bangla Language Processing (BLP-2023), pages 117–123, Singapore. Association for Computational Linguistics.
Cite (Informal):
SynthNID: Synthetic Data to Improve End-to-end Bangla Document Key Information Extraction (Monsur et al., BanglaLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.banglalp-1.13.pdf
Video:
 https://aclanthology.org/2023.banglalp-1.13.mp4