Data Integration for Toxic Comment Classification: Making More Than 40 Datasets Easily Accessible in One Unified Format

Julian Risch; Philipp Schmidt; Ralf Krestel

doi:10.18653/v1/2021.woah-1.17

Data Integration for Toxic Comment Classification: Making More Than 40 Datasets Easily Accessible in One Unified Format

Julian Risch, Philipp Schmidt, Ralf Krestel

Abstract

With the rise of research on toxic comment classification, more and more annotated datasets have been released. The wide variety of the task (different languages, different labeling processes and schemes) has led to a large amount of heterogeneous datasets that can be used for training and testing very specific settings. Despite recent efforts to create web pages that provide an overview, most publications still use only a single dataset. They are not stored in one central database, they come in many different data formats and it is difficult to interpret their class labels and how to reuse these labels in other projects. To overcome these issues, we present a collection of more than thirty datasets in the form of a software tool that automatizes downloading and processing of the data and presents them in a unified data format that also offers a mapping of compatible class labels. Another advantage of that tool is that it gives an overview of properties of available datasets, such as different languages, platforms, and class labels to make it easier to select suitable training and test data.

Anthology ID:: 2021.woah-1.17
Volume:: Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)
Month:: August
Year:: 2021
Address:: Online
Editors:: Aida Mostafazadeh Davani, Douwe Kiela, Mathias Lambert, Bertie Vidgen, Vinodkumar Prabhakaran, Zeerak Waseem
Venue:: WOAH
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 157–163
Language:
URL:: https://aclanthology.org/2021.woah-1.17
DOI:: 10.18653/v1/2021.woah-1.17
Bibkey:
Cite (ACL):: Julian Risch, Philipp Schmidt, and Ralf Krestel. 2021. Data Integration for Toxic Comment Classification: Making More Than 40 Datasets Easily Accessible in One Unified Format. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), pages 157–163, Online. Association for Computational Linguistics.
Cite (Informal):: Data Integration for Toxic Comment Classification: Making More Than 40 Datasets Easily Accessible in One Unified Format (Risch et al., WOAH 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.woah-1.17.pdf
Video:: https://aclanthology.org/2021.woah-1.17.mp4
Code: julian-risch/toxic-comment-collection
Data: Hate Speech

PDF Cite Search Code Video