المنشورة : Tackling the multilingual and heterogeneous documents with the pre-trained language identifiers

Article

المجلة العلمية :

International Journal of Computers and Applications

ISSN : 1206-212X

الناشر :

Taylor & Francis

معلومات

الفترة : May 2023

المجلد : 45 العدد : 5

الصفحات : 391-402

التفاصيل

Tackling the multilingual and heterogeneous documents with the pre-trained language identifiers

Mohamed Raouf Kanfoud • Abdelkrim Bouramoul

DOI : 10.1080/1206212X.2023.2218236

The Web has become one of the most important data sources, and the content shared is most often multilingual, as users belong to different cultures and speak different languages. Multilingual content (document) is not suitable for many people who only need content in one language. Furthermore, dividing a multilingual document into monolingual documents helps researchers extract only the text of the desired language to use in different tasks such as training or model testing. Therefore, it is challenging to clean and divide the raw content manually. This paper presents an automatic approach to dividing a multilingual document and reassembling it into monolingual documents by examining three existing state-of-the-art tools for Language Identification (LI). We prepared different corpora with different heterogeneity characteristics for the evaluation and evaluated their code-switching pattern using three different code-switching metrics. The proposed approach reached 99% as the best accuracy result for the long segment (long text) and 90% for the mixed segment. In addition, a good correlation was found between the I-Index and accuracy with Pearson’s r = −0.998.

الكلمات المفتاحية :

Multilingual documents Monolingual documents Code-switching Language identification Dynamic window-based

مرجع الإقتباس :

misc-lab-418

DOI :

10.1080/1206212X.2023.2218236

الرابط :

https://www.tandfonline.com/doi/full/10.1080/1206212X.2023.2218236

Texte intégral

ACM :

M. R. Kanfoud and A. Bouramoul. 2023. Tackling the multilingual and heterogeneous documents with the pre-trained language identifiers. International Journal of Computers and Applications, 45, 5 (May 2023), Taylor & Francis, 391-402. DOI: https://doi.org/10.1080/1206212X.2023.2218236.

APA :

Kanfoud, M. R. & Bouramoul, A. (2023, May). Tackling the multilingual and heterogeneous documents with the pre-trained language identifiers. International Journal of Computers and Applications, 45(5), Taylor & Francis, 391-402. DOI: https://doi.org/10.1080/1206212X.2023.2218236

IEEE :

M. R. Kanfoud and A. Bouramoul, "Tackling the multilingual and heterogeneous documents with the pre-trained language identifiers". International Journal of Computers and Applications, vol. 45, no. 5, Taylor & Francis, pp. 391-402, May, 2023. DOI: https://doi.org/10.1080/1206212X.2023.2218236.

BibTeX :

@article{misc-lab-418,
author = {Kanfoud, Mohamed Raouf and Bouramoul, Abdelkrim},
title = {Tackling the multilingual and heterogeneous documents with the pre-trained language identifiers},
journal = {International Journal of Computers and Applications},
volume = {45},
number = {5},
issn = {1206-212X},
pages = {391--402},
publisher = {Taylor & Francis},
year = {2023},
month = {May},
doi = {10.1080/1206212X.2023.2218236},
url = {https://www.tandfonline.com/doi/full/10.1080/1206212X.2023.2218236},
keywords = {Multilingual documents, monolingual documents, code-switching, language identification, dynamic window-based}
}

RIS :

TI  - Tackling the multilingual and heterogeneous documents with the pre-trained language identifiers
AU  - M. R. Kanfoud
AU  - A. Bouramoul
PY  - 2023
SN  - 1206-212X
JO  - International Journal of Computers and Applications
VL  - 45
IS  - 5
SP  - 391
EP  - 402
PB  - Taylor & Francis
AB  - The Web has become one of the most important data sources, and the content shared is most often multilingual, as users belong to different cultures and speak different languages. Multilingual content (document) is not suitable for many people who only need content in one language. Furthermore, dividing a multilingual document into monolingual documents helps researchers extract only the text of the desired language to use in different tasks such as training or model testing. Therefore, it is challenging to clean and divide the raw content manually. This paper presents an automatic approach to dividing a multilingual document and reassembling it into monolingual documents by examining three existing state-of-the-art tools for Language Identification (LI). We prepared different corpora with different heterogeneity characteristics for the evaluation and evaluated their code-switching pattern using three different code-switching metrics. The proposed approach reached 99% as the best accuracy result for the long segment (long text) and 90% for the mixed segment. In addition, a good correlation was found between the I-Index and accuracy with Pearson’s r = −0.998.
KW  - Multilingual documents
KW  - monolingual documents
KW  - code-switching
KW  - language identification
KW  - dynamic window-based
DO  - 10.1080/1206212X.2023.2218236
UR  - https://www.tandfonline.com/doi/full/10.1080/1206212X.2023.2218236
ID  - misc-lab-418
ER  -