Standard

Initial step of specialized corpora building: cleaning procedures. / Yakubson, Vera; Zakharov, Victor .

Nordsci 2020. 2020. стр. 152-162 (Nordsci Conference on Social Sciences; Том 3, № 1).

Результаты исследований: Публикации в книгах, отчётах, сборниках, трудах конференцийстатья в сборнике материалов конференциинаучнаяРецензирование

Harvard

Yakubson, V & Zakharov, V 2020, Initial step of specialized corpora building: cleaning procedures. в Nordsci 2020. Nordsci Conference on Social Sciences, № 1, Том. 3, стр. 152-162, Nordsci Conference on Social Sciences, Sofia, Болгария, 27/08/20.

APA

Yakubson, V., & Zakharov, V. (2020). Initial step of specialized corpora building: cleaning procedures. в Nordsci 2020 (стр. 152-162). (Nordsci Conference on Social Sciences; Том 3, № 1).

Vancouver

Yakubson V, Zakharov V. Initial step of specialized corpora building: cleaning procedures. в Nordsci 2020. 2020. стр. 152-162. (Nordsci Conference on Social Sciences; 1).

Author

Yakubson, Vera ; Zakharov, Victor . / Initial step of specialized corpora building: cleaning procedures. Nordsci 2020. 2020. стр. 152-162 (Nordsci Conference on Social Sciences; 1).

BibTeX

@inproceedings{68d415d1996348d99cbdc5ccece1d8a0,
title = "Initial step of specialized corpora building: cleaning procedures",
abstract = "This paper deals with the specialized corpora building, specifically academic language corpus in the biotechnology field. Being a part of larger research devoted to creation and usage of specialized parallel corpus, this piece aims to analyze the initial step of corpus building. Our main research question was what procedures we need to implement to the texts before using them to develop the corpus. Analysis of previous research showed the significant quantity of papers devoted to corpora creation, including academic specialized corpora. Different sides of the process were analyzed in these researches, including the types of texts used, the principles of crawling, the recommended length of texts etc. As to the text processing for the needs of corpora creation, only the linguistic annotation issues were examined earlier. At the same time, the preliminary cleaning of texts before their usage in corpora may have significant influence on the corpus quality and its utility for the linguistic research. In this paper, we considered three small corpora derived from the same set of academic texts in the biotechnology field: “raw” corpus without any preliminary cleaning and two corpora with different level of cleaning. Using different Sketch Engine tools, we analyzed these corpora from the position of their future users, predominantly as sources for academic wordlists and specialized multi-word units. The conducted research showed very little difference between two cleaned corpora, meaning that only basic cleaning procedures such as removal of reference lists are can be useful in corpora design. At the same time, we found a significant difference between raw and cleaned corpora and argue that this difference can affect the quality of wordlists and multi-word terms extraction, therefore these cleaning procedures are meaningful. The main limitation of the study is that all texts were taken from the unique source, so the conclusions could be affected by this specific journal{\textquoteright}s peculiarities. Therefore, the future work should be the verification of results on different text collections",
author = "Vera Yakubson and Victor Zakharov",
year = "2020",
language = "English",
series = "Nordsci Conference on Social Sciences",
number = "1",
pages = "152--162",
booktitle = "Nordsci 2020",
note = "Nordsci Conference on Social Sciences ; Conference date: 27-08-2020 Through 28-08-2020",

}

RIS

TY - GEN

T1 - Initial step of specialized corpora building: cleaning procedures

AU - Yakubson, Vera

AU - Zakharov, Victor

PY - 2020

Y1 - 2020

N2 - This paper deals with the specialized corpora building, specifically academic language corpus in the biotechnology field. Being a part of larger research devoted to creation and usage of specialized parallel corpus, this piece aims to analyze the initial step of corpus building. Our main research question was what procedures we need to implement to the texts before using them to develop the corpus. Analysis of previous research showed the significant quantity of papers devoted to corpora creation, including academic specialized corpora. Different sides of the process were analyzed in these researches, including the types of texts used, the principles of crawling, the recommended length of texts etc. As to the text processing for the needs of corpora creation, only the linguistic annotation issues were examined earlier. At the same time, the preliminary cleaning of texts before their usage in corpora may have significant influence on the corpus quality and its utility for the linguistic research. In this paper, we considered three small corpora derived from the same set of academic texts in the biotechnology field: “raw” corpus without any preliminary cleaning and two corpora with different level of cleaning. Using different Sketch Engine tools, we analyzed these corpora from the position of their future users, predominantly as sources for academic wordlists and specialized multi-word units. The conducted research showed very little difference between two cleaned corpora, meaning that only basic cleaning procedures such as removal of reference lists are can be useful in corpora design. At the same time, we found a significant difference between raw and cleaned corpora and argue that this difference can affect the quality of wordlists and multi-word terms extraction, therefore these cleaning procedures are meaningful. The main limitation of the study is that all texts were taken from the unique source, so the conclusions could be affected by this specific journal’s peculiarities. Therefore, the future work should be the verification of results on different text collections

AB - This paper deals with the specialized corpora building, specifically academic language corpus in the biotechnology field. Being a part of larger research devoted to creation and usage of specialized parallel corpus, this piece aims to analyze the initial step of corpus building. Our main research question was what procedures we need to implement to the texts before using them to develop the corpus. Analysis of previous research showed the significant quantity of papers devoted to corpora creation, including academic specialized corpora. Different sides of the process were analyzed in these researches, including the types of texts used, the principles of crawling, the recommended length of texts etc. As to the text processing for the needs of corpora creation, only the linguistic annotation issues were examined earlier. At the same time, the preliminary cleaning of texts before their usage in corpora may have significant influence on the corpus quality and its utility for the linguistic research. In this paper, we considered three small corpora derived from the same set of academic texts in the biotechnology field: “raw” corpus without any preliminary cleaning and two corpora with different level of cleaning. Using different Sketch Engine tools, we analyzed these corpora from the position of their future users, predominantly as sources for academic wordlists and specialized multi-word units. The conducted research showed very little difference between two cleaned corpora, meaning that only basic cleaning procedures such as removal of reference lists are can be useful in corpora design. At the same time, we found a significant difference between raw and cleaned corpora and argue that this difference can affect the quality of wordlists and multi-word terms extraction, therefore these cleaning procedures are meaningful. The main limitation of the study is that all texts were taken from the unique source, so the conclusions could be affected by this specific journal’s peculiarities. Therefore, the future work should be the verification of results on different text collections

UR - https://www.sciencegate.app/app/document/download/10.32008/nordsci2020/b1/v3/16

M3 - Conference contribution

T3 - Nordsci Conference on Social Sciences

SP - 152

EP - 162

BT - Nordsci 2020

T2 - Nordsci Conference on Social Sciences

Y2 - 27 August 2020 through 28 August 2020

ER -

ID: 92087573