DISCOVERING NEAR DUPLICATE TEXT IN SOFTWARE DOCUMENTATION › Научные исследования в СПбГУ

Standard

DISCOVERING NEAR DUPLICATE TEXT IN SOFTWARE DOCUMENTATION. / Kanteev, L.D.; Kostyukov., Yu.O.; Luciv, D.V.; Koznov, D.V.; Smirnov, M.N.

в: Труды института системного программирования РАН, Том 29, № 4, 2017, стр. 303-314.

Результаты исследований: Научные публикации в периодических изданиях › статья › Рецензирование

Harvard

Kanteev, LD , Kostyukov., YO , Luciv, DV , Koznov, DV & Smirnov, MN 2017, 'DISCOVERING NEAR DUPLICATE TEXT IN SOFTWARE DOCUMENTATION', Труды института системного программирования РАН, Том. 29, № 4, стр. 303-314.

APA

Kanteev, L. D., Kostyukov., Y. O., Luciv, D. V., Koznov, D. V., & Smirnov, M. N. (2017). DISCOVERING NEAR DUPLICATE TEXT IN SOFTWARE DOCUMENTATION. Труды института системного программирования РАН, 29(4), 303-314.

Vancouver

Kanteev LD , Kostyukov. YO , Luciv DV , Koznov DV , Smirnov MN. DISCOVERING NEAR DUPLICATE TEXT IN SOFTWARE DOCUMENTATION. Труды института системного программирования РАН. 2017;29(4):303-314.

Author

Kanteev, L.D. ; Kostyukov., Yu.O. ; Luciv, D.V. ; Koznov, D.V. ; Smirnov, M.N. / DISCOVERING NEAR DUPLICATE TEXT IN SOFTWARE DOCUMENTATION. в: Труды института системного программирования РАН. 2017 ; Том 29, № 4. стр. 303-314.

BibTeX

@article{c16dcf5b82b8447a85cad3d77e76e19b,

title = "DISCOVERING NEAR DUPLICATE TEXT IN SOFTWARE DOCUMENTATION",

abstract = "Development of software documentation often involves copy-pasting, which produces a lot of duplicate text. Such duplicates make it difficult and expensive documentation maintenance, especially in case of long life cycle of software and its documentation. The situation is further complicated by duplicate information frequently being near duplicate, i.e., the same information may be presented many times with different levels of detail, in various contexts, etc. There are a number approaches to deal with duplicates in software documentation. But most of them use software clone detection technique, that is make difficult to provide efficient near duplicate detection: source code algorithms ignore a document structure, and they produce a lot of false positives. In this paper, we present an algorithm aiming to detect near duplicates in software documentation using natural language processing technique called as N-gramm model. The algorithm has a considerable limitation: it only detects single sentences as near duplicates. But it is very simple and may be easily improved in future. It is implemented with use of Natural Language Toolkit (NLTK), and. Evaluation results are presented for five real life documents from various industrial projects. Manual analysis shows 39 % of false positives in automatic detected duplicates. The algorithm demonstrates reasonable performance: documents of 0,8-3 Mb are processed 5-22 min.",

author = "L.D. Kanteev and Yu.O. Kostyukov. and D.V. Luciv and D.V. Koznov and M.N. Smirnov",

year = "2017",

language = "English",

volume = "29",

pages = "303--314",

journal = "Труды института системного программирования РАН",

issn = "2079-8156",

publisher = "Институт системного программирования им. В.П.Иванникова РАН",

number = "4",

}

RIS

TY - JOUR

T1 - DISCOVERING NEAR DUPLICATE TEXT IN SOFTWARE DOCUMENTATION

AU - Kanteev, L.D.

AU - Kostyukov., Yu.O.

AU - Luciv, D.V.

AU - Koznov, D.V.

AU - Smirnov, M.N.

PY - 2017

Y1 - 2017

N2 - Development of software documentation often involves copy-pasting, which produces a lot of duplicate text. Such duplicates make it difficult and expensive documentation maintenance, especially in case of long life cycle of software and its documentation. The situation is further complicated by duplicate information frequently being near duplicate, i.e., the same information may be presented many times with different levels of detail, in various contexts, etc. There are a number approaches to deal with duplicates in software documentation. But most of them use software clone detection technique, that is make difficult to provide efficient near duplicate detection: source code algorithms ignore a document structure, and they produce a lot of false positives. In this paper, we present an algorithm aiming to detect near duplicates in software documentation using natural language processing technique called as N-gramm model. The algorithm has a considerable limitation: it only detects single sentences as near duplicates. But it is very simple and may be easily improved in future. It is implemented with use of Natural Language Toolkit (NLTK), and. Evaluation results are presented for five real life documents from various industrial projects. Manual analysis shows 39 % of false positives in automatic detected duplicates. The algorithm demonstrates reasonable performance: documents of 0,8-3 Mb are processed 5-22 min.

AB - Development of software documentation often involves copy-pasting, which produces a lot of duplicate text. Such duplicates make it difficult and expensive documentation maintenance, especially in case of long life cycle of software and its documentation. The situation is further complicated by duplicate information frequently being near duplicate, i.e., the same information may be presented many times with different levels of detail, in various contexts, etc. There are a number approaches to deal with duplicates in software documentation. But most of them use software clone detection technique, that is make difficult to provide efficient near duplicate detection: source code algorithms ignore a document structure, and they produce a lot of false positives. In this paper, we present an algorithm aiming to detect near duplicates in software documentation using natural language processing technique called as N-gramm model. The algorithm has a considerable limitation: it only detects single sentences as near duplicates. But it is very simple and may be easily improved in future. It is implemented with use of Natural Language Toolkit (NLTK), and. Evaluation results are presented for five real life documents from various industrial projects. Manual analysis shows 39 % of false positives in automatic detected duplicates. The algorithm demonstrates reasonable performance: documents of 0,8-3 Mb are processed 5-22 min.

UR - https://elibrary.ru/item.asp?id=29968661

M3 - Article

VL - 29

SP - 303

EP - 314

JO - Труды института системного программирования РАН

JF - Труды института системного программирования РАН

SN - 2079-8156

IS - 4

ER -

ID: 35261220