Standard

Sosed: A tool for finding similar software projects. / Bogomolov, Egor; Golubev, Yaroslav; Lobanov, Artyom; Kovalenko, Vladimir; Bryksin, Timofey.

Proceedings - 2020 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020. Institute of Electrical and Electronics Engineers Inc., 2020. p. 1316-1320 9286041 (Proceedings - 2020 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020).

Research output: Chapter in Book/Report/Conference proceedingConference contributionResearchpeer-review

Harvard

Bogomolov, E, Golubev, Y, Lobanov, A, Kovalenko, V & Bryksin, T 2020, Sosed: A tool for finding similar software projects. in Proceedings - 2020 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020., 9286041, Proceedings - 2020 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020, Institute of Electrical and Electronics Engineers Inc., pp. 1316-1320, 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020, Virtual, Melbourne, Australia, 22/09/20. https://doi.org/10.1145/3324884.3415291

APA

Bogomolov, E., Golubev, Y., Lobanov, A., Kovalenko, V., & Bryksin, T. (2020). Sosed: A tool for finding similar software projects. In Proceedings - 2020 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020 (pp. 1316-1320). [9286041] (Proceedings - 2020 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1145/3324884.3415291

Vancouver

Bogomolov E, Golubev Y, Lobanov A, Kovalenko V, Bryksin T. Sosed: A tool for finding similar software projects. In Proceedings - 2020 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020. Institute of Electrical and Electronics Engineers Inc. 2020. p. 1316-1320. 9286041. (Proceedings - 2020 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020). https://doi.org/10.1145/3324884.3415291

Author

Bogomolov, Egor ; Golubev, Yaroslav ; Lobanov, Artyom ; Kovalenko, Vladimir ; Bryksin, Timofey. / Sosed: A tool for finding similar software projects. Proceedings - 2020 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020. Institute of Electrical and Electronics Engineers Inc., 2020. pp. 1316-1320 (Proceedings - 2020 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020).

BibTeX

@inproceedings{5ab6e14305de4932890d9db75d74668e,
title = "Sosed: A tool for finding similar software projects",
abstract = "In this paper, we present Sosed, a tool for discovering similar software projects. We use fastText to compute the embeddings of subto-kens into a dense space for 120, 000 GitHub projects in 200 languages. Then, we cluster embeddings to identify groups of semantically similar subtokens that reflect topics in source code. We use a dataset of 9 million GitHub projects as a reference search base. To identify similar projects, we compare the distributions of clusters among their subtokens. The tool receives an arbitrary project as input, extracts subtokens in 16 most popular programming languages, computes cluster distribution, and finds projects with the closest distribution in the search base. We labeled subtoken clusters with short descriptions to enable Sosed to produce interpretable output. Sosed is available at https://github.com/JetBrains-Research/sosed/. The tool demo is available at https://www.youtube.com/watch?v=LYLkztCGRt8. The multi-language extractor of subtokens is available separately at https://github.com/JetBrains-Research/buckwheat/.",
keywords = "machine learning, similar repositories, topic modeling, word embeddings",
author = "Egor Bogomolov and Yaroslav Golubev and Artyom Lobanov and Vladimir Kovalenko and Timofey Bryksin",
note = "Publisher Copyright: {\textcopyright} 2020 ACM. Copyright: Copyright 2021 Elsevier B.V., All rights reserved.; 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020 ; Conference date: 22-09-2020 Through 25-09-2020",
year = "2020",
month = sep,
doi = "10.1145/3324884.3415291",
language = "English",
series = "Proceedings - 2020 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "1316--1320",
booktitle = "Proceedings - 2020 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020",
address = "United States",

}

RIS

TY - GEN

T1 - Sosed: A tool for finding similar software projects

AU - Bogomolov, Egor

AU - Golubev, Yaroslav

AU - Lobanov, Artyom

AU - Kovalenko, Vladimir

AU - Bryksin, Timofey

N1 - Publisher Copyright: © 2020 ACM. Copyright: Copyright 2021 Elsevier B.V., All rights reserved.

PY - 2020/9

Y1 - 2020/9

N2 - In this paper, we present Sosed, a tool for discovering similar software projects. We use fastText to compute the embeddings of subto-kens into a dense space for 120, 000 GitHub projects in 200 languages. Then, we cluster embeddings to identify groups of semantically similar subtokens that reflect topics in source code. We use a dataset of 9 million GitHub projects as a reference search base. To identify similar projects, we compare the distributions of clusters among their subtokens. The tool receives an arbitrary project as input, extracts subtokens in 16 most popular programming languages, computes cluster distribution, and finds projects with the closest distribution in the search base. We labeled subtoken clusters with short descriptions to enable Sosed to produce interpretable output. Sosed is available at https://github.com/JetBrains-Research/sosed/. The tool demo is available at https://www.youtube.com/watch?v=LYLkztCGRt8. The multi-language extractor of subtokens is available separately at https://github.com/JetBrains-Research/buckwheat/.

AB - In this paper, we present Sosed, a tool for discovering similar software projects. We use fastText to compute the embeddings of subto-kens into a dense space for 120, 000 GitHub projects in 200 languages. Then, we cluster embeddings to identify groups of semantically similar subtokens that reflect topics in source code. We use a dataset of 9 million GitHub projects as a reference search base. To identify similar projects, we compare the distributions of clusters among their subtokens. The tool receives an arbitrary project as input, extracts subtokens in 16 most popular programming languages, computes cluster distribution, and finds projects with the closest distribution in the search base. We labeled subtoken clusters with short descriptions to enable Sosed to produce interpretable output. Sosed is available at https://github.com/JetBrains-Research/sosed/. The tool demo is available at https://www.youtube.com/watch?v=LYLkztCGRt8. The multi-language extractor of subtokens is available separately at https://github.com/JetBrains-Research/buckwheat/.

KW - machine learning

KW - similar repositories

KW - topic modeling

KW - word embeddings

UR - http://www.scopus.com/inward/record.url?scp=85099188230&partnerID=8YFLogxK

U2 - 10.1145/3324884.3415291

DO - 10.1145/3324884.3415291

M3 - Conference contribution

AN - SCOPUS:85099188230

T3 - Proceedings - 2020 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020

SP - 1316

EP - 1320

BT - Proceedings - 2020 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020

Y2 - 22 September 2020 through 25 September 2020

ER -

ID: 73688884