Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Research › peer-review
Sosed: A tool for finding similar software projects. / Bogomolov, Egor; Golubev, Yaroslav; Lobanov, Artyom; Kovalenko, Vladimir; Bryksin, Timofey.
Proceedings - 2020 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020. Institute of Electrical and Electronics Engineers Inc., 2020. p. 1316-1320 9286041 (Proceedings - 2020 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020).Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Research › peer-review
}
TY - GEN
T1 - Sosed: A tool for finding similar software projects
AU - Bogomolov, Egor
AU - Golubev, Yaroslav
AU - Lobanov, Artyom
AU - Kovalenko, Vladimir
AU - Bryksin, Timofey
N1 - Publisher Copyright: © 2020 ACM. Copyright: Copyright 2021 Elsevier B.V., All rights reserved.
PY - 2020/9
Y1 - 2020/9
N2 - In this paper, we present Sosed, a tool for discovering similar software projects. We use fastText to compute the embeddings of subto-kens into a dense space for 120, 000 GitHub projects in 200 languages. Then, we cluster embeddings to identify groups of semantically similar subtokens that reflect topics in source code. We use a dataset of 9 million GitHub projects as a reference search base. To identify similar projects, we compare the distributions of clusters among their subtokens. The tool receives an arbitrary project as input, extracts subtokens in 16 most popular programming languages, computes cluster distribution, and finds projects with the closest distribution in the search base. We labeled subtoken clusters with short descriptions to enable Sosed to produce interpretable output. Sosed is available at https://github.com/JetBrains-Research/sosed/. The tool demo is available at https://www.youtube.com/watch?v=LYLkztCGRt8. The multi-language extractor of subtokens is available separately at https://github.com/JetBrains-Research/buckwheat/.
AB - In this paper, we present Sosed, a tool for discovering similar software projects. We use fastText to compute the embeddings of subto-kens into a dense space for 120, 000 GitHub projects in 200 languages. Then, we cluster embeddings to identify groups of semantically similar subtokens that reflect topics in source code. We use a dataset of 9 million GitHub projects as a reference search base. To identify similar projects, we compare the distributions of clusters among their subtokens. The tool receives an arbitrary project as input, extracts subtokens in 16 most popular programming languages, computes cluster distribution, and finds projects with the closest distribution in the search base. We labeled subtoken clusters with short descriptions to enable Sosed to produce interpretable output. Sosed is available at https://github.com/JetBrains-Research/sosed/. The tool demo is available at https://www.youtube.com/watch?v=LYLkztCGRt8. The multi-language extractor of subtokens is available separately at https://github.com/JetBrains-Research/buckwheat/.
KW - machine learning
KW - similar repositories
KW - topic modeling
KW - word embeddings
UR - http://www.scopus.com/inward/record.url?scp=85099188230&partnerID=8YFLogxK
U2 - 10.1145/3324884.3415291
DO - 10.1145/3324884.3415291
M3 - Conference contribution
AN - SCOPUS:85099188230
T3 - Proceedings - 2020 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020
SP - 1316
EP - 1320
BT - Proceedings - 2020 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020
Y2 - 22 September 2020 through 25 September 2020
ER -
ID: 73688884