DOI

  • Egor Bogomolov
  • Yaroslav Golubev
  • Artyom Lobanov
  • Vladimir Kovalenko
  • Timofey Bryksin

In this paper, we present Sosed, a tool for discovering similar software projects. We use fastText to compute the embeddings of subto-kens into a dense space for 120, 000 GitHub projects in 200 languages. Then, we cluster embeddings to identify groups of semantically similar subtokens that reflect topics in source code. We use a dataset of 9 million GitHub projects as a reference search base. To identify similar projects, we compare the distributions of clusters among their subtokens. The tool receives an arbitrary project as input, extracts subtokens in 16 most popular programming languages, computes cluster distribution, and finds projects with the closest distribution in the search base. We labeled subtoken clusters with short descriptions to enable Sosed to produce interpretable output. Sosed is available at https://github.com/JetBrains-Research/sosed/. The tool demo is available at https://www.youtube.com/watch?v=LYLkztCGRt8. The multi-language extractor of subtokens is available separately at https://github.com/JetBrains-Research/buckwheat/.

Original languageEnglish
Title of host publicationProceedings - 2020 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1316-1320
Number of pages5
ISBN (Electronic)9781450367684
DOIs
StatePublished - Sep 2020
Event35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020 - Virtual, Melbourne, Australia
Duration: 22 Sep 202025 Sep 2020

Publication series

NameProceedings - 2020 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020

Conference

Conference35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020
Country/TerritoryAustralia
CityVirtual, Melbourne
Period22/09/2025/09/20

    Scopus subject areas

  • Artificial Intelligence
  • Software
  • Safety, Risk, Reliability and Quality

    Research areas

  • machine learning, similar repositories, topic modeling, word embeddings

ID: 73688884