DOI

  • Egor Bogomolov
  • Yaroslav Golubev
  • Artyom Lobanov
  • Vladimir Kovalenko
  • Timofey Bryksin

In this paper, we present Sosed, a tool for discovering similar software projects. We use fastText to compute the embeddings of subto-kens into a dense space for 120, 000 GitHub projects in 200 languages. Then, we cluster embeddings to identify groups of semantically similar subtokens that reflect topics in source code. We use a dataset of 9 million GitHub projects as a reference search base. To identify similar projects, we compare the distributions of clusters among their subtokens. The tool receives an arbitrary project as input, extracts subtokens in 16 most popular programming languages, computes cluster distribution, and finds projects with the closest distribution in the search base. We labeled subtoken clusters with short descriptions to enable Sosed to produce interpretable output. Sosed is available at https://github.com/JetBrains-Research/sosed/. The tool demo is available at https://www.youtube.com/watch?v=LYLkztCGRt8. The multi-language extractor of subtokens is available separately at https://github.com/JetBrains-Research/buckwheat/.

Язык оригиналаанглийский
Название основной публикацииProceedings - 2020 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020
ИздательInstitute of Electrical and Electronics Engineers Inc.
Страницы1316-1320
Число страниц5
ISBN (электронное издание)9781450367684
DOI
СостояниеОпубликовано - сен 2020
Событие35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020 - Virtual, Melbourne, Австралия
Продолжительность: 22 сен 202025 сен 2020

Серия публикаций

НазваниеProceedings - 2020 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020

конференция

конференция35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020
Страна/TерриторияАвстралия
ГородVirtual, Melbourne
Период22/09/2025/09/20

    Предметные области Scopus

  • Искусственный интеллект
  • Программный продукт
  • Безопасность, риски, качество и надежность

ID: 73688884