Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Research › peer-review
In this paper, we present Sosed, a tool for discovering similar software projects. We use fastText to compute the embeddings of subto-kens into a dense space for 120, 000 GitHub projects in 200 languages. Then, we cluster embeddings to identify groups of semantically similar subtokens that reflect topics in source code. We use a dataset of 9 million GitHub projects as a reference search base. To identify similar projects, we compare the distributions of clusters among their subtokens. The tool receives an arbitrary project as input, extracts subtokens in 16 most popular programming languages, computes cluster distribution, and finds projects with the closest distribution in the search base. We labeled subtoken clusters with short descriptions to enable Sosed to produce interpretable output. Sosed is available at https://github.com/JetBrains-Research/sosed/. The tool demo is available at https://www.youtube.com/watch?v=LYLkztCGRt8. The multi-language extractor of subtokens is available separately at https://github.com/JetBrains-Research/buckwheat/.
Original language | English |
---|---|
Title of host publication | Proceedings - 2020 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 1316-1320 |
Number of pages | 5 |
ISBN (Electronic) | 9781450367684 |
DOIs | |
State | Published - Sep 2020 |
Event | 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020 - Virtual, Melbourne, Australia Duration: 22 Sep 2020 → 25 Sep 2020 |
Name | Proceedings - 2020 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020 |
---|
Conference | 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020 |
---|---|
Country/Territory | Australia |
City | Virtual, Melbourne |
Period | 22/09/20 → 25/09/20 |
ID: 73688884