DOI

  • Egor Bogomolov
  • Vladimir Kovalenko
  • Yurii Rebryk
  • Alberto Bacchelli
  • Timofey Bryksin

Authorship attribution (i.e., determining who is the author of a piece of source code) is an established research topic. State-of-the-art results for the authorship attribution problem look promising for the software engineering field, where they could be applied to detect plagiarized code and prevent legal issues. With this article, we first introduce a new language-agnostic approach to authorship attribution of source code. Then, we discuss limitations of existing synthetic datasets for authorship attribution, and propose a data collection approach that delivers datasets that better reflect aspects important for potential practical use in software engineering. Finally, we demonstrate that high accuracy of authorship attribution models on existing datasets drastically drops when they are evaluated on more realistic data. We outline next steps for the design and evaluation of authorship attribution models that could bring the research efforts closer to practical use for software engineering.

Original languageEnglish
Title of host publicationESEC/FSE 2021 - Proceedings of the 29th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering
Subtitle of host publicationProceedings of the 29th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering
EditorsDiomidis Spinellis
PublisherAssociation for Computing Machinery
Pages932-944
Number of pages13
ISBN (Electronic)9781450385626
DOIs
StatePublished - 20 Aug 2021
Event29th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2021 - Virtual, Online, Greece
Duration: 23 Aug 202128 Aug 2021

Conference

Conference29th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2021
Country/TerritoryGreece
CityVirtual, Online
Period23/08/2128/08/21

    Scopus subject areas

  • Artificial Intelligence
  • Software

    Research areas

  • Copyrights, Machine learning, Methods of data collection, Security, Software maintenance, Software process

ID: 87612481