Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Research › peer-review
Authorship attribution of source code : A language-agnostic approach and applicability in software engineering. / Bogomolov, Egor; Kovalenko, Vladimir; Rebryk, Yurii; Bacchelli, Alberto; Bryksin, Timofey.
ESEC/FSE 2021 - Proceedings of the 29th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering: Proceedings of the 29th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ed. / Diomidis Spinellis. Association for Computing Machinery, 2021. p. 932-944.Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Research › peer-review
}
TY - GEN
T1 - Authorship attribution of source code
T2 - 29th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2021
AU - Bogomolov, Egor
AU - Kovalenko, Vladimir
AU - Rebryk, Yurii
AU - Bacchelli, Alberto
AU - Bryksin, Timofey
N1 - Publisher Copyright: © 2021 ACM.
PY - 2021/8/20
Y1 - 2021/8/20
N2 - Authorship attribution (i.e., determining who is the author of a piece of source code) is an established research topic. State-of-the-art results for the authorship attribution problem look promising for the software engineering field, where they could be applied to detect plagiarized code and prevent legal issues. With this article, we first introduce a new language-agnostic approach to authorship attribution of source code. Then, we discuss limitations of existing synthetic datasets for authorship attribution, and propose a data collection approach that delivers datasets that better reflect aspects important for potential practical use in software engineering. Finally, we demonstrate that high accuracy of authorship attribution models on existing datasets drastically drops when they are evaluated on more realistic data. We outline next steps for the design and evaluation of authorship attribution models that could bring the research efforts closer to practical use for software engineering.
AB - Authorship attribution (i.e., determining who is the author of a piece of source code) is an established research topic. State-of-the-art results for the authorship attribution problem look promising for the software engineering field, where they could be applied to detect plagiarized code and prevent legal issues. With this article, we first introduce a new language-agnostic approach to authorship attribution of source code. Then, we discuss limitations of existing synthetic datasets for authorship attribution, and propose a data collection approach that delivers datasets that better reflect aspects important for potential practical use in software engineering. Finally, we demonstrate that high accuracy of authorship attribution models on existing datasets drastically drops when they are evaluated on more realistic data. We outline next steps for the design and evaluation of authorship attribution models that could bring the research efforts closer to practical use for software engineering.
KW - Copyrights
KW - Machine learning
KW - Methods of data collection
KW - Security
KW - Software maintenance
KW - Software process
UR - http://www.scopus.com/inward/record.url?scp=85116219662&partnerID=8YFLogxK
UR - https://www.mendeley.com/catalogue/0989a8f9-3975-32a6-9fda-236b1d4f1142/
U2 - 10.1145/3468264.3468606
DO - 10.1145/3468264.3468606
M3 - Conference contribution
AN - SCOPUS:85116219662
SP - 932
EP - 944
BT - ESEC/FSE 2021 - Proceedings of the 29th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering
A2 - Spinellis, Diomidis
PB - Association for Computing Machinery
Y2 - 23 August 2021 through 28 August 2021
ER -
ID: 87612481