Standard

Natural language processing metrics efficiency for evaluating a generated code: facing the challenge. / Fedrushkov, D.V.; Kovalchuk, S.V.; Aliev, A.A.

In: Scientific and Technical Journal of Information Technologies, Mechanics and Optics, Vol. 26, No. 1, 25.02.2026, p. 135-144.

Research output: Contribution to journalArticlepeer-review

Harvard

Fedrushkov, DV, Kovalchuk, SV & Aliev, AA 2026, 'Natural language processing metrics efficiency for evaluating a generated code: facing the challenge', Scientific and Technical Journal of Information Technologies, Mechanics and Optics, vol. 26, no. 1, pp. 135-144. https://doi.org/10.17586/2226-1494-2026-26-1-135-144

APA

Fedrushkov, D. V., Kovalchuk, S. V., & Aliev, A. A. (2026). Natural language processing metrics efficiency for evaluating a generated code: facing the challenge. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 26(1), 135-144. https://doi.org/10.17586/2226-1494-2026-26-1-135-144

Vancouver

Fedrushkov DV, Kovalchuk SV, Aliev AA. Natural language processing metrics efficiency for evaluating a generated code: facing the challenge. Scientific and Technical Journal of Information Technologies, Mechanics and Optics. 2026 Feb 25;26(1):135-144. https://doi.org/10.17586/2226-1494-2026-26-1-135-144

Author

Fedrushkov, D.V. ; Kovalchuk, S.V. ; Aliev, A.A. / Natural language processing metrics efficiency for evaluating a generated code: facing the challenge. In: Scientific and Technical Journal of Information Technologies, Mechanics and Optics. 2026 ; Vol. 26, No. 1. pp. 135-144.

BibTeX

@article{0dcab828e00e488286c34fc6b9d07447,
title = "Natural language processing metrics efficiency for evaluating a generated code: facing the challenge",
abstract = "The evaluation of Large Language Models (LLMs) for code generation tasks presents unique challenges, because conventional Natural Language Processing (NLP) methods might be not applicable for assessing the code. Traditional text similarity metrics may fail to capture the functional correctness of generated code. This study investigates the effectiveness of various evaluation metrics by comparing a LLM-generated code with the mutated versions of the original code snippets. Using state-of-the-art models and benchmarks, the generated and the mutated codes were evaluated using some widely used NLP metrics, including code-oriented CodeBLEU and Ruby, and the neural network-based BERTScore and CodeBERTScore. Results demonstrated that text-oriented metrics tend to have inferior relevance in assessing programming tasks, particularly when functional accuracy is crucial. Code-specific and neural metrics show higher correlation with test pass rates, although their limitations highlight the need for a further refinement. The findings underscore the importance of developing functionality-aware evaluation methods for LLM-driven code generation. This research suggests insights into metrics selection to assess the quality of AI-generated code. {\textcopyright} Fedrushkov D.V., Kovalchuk S.V., Aliev A.A., 2026.",
keywords = "code generation, evaluation, large language models, metrics, natural language processing, source code mutations",
author = "D.V. Fedrushkov and S.V. Kovalchuk and A.A. Aliev",
note = "Export Date: 16 March 2026; Cited By: 0; Correspondence Address: S.V. Kovalchuk; ITMO University, Saint Petersburg, 197101, Russian Federation; email: kovalchuk@itmo.ru",
year = "2026",
month = feb,
day = "25",
doi = "10.17586/2226-1494-2026-26-1-135-144",
language = "Английский",
volume = "26",
pages = "135--144",
journal = "Scientific and Technical Journal of Information Technologies, Mechanics and Optics",
issn = "2226-1494",
publisher = "НИУ ИТМО",
number = "1",

}

RIS

TY - JOUR

T1 - Natural language processing metrics efficiency for evaluating a generated code: facing the challenge

AU - Fedrushkov, D.V.

AU - Kovalchuk, S.V.

AU - Aliev, A.A.

N1 - Export Date: 16 March 2026; Cited By: 0; Correspondence Address: S.V. Kovalchuk; ITMO University, Saint Petersburg, 197101, Russian Federation; email: kovalchuk@itmo.ru

PY - 2026/2/25

Y1 - 2026/2/25

N2 - The evaluation of Large Language Models (LLMs) for code generation tasks presents unique challenges, because conventional Natural Language Processing (NLP) methods might be not applicable for assessing the code. Traditional text similarity metrics may fail to capture the functional correctness of generated code. This study investigates the effectiveness of various evaluation metrics by comparing a LLM-generated code with the mutated versions of the original code snippets. Using state-of-the-art models and benchmarks, the generated and the mutated codes were evaluated using some widely used NLP metrics, including code-oriented CodeBLEU and Ruby, and the neural network-based BERTScore and CodeBERTScore. Results demonstrated that text-oriented metrics tend to have inferior relevance in assessing programming tasks, particularly when functional accuracy is crucial. Code-specific and neural metrics show higher correlation with test pass rates, although their limitations highlight the need for a further refinement. The findings underscore the importance of developing functionality-aware evaluation methods for LLM-driven code generation. This research suggests insights into metrics selection to assess the quality of AI-generated code. © Fedrushkov D.V., Kovalchuk S.V., Aliev A.A., 2026.

AB - The evaluation of Large Language Models (LLMs) for code generation tasks presents unique challenges, because conventional Natural Language Processing (NLP) methods might be not applicable for assessing the code. Traditional text similarity metrics may fail to capture the functional correctness of generated code. This study investigates the effectiveness of various evaluation metrics by comparing a LLM-generated code with the mutated versions of the original code snippets. Using state-of-the-art models and benchmarks, the generated and the mutated codes were evaluated using some widely used NLP metrics, including code-oriented CodeBLEU and Ruby, and the neural network-based BERTScore and CodeBERTScore. Results demonstrated that text-oriented metrics tend to have inferior relevance in assessing programming tasks, particularly when functional accuracy is crucial. Code-specific and neural metrics show higher correlation with test pass rates, although their limitations highlight the need for a further refinement. The findings underscore the importance of developing functionality-aware evaluation methods for LLM-driven code generation. This research suggests insights into metrics selection to assess the quality of AI-generated code. © Fedrushkov D.V., Kovalchuk S.V., Aliev A.A., 2026.

KW - code generation

KW - evaluation

KW - large language models

KW - metrics

KW - natural language processing

KW - source code mutations

UR - https://www.mendeley.com/catalogue/d64d704a-73ce-312b-8d6a-499cda9a53f2/

U2 - 10.17586/2226-1494-2026-26-1-135-144

DO - 10.17586/2226-1494-2026-26-1-135-144

M3 - статья

VL - 26

SP - 135

EP - 144

JO - Scientific and Technical Journal of Information Technologies, Mechanics and Optics

JF - Scientific and Technical Journal of Information Technologies, Mechanics and Optics

SN - 2226-1494

IS - 1

ER -

ID: 150628680