Research output: Contribution to journal › Article › peer-review
Natural language processing metrics efficiency for evaluating a generated code: facing the challenge. / Fedrushkov, D.V.; Kovalchuk, S.V.; Aliev, A.A.
In: Scientific and Technical Journal of Information Technologies, Mechanics and Optics, Vol. 26, No. 1, 25.02.2026, p. 135-144.Research output: Contribution to journal › Article › peer-review
}
TY - JOUR
T1 - Natural language processing metrics efficiency for evaluating a generated code: facing the challenge
AU - Fedrushkov, D.V.
AU - Kovalchuk, S.V.
AU - Aliev, A.A.
N1 - Export Date: 16 March 2026; Cited By: 0; Correspondence Address: S.V. Kovalchuk; ITMO University, Saint Petersburg, 197101, Russian Federation; email: kovalchuk@itmo.ru
PY - 2026/2/25
Y1 - 2026/2/25
N2 - The evaluation of Large Language Models (LLMs) for code generation tasks presents unique challenges, because conventional Natural Language Processing (NLP) methods might be not applicable for assessing the code. Traditional text similarity metrics may fail to capture the functional correctness of generated code. This study investigates the effectiveness of various evaluation metrics by comparing a LLM-generated code with the mutated versions of the original code snippets. Using state-of-the-art models and benchmarks, the generated and the mutated codes were evaluated using some widely used NLP metrics, including code-oriented CodeBLEU and Ruby, and the neural network-based BERTScore and CodeBERTScore. Results demonstrated that text-oriented metrics tend to have inferior relevance in assessing programming tasks, particularly when functional accuracy is crucial. Code-specific and neural metrics show higher correlation with test pass rates, although their limitations highlight the need for a further refinement. The findings underscore the importance of developing functionality-aware evaluation methods for LLM-driven code generation. This research suggests insights into metrics selection to assess the quality of AI-generated code. © Fedrushkov D.V., Kovalchuk S.V., Aliev A.A., 2026.
AB - The evaluation of Large Language Models (LLMs) for code generation tasks presents unique challenges, because conventional Natural Language Processing (NLP) methods might be not applicable for assessing the code. Traditional text similarity metrics may fail to capture the functional correctness of generated code. This study investigates the effectiveness of various evaluation metrics by comparing a LLM-generated code with the mutated versions of the original code snippets. Using state-of-the-art models and benchmarks, the generated and the mutated codes were evaluated using some widely used NLP metrics, including code-oriented CodeBLEU and Ruby, and the neural network-based BERTScore and CodeBERTScore. Results demonstrated that text-oriented metrics tend to have inferior relevance in assessing programming tasks, particularly when functional accuracy is crucial. Code-specific and neural metrics show higher correlation with test pass rates, although their limitations highlight the need for a further refinement. The findings underscore the importance of developing functionality-aware evaluation methods for LLM-driven code generation. This research suggests insights into metrics selection to assess the quality of AI-generated code. © Fedrushkov D.V., Kovalchuk S.V., Aliev A.A., 2026.
KW - code generation
KW - evaluation
KW - large language models
KW - metrics
KW - natural language processing
KW - source code mutations
UR - https://www.mendeley.com/catalogue/d64d704a-73ce-312b-8d6a-499cda9a53f2/
U2 - 10.17586/2226-1494-2026-26-1-135-144
DO - 10.17586/2226-1494-2026-26-1-135-144
M3 - статья
VL - 26
SP - 135
EP - 144
JO - Scientific and Technical Journal of Information Technologies, Mechanics and Optics
JF - Scientific and Technical Journal of Information Technologies, Mechanics and Optics
SN - 2226-1494
IS - 1
ER -
ID: 150628680