Natural language processing metrics efficiency for evaluating a generated code: facing the challenge

Research output: Contribution to journal › Article › peer-review

CB.5000.2019 Mathematics

DOI

https://doi.org/10.17586/2226-1494-2026-26-1-135-144
Final published version

D.V. Fedrushkov
S.V. Kovalchuk
A.A. Aliev

The evaluation of Large Language Models (LLMs) for code generation tasks presents unique challenges, because conventional Natural Language Processing (NLP) methods might be not applicable for assessing the code. Traditional text similarity metrics may fail to capture the functional correctness of generated code. This study investigates the effectiveness of various evaluation metrics by comparing a LLM-generated code with the mutated versions of the original code snippets. Using state-of-the-art models and benchmarks, the generated and the mutated codes were evaluated using some widely used NLP metrics, including code-oriented CodeBLEU and Ruby, and the neural network-based BERTScore and CodeBERTScore. Results demonstrated that text-oriented metrics tend to have inferior relevance in assessing programming tasks, particularly when functional accuracy is crucial. Code-specific and neural metrics show higher correlation with test pass rates, although their limitations highlight the need for a further refinement. The findings underscore the importance of developing functionality-aware evaluation methods for LLM-driven code generation. This research suggests insights into metrics selection to assess the quality of AI-generated code. © Fedrushkov D.V., Kovalchuk S.V., Aliev A.A., 2026.

Original language	English
Pages (from-to)	135-144
Number of pages	10
Journal	Scientific and Technical Journal of Information Technologies, Mechanics and Optics
Volume	26
Issue number	1
DOIs	https://doi.org/10.17586/2226-1494-2026-26-1-135-144
State	Published - 25 Feb 2026

Research areas

code generation, evaluation, large language models, metrics, natural language processing, source code mutations

ID: 150628680

Natural language processing metrics efficiency for evaluating a generated code: facing the challenge

Links

DOI

Research areas