The evaluation of Large Language Models (LLMs) for code generation tasks presents unique challenges, because conventional Natural Language Processing (NLP) methods might be not applicable for assessing the code. Traditional text similarity metrics may fail to capture the functional correctness of generated code. This study investigates the effectiveness of various evaluation metrics by comparing a LLM-generated code with the mutated versions of the original code snippets. Using state-of-the-art models and benchmarks, the generated and the mutated codes were evaluated using some widely used NLP metrics, including code-oriented CodeBLEU and Ruby, and the neural network-based BERTScore and CodeBERTScore. Results demonstrated that text-oriented metrics tend to have inferior relevance in assessing programming tasks, particularly when functional accuracy is crucial. Code-specific and neural metrics show higher correlation with test pass rates, although their limitations highlight the need for a further refinement. The findings underscore the importance of developing functionality-aware evaluation methods for LLM-driven code generation. This research suggests insights into metrics selection to assess the quality of AI-generated code. © Fedrushkov D.V., Kovalchuk S.V., Aliev A.A., 2026.