The Advantages of Human Evaluation of Sociomedical Question Answering Systems

Standard

The Advantages of Human Evaluation of Sociomedical Question Answering Systems. / Фирсанова, Виктория Игоревна.

In: International Journal of Open Information Technologies, Vol. 9, No. 12, 12.2021, p. 53-59.

Research output: Contribution to journal › Article › peer-review

BibTeX

@article{06d4f1cd04b24620943c563a180110c8,

title = "The Advantages of Human Evaluation of Sociomedical Question Answering Systems",

abstract = "The paper presents a study on question answering systems evaluation. The purpose of the study is to determine if human evaluation is indeed necessary to qualitatively measure the performance of a sociomedical dialogue system. The study is based on the data from several natural language processing experiments conducted with a question answering dataset for inclusion of people with autism spectrum disorder and state-of-the-art models with the Transformer architecture. The study describes model-centric experiments on generative and extractive question answering and data-centric experiments on dataset tuning. The purpose of both model- and data-centric approaches is to reach the highest F1-Score. Although F1-Score and Exact Match are well-known automated evaluation metrics for question answering, their reliability in measuring the performance of sociomedical systems, in which outputs should be not only consistent but also psychologically safe, is questionable. Considering this idea, the author of the paper experimented with human evaluation of a dialogue system for inclusion developed in the previous phase of the work. The result of the study is the analysis of the advantages and disadvantages of automated and human approaches to evaluate conversational artificial intelligence systems, in which the psychological safety of a user is essential.",

author = "Фирсанова, {Виктория Игоревна}",

year = "2021",

month = dec,

language = "English",

volume = "9",

pages = "53--59",

journal = "International Journal of Open Information Technologies",

issn = "2307-8162",

publisher = "Издательство Московского университета",

number = "12",

}

RIS

TY - JOUR

T1 - The Advantages of Human Evaluation of Sociomedical Question Answering Systems

AU - Фирсанова, Виктория Игоревна

PY - 2021/12

Y1 - 2021/12

N2 - The paper presents a study on question answering systems evaluation. The purpose of the study is to determine if human evaluation is indeed necessary to qualitatively measure the performance of a sociomedical dialogue system. The study is based on the data from several natural language processing experiments conducted with a question answering dataset for inclusion of people with autism spectrum disorder and state-of-the-art models with the Transformer architecture. The study describes model-centric experiments on generative and extractive question answering and data-centric experiments on dataset tuning. The purpose of both model- and data-centric approaches is to reach the highest F1-Score. Although F1-Score and Exact Match are well-known automated evaluation metrics for question answering, their reliability in measuring the performance of sociomedical systems, in which outputs should be not only consistent but also psychologically safe, is questionable. Considering this idea, the author of the paper experimented with human evaluation of a dialogue system for inclusion developed in the previous phase of the work. The result of the study is the analysis of the advantages and disadvantages of automated and human approaches to evaluate conversational artificial intelligence systems, in which the psychological safety of a user is essential.

AB - The paper presents a study on question answering systems evaluation. The purpose of the study is to determine if human evaluation is indeed necessary to qualitatively measure the performance of a sociomedical dialogue system. The study is based on the data from several natural language processing experiments conducted with a question answering dataset for inclusion of people with autism spectrum disorder and state-of-the-art models with the Transformer architecture. The study describes model-centric experiments on generative and extractive question answering and data-centric experiments on dataset tuning. The purpose of both model- and data-centric approaches is to reach the highest F1-Score. Although F1-Score and Exact Match are well-known automated evaluation metrics for question answering, their reliability in measuring the performance of sociomedical systems, in which outputs should be not only consistent but also psychologically safe, is questionable. Considering this idea, the author of the paper experimented with human evaluation of a dialogue system for inclusion developed in the previous phase of the work. The result of the study is the analysis of the advantages and disadvantages of automated and human approaches to evaluate conversational artificial intelligence systems, in which the psychological safety of a user is essential.

M3 - Article

VL - 9

SP - 53

EP - 59

JO - International Journal of Open Information Technologies

JF - International Journal of Open Information Technologies

SN - 2307-8162

IS - 12

ER -

ID: 91850122