Research output: Contribution to journal › Article › peer-review
Who Will Author the Synthetic Texts? Evoking Multiple Personas from Large Language Models to Represent Users’ Associative Thesauri. / Bakaev, Maxim; Gorovaia, Svetlana; Mitrofanova, Olga.
In: Big Data and Cognitive Computing, Vol. 9, No. 2, 46, 18.02.2025.Research output: Contribution to journal › Article › peer-review
}
TY - JOUR
T1 - Who Will Author the Synthetic Texts? Evoking Multiple Personas from Large Language Models to Represent Users’ Associative Thesauri
AU - Bakaev, Maxim
AU - Gorovaia, Svetlana
AU - Mitrofanova, Olga
PY - 2025/2/18
Y1 - 2025/2/18
N2 - Previously, it was suggested that the “persona-driven” approach can contribute to producing sufficiently diverse synthetic training data for Large Language Models (LLMs) that are currently about to run out of real natural language texts. In our paper, we explore whether personas evoked from LLMs through HCI-style descriptions could indeed imitate human-like differences in authorship. For this end, we ran an associative experiment with 50 human participants and four artificial personas evoked from the popular LLM-based services: GPT-4(o) and YandexGPT Pro. For each of the five stimuli words selected from university websites’ homepages, we asked both groups of subjects to come up with 10 short associations (in Russian). We then used cosine similarity and Mahalanobis distance to measure the distance between the association lists produced by different humans and personas. While the difference in the similarity was significant for different human associators and different gender and age groups, neither was the case for the different personas evoked from ChatGPT or YandexGPT. Our findings suggest that the LLM-based services so far fall short at imitating the associative thesauri of different authors. The outcome of our study might be of interest to computer linguists, as well as AI/ML scientists and prompt engineers.
AB - Previously, it was suggested that the “persona-driven” approach can contribute to producing sufficiently diverse synthetic training data for Large Language Models (LLMs) that are currently about to run out of real natural language texts. In our paper, we explore whether personas evoked from LLMs through HCI-style descriptions could indeed imitate human-like differences in authorship. For this end, we ran an associative experiment with 50 human participants and four artificial personas evoked from the popular LLM-based services: GPT-4(o) and YandexGPT Pro. For each of the five stimuli words selected from university websites’ homepages, we asked both groups of subjects to come up with 10 short associations (in Russian). We then used cosine similarity and Mahalanobis distance to measure the distance between the association lists produced by different humans and personas. While the difference in the similarity was significant for different human associators and different gender and age groups, neither was the case for the different personas evoked from ChatGPT or YandexGPT. Our findings suggest that the LLM-based services so far fall short at imitating the associative thesauri of different authors. The outcome of our study might be of interest to computer linguists, as well as AI/ML scientists and prompt engineers.
KW - ChatGPT
KW - YandexGPT
KW - data augmentation
KW - language models
KW - machine learning
KW - semantic similarity
KW - text authorship
UR - https://www.mendeley.com/catalogue/a9822ff9-da6b-3d0f-9fb8-cbc7ebfa1740/
U2 - 10.3390/bdcc9020046
DO - 10.3390/bdcc9020046
M3 - Article
VL - 9
JO - Big Data and Cognitive Computing
JF - Big Data and Cognitive Computing
SN - 2504-2289
IS - 2
M1 - 46
ER -
ID: 132344285