Who Will Author the Synthetic Texts? Evoking Multiple Personas from Large Language Models to Represent Users’ Associative Thesauri

Standard

Who Will Author the Synthetic Texts? Evoking Multiple Personas from Large Language Models to Represent Users’ Associative Thesauri. / Bakaev, Maxim; Gorovaia, Svetlana ; Mitrofanova, Olga.

In: Big Data and Cognitive Computing, Vol. 9, No. 2, 46, 18.02.2025.

Research output: Contribution to journal › Article › peer-review

BibTeX

@article{8a731573675d42efa12658749e58d007,

title = "Who Will Author the Synthetic Texts? Evoking Multiple Personas from Large Language Models to Represent Users{\textquoteright} Associative Thesauri",

abstract = "Previously, it was suggested that the “persona-driven” approach can contribute to producing sufficiently diverse synthetic training data for Large Language Models (LLMs) that are currently about to run out of real natural language texts. In our paper, we explore whether personas evoked from LLMs through HCI-style descriptions could indeed imitate human-like differences in authorship. For this end, we ran an associative experiment with 50 human participants and four artificial personas evoked from the popular LLM-based services: GPT-4(o) and YandexGPT Pro. For each of the five stimuli words selected from university websites{\textquoteright} homepages, we asked both groups of subjects to come up with 10 short associations (in Russian). We then used cosine similarity and Mahalanobis distance to measure the distance between the association lists produced by different humans and personas. While the difference in the similarity was significant for different human associators and different gender and age groups, neither was the case for the different personas evoked from ChatGPT or YandexGPT. Our findings suggest that the LLM-based services so far fall short at imitating the associative thesauri of different authors. The outcome of our study might be of interest to computer linguists, as well as AI/ML scientists and prompt engineers.",

keywords = "ChatGPT, YandexGPT, data augmentation, language models, machine learning, semantic similarity, text authorship",

author = "Maxim Bakaev and Svetlana Gorovaia and Olga Mitrofanova",

year = "2025",

month = feb,

day = "18",

doi = "10.3390/bdcc9020046",

language = "English",

volume = "9",

journal = "Big Data and Cognitive Computing",

issn = "2504-2289",

publisher = "MDPI AG",

number = "2",

}

RIS

TY - JOUR

T1 - Who Will Author the Synthetic Texts? Evoking Multiple Personas from Large Language Models to Represent Users’ Associative Thesauri

AU - Bakaev, Maxim

AU - Gorovaia, Svetlana

AU - Mitrofanova, Olga

PY - 2025/2/18

Y1 - 2025/2/18

N2 - Previously, it was suggested that the “persona-driven” approach can contribute to producing sufficiently diverse synthetic training data for Large Language Models (LLMs) that are currently about to run out of real natural language texts. In our paper, we explore whether personas evoked from LLMs through HCI-style descriptions could indeed imitate human-like differences in authorship. For this end, we ran an associative experiment with 50 human participants and four artificial personas evoked from the popular LLM-based services: GPT-4(o) and YandexGPT Pro. For each of the five stimuli words selected from university websites’ homepages, we asked both groups of subjects to come up with 10 short associations (in Russian). We then used cosine similarity and Mahalanobis distance to measure the distance between the association lists produced by different humans and personas. While the difference in the similarity was significant for different human associators and different gender and age groups, neither was the case for the different personas evoked from ChatGPT or YandexGPT. Our findings suggest that the LLM-based services so far fall short at imitating the associative thesauri of different authors. The outcome of our study might be of interest to computer linguists, as well as AI/ML scientists and prompt engineers.

AB - Previously, it was suggested that the “persona-driven” approach can contribute to producing sufficiently diverse synthetic training data for Large Language Models (LLMs) that are currently about to run out of real natural language texts. In our paper, we explore whether personas evoked from LLMs through HCI-style descriptions could indeed imitate human-like differences in authorship. For this end, we ran an associative experiment with 50 human participants and four artificial personas evoked from the popular LLM-based services: GPT-4(o) and YandexGPT Pro. For each of the five stimuli words selected from university websites’ homepages, we asked both groups of subjects to come up with 10 short associations (in Russian). We then used cosine similarity and Mahalanobis distance to measure the distance between the association lists produced by different humans and personas. While the difference in the similarity was significant for different human associators and different gender and age groups, neither was the case for the different personas evoked from ChatGPT or YandexGPT. Our findings suggest that the LLM-based services so far fall short at imitating the associative thesauri of different authors. The outcome of our study might be of interest to computer linguists, as well as AI/ML scientists and prompt engineers.

KW - ChatGPT

KW - YandexGPT

KW - data augmentation

KW - language models

KW - machine learning

KW - semantic similarity

KW - text authorship

UR - https://www.mendeley.com/catalogue/a9822ff9-da6b-3d0f-9fb8-cbc7ebfa1740/

U2 - 10.3390/bdcc9020046

DO - 10.3390/bdcc9020046

M3 - Article

VL - 9

JO - Big Data and Cognitive Computing

JF - Big Data and Cognitive Computing

SN - 2504-2289

IS - 2

M1 - 46

ER -

ID: 132344285