A graph-based approach to closed-domain natural language generation

Standard

A graph-based approach to closed-domain natural language generation. / Firsanova, V.I.

In: Research Result. Theoretical and Applied Linguistics, Vol. 10, No. 3, 30.09.2024, p. 135-167.

Research output: Contribution to journal › Article › peer-review

Author

Firsanova, V.I. / A graph-based approach to closed-domain natural language generation. In: Research Result. Theoretical and Applied Linguistics. 2024 ; Vol. 10, No. 3. pp. 135-167.

BibTeX

@article{c040693d883d4a5baca588d39bd19b36,

title = "A graph-based approach to closed-domain natural language generation",

abstract = "Graph-based Natural Language Processing (NLP) methods have seen significant advancements in recent years with the development of Large Language Models (LLMs) and Retrieval Augmented Generation (RAG). LLMs are sophisticated models that recognize numerous NLP tasks by analyzing the users' natural language instructions called prompts. However, their industrial use is questionable due to such ethical concerns as false information generation called hallucinations, high risks of data breaches, and plagiarism. The paper introduces a novel NLP architecture, the Graph-Based Block-to-Block Generation (G3BG), which leverages state-of-the-art deep learning techniques, the power of attention mechanisms, distributional semantics, graph-based information retrieval, and decentralized networks. The model encodes user prompts to mitigate data breach risk, retrieves relevant information from a graph knowledge base, and forms a block for a conditional language model using LLMs to perform a new secure type of RAG. The model is closed-domain and small-scale oriented. It exhibits superior performance across low-resource NLP tasks, which makes it prominent for industrial use. The research presents a novel graph-based dataset. The dataset comprises private data features to encode and closed-domain textual information for information retrieval. The dataset is used to train and evaluate the G3BG model. The model allows cutting 100x training dataset volume achieving Perplexity ~6.51 on the Language Generation task and F1-Score ~90.3 on the Information Retrieval task comparable to most state-of-the-art language models. The experimental results prove the effectiveness of the proposed method and contribute to the algorithmic approaches toward LLM risk mitigation. {\textcopyright} 2024 Belgorod State National Research University. All rights reserved.",

keywords = "Closed-Domain Systems, Data Encoding, Decentralized Networks, Distributional Semantics, Generative Artificial Intelligence, Language Generation, Language Understanding, Large Language Models",

author = "V.I. Firsanova",

note = "Export Date: 18 November 2024",

year = "2024",

month = sep,

day = "30",

doi = "10.18413/2313-8912-2024-10-3-0-7",

language = "Английский",

volume = "10",

pages = "135--167",

journal = "Research Result. Theoretical and Applied Linguistics",

issn = "2313-8912",

publisher = "Белгородский государственный национальный исследовательский университет",

number = "3",

}

RIS

TY - JOUR

T1 - A graph-based approach to closed-domain natural language generation

AU - Firsanova, V.I.

N1 - Export Date: 18 November 2024

PY - 2024/9/30

Y1 - 2024/9/30

N2 - Graph-based Natural Language Processing (NLP) methods have seen significant advancements in recent years with the development of Large Language Models (LLMs) and Retrieval Augmented Generation (RAG). LLMs are sophisticated models that recognize numerous NLP tasks by analyzing the users' natural language instructions called prompts. However, their industrial use is questionable due to such ethical concerns as false information generation called hallucinations, high risks of data breaches, and plagiarism. The paper introduces a novel NLP architecture, the Graph-Based Block-to-Block Generation (G3BG), which leverages state-of-the-art deep learning techniques, the power of attention mechanisms, distributional semantics, graph-based information retrieval, and decentralized networks. The model encodes user prompts to mitigate data breach risk, retrieves relevant information from a graph knowledge base, and forms a block for a conditional language model using LLMs to perform a new secure type of RAG. The model is closed-domain and small-scale oriented. It exhibits superior performance across low-resource NLP tasks, which makes it prominent for industrial use. The research presents a novel graph-based dataset. The dataset comprises private data features to encode and closed-domain textual information for information retrieval. The dataset is used to train and evaluate the G3BG model. The model allows cutting 100x training dataset volume achieving Perplexity ~6.51 on the Language Generation task and F1-Score ~90.3 on the Information Retrieval task comparable to most state-of-the-art language models. The experimental results prove the effectiveness of the proposed method and contribute to the algorithmic approaches toward LLM risk mitigation. © 2024 Belgorod State National Research University. All rights reserved.

AB - Graph-based Natural Language Processing (NLP) methods have seen significant advancements in recent years with the development of Large Language Models (LLMs) and Retrieval Augmented Generation (RAG). LLMs are sophisticated models that recognize numerous NLP tasks by analyzing the users' natural language instructions called prompts. However, their industrial use is questionable due to such ethical concerns as false information generation called hallucinations, high risks of data breaches, and plagiarism. The paper introduces a novel NLP architecture, the Graph-Based Block-to-Block Generation (G3BG), which leverages state-of-the-art deep learning techniques, the power of attention mechanisms, distributional semantics, graph-based information retrieval, and decentralized networks. The model encodes user prompts to mitigate data breach risk, retrieves relevant information from a graph knowledge base, and forms a block for a conditional language model using LLMs to perform a new secure type of RAG. The model is closed-domain and small-scale oriented. It exhibits superior performance across low-resource NLP tasks, which makes it prominent for industrial use. The research presents a novel graph-based dataset. The dataset comprises private data features to encode and closed-domain textual information for information retrieval. The dataset is used to train and evaluate the G3BG model. The model allows cutting 100x training dataset volume achieving Perplexity ~6.51 on the Language Generation task and F1-Score ~90.3 on the Information Retrieval task comparable to most state-of-the-art language models. The experimental results prove the effectiveness of the proposed method and contribute to the algorithmic approaches toward LLM risk mitigation. © 2024 Belgorod State National Research University. All rights reserved.

KW - Closed-Domain Systems

KW - Data Encoding

KW - Decentralized Networks

KW - Distributional Semantics

KW - Generative Artificial Intelligence

KW - Language Generation

KW - Language Understanding

KW - Large Language Models

UR - https://www.mendeley.com/catalogue/297f77e0-6e1d-31cf-9c3c-627147e4513b/

U2 - 10.18413/2313-8912-2024-10-3-0-7

DO - 10.18413/2313-8912-2024-10-3-0-7

M3 - статья

VL - 10

SP - 135

EP - 167

JO - Research Result. Theoretical and Applied Linguistics

JF - Research Result. Theoretical and Applied Linguistics

SN - 2313-8912

IS - 3

ER -

ID: 127408140