Deduction under Perturbed Evidence: Probing Student Simulation (Knowledge Tracing) Capabilities of Large Language Models

Shashank Sonkar; Richard G. Baraniuk

Deduction under Perturbed Evidence: Probing Student Simulation (Knowledge Tracing) Capabilities of Large Language Models

Research output: Contribution to journal › Conference article › peer-review

Abstract

We explore whether Large Language Models (LLMs) are capable of logical reasoning with distorted facts, which we call Deduction under Perturbed Evidence (DUPE). DUPE presents a unique challenge to LLMs since they typically rely on their parameters, which encode mostly accurate information, to reason and make inferences. However, in DUPE, LLMs must reason over manipulated or falsified evidence present in their prompts, which can result in false conclusions that are valid only under the manipulated evidence. Our goal with DUPE is to determine whether LLMs can arrive at these false conclusions and identify whether the dominant factor influencing the deduction process is the encoded data in the parameters or the manipulated evidence in the prompts. To evaluate the DUPE capabilities of LLMs, we create a DUPEd version of the StrategyQA dataset, where facts are manipulated to reverse the answer to the question. Our findings show that even the most advanced GPT models struggle to reason on manipulated facts - showcasing poor DUPE skills - with accuracy dropping by 45% compared to the original dataset. We also investigate prompt settings inspired from student simulation models a.k.a. knowledge tracing models, which mitigate the accuracy drop to some extent. Our findings have practical implications for understanding the performance of LLMs in real-world applications such as student simulation models that involve reasoning over inaccurate information. The prompts and dataset are available at https://github.com/luffycodes/gpt-knowledge-tracing.

Original language	English (US)
Pages (from-to)	26-33
Number of pages	8
Journal	CEUR Workshop Proceedings
Volume	3487
State	Published - 2023
Event	1st Annual Workshop on Empowering Education with LLMs - the Next-Gen Interface and Content Generation, AIEDLLM 2023 - Tokyo, Japan Duration: Jul 7 2023 → …

Keywords

GPT
Knowledge Tracing
Large Language Models
Reasoning
Student Simulation Models

ASJC Scopus subject areas

Computer Science(all)

Cite this

@article{393e0ebd60bf4261ad494a44d56e1810,

title = "Deduction under Perturbed Evidence: Probing Student Simulation (Knowledge Tracing) Capabilities of Large Language Models",

abstract = "We explore whether Large Language Models (LLMs) are capable of logical reasoning with distorted facts, which we call Deduction under Perturbed Evidence (DUPE). DUPE presents a unique challenge to LLMs since they typically rely on their parameters, which encode mostly accurate information, to reason and make inferences. However, in DUPE, LLMs must reason over manipulated or falsified evidence present in their prompts, which can result in false conclusions that are valid only under the manipulated evidence. Our goal with DUPE is to determine whether LLMs can arrive at these false conclusions and identify whether the dominant factor influencing the deduction process is the encoded data in the parameters or the manipulated evidence in the prompts. To evaluate the DUPE capabilities of LLMs, we create a DUPEd version of the StrategyQA dataset, where facts are manipulated to reverse the answer to the question. Our findings show that even the most advanced GPT models struggle to reason on manipulated facts - showcasing poor DUPE skills - with accuracy dropping by 45% compared to the original dataset. We also investigate prompt settings inspired from student simulation models a.k.a. knowledge tracing models, which mitigate the accuracy drop to some extent. Our findings have practical implications for understanding the performance of LLMs in real-world applications such as student simulation models that involve reasoning over inaccurate information. The prompts and dataset are available at https://github.com/luffycodes/gpt-knowledge-tracing.",

keywords = "GPT, Knowledge Tracing, Large Language Models, Reasoning, Student Simulation Models",

author = "Shashank Sonkar and Baraniuk, {Richard G.}",

note = "Funding Information: This work was supported by NSF grants 1842378, ONR grant N0014-20-1-2534, AFOSR grant FA9550-22-1-0060, and a ? annevar Bush Faculty Fellowship, ONR grant N00014-18-1-2047. Funding Information: This work was supported by NSF grants 1842378, ONR grant N0014-20-1-2534, AFOSR grant FA9550-22-1-0060, and a Vannevar Bush Faculty Fellowship, ONR grant N00014-18-1-2047. Publisher Copyright: {\textcopyright} 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).; 1st Annual Workshop on Empowering Education with LLMs - the Next-Gen Interface and Content Generation, AIEDLLM 2023 ; Conference date: 07-07-2023",

year = "2023",

language = "English (US)",

volume = "3487",

pages = "26--33",

journal = "CEUR Workshop Proceedings",

issn = "1613-0073",

publisher = "CEUR-WS",

}

TY - JOUR

T1 - Deduction under Perturbed Evidence

T2 - 1st Annual Workshop on Empowering Education with LLMs - the Next-Gen Interface and Content Generation, AIEDLLM 2023

AU - Sonkar, Shashank

AU - Baraniuk, Richard G.

N1 - Funding Information: This work was supported by NSF grants 1842378, ONR grant N0014-20-1-2534, AFOSR grant FA9550-22-1-0060, and a ? annevar Bush Faculty Fellowship, ONR grant N00014-18-1-2047. Funding Information: This work was supported by NSF grants 1842378, ONR grant N0014-20-1-2534, AFOSR grant FA9550-22-1-0060, and a Vannevar Bush Faculty Fellowship, ONR grant N00014-18-1-2047. Publisher Copyright: © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

PY - 2023

Y1 - 2023

N2 - We explore whether Large Language Models (LLMs) are capable of logical reasoning with distorted facts, which we call Deduction under Perturbed Evidence (DUPE). DUPE presents a unique challenge to LLMs since they typically rely on their parameters, which encode mostly accurate information, to reason and make inferences. However, in DUPE, LLMs must reason over manipulated or falsified evidence present in their prompts, which can result in false conclusions that are valid only under the manipulated evidence. Our goal with DUPE is to determine whether LLMs can arrive at these false conclusions and identify whether the dominant factor influencing the deduction process is the encoded data in the parameters or the manipulated evidence in the prompts. To evaluate the DUPE capabilities of LLMs, we create a DUPEd version of the StrategyQA dataset, where facts are manipulated to reverse the answer to the question. Our findings show that even the most advanced GPT models struggle to reason on manipulated facts - showcasing poor DUPE skills - with accuracy dropping by 45% compared to the original dataset. We also investigate prompt settings inspired from student simulation models a.k.a. knowledge tracing models, which mitigate the accuracy drop to some extent. Our findings have practical implications for understanding the performance of LLMs in real-world applications such as student simulation models that involve reasoning over inaccurate information. The prompts and dataset are available at https://github.com/luffycodes/gpt-knowledge-tracing.

AB - We explore whether Large Language Models (LLMs) are capable of logical reasoning with distorted facts, which we call Deduction under Perturbed Evidence (DUPE). DUPE presents a unique challenge to LLMs since they typically rely on their parameters, which encode mostly accurate information, to reason and make inferences. However, in DUPE, LLMs must reason over manipulated or falsified evidence present in their prompts, which can result in false conclusions that are valid only under the manipulated evidence. Our goal with DUPE is to determine whether LLMs can arrive at these false conclusions and identify whether the dominant factor influencing the deduction process is the encoded data in the parameters or the manipulated evidence in the prompts. To evaluate the DUPE capabilities of LLMs, we create a DUPEd version of the StrategyQA dataset, where facts are manipulated to reverse the answer to the question. Our findings show that even the most advanced GPT models struggle to reason on manipulated facts - showcasing poor DUPE skills - with accuracy dropping by 45% compared to the original dataset. We also investigate prompt settings inspired from student simulation models a.k.a. knowledge tracing models, which mitigate the accuracy drop to some extent. Our findings have practical implications for understanding the performance of LLMs in real-world applications such as student simulation models that involve reasoning over inaccurate information. The prompts and dataset are available at https://github.com/luffycodes/gpt-knowledge-tracing.

KW - GPT

KW - Knowledge Tracing

KW - Large Language Models

KW - Reasoning

KW - Student Simulation Models

UR - http://www.scopus.com/inward/record.url?scp=85174190260&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85174190260&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:85174190260

SN - 1613-0073

VL - 3487

SP - 26

EP - 33

JO - CEUR Workshop Proceedings

JF - CEUR Workshop Proceedings

Y2 - 7 July 2023

ER -

Deduction under Perturbed Evidence: Probing Student Simulation (Knowledge Tracing) Capabilities of Large Language Models

Abstract

Keywords

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this