A Case Study Using Large Language Models to Generate Metadata for Math Questions

Katie Bainbridge; Candace Walkington; Armon Ibrahim; Iris Zhong; Debshila Basu Mallick; Julianna Washington; Rich Baraniuk

A Case Study Using Large Language Models to Generate Metadata for Math Questions

Katie Bainbridge, Candace Walkington, Armon Ibrahim, Iris Zhong, Debshila Basu Mallick, Julianna Washington, Rich Baraniuk

Research output: Contribution to journal › Conference article › peer-review

Abstract

Creating labels for assessment items, such as concept used, difficulty, or vocabulary used, can improve the quality and depth of research insights as well as targeting the right kinds of questions for students depending on their needs. However, traditional processes for metadata tagging are resource intensive in terms of labor, time, and cost, and these metadata become quickly outdated with any changes to the question content. Given thoughtful prompts, Large Language Models (LLMs) like GPT-3.5 and 4 can efficiently automate generation of assessment metadata and can help scale the process for larger volumes of questions as well as address any updates to question content that would otherwise have been tedious to reanalyze. With a human subject matter expert in-the-loop, recall and precision were analyzed for LLM generated tags for two metadata variables: problem context and math vocabulary. We conclude that LLMs like GPT-3.5 and 4 are highly reliable at generating assessment metadata, and make actionable recommendations for others intending to apply the technology to their own assessment items.

Original language	English (US)
Pages (from-to)	34-42
Number of pages	9
Journal	CEUR Workshop Proceedings
Volume	3487
State	Published - 2023
Event	1st Annual Workshop on Empowering Education with LLMs - the Next-Gen Interface and Content Generation, AIEDLLM 2023 - Tokyo, Japan Duration: Jul 7 2023 → …

Keywords

Assessments
Human-in-the-loop
Large Language Models
Metadata

ASJC Scopus subject areas

Computer Science(all)

Cite this

@article{5571dec375654cb7bb5800a10a94a761,

title = "A Case Study Using Large Language Models to Generate Metadata for Math Questions",

abstract = "Creating labels for assessment items, such as concept used, difficulty, or vocabulary used, can improve the quality and depth of research insights as well as targeting the right kinds of questions for students depending on their needs. However, traditional processes for metadata tagging are resource intensive in terms of labor, time, and cost, and these metadata become quickly outdated with any changes to the question content. Given thoughtful prompts, Large Language Models (LLMs) like GPT-3.5 and 4 can efficiently automate generation of assessment metadata and can help scale the process for larger volumes of questions as well as address any updates to question content that would otherwise have been tedious to reanalyze. With a human subject matter expert in-the-loop, recall and precision were analyzed for LLM generated tags for two metadata variables: problem context and math vocabulary. We conclude that LLMs like GPT-3.5 and 4 are highly reliable at generating assessment metadata, and make actionable recommendations for others intending to apply the technology to their own assessment items.",

keywords = "Assessments, Human-in-the-loop, Large Language Models, Metadata",

author = "Katie Bainbridge and Candace Walkington and Armon Ibrahim and Iris Zhong and Mallick, {Debshila Basu} and Julianna Washington and Rich Baraniuk",

note = "Funding Information: The research reported here was supported by philanthropic foundations. Publisher Copyright: {\textcopyright} 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).; 1st Annual Workshop on Empowering Education with LLMs - the Next-Gen Interface and Content Generation, AIEDLLM 2023 ; Conference date: 07-07-2023",

year = "2023",

language = "English (US)",

volume = "3487",

pages = "34--42",

journal = "CEUR Workshop Proceedings",

issn = "1613-0073",

publisher = "CEUR-WS",

}

TY - JOUR

T1 - A Case Study Using Large Language Models to Generate Metadata for Math Questions

AU - Bainbridge, Katie

AU - Walkington, Candace

AU - Ibrahim, Armon

AU - Zhong, Iris

AU - Mallick, Debshila Basu

AU - Washington, Julianna

AU - Baraniuk, Rich

N1 - Funding Information: The research reported here was supported by philanthropic foundations. Publisher Copyright: © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

PY - 2023

Y1 - 2023

N2 - Creating labels for assessment items, such as concept used, difficulty, or vocabulary used, can improve the quality and depth of research insights as well as targeting the right kinds of questions for students depending on their needs. However, traditional processes for metadata tagging are resource intensive in terms of labor, time, and cost, and these metadata become quickly outdated with any changes to the question content. Given thoughtful prompts, Large Language Models (LLMs) like GPT-3.5 and 4 can efficiently automate generation of assessment metadata and can help scale the process for larger volumes of questions as well as address any updates to question content that would otherwise have been tedious to reanalyze. With a human subject matter expert in-the-loop, recall and precision were analyzed for LLM generated tags for two metadata variables: problem context and math vocabulary. We conclude that LLMs like GPT-3.5 and 4 are highly reliable at generating assessment metadata, and make actionable recommendations for others intending to apply the technology to their own assessment items.

AB - Creating labels for assessment items, such as concept used, difficulty, or vocabulary used, can improve the quality and depth of research insights as well as targeting the right kinds of questions for students depending on their needs. However, traditional processes for metadata tagging are resource intensive in terms of labor, time, and cost, and these metadata become quickly outdated with any changes to the question content. Given thoughtful prompts, Large Language Models (LLMs) like GPT-3.5 and 4 can efficiently automate generation of assessment metadata and can help scale the process for larger volumes of questions as well as address any updates to question content that would otherwise have been tedious to reanalyze. With a human subject matter expert in-the-loop, recall and precision were analyzed for LLM generated tags for two metadata variables: problem context and math vocabulary. We conclude that LLMs like GPT-3.5 and 4 are highly reliable at generating assessment metadata, and make actionable recommendations for others intending to apply the technology to their own assessment items.

KW - Assessments

KW - Human-in-the-loop

KW - Large Language Models

KW - Metadata

UR - http://www.scopus.com/inward/record.url?scp=85174178353&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85174178353&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:85174178353

SN - 1613-0073

VL - 3487

SP - 34

EP - 42

JO - CEUR Workshop Proceedings

JF - CEUR Workshop Proceedings

T2 - 1st Annual Workshop on Empowering Education with LLMs - the Next-Gen Interface and Content Generation, AIEDLLM 2023

Y2 - 7 July 2023

ER -

A Case Study Using Large Language Models to Generate Metadata for Math Questions

Abstract

Keywords

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this