Reliable Intelligence: Evaluating GenAI’s Role in High-Impact Decision Making
Presentation Menu
This talk investigates the trust and confidence of using GenAI-based models in high-impact decision-making process, such as quantitative reliability reasoning. Using synthetic datasets generated from classic reliability models, we design multiple experimental factors, including sample sizes, prompts, and various AI models (Claude, DeepSeek, Gemini-pro, ChatGPT, and xAI) to see how different factors impact the reliability estimations using predictive reliability models. The confidence levels and robustness of each AI model’s performance are assessed through a factorial experimental design framework, and it is interesting to observe the estimates of unknown parameters for the Weibull reliability model do not converge over increasing sample size as what is commonly observed in traditional statistics-based inference. We also observe that ChatGPT and Deepseek perform the best in reliability estimation compared with other AI models. ChatGPT and Deepseek also show robustness in predicting reliability regardless of the types of prompts being used. The findings shed light on the reliability and confidence of using GenAI models for inferential reliability predictions and reasoning.