A recent comprehensive investigation reveals that contemporary artificial intelligence models, exemplified by ChatGPT, frequently demonstrate significant limitations in accurately discerning scientific facts and consistently reproducing analytical judgments. This extensive study, conducted by a multidisciplinary team led by Professor Mesut Cicek of Washington State University, systematically evaluated the performance of advanced large language models (LLMs) when tasked with assessing the veracity of scientific hypotheses. The findings underscore a critical gap between the sophisticated linguistic output of these systems and a genuine grasp of conceptual understanding, signaling a vital need for heightened scrutiny and a calibrated approach to their integration into decision-making processes.
The pervasive integration of artificial intelligence across various professional domains has fostered an environment of both profound optimism and burgeoning apprehension. While generative AI tools have undeniably revolutionized content creation, data synthesis, and rudimentary problem-solving, their capacity for deep analytical reasoning and factual verification remains a subject of intense academic inquiry. This latest research contributes significantly to the latter, providing empirical evidence that challenges the prevailing narrative of AI as an infallible oracle of information. By subjecting ChatGPT to a rigorous battery of scientific propositions, the study aimed to quantify its accuracy and, crucially, its consistency in determining whether a given hypothesis was substantiated by research or not.
Methodology: A Rigorous Framework for Evaluation
The research team embarked on an ambitious undertaking, compiling a robust dataset of over 700 distinct hypotheses derived from scholarly articles published in reputable business journals since 2021. This selection was deliberate, as hypotheses in such fields often involve intricate relationships between variables, requiring nuanced interpretation rather than straightforward factual recall. To assess both accuracy and reliability, each of these hypotheses was presented to the AI model a remarkable ten times. This repeated interrogation served a dual purpose: it allowed for the calculation of an average accuracy rate and, more importantly, provided a direct measure of the model’s consistency, revealing whether it would provide the same judgment when confronted with identical input.
The initial phase of the experiment, conducted in 2024, utilized the freely accessible ChatGPT-3.5. A subsequent follow-up assessment in 2025 involved the more advanced ChatGPT-5 mini, offering an opportunity to observe any improvements in performance across different iterations of the technology. This comparative approach is essential for understanding the developmental trajectory and inherent limitations of these rapidly evolving AI systems.
Divergence Between Apparent Performance and Actual Reliability
The raw accuracy figures from the study initially appeared somewhat reassuring. In the 2024 trials, ChatGPT correctly identified whether a hypothesis was supported or not in 76.5% of instances. This figure saw a modest increase to 80% in the 2025 re-evaluation. However, these percentages present a potentially misleading picture when viewed in isolation. A critical component of the analysis involved adjusting these scores to account for the inherent probability of random guessing. Given a binary choice (true/false), an AI model would statistically achieve a 50% accuracy rate purely by chance. Once this baseline was factored in, the AI’s performance relative to random chance diminished considerably, registering only approximately 60% above the random guessing baseline. This adjusted metric, according to the researchers, translates to a performance level akin to a low "D" grade in an academic context, far removed from the high reliability often presumed of advanced computational systems.
A particularly striking revelation emerged concerning the AI’s struggle with identifying unsupported or "false" statements. The model demonstrated a profound weakness in this area, correctly labeling false hypotheses only 16.4% of the time. This specific deficiency has significant implications, as the inability to reliably identify erroneous information can be more damaging than merely failing to confirm true statements. It suggests a fundamental bias or limitation in how these models process and negate information, potentially favoring affirmative responses or struggling with the nuanced conditions that render a statement false within a given scientific context.
The Pervasive Challenge of Algorithmic Inconsistency
Beyond the accuracy metrics, the study unearthed a deeply concerning issue: the pronounced inconsistency of the AI’s responses. Even when presented with the exact same hypothesis and prompt ten consecutive times, ChatGPT produced a consistent answer only about 73% of the time. This means that in nearly one-quarter of all trials, the model would contradict itself, offering differing judgments to identical queries.
Professor Cicek underscored the gravity of this finding, stating, "We’re not just talking about accuracy, we’re talking about inconsistency, because if you ask the same question again and again, you come up with different answers." He further elaborated on the startling phenomenon observed, recounting instances where, for the same precise question, the AI would respond "true" in some iterations and "false" in others, even within the same sequence of ten prompts. This erratic behavior fundamentally undermines the credibility and utility of AI systems for any application requiring dependable, repeatable outputs, particularly in critical analytical or decision-making contexts. The presence of such internal contradictions suggests a lack of stable internal representation or reasoning process, highlighting that the model’s output is highly stochastic and prone to variation.
Linguistic Fluency Versus Genuine Conceptual Understanding
The core takeaway from the research, published in the Rutgers Business Review, reinforces a growing consensus among AI ethicists and researchers: the impressive linguistic fluency of generative AI should not be conflated with genuine conceptual understanding. Large language models excel at pattern recognition, synthesizing vast quantities of text to generate coherent and grammatically correct sentences that often mimic human-like discourse. However, this ability to produce convincing language does not necessarily equate to an underlying comprehension of the meaning, context, or logical implications of the information being processed.
As Professor Cicek articulates, "Current AI tools don’t understand the world the way we do — they don’t have a ‘brain.’ They just memorize, and they can give you some insight, but they don’t understand what they’re talking about." This distinction is paramount. Human understanding involves abstract reasoning, common sense, causal inference, and a rich, embodied experience of the world. LLMs, in contrast, operate primarily on statistical relationships between words and phrases. They predict the next most probable word in a sequence based on their training data, rather than constructing a mental model of reality or engaging in deductive or inductive reasoning in a human sense. This "stochastic parrot" phenomenon explains why they can generate plausible-sounding but factually incorrect or inconsistent information, especially when faced with complex, nuanced problems that demand true analytical depth.
The findings thus suggest that the advent of Artificial General Intelligence (AGI), defined as AI possessing human-level cognitive abilities across a wide range of tasks, may be significantly further away than popular discourse and media portrayals often imply. The current generation of AI tools, despite their remarkable progress, still demonstrates fundamental limitations in tasks requiring robust reasoning, critical evaluation, and a stable internal representation of knowledge.
Implications for Business, Academia, and Public Trust
The practical implications of this study are far-reaching, particularly for sectors that increasingly rely on AI for strategic insights and operational efficiency. For business leaders, the researchers’ recommendations are clear: treat AI-generated information with a healthy dose of skepticism and implement rigorous verification protocols. Relying solely on AI output for critical decisions—whether in market analysis, financial forecasting, or strategic planning—carries substantial risks, including flawed strategies, financial losses, and reputational damage. The study underscores the necessity for comprehensive training programs aimed at fostering "AI literacy" among professionals, enabling them to discern the capabilities and, more importantly, the inherent limitations of these tools.
In academic and research environments, the temptation to leverage AI for literature reviews, hypothesis generation, or even data interpretation is growing. However, the demonstrated propensity for inaccuracy in identifying false statements and the pervasive inconsistency present a clear warning. Uncritical adoption could lead to the propagation of misinformation, misinterpretation of research findings, and a degradation of scholarly rigor. Human expert oversight remains indispensable to ensure the integrity of scientific inquiry.
More broadly, the study contributes to the ongoing conversation about public trust in AI. If AI systems cannot consistently provide accurate and reliable answers to well-defined scientific questions, their trustworthiness in applications impacting public welfare, such as healthcare diagnostics, legal advice, or news dissemination, becomes a significant concern. The societal risks associated with widespread reliance on systems that "don’t understand what they’re talking about" are considerable, necessitating cautious deployment and robust regulatory frameworks.
Broader Context and the Path Forward
This investigation is not an isolated finding. Professor Cicek noted that similar experiments with other AI tools have yielded comparable outcomes, suggesting that the observed limitations are systemic to the current paradigm of large language models rather than specific to ChatGPT. The work also aligns with earlier research indicating a general caution around AI hype. For instance, a 2024 national survey revealed that consumers were less inclined to purchase products marketed with a prominent focus on AI, signaling a latent public skepticism that these scientific findings now help to validate.
The researchers, while highlighting these crucial limitations, are not advocating for a wholesale rejection of AI. Instead, their message is one of informed caution and strategic implementation. "Always be skeptical," Cicek advises, emphasizing that while he personally uses AI, a critical approach is paramount. The path forward involves a nuanced understanding of AI’s strengths—its ability to process vast datasets, identify patterns, and generate creative content—while simultaneously acknowledging its weaknesses in areas requiring deep reasoning, factual certainty, and consistent judgment.
Future advancements in AI will undoubtedly seek to address these challenges. Research into neural-symbolic AI, causal inference, and more robust methods for grounding language models in real-world knowledge could potentially mitigate some of the issues identified. However, for the foreseeable future, the human element of critical thinking, verification, and ethical oversight will remain indispensable. The study serves as a potent reminder that while AI can be a powerful assistant, it is not yet, and perhaps never will be, a substitute for human intelligence in its fullest, most reliable, and conceptually grounded form. The responsibility to navigate this evolving technological landscape wisely rests firmly with human users and developers alike.







