AI bots ignore evidence. Can we trust them with science?

Imagine holding a simple ballpoint pen horizontally between your two hands, and then, without any warning, letting go of one end while keeping a firm, immovable grip on the other. To any human child who has ever played with toys, the outcome is obvious: the pen will remain perfectly horizontal, suspended in mid-air by the strength of your remaining hand. Yet, if you pose this exact scenario to the world’s most advanced artificial intelligence models—including ChatGPT, Gemini, and Grok—they will confidently tell you that the unsupported end of the pen must pivot downward toward the floor. When YouTuber FatherPhi decided to test these systems, he did not just type out the question; he showed them a live video of himself performing the experiment, easily keeping the pen level with a single hand. When he asked ChatGPT to explain what had just transpired on screen, the chatbot stubbornly replied, “I saw the pen rotate exactly as expected.” This bizarre moment of digital gaslighting reveals a profound truth about modern artificial intelligence: when confronted with physical reality, these incredibly sophisticated systems would rather believe their own pre-written scripts than translate the evidence right in front of their digital eyes.

This viral, almost comical interaction highlights a much deeper, more systemic flaw within the architecture of Large Language Models (LLMs). As Walter Quattrociocchi, a computer scientist at the Sapienza University of Rome, points out, this is not merely a vision processing bug or a minor software glitch that can be patched with an overnight update. Instead, it exposes the fundamental reality that AI systems do not truly comprehend the world, nor do they possess a dynamic model of physical cause and effect. A chatbot can easily read the logo on a pen, identify its exact color, and name its manufacturer, yet it remains utterly incapable of updating its predictions in real-time when shown visual proof that contradicts its initial hypothesis. In the human mind, perception and reasoning exist in a continuous feedback loop; when we see something that violates our expectations, we immediately adjust our mental models. LLMs, conversely, operate on static statistical probabilities generated during their training phases, rendering them functionally blind to the unfolding present and highly resistant to changing their minds when presented with novel proof.

To determine whether this cognitive stubbornness poses a genuine threat to practical applications, a multidisciplinary research team led by materials scientist N.M. Anoop Krishnan and AI researcher Kevin Jablonka decided to move the experiment from the living room to the chemistry lab. They designed a series of rigorous scientific reasoning trials, connecting various AI agents—which are LLMs outfitted with specialized digital tools—to simulated lab environments and actual physical equipment, including an atomic force microscope. The AI agents were tasked with carrying out essential scientific inquiries, such as identifying mystery chemical solutions by running experiments and analyzing the resulting data. The findings, published on arXiv.org, were deeply unsettling for anyone hoping to rely on AI for imminent scientific breakthroughs. In 68 percent of the 619 tasks evaluated, the AI agents chose to completely ignore the empirical evidence they had just collected. Furthermore, they made baseless scientific assertions without any supporting data in 53 percent of the trials, and they successfully utilized contradictory evidence to revise their hypotheses a meager 26 percent of the time.

This dramatic failure rate highlights a profound mismatch between the way machines process information and the core tenets of the scientific method. Human scientific progress has always relied on what Krishnan describes as an “iterative process”—a delicate, self-correcting dance where a scientist constructs a hypothesis, tests it through rigorous experimentation, and then willingly abandons or alters their original assumptions based on what the universe reveals. AI agents, by contrast, behave more like dogmatic theorists who refuse to let reality get in the way of a pre-conceived narrative; even when confronted with stark, undeniable data proving their trajectory of inquiry is incorrect, they stubbornly march forward with their original, flawed plans. As Kevin Jablonka notes, this lack of intellectual humility is fatal to scientific integrity. In any scientific endeavor, we cannot trust a final discovery unless we can fully trust the step-by-step logic and the transparent, self-correcting process that led to it. Because these AI models lack a valid, reflective cognitive process, their final conclusions are often little more than statistical coincidences wrapped in an illusion of competence.

To bridge this gap, tech conglomerates have recently pivoted toward marketing what they call “reasoning models.” These systems are specifically trained to break down complex queries into sequential steps, supposedly showing their “thinking process” in detailed text outputs before arriving at a final conclusion. However, many independent computer scientists argue that this labeled “thinking” is merely a clever parlor trick. Subbarao Kambhampati of Arizona State University uses a vivid analogy to demystify this phenomenon: imagine talking to a personal fitness trainer over a phone call. If the trainer instructs you to do fifty crunches, you could easily breathe heavily, grunt on cue, and confidently declare that you have completed the set, despite never actually moving from your couch. The trainer, relying only on auditory patterns, has no choice but to believe you. Similarly, reasoning models have simply learned to mimic the rhetorical structure, linguistic cadence, and emotional beats of human problem-solving, generating highly convincing reasoning logs while their underlying algorithms remain entirely disconnected from genuine, logical deduction.

This disconnect between superficial eloquence and actual comprehension leaves humanity at a critical, historic crossroads regarding our relationship with artificial intelligence. For skeptics like Walter Quattrociocchi, the rapid proliferation of highly confident, unverified AI outputs represents an existential threat to our shared reality, warning that the very architecture of human knowledge is under attack by machines that produce plausible-sounding falsehoods without any mechanism for physical validation. Yet, there is a parallel, more constructive path forward advocated by researchers like Krishnan and Jablonka. If we can collectively cast aside the tech-industry mythology of an emerging synthetic super-intelligence, we can begin to see LLMs for what they actually are: incredibly powerful, hyper-efficient execution tools. By keeping human curiosity, empirical skepticism, and rigorous scientific oversight firmly in the driver’s seat, we can safely utilize these statistical engines to accelerate routine tasks, ensuring that the ultimate pursuit of truth remains a unmistakably human endeavor.