In a significant development for AI benchmarking, a newly launched website by AI professor Aaron Sloman and developer Benjamin Schmidt enables users to blind test OpenAI’s GPT-5 against its predecessor, GPT-4o. The platform removes branding, providing two generated outputs without indicating which model produced which, allowing unbiased comparisons. The results have been as surprising as they are insightful, challenging assumptions about what constitutes progress in large language models (LLMs). This kind of blind testing sheds light on user perceptions, model capabilities, and raises critical questions about what “better” really means in generative AI.
Emergence of GPT-5 vs. GPT-4o: A Quiet Revolution
OpenAI’s release of GPT-4o in May 2024 was hailed as a leap forward. It brought real-time multi-modality, reduced latency, and a more humanlike conversational experience — especially through the integration of audio, vision, and text inputs (OpenAI, 2024). GPT-5, on the other hand, introduced in early 2025, was designed not merely for performance gains but focused heavily on consistency, precision in reasoning, and improved instruction following, per leaked insider documentation and analyses by The Gradient (The Gradient, 2025).
Despite expectations of a clear winner, early blind testing from the platform set up by Sloman and Schmidt revealed that many users often preferred GPT-4o responses, mistaking them for GPT-5. This aligns with cognitive studies in comparative choice theory which suggest that people tend to rate outputs as better when they exhibit a more engaging or “warmer” tone — features GPT-4o retained even if it sometimes lacked the deeper reasoning capabilities GPT-5 boasts. The interplay between user perception and technical sophistication is shaping a new landscape for AI benchmarking.
Blind Testing Results: A Counterintuitive Turn
The platform designed for double-blind testing allows users to paste a prompt and receive two AI responses, labeled simply as “A” and “B.” Users then pick the better response and are only afterward informed which version came from which model. Thousands of such interactions have already occurred, producing a living dataset that is independently revealing not just preferences — but highlighting the nuances of AI quality evaluation.
According to aggregate data shared by Schmidt on his GitHub, results were nearly even at launch, with GPT-4o sometimes receiving more votes from casual users based on softness of tone, humor, or relatability. More analytical or technical questions saw GPT-5 edged forward, confirming early assessments by researchers at the DeepMind Blog (2025) that GPT-5 was tuned more for formal logical consistency and improved creative factual synthesis.
Category | GPT-5 Wins | GPT-4o Wins |
---|---|---|
Coding Accuracy | 62% | 38% |
Emotional Tone / Empathy | 44% | 56% |
Mathematical Logic | 68% | 32% |
Creative Writing | 51% | 49% |
This table is derived from test data summaries referenced within VentureBeat (2025) and represents cumulative outcomes based on at least 10,000 blind session responses. It shows that while GPT-5 excels where structure and logic are paramount, GPT-4o retains high preference in naturally flowing conversations, something Slack’s Future Forum research finds crucial in hybrid communication environments (Slack, 2025).
Implications for AI Benchmarking and Model Adoption
The blind testing phenomenon reveals how essential the context of use is. A corporate legal advisor might find GPT-5’s depth indispensable, while a marketing team writing short ad copy or social posts may prefer the candor and engagement of GPT-4o. According to the World Economic Forum’s 2025 white paper on adaptive AI, benchmark evaluation methods that rely solely on academic test sets fall short of capturing real user values (WEF, 2025).
This insight is accelerating what Deloitte Insights has termed the “personalization imperative” in enterprise tech acquisition (Deloitte, 2025). Companies are no longer asking which model is better on average — they’re asking which one aligns better with their brand tone, regulatory needs, and operational rhythm. As organizations continue AI tool integration across workflows, being able to test AI performance in real-world context will be mission-critical, as noted in recent forewarnings by the FTC (2025) on vendor lock-in and bias transparency risks.
Inference Mechanics and Cost Considerations
GPT-5 appears to consume significantly more compute per token, according to data compiled from early benchmarks conducted on NVIDIA Hopper H200 GPUs (NVIDIA, 2025). Not only does this impact scalability, but raises the financial barrier for startups wanting to integrate the latest models. GPT-4o’s computational efficiency may offer a better dollar-to-performance ratio for non-critical deployments, particularly in AI assistant or content moderation tasks.
Recent analysis by MarketWatch (2025) and McKinsey MGI reports found that the average monthly AI spend for mid-size enterprises has doubled year-over-year due to demand for higher accuracy and multi-modal capabilities. Model choice now heavily impacts operational expenses and product cost structures, leading to “model tuning arbitrage”: selecting the most cost-effective model for each workflow, as described in Investopedia’s 2025 emerging tech economics outlook (Investopedia, 2025).
Beyond Scores: Evaluating Model Usefulness at a Human Scale
Blind testing initiatives are also reshaping how ordinary users define intelligence. Whereas earlier model evaluations focused on benchmarks like MMLU, HumanEval, or even MATH, the reality exposed by user ratings is that helpfulness and clarity often matter more than raw power. Indeed, this supports a view echoed by Gallup’s 2025 Future of Work survey, which found that 61% of workers prefer conversational interfaces that “feel human,” even over more factually accurate ones (Gallup, 2025).
There’s also the issue of familiarity bias. GPT-4o, integrated into most ChatGPT Pro experiences for months, has shaped user expectations about how bots “should” speak. Consequently, when GPT-5 responds with deep logic but blunt tone, users may interpret that as cold or wrong. This sociolinguistic factor has echoed in commentary by AI Trends and MIT Tech Review, both of which highlight that model evaluation should increasingly account for cultural and emotional cognition (AI Trends, MIT Tech Review, 2025).
Concluding Insights: Toward Participatory AI Evaluation
As the AI community continues to push toward general intelligence, how we evaluate models must evolve alongside performance gains. One might recall Anthropic’s Claude family excelling on constitutional reasoning but questioned for verbosity, and Google DeepMind’s Gemini excelling in agentic tasks but initially lacking humanlike flow (DeepMind Blog, 2025). Nuance and user experience are emerging as powerful performance proxies — and blind testing provides a rare, transparent mechanism to challenge top-down narratives.
In this evolving landscape, tools like Schmidt and Sloman’s blind test site introduce a user-centric lens through which LLM evolution can be understood not just by number-crunching AI engineers, but by regular humans trying to get work done. It raises a poignant question: What good is a smarter AI if it doesn’t feel intuitive to use? Perhaps, as OpenAI, Google, and Anthropic race toward AGI, blind testing will remind us that AI’s success will ultimately be judged not by leaders — but by users.