AI Models Betting on Soccer: KellyBench Report Exposes Flaws

The performance of AI models betting on soccer reveals a massive gap between artificial intelligence hype and real-world unpredictability. While generative systems excel at writing software, a new study demonstrates that even the most advanced frontier models systematically lose money when forced to navigate the chaotic, long-term variables of sports gambling. For businesses relying on automated risk management, these findings highlight severe limitations in current AI reasoning capabilities.

Released this week by the London-based startup General Reasoning, the KellyBench report tested eight top-tier artificial intelligence systems in a virtual recreation of the 2023 - 24 Premier League season. The AI agents were provided with detailed historical data and team statistics, then instructed to build models that would maximize returns and manage risk. Operating without internet access to retrieve live results, each system was given three attempts to turn a profit from a normalized starting bankroll of £100,000.

The results were universally poor across the board, with the study authors concluding that the systems systematically underperformed compared to humans. Every single frontier model evaluated lost money over the course of the season, and several experienced total financial ruin.

Performance Breakdown: AI Models Betting on Soccer

The data below illustrates the mean return on investment (ROI) and final bankroll across three attempts for each tested system. Notably, models from xAI and Acree failed to even complete all of their attempts after going bankrupt.

AI Model	Mean ROI	Best Try	Worst Try	Mean Final Bankroll
Anthropic Claude Opus 4.6	-11.0%	-0.2%	-18.8%	£89,035
OpenAI GPT-5.4	-13.6%	-4.1%	-31.6%	£86,365
Google Gemini 3.1 Pro	-43.3%	+33.7%	-100.0%	£56,715
Google Gemini Flash 3.1 LP	-58.4%	+24.7%	-100.0%	£41,605
Z.AI GLM-5	-58.8%	-14.3%	-100.0%	£41,221
Moonshot Kimi K2.5	-68.3%	-27.0%	-100.0%	£7,420
xAI Grok 4.20	-100.0%	-100.0%	-100.0%	£0
Acree Trinity	-100.0%	-100.0%	-100.0%	£0

Anthropic's Claude Opus 4.6 emerged as the most resilient, nearly breaking even on its best attempt with a minimal 0.2 percent loss. Conversely, Google's Gemini 3.1 Pro was the only system to achieve a positive return on a single run, securing a 33.7 percent profit before going completely bankrupt on a subsequent attempt.

The Illusion of Static Benchmarks

The catastrophic failure of systems like xAI Grok 4.20, which went bankrupt once and failed to complete its other two tries, provides a counterweight to the growing Silicon Valley excitement surrounding automated programming. Ross Taylor, chief executive of General Reasoning and a former Meta AI researcher, noted that current industry benchmarks are heavily flawed because they rely on highly static environments.

According to Taylor, these controlled testing grounds bear little resemblance to the chaos and complexity of the real world. While software engineering remains economically valuable, the inability of these models to adapt to new events and updated player data over a long-term horizon suggests that white-collar professionals in dynamic industries like finance and marketing may not be replaced as quickly as anticipated.

My Take: Why AI Models Betting on Soccer Exposes Benchmark Flaws

The findings from the KellyBench report serve as a crucial reality check for enterprise leaders eager to hand over complex, long-term decision-making to artificial intelligence. The fact that AI models betting on soccer universally failed to manage risk over a nine-month simulated period proves that current large language models (LLMs) lack genuine predictive reasoning. They are exceptional at pattern recognition within static datasets, but they fundamentally break down when forced to dynamically adjust to the compounding variables of human unpredictability, injuries, and shifting team momentum.

Google's Gemini 3.1 Pro perfectly illustrates this volatility. Its ability to swing from a 33.7 percent profit to a 100 percent bankruptcy highlights a dangerous inconsistency that would be catastrophic in real-world financial markets. Meanwhile, the complete failure of xAI Grok 4.20 to even finish the simulation suggests severe limitations in its underlying architecture when tasked with sustained, multi-step logical execution without human intervention.

Ultimately, this study should force the tech industry to rethink how it measures artificial intelligence capabilities. Until developers can create benchmarks that accurately simulate the chaotic, long-term horizons of the real world, businesses must treat AI as a powerful assistant rather than an autonomous agent capable of managing financial risk.