Colombian Court AI Scandal: Ruling Flagged by Winston AI Detector

The "Double Standard" and Technical Backlash
Why AI Detectors Fail on Legal Texts
Frequently Asked Questions
My Take

The Supreme Court of Colombia has ignited a fierce debate over legal technology after rejecting a cassation appeal on the grounds that it was written by artificial intelligence, only for the court's own ruling to be flagged as AI-generated by the very same software. This incident highlights the precarious reliability of current AI detection tools in high-stakes professional environments.

The controversy began when the court dismissed a lawyer's filing, citing a "well-founded suspicion" that the text was not drafted by a human legal professional. To justify this dismissal, the court utilized the Winston AI detection tool. According to the court's statement, the analysis indicated that the lawyer's document contained "only 7% human content," leading the judges to conclude it was produced using generative AI and was therefore inadmissible.

The "Double Standard" and Technical Backlash

The ruling immediately faced scrutiny from the legal community. Attorney Emmanuel Alessio Velasquez ran the text of the court's own decision, identified as Auto AP760/2026, through the exact same Winston AI software cited by the judges. Velasquez revealed on X (formerly Twitter) that the tool flagged the court's ruling as containing "93% AI-generated text."

Velasquez argued that if the judicial decision condemning the use of AI scores such a high percentage on the same detector, the "methodological fragility" of using these tools as legal evidence becomes undeniable. This revelation went viral, prompting other lawyers to test the reliability of these detectors against historical documents.

Criminal defense lawyer Andres F. Arango G submitted a court filing from 2019years before the widespread availability of modern Large Language Models (LLMs)and the software still claimed it was 95% AI-generated. Similarly, Nicolas Buelvas tested his 2020 undergraduate thesis, which returned a result of 100% AI. These false positives suggest that the tools may be flagging formal legal prose, which is naturally structured and repetitive, rather than actual machine-generated content.

Why AI Detectors Fail on Legal Texts

The technical failure here stems from how AI detectors operate. These tools analyze statistical patterns such as sentence length, vocabulary predictability, and "burstiness" (the variation in sentence structure). Legal and academic writing often lacks high burstiness because it prioritizes precision, formal structure, and specific terminologytraits that overlap significantly with how AI models are trained to generate text.

Independent tests on the court's verdict showed inconsistent results depending on the sample size. When GPTZero scanned only the opening words of the court's text, it returned a 100% AI result. However, when a longer version including the factual background was processed, the result reversed to 100% human. This volatility proves that such tools are currently too unreliable to serve as the sole basis for denying access to justice.

Frequently Asked Questions

Which AI detector did the Colombian court use?
The court explicitly cited Winston AI in its ruling to justify rejecting the lawyer's appeal.

Why do human-written legal documents get flagged as AI?
Legal documents use highly structured, formal, and repetitive language. AI detectors often mistake this low "perplexity" and lack of variation for machine-generated text.

My Take

This incident serves as a critical wake-up call for the integration of legal tech. The fact that a 2019 documentwritten before the ChatGPT boomscored 95% on an AI detector proves that these tools are fundamentally flawed when applied to technical or formal writing. Relying on probabilistic software like Winston AI or GPTZero to make binary judicial decisions denies due process based on a "black box" algorithm that cannot distinguish between a robotic writing style and an actual robot. Until these tools can transparently prove their accuracy rates on specific domain literature, they should be treated as advisory at best, not evidentiary.

Sources: decrypt.co ↗