Do AI detectors actually work?

AI-generated content is everywhere: on social media, in classrooms, in cover letters, and even in published books. In response, a wave of “AI detectors” appeared, each promising to separate human writing from machine output.

But do AI detectors work?

We ran a simple test to find out, and the results were worse than expected.

Summary

Detection tools have overcorrected from low accuracy in 2022 to now favoring false positives. Even Shakespeare has been flagged, and OpenAI discontinued its own detector due to low reliability
Most AI detectors are unreliable and often label human-written content as AI, sometimes with higher confidence than actual AI-generated text.
Several tools gave identical results, suggesting they use the same backend despite different branding.

The experiment

We looked up what people actually search for when they want to detect AI-written content. For each keyword, we picked the first non-duplicate organic result.

Then we tested two pieces of text in every tool:

AI-generated text: Created using a single ultra-simple prompt in ChatGPT: Generate me an article about Nikola Tesla and AI.
Non-AI (Human-written) text: An original article published on our website, covering the connection between Nikola Tesla and AI.

We wanted to see how well these tools could tell the difference.

The results

The table below shows how each tool rated the two texts. The numbers represent the percentage likelihood that the entire text was generated by AI, according to each tool.

KW	Tool	AI text	Non-AI text
AI detector	Quillbot	100% AI	69% AI
AI detection tool	GPTZero	100% AI	80% AI
Detect AI writing	Scribbr	100% AI	69% AI
Free AI detector	NoteGPT	74.39% AI	86.89% AI
AI content detector	CopyLeaks	100% AI	100% AI
ChatGPT detector	ZeroGPT	74.39% AI	86.89% AI
Detect if text is written by AI	Grammarly	18% AI	8% AI
Check if essay is AI generated	SciSpace	76% AI	69% AI
AI detection tool	Undetectable	88% AI	72% AI

As you can see, most tools labeled the AI-generated text with high certainty, often 100 percent. But many of them also gave high AI probabilities to a fully human-written article.

That’s a major red flag.

Identical results, shared backend?

Some tools returned exactly the same results for both texts. NoteGPT and ZeroGPT each gave 74.39 percent AI for the ChatGPT article and 86.89 percent AI for the human-written one. QuillBot and Scribbr also gave identical AI scores: 100 and 69 percent.

This strongly suggests that many of these tools are powered by the same underlying detection engine or API. The branding might change, but the core algorithm stays the same.

Even Undetectable, which claims to combine multiple detectors for improved accuracy, gave results that still aligned closely with the rest.

Overcorrecting into false positives

When these tools first appeared in late 2022, they were often criticized for failing to catch even blatant ChatGPT writing. Now, it seems they’ve flipped the switch and started flagging everything, especially human content, as AI.

That might seem like an improvement.

It’s not. It’s overfitting.

Worse, it creates dangerous false positives. In our test, a well-researched human-written article was marked as AI by most tools.

And this isn’t just happening to blog posts.

Even Shakespeare’s writing has been flagged as AI by modern detectors. If 16th-century verse is “too polished to be human,” what chance does anyone else have?

OpenAI couldn’t build one either

In early 2023, OpenAI released its own AI text classifier to help detect content generated by ChatGPT. Just a few months later, they quietly shut it down, stating it had “low accuracy” and wasn’t reliable for real-world use.

This was the company that built ChatGPT itself.

They couldn’t reliably detect their own model’s output. But somehow, all of the other companies think they can. 🥸

Nothing has changed in two years

A 2023 Stanford HAI study showed that AI detectors are not only inaccurate but biased! Frequently flagging non-native English writing as AI-generated. At the time, researchers warned these tools could reinforce discrimination and create harm in education and job applications.

Nearly two years later, our results suggest nothing has changed.

The detectors are still unreliable. The biases are still there. And the confidence with which people use them has only grown.

So, do AI detectors actually work?

No. AI detectors don’t work in any trustworthy or consistent way.

💡 Results can be rough signals but should never be used as proof. A detector flagging 80% AI says very little without proper context, version history, or an understanding of the writer.

In high-stakes environments like academia, publishing, or hiring, these tools are dangerous. They look scientific, but behave like guesswork.

AI-generated writing is evolving. So is human writing. The line between the two will only get blurrier. Detectors that claim to know where that line is often end up drawing it in the wrong place.

If you ever hear someone say “This looks like AI, I ran it through a detector”, do them a favor and send them this article.

Let the facts speak for themselves.