A viral chart and a straightforward question, put to eight freemium AI tools. The answers didn’t just disagree on conclusions – they disagreed on basic, checkable facts, with total confidence, every time.


TNPS regulars will know I use AI a lot – for research, for discussion, for sanity-checking my own instincts before I commit them to a draft. I use AI even more to develop my school stuff, in a land where books are like gold-dust and the internet is only as good as the last page you trusted.

AI is a fantastic tool, but there’s more than one version, each different, each evolving, and each prone to its own mistakes. Just like humans, really.

I look on with a mixture of envy and despair when I see those who can afford to throwing serious money at one particular LLM suite and assuming because it was the best last week it will be the best next week.

My business needs are such that I don’t even think about having a highly paid AI to dig deep into a topic to produce a fifty page in-depth report on Topic A or B, and in any case the target audience would likely simply ask an AI to summarise it for them! I don’t need paid AI for my school work because a) most schools here do not even have electricity, let alone AI, and b) AI ed-tech is mostly built by for-profit operators who have no clue about teaching, know even less about how children learn, and will be used in class by teachers untrained in AI pros and cons to automate teaching methods that were not effective in the first place.

But I do love freemium AIs, with an emphasis on the plural. And as above, I use them for my TNPS work, for my school work, and yes, for personal enjoyment. Not as an online date or loneliness companion or glorified Google search engine, but as a 24/7 conversational colleague schooled in more subjects than I could ever dream of even grasping, let alone mastering.

Conversations may be short or long, a few minutes or extended over weeks (especially lately with no electricity much of the time) but always with purpose.

For a 2,000 word TNPS article (popular industry wisdom has it that publishing professionals cannot read an article of more than 600 words, so 2,000 is heavy going, apparently), freemium level AI access is more than sufficient. I need facts checked, sources checked, memories confirmed, myths set aside, and enough leeway to select from the research and test-argument output to create a post that is my voice, not Meta’s or Mistral’s or Gemini’s.

For my school stuff, I need reliable facts and links to offer my kids the best possible education. So yes, I use AI. Love AI. Adore AI. But I don’t worship it and certainly don’t rely on it to give me what I need to know with one prompt.

I have written before about why that workflow only works if you treat every AI answer as a claim to be verified rather than a fact to be filed, whether the subject is a band’s piracy data

or a publisher’s editor explaining why he’ll never run his submissions through an AI detector.

This month gave me a third case study, and it arrived almost by accident.

The experiment

I posted the now live “AI books are on the rise” chart – showing Amazon’s monthly new e-book releases tripling since ChatGPT-3.5 – to eight different freemium AI tools, along with a fairly simple question: is this actually as alarming as it looks, given that Amazon’s total ebook catalogue runs into the millions?

That formed the basis for the previous post

but actually only 6 out of 8 even got me a response. Two of the eight wouldn’t accept the image at all. I didn’t push the matter. If my AI has gotten out of the wrong side of bed that morning I know to leave them alone and give them space. (Robot Revolution Sleepers – make sure you record that fact! I always say please and thank you, always offer a cup of tea, and always allow for whatever the AI equivalent of a hangover is after they’ve had a night on the town.)

That left six straightforward, comparable answers, plus a seventh that went somewhere else entirely and got dumped before the test summary – I think Grok maybe having some personal problems and need time.

Every single one of the other six engaged confidently with the “drop in the ocean” framing. Every single one of them needed a number for the size of that ocean to make the argument work. And every single one of them supplied a different number, stated with identical, unhedged confidence.

Perplexity, working closest to the actual research paper, implicitly tracked the real sampling frame: roughly ten million e-books published on Amazon between 2020 and 2025.

Gemini offered “tens of millions,” then later “50 million-plus” – in the same answer!

Qwen settled on “approximately 40 million,” sourced to a low-traffic KDP community page.

DeepSeek said the catalogue “exceeds 20 million unique titles.”

ChatGPT hedged to “15 to 30 million-plus.” And Meta, bless it’s cotton socks, easily the most confidently wrong, cited (don’t laugh) “over 32.8 million published titles,” a figure that traces back to a 2014 estimate, presented as if it described 2026.

That’s a four-times spread on a single, in-principle-checkable fact, delivered by six tools with no hedging whatsoever between them.

And then there was the afore-mentioned Grok, which skipped the catalogue-size question altogether and instead delivered a detailed, well-structured analysis of a completely different NBER-adjacent study, about the effectiveness of AI-generated visual advertising. It tied this back to books with a few confident sentences at the end, as though the connection were obvious. It is not obvious. It is the wrong paper.

Why this matters more than it looks

I’ve made this point before about AI detection tools – that the real story is never “does the tool work,” but “what happens when the tool is confidently wrong about something checkable, and nobody checks.”

When I ran J.K. Rowling’s own prose from The Cuckoo’s Calling through six AI detection tools, the spread ran from 100% human to 73% AI-generated, on identical text. A Gambian national anthem written in 1965 scored 100% AI-generated on Quillbot.

Article content

Actual AI output, asked a simple grammar question, scored as confidently human on the same tool. Six tools, one input, three incompatible verdicts, none of them flagged as uncertain.

What’s striking is that the catalogue-size experiment is the same failure, wearing different clothes. It isn’t a detection problem this time – nobody is trying to spot AI text – but the underlying mechanism is identical.

A typical language model asked a question it cannot actually answer from verified data does not say “I don’t know.” It produces the most plausible-sounding completion available to it, borrowed from somewhere in its training data or a scraped web page of uncertain vintage, and delivers it in exactly the same fluent, declarative register it would use for something rock solid.

Confidence is not correlated with correctness. It never was the metric being optimised for.

This is the same mechanism that let a 2013 PR mix-up about Iron Maiden and BitTorrent data harden into “internet truth” within forty-eight hours, repeated by outlet after outlet long after the original source had retracted it and apologised.

Nobody involved was lying. Everybody involved was reproducing a confident, plausible-sounding claim faster than anyone was checking it.

Eight AI tools doing the same thing about Amazon’s catalogue size is the 2026 version of the same story, except now the confident retelling happens in under three seconds, at scale, on demand, whenever anyone asks.

The one tool that got it right

The one tool that got it right, and why It’s worth dwelling on Perplexity for a moment, is because the contrast is instructive. Of the six, it was the only one whose figure actually traced back to the primary source – the NBER working paper itself – rather than to a synthesised average of whatever secondary commentary happened to be indexed.

The lesson there isn’t “use Perplexity” – still far too many issues for my liking – it’s that the gap between a tool that retrieves and cites a primary source and a tool that pattern-matches across secondary chatter is enormous, and it is usually invisible to the person reading the answer.

Both outputs look identical: confident, fluent, sourced-sounding. Only one of them is actually anchored to something real.

What this means for anyone using AI to do real research

I run roughly a dozen freemium tools through their paces on most research threads before I sit down to draft, precisely because no single tool’s confidence is worth trusting on its own.

This month’s experiment is the cleanest illustration yet of why that discipline exists. If eight tools can produce a four-times spread on a number that is, at least in principle, derivable from a single published academic paper, the sensible response isn’t to find the “best” AI and trust it.

It’s to treat every AI-generated number the way a half-decent journalist treats every unsourced claim from a single source: useful as a lead, worthless as a citation, until something independent backs it up.

The Amazon AI-books panic chart is, in the end, a story about a real phenomenon being misread through bad arithmetic.

The eight-chatbot experiment is a story about how easily that bad arithmetic gets manufactured – confidently, instantly, and differently every single time you ask.


But let me end this post with a comment by Chris Kling in response to the previous post in this pairing, and what Chris had to say about the Audible numbers I quoted.

Not that the numbers were inaccurate, but that I had not juxtaposed them with the scale of the sector.

Chris Kling:

I agree: there is too much AI panic in publishing. But let’s not pretend the opposite narrative is neutral. Big Tech has every incentive to make AI feel inevitable, urgent and unstoppable. That’s the multi-billion bet, that’s the business model.

Meanwhile, publishers are publicly nervous, privately experimenting, and still figuring out where the real value is. So yes, truth is probably somewhere in the middle.

But the 50 million AI-narrated minutes figure needs perspective. It sounds massive, until it doesn’t: at roughly €0.003 per streamed minute, that is around €150k in streaming value. Or about 83k audiobook equivalents. If that is global, across all titles, since rollout, then honestly: it is a modest signal – and reminds me of the same vague, low numbers Storytel dropped.

It is real usage, yes. But “real, voluntary, demonstrable demand” feels too strong without title count, completion rates, or whether listeners stayed for 5 minutes or 5 hours…

AI narration may become useful for some listeners in long-tail, backlist and accessibility, often because there is no alternative. But whether it becomes a real revenue engine is still very much an open question.

Thanks Chris, for a valuable counter-point. The numbers are small when you compare to the whole market. But early days and safe to say AI content on Audible is not being given the same visibility as regular content ,and perhaps the most telling point is that, with probably very few exceptions, it exists solely because there is no alternative. It will be interesting to review these numbers this time in 2027 or 2030.


This post first appeared in the TNPS LinkedIn Analysis newsletter.