Published at

The Death of the Demo

Why flashy AI demos don't tell the real story — and why we need measurable benchmarks for LLMs and TTS.

Authors
  • avatar
    Name
    Liel Villa
    Twitter
    @lielvilla
  • Data + AI Nerd | Working on something new | Let's talk!
The Death of the Demo

You don’t need to scroll long on X (formerly Twitter) before running into one of these:

RIP {profession or product}… this AI tool automatically does X/Y/Z!

or

This is crazy…
{Company} just dropped a {product/model name} a few days ago, and it’s mind-blowing!
{One-liner of what it does}
Here are 10 examples:

And of course, the video that follows looks unbelievable. Perfect music, magical output, and zero rough edges.

But after nearly three years since ChatGPT launched, and thousands of “this changes everything” demos later — I (and I think many others) stopped taking them too seriously.


The Reality Behind the Wow

Here’s what actually happens for most of these products once you move beyond the curated demo. While it’s not a scientific measure, my rough estimate from extensive testing is that:
3–10% of the outputs are incredible, 70–80% are mediocre, and 10–30% are just plain bad.

That may still be fine for some creative use cases, where you only need a handful of great results. But for products that need to work reliably, 95% of the time, this simply doesn’t hold up.


Where It Breaks Down

At WonderPods💫, we create custom podcast episodes for children, turning any request from your kid into a unique, generated story. That means two AI components have to work almost flawlessly together:

  • Script generation (GPT-5)
  • Narration (ElevenLabs and others)

When I tested almost all top tier text-to-speech models that had stunning demos, I quickly realized how brittle they can be at scale.

Several recurring problems:

  • Artifacts – subtle distortions or robotic glitches that appear mid-sentence.
  • Volume drift – the narrator’s voice quietly fades or spikes for no reason.
  • Speed changes – the narrator’s voice speed suddenly increases for no reason.
  • Wrong pronunciation – just saying the wrong word here and there.

Here are two examples:

Example 1: Artifacts in Azure HD Neural TTS
Listen for subtle distortions and robotic glitches in this audio clip (Starts at 0:07).

Example 2: Volume Drift in ElevenLabs (Turbo Model)
Here are two cuts from the same recording. On the first, the volume slowly drops and on the second, 3 minutes later(!), when it suddenly goes back.

When I reached out to ElevenLabs support about these volume and speed inconsistencies, they suggested two mitigations: increasing the stability value and splitting the narration into shorter segments.

Both help a little — but they come with clear tradeoffs. Higher stability flattens the voice, making it sound more monotonic and less expressive. Shorter files, meanwhile, break continuity and have to be processed one by one, making the episode creation noticeably longer.

In other words: you can make it better, but only by sacrificing what made the demo sound great in the first place.

These issues don’t show up in the demo because the demo is curated. But when you synthesize hundreds of narrations, you start seeing them everywhere.

The root issue? We don’t have standardized ways to measure these inconsistencies.


Why Benchmarks Matter

Large language models at least have benchmarks (MMLU, BIG-Bench, HELM) that give us some way to measure reasoning and factual accuracy.

They’re not perfect — models can be trained to optimize for them, and companies often chase leaderboard scores, but they still provide a useful baseline. Benchmarks offer a shared way to compare progress and get a reasonable sense of a model’s quality, even if they don’t capture the full picture.

And in text-to-speech? While some metrics like Mean Opinion Score (MOS) can rate the subjective quality of a short clip, they fail to capture the long-form consistency issues that appear in production. There’s nothing close to the kind of robust, automated benchmarks we see for LLMs. Every launch comes with a few jaw-dropping clips and the usual “human-level” headline, but no consistent evaluation for at-scale reliability.


What TTS Benchmarks Should Look Like

Here’s a set of measurable metrics that could make text-to-speech evaluations more honest and production-ready. These are rough initial thoughts, designed to start a conversation, and would certainly need to be tweaked and refined by the community.

  1. Volume Stability (LUFS Variance) This measures the perceived loudness variance across long segments. Good models keep their volume steady, not fading in and out unexpectedly.

    • The Benchmark: Generate audio and split it into consecutive 10-second segments. Measure the Integrated Loudness of each in LUFS (Loudness Units Full Scale), a standard for perceived loudness. The score would be the difference in LUFS between the 95th percentile and 5th percentile segments. A lower score indicates better stability.
  2. Speech Rate Consistency (WPM Variation) This tracks words per minute (WPM) to ensure the pacing doesn’t speed up or drag unpredictably, which can be jarring for listeners.

    • The Benchmark: Using a time-stamped transcript, calculate the WPM over rolling 15-second windows. The score could be the Coefficient of Variation (Standard Deviation / Mean) of the WPM readings. A lower coefficient indicates more consistent pacing.
  3. Pronunciation Accuracy (Word Error Rate) This counts every mispronounced, dropped, or “hallucinated” word by comparing the generated audio’s transcript back to the original input text.

    • The Benchmark: Generate audio from a challenging text corpus (including proper nouns, acronyms, and technical jargon) and run a state-of-the-art ASR (Automatic Speech Recognition) model on the output. The score is the Word Error Rate (WER). The target is a WER as close to 0% as possible.
  4. Synthetic Detection Score (Classifier Accuracy) This objectively measures how “human” the voice sounds by testing how easily an AI model can identify it as synthetic.

    • The Benchmark: Use a standardized, pre-trained AI classifier to distinguish between real human speech and the generated audio. The score is the classifier’s accuracy in identifying the audio as synthetic. The ultimate goal for a TTS model is to achieve a detection score on par with the baseline scores for genuine human recordings when tested against the same classifier.

Crucially, none of these metrics are meaningful when run on a single, curated demo clip. To get a true sense of a model’s production-readiness, these benchmarks must be calculated across hundreds or even thousands of diverse audio generations, reflecting real-world usage.


Measuring Instead of Hyping

Generative AI has already reshaped how we work, learn, and create. But if we want it to move from impressive demos to dependable products, we need to start judging it like any production system — with structured, percentile-based benchmarks.

Because the next time you see a viral “RIP designers” or “RIP podcasters” post, the real question isn’t “what can it do?” It’s “how often does it actually do it right?”

The next revolution won’t be televised — it’ll be benchmarked 💪🏼.