A local TTS production tool with integrated voice quality scoring. Built for consistency, not just generation — because a voice that sounds different every run is useless in production.
Modern TTS engines can generate impressive speech from text. But ask the same engine to say ten different sentences with the same voice — and you'll get ten slightly different speakers. Pitch shifts, timbre drift, pacing changes. For a single demo, that's fine. For a game with 200+ voice lines that need to sound like the same character, it's a dealbreaker.
AudioGen Studio was built to solve this: generate speech locally, measure consistency scientifically, and only keep what passes quality thresholds.
A 5-metric evaluation framework that replaces subjective "does this sound right?" with quantifiable, reproducible measurements. Based on MFCC speaker embeddings (80-dimensional vectors), the VCS answers one question: is this still the same voice?
VCS Dashboard — 99.6% overall score across 10 runs with all 5 sub-metrics visualized.
| Metric | Weight | What It Measures |
|---|---|---|
| Tightness | 25% | How closely clustered are the embeddings? Tight cluster = consistent voice. |
| Adherence | 25% | Does the output match the reference voice? Measures identity preservation. |
| Drift | 15% | Does the voice change over time? Detects gradual quality degradation. |
| Prompt-Robustness | 20% | Does the voice stay consistent across different text inputs? |
| Seed-Stability | 15% | How much does the random seed affect the voice identity? |
AudioGen Studio is a full production pipeline — from text input to mastered audio output, with quality measurement at every step.
Manual text-to-speech with multi-speaker support, 56 voices (EN/DE/IT), 8 built-in presets, seed locking, and real-time preview.
Load a game JSON, auto-detect format, queue all audio items. Step-through or batch processing with SSE progress streaming.
Generate custom voices from text descriptions. "A deep, rugged male voice with a slow cowboy drawl" → instant voice, auto-scored via VCS.
Blind comparison with spectrograms and cosine similarity. Data-driven voice selection instead of subjective guesswork.
Automated N-run evaluation with early abort. Tests a voice across multiple generations and recommends: Production Ready, Acceptable, or Needs Work.
Every generation is recorded: embedding, metadata, audio, prompt. Enables reproducible measurement across runs, sessions, and seeds.
Spectrograms reveal what the ear misses. The A/B test lets you vote blind, then reveals the cosine similarity score — often the numbers disagree with your gut feeling.
A/B Listening Test — blind voting with spectrogram comparison.
Spectrogram grid — 8 runs of the same voice, visually comparing consistency.
Outlier Analysis — 99.8% Trimmed Tightness with 1/10 outlier detected and excluded from scoring.
Every generated audio goes through a deterministic post-processing chain. The goal: broadcast-ready output, normalized to -16 LUFS, with no manual editing required.
Runs entirely local on consumer hardware. No cloud APIs, no subscriptions, no data leaving the machine.
Python / Flask serving on localhost:5092. Single-file architecture for maximum portability.
VibeVoice (Microsoft, open-source). Q8 model at ~6GB VRAM, 7B Full at ~14GB for higher fidelity.
Qwen3-TTS VoiceDesign (Alibaba, 1.7B params). Text-to-voice generation running alongside TTS on a single GPU.
Resemble Enhance for neural upsampling and denoising. Adds ~10-15s per generation but significantly improves clarity.
80-dimensional MFCC speaker vectors via SciPy/NumPy. Cosine similarity for identity comparison. Custom C0-skip for LUFS-normalized audio.
Vanilla HTML/CSS/JS — no frameworks. Professional UI with icon-based navigation, spectrograms, and real-time VRAM monitoring.
AudioGen Studio isn't a prototype — it has produced hundreds of voice lines for shipped products, with measurable quality at every step.
39 English, 16 German, 1 Italian — plus unlimited custom voices via Voice Designer.
Peak consistency score achieved across 10 runs. Production-ready threshold is 92%.
Battle-tested on real game productions with narrator, NPC, and combined dialogue tracks.
The VCS framework is designed to become a standalone evaluation tool. The goal: pip install vcs-voice — three lines of code, automated TTS quality evaluation at scale. Replacing manual crowd-worker evaluation with reproducible, quantifiable metrics.