Research Project — Deep Dive

AudioGen Studio

A local TTS production tool with integrated voice quality scoring. Built for consistency, not just generation — because a voice that sounds different every run is useless in production.

The Problem

TTS Is Easy. Consistency Is Hard.

Modern TTS engines can generate impressive speech from text. But ask the same engine to say ten different sentences with the same voice — and you'll get ten slightly different speakers. Pitch shifts, timbre drift, pacing changes. For a single demo, that's fine. For a game with 200+ voice lines that need to sound like the same character, it's a dealbreaker.

AudioGen Studio was built to solve this: generate speech locally, measure consistency scientifically, and only keep what passes quality thresholds.

Core Innovation

Voice Consistency Score (VCS)

A 5-metric evaluation framework that replaces subjective "does this sound right?" with quantifiable, reproducible measurements. Based on MFCC speaker embeddings (80-dimensional vectors), the VCS answers one question: is this still the same voice?

VCS Score Dashboard showing 99.6% score with 5 sub-metrics

VCS Dashboard — 99.6% overall score across 10 runs with all 5 sub-metrics visualized.

Metric	Weight	What It Measures
Tightness	25%	How closely clustered are the embeddings? Tight cluster = consistent voice.
Adherence	25%	Does the output match the reference voice? Measures identity preservation.
Drift	15%	Does the voice change over time? Detects gradual quality degradation.
Prompt-Robustness	20%	Does the voice stay consistent across different text inputs?
Seed-Stability	15%	How much does the random seed affect the voice identity?

Capabilities

What It Does

AudioGen Studio is a full production pipeline — from text input to mastered audio output, with quality measurement at every step.

Free Mode

Manual text-to-speech with multi-speaker support, 56 voices (EN/DE/IT), 8 built-in presets, seed locking, and real-time preview.

Game Mode

Load a game JSON, auto-detect format, queue all audio items. Step-through or batch processing with SSE progress streaming.

Voice Designer

Generate custom voices from text descriptions. "A deep, rugged male voice with a slow cowboy drawl" → instant voice, auto-scored via VCS.

A/B Listening Test

Blind comparison with spectrograms and cosine similarity. Data-driven voice selection instead of subjective guesswork.

Voice Score Mode

Automated N-run evaluation with early abort. Tests a voice across multiple generations and recommends: Production Ready, Acceptable, or Needs Work.

Experiment Tracking

Every generation is recorded: embedding, metadata, audio, prompt. Enables reproducible measurement across runs, sessions, and seeds.

Quality Assurance

Measure, Don't Guess

Spectrograms reveal what the ear misses. The A/B test lets you vote blind, then reveals the cosine similarity score — often the numbers disagree with your gut feeling.

A/B Listening Test — blind voting with spectrogram comparison.

Spectrogram grid — 8 runs of the same voice, visually comparing consistency.

Outlier Analysis showing 99.8% Trimmed Tightness

Outlier Analysis — 99.8% Trimmed Tightness with 1/10 outlier detected and excluded from scoring.

Audio Pipeline

From Text to Mastered Output

Every generated audio goes through a deterministic post-processing chain. The goal: broadcast-ready output, normalized to -16 LUFS, with no manual editing required.

TTS Generation → Noise Gate → HPF 80Hz → Presence Boost 3kHz → Compressor → Limiter -1dB → LUFS -16 → Neural Enhance → MP3 128kbps

Under the Hood

Tech Stack

Runs entirely local on consumer hardware. No cloud APIs, no subscriptions, no data leaving the machine.

Backend

Python / Flask serving on localhost:5092. Single-file architecture for maximum portability.

TTS Engine

VibeVoice (Microsoft, open-source). Q8 model at ~6GB VRAM, 7B Full at ~14GB for higher fidelity.

Voice Design

Qwen3-TTS VoiceDesign (Alibaba, 1.7B params). Text-to-voice generation running alongside TTS on a single GPU.

Enhancement

Resemble Enhance for neural upsampling and denoising. Adds ~10-15s per generation but significantly improves clarity.

Embeddings

80-dimensional MFCC speaker vectors via SciPy/NumPy. Cosine similarity for identity comparison. Custom C0-skip for LUFS-normalized audio.

Frontend

Vanilla HTML/CSS/JS — no frameworks. Professional UI with icon-based navigation, spectrograms, and real-time VRAM monitoring.

Python Flask PyTorch CUDA SciPy NumPy VibeVoice Qwen3-TTS Resemble Enhance MFCC

Results

Production-Tested

AudioGen Studio isn't a prototype — it has produced hundreds of voice lines for shipped products, with measurable quality at every step.

56 Voices

39 English, 16 German, 1 Italian — plus unlimited custom voices via Voice Designer.

VCS 99.6%

Peak consistency score achieved across 10 runs. Production-ready threshold is 92%.

200+ Lines per Game

Battle-tested on real game productions with narrator, NPC, and combined dialogue tracks.

What's Next

Enterprise Vision

The VCS framework is designed to become a standalone evaluation tool. The goal: pip install vcs-voice — three lines of code, automated TTS quality evaluation at scale. Replacing manual crowd-worker evaluation with reproducible, quantifiable metrics.

← Back to Home