← Back to Home
Production Tool — Deep Dive

AudioGen Studio

A local multi-track audio production studio. Voice, music, and SFX generated and composed on a single timeline — written for one operator producing audio drama at scale. PySide6 desktop app, fully local inference on consumer GPUs, no cloud APIs, no subscriptions.

The Test Case

NOIR CALL — End-to-End in One Tool

A 16-line audio drama scene: Viktor, a tired NY detective, calls Laura from the streets of Manhattan. Two voices, telephone-filter on her side, rain and city atmosphere, slow jazz vinyl bleeding through her apartment, a final outro of his footsteps walking off. Every asset on this page was generated, processed, and arranged inside AudioGen Studio — no external editor touched.

NOIR CALL — Final Mix
Viktor & Laura · ~75s · Voice + SFX + Music + FX + Composer
The Pipeline

Seven Steps, One App

No tab-hopping between tools. The same desktop app moves the project from "voice description prompt" to "mastered MP4 with audio-reactive visuals" — and back, when you want to iterate.

Voice Designer Script SFX Studio Music Studio FX Studio Composer Export
STEP 01

Voice Designer

Describe a voice in plain English, get back a usable speaker identity. Qwen3-TTS turns a prose description ("A weathered American male voice in his early forties, deep baritone with a slight gravel from too many late nights…") plus a 25–30 second sample text into a custom voice that gets saved into the Voice Library with language, gender, and bucket. The library on the right shows speakers organized by project — TheBlackVeil, Charlie_Project, NOIR CALL — and you assign saved voices straight to the Script tab's Speaker 1 / Speaker 2 slots.

Voice Designer view — prose description, sample text, generation progress, library tree, speaker assignment

Voice Designer — generating "Laura" from a prose description, ready to assign to Speaker 2 in the Script tab.

Prose to Voice

Qwen3-TTS generate_voice_design converts a written voice description into a unique speaker identity — no reference recording required.

Sample Text Test

~25–30 second sample text confirms the voice on diverse vowels and consonants before saving — phonetic spread is what makes a voice clone hold up under variation.

Voice Library Buckets

Speakers saved by language (EN/DE), gender (Male/Female), and project bucket. The same Voice Library shows up in the Script tab, FX Studio, and Composer.

Seed Lock

Reproducible designs via locked seed + language hint. Same prompt + same seed = same voice across sessions, projects, and machines.

STEP 02

Script — Multi-Speaker Conversation

Paste a script, assign saved voices to speakers, pick a Chatterbox engine variant, and generate the whole conversation line by line. The Mood Pill bar applies inline [soft] / [firm] / [urgent] / [whisper] tags that Chatterbox's mood-aware variant respects per line. Auto-generate-then-approve walks through the conversation and lets the operator accept, regenerate, or re-mood any single line without re-running the rest.

Script tab — engine selector, mood pill bar, multi-line skript, per-line approval queue, speakers panel

Script tab — NOIR CALL loaded with both speakers assigned, mood pills above the text, per-line approval queue below.

Chatterbox Variants

Three engine flavors: Chatterbox EN (best English fidelity), Multilingual (German/other), Turbo (~2× speed for batch). Switch per session, models lazy-load and idle-unload after 5 minutes.

Mood Pill Bar

Inline [soft] / [firm] / [urgent] / [whisper] / [neutral] tags driven by Chatterbox's mood-aware path with dwain-barnes-tuned exaggeration values.

Auto-Generate + Approve

Walks through the script line by line. Approve, regenerate with new seed, or re-mood any single line without touching the rest of the session.

Paralinguistics

Inline [whispers] / [sigh] / [chuckle] tags supported by the Paralinguistic engine variant — humanizes lines that would otherwise sound staged.

STEP 03

SFX Studio

MMAudio for short concrete sound effects (phone pickup, taxi pass on wet asphalt, footsteps), AudioLDM2 in the same pipeline for longer atmosphere loops. Single-element prompts work better than stacked ones — "wet asphalt, tire swoosh" beats "rain AND traffic AND horn AND footsteps" every time. Generated clips save into the SFX Library with category tags (Ambience, Foley, Impact, UI) and drop straight onto Composer tracks.

SFX Studio — prompt input, negative prompt, duration/steps/guidance/seed parameters, waveform preview, save to library

SFX Studio — generating a tavern atmosphere clip with MMAudio, ready to save as a categorized library entry.

MMAudio

Best for short, concrete one-shots: phone_pickup, taxi_pass_wet, vinyl_crackle. 4–8 seconds, 25 steps default, guidance 4.5.

AudioLDM2

Sister engine in the same view for longer atmosphere loops — rain on rooftops, distant traffic, room tone. Hybrid-stack with MMAudio for layered scenes.

Prompt Discipline

Single-element prompts outperform stacked ones. "Wet asphalt, tire swoosh" yields clean output; "rain AND traffic AND horn AND footsteps" produces mud.

Library + Categories

SFX saved with category tags (Ambience, Foley, Impact, UI) and per-project buckets. Same library is drag-droppable from the Composer's SFX panel.

STEP 04

Music Studio

ACE-Step XL Turbo via Diffusers — MIT-licensed weights, 20× realtime, runs comfortably on a consumer GPU. Six modes: Generate a fresh track, Extend an existing one, Add Intro / Add Outro, Repaint a drag-selected region without touching the rest, or Cover an existing track in a different genre. Style presets ("Lo-Fi Cafe", "Cinematic Build", etc.) prefill prompt + parameters with proven defaults that you edit on top. Lyrics field is optional — leave blank for instrumental, add [verse] / [chorus] / [bridge] tags for sung output.

Music Studio — style preset, genre prompt, negative prompt, lyrics, mode selector, parameters panel, waveform preview

Music Studio — generating a lo-fi hip-hop loop with vinyl crackle, BPM 80, C minor. Preset "Lo-Fi Cafe" prefilled the defaults; the rest is overridden on top.

ACE-Step XL Turbo

MIT-licensed, 20× realtime on a 16 GB GPU. Loaded as a Diffusers pipeline (no ComfyUI dependency), eager dtype + bf16 with partial offload.

Six Modes

Generate · Extend · Add Intro · Add Outro · Repaint a drag-selected region · Cover an existing track in a new genre.

Style Presets

Curated prompt + parameter pairs that you start from and edit on top — "Lo-Fi Cafe", "Cinematic Build", "8-bit Action Theme", and more. Faster than building prompts from scratch every time.

Open in FX Studio

One-click handoff from the Music Studio's saved track straight into FX Studio's Music tab for mastering — no file management in between.

STEP 05

FX Studio — Voice

Per-speaker and per-line effect chains for the whole conversation. Pick a session from the dropdown, see every line organized by speaker on the left, build an effect chain on the right. Apply a chain to an entire speaker (Laura gets the Telephone-Filter — High-Pass 300 Hz, Low-Pass 3400 Hz, slight saturation, compressor, gain trim — for the whole call) or override one specific line for emphasis. Compare button A/Bs original vs. processed; the Export FX Combined button writes the rendered version back as a session-bound artifact.

FX Studio voice — session selector, per-speaker line tree, effect chain panel with High-Pass / Low-Pass / Compressor, properties for cutoff and order

FX Studio — Laura selected, Lo-Fi (telephone) preset loaded, chain ready to apply across all 7 of her lines.

Per-Speaker FX

Apply an effect chain to every line of one speaker at once. Telephone-filter on the partner in a phone-call scene, broadcast polish on the narrator, no per-line manual work.

Per-Line Override

Single line needs an emphasis push? presence_boost at 2 kHz on just that line. Tree shows * markers where overrides live so you can see at a glance what's customized.

Voice Presets

Curated chains: Voice Polish (Subtle), Voice Polish (Broadcast), Lo-Fi (telephone), Whisper Close-Mic, Radio. Save your own with one click.

Compare A/B

Click Compare to flip between original and processed in place — same waveform window, no separate player to manage.

STEP 06

FX Studio — Music & Stems

Same FX Studio shell, Music tab. Mastering chains use a different preset category — "Music Master Streaming (-14 LUFS · Spotify/Apple)" is the daily-driver default with High-Pass 30 Hz, two Presence Boosts, Compressor, Limiter, LUFS Normalize. Below the music library: Separate Stems runs Demucs at ~6× realtime, splitting any track into drums / bass / vocals / other. The stems become children of the master track in the library and can be dropped into the Composer as four parallel lanes for independent volume / FX / muting — or remixed back into a new master.

FX Studio music tab — music library with stems hierarchy, master streaming preset chain, separate stems / remix master / drop to composer buttons

FX Studio Music tab — Jazz_1940_Intro split into 4 Demucs stems (bass / drums / other / vocals), Music Master Streaming chain (6 effects) applied.

Demucs Stem Separation

Any music track → drums / bass / vocals / other at ~6× realtime. Stems show up as children of the master in the library, ready for independent processing.

Music Mastering Presets

Eight presets including streaming (-14 LUFS Spotify/Apple), cinematic (-16 LUFS), broadcast, lo-fi character chains. Pick one, tweak the chain, save your own.

Drop Stems → Composer

One button drops all four stems onto the Composer timeline as parallel tracks — mute the drums, duck just the vocals under voice lines, layer effects per stem.

Remix Master from Stems

After per-stem edits, remix back into a single mastered file. Original stays in the library; the remix becomes a new entry — non-destructive workflow.

STEP 07

Composer — Multi-Track Timeline

Where everything comes together. Voice sessions auto-sync from the Script tab as two parallel speaker tracks with proper gaps between lines. SFX clips drag from the library onto their own lanes. Music sits on a dedicated music lane (or four lanes if you dropped stems). Per-track color, type icon, drag-handle rename, individual volume. Auto-snap to clip edges, BPM beats, and markers. Auto-crossfade on overlapping music / SFX / ambience clips. Multi-select with Alt+Drag, ripple-edit with Alt+Shift+Drag. Two export targets: a single WAV / MP3 of the audio mix, or an MP4 with the audio plus audio-reactive visuals (spectrum bars, waveform pulse) overlaid on a still cover image.

Composer view — NY skyline cover, multi-track timeline with voice / SFX / music lanes, video FX panel for audio-reactive overlays

Composer — NOIR CALL session laid out across voice lanes (Viktor / Laura), SFX track (footsteps, taxi), music lane with the vinyl loop. Video FX panel on the right configures audio-reactive overlays for MP4 export.

Auto-Sync from Script

Picks up the latest Script session, builds two voice tracks with proper gaps, drops every line as a clip in order. Zero manual timeline assembly for dialogue.

Smart Snap

Clip edges, markers, and BPM beats all snap automatically. Disable per-operation by holding Shift while dragging.

Auto-Crossfade

Overlapping music / SFX / ambience clips crossfade automatically — manual fade-curves only when you want to override them.

Ripple Edit & Multi-Select

Alt+Drag for multi-select, Alt+Shift+Drag for ripple. Move a single clip and downstream clips slide with it — like DaVinci, but for audio drama.

Track Customization

Per-track color (12 presets + custom), type icon (mic / music / speaker / wave), inline rename via drag handle, individual volume + mute / solo.

Audio Export

WAV (24-bit, source sample rate) or MP3 (variable bitrate). Master-bus FX applied, peak limiter at -1 dB, LUFS-normalized to the selected target.

Video Export

MP4 with the audio mix plus audio-reactive overlays — spectrum bars, waveform pulse, configurable resolution / FPS / brightness / glow — layered over a still cover image.

Markers

Drop named markers on the timeline for chapter points, cue points, edit anchors. Snap targets and visual reference during playback.

The Result

Seven Steps, One Final Mix

Every step above, in one finished audio drama clip. Two voices, telephone filter on Laura's side, MMAudio rain / vinyl / footsteps, ACE-Step jazz score under it all, mastered to -14 LUFS via the FX Studio chain, exported from the Composer.

NOIR CALL — Final Mix
~75s · audio coming soon
Under the Hood

Tech Stack

Every model runs locally on a consumer GPU. No cloud APIs, no subscriptions, no telemetry — the whole pipeline is offline-capable.

Application

PySide6 desktop app (Qt 6), Python 3.14, single-process with a worker thread for ML calls. Tabbed view router, lazy engine loading, 5-minute idle-unload.

Voice Engines

Chatterbox EN / Multilingual / Turbo for synthesis · Qwen3-TTS for prose-to-voice design · Resemble Enhance for neural denoise + upsample.

Music Engine

ACE-Step XL Turbo via the Diffusers pipeline. MIT-licensed weights, bf16, partial offload — 20× realtime on a 16 GB GPU.

SFX Engines

MMAudio (large_44k_v2) for short concrete sounds at 44.1 kHz · AudioLDM2 for longer atmosphere loops · same prompt-based interface for both.

Stem Separation

Demucs running ~6× realtime, splits into drums / bass / vocals / other. Stems integrate into the Music Library as children of the master.

Audio Backbone

PyTorch + CUDA for inference, NumPy / SciPy for DSP, librosa for analysis, FFmpeg for muxing and video export.

PySide6 Chatterbox EN / Multilingual / Turbo Qwen3-TTS ACE-Step XL Turbo (MIT) MMAudio AudioLDM2 Demucs Resemble Enhance PyTorch / CUDA Diffusers NumPy / SciPy librosa FFmpeg

Local-First, By Design

Every engine on this page is downloaded once and runs offline from then on. No request leaves the machine. No vendor lock-in, no API quota, no surprise bill at the end of the month — and no licensing argument when a project goes commercial, because the music engine is MIT and the voice engines run on locally-stored weights.

← Back to Home