A local multi-track audio production studio. Voice, music, and SFX generated and composed on a single timeline — written for one operator producing audio drama at scale. PySide6 desktop app, fully local inference on consumer GPUs, no cloud APIs, no subscriptions.
A 16-line audio drama scene: Viktor, a tired NY detective, calls Laura from the streets of Manhattan. Two voices, telephone-filter on her side, rain and city atmosphere, slow jazz vinyl bleeding through her apartment, a final outro of his footsteps walking off. Every asset on this page was generated, processed, and arranged inside AudioGen Studio — no external editor touched.
No tab-hopping between tools. The same desktop app moves the project from "voice description prompt" to "mastered MP4 with audio-reactive visuals" — and back, when you want to iterate.
Describe a voice in plain English, get back a usable speaker identity. Qwen3-TTS turns a prose description ("A weathered American male voice in his early forties, deep baritone with a slight gravel from too many late nights…") plus a 25–30 second sample text into a custom voice that gets saved into the Voice Library with language, gender, and bucket. The library on the right shows speakers organized by project — TheBlackVeil, Charlie_Project, NOIR CALL — and you assign saved voices straight to the Script tab's Speaker 1 / Speaker 2 slots.
Voice Designer — generating "Laura" from a prose description, ready to assign to Speaker 2 in the Script tab.
Qwen3-TTS generate_voice_design converts a written voice description into a unique speaker identity — no reference recording required.
~25–30 second sample text confirms the voice on diverse vowels and consonants before saving — phonetic spread is what makes a voice clone hold up under variation.
Speakers saved by language (EN/DE), gender (Male/Female), and project bucket. The same Voice Library shows up in the Script tab, FX Studio, and Composer.
Reproducible designs via locked seed + language hint. Same prompt + same seed = same voice across sessions, projects, and machines.
Paste a script, assign saved voices to speakers, pick a Chatterbox engine variant, and generate the whole conversation line by line. The Mood Pill bar applies inline [soft] / [firm] / [urgent] / [whisper] tags that Chatterbox's mood-aware variant respects per line. Auto-generate-then-approve walks through the conversation and lets the operator accept, regenerate, or re-mood any single line without re-running the rest.
Script tab — NOIR CALL loaded with both speakers assigned, mood pills above the text, per-line approval queue below.
Three engine flavors: Chatterbox EN (best English fidelity), Multilingual (German/other), Turbo (~2× speed for batch). Switch per session, models lazy-load and idle-unload after 5 minutes.
Inline [soft] / [firm] / [urgent] / [whisper] / [neutral] tags driven by Chatterbox's mood-aware path with dwain-barnes-tuned exaggeration values.
Walks through the script line by line. Approve, regenerate with new seed, or re-mood any single line without touching the rest of the session.
Inline [whispers] / [sigh] / [chuckle] tags supported by the Paralinguistic engine variant — humanizes lines that would otherwise sound staged.
MMAudio for short concrete sound effects (phone pickup, taxi pass on wet asphalt, footsteps), AudioLDM2 in the same pipeline for longer atmosphere loops. Single-element prompts work better than stacked ones — "wet asphalt, tire swoosh" beats "rain AND traffic AND horn AND footsteps" every time. Generated clips save into the SFX Library with category tags (Ambience, Foley, Impact, UI) and drop straight onto Composer tracks.
SFX Studio — generating a tavern atmosphere clip with MMAudio, ready to save as a categorized library entry.
Best for short, concrete one-shots: phone_pickup, taxi_pass_wet, vinyl_crackle. 4–8 seconds, 25 steps default, guidance 4.5.
Sister engine in the same view for longer atmosphere loops — rain on rooftops, distant traffic, room tone. Hybrid-stack with MMAudio for layered scenes.
Single-element prompts outperform stacked ones. "Wet asphalt, tire swoosh" yields clean output; "rain AND traffic AND horn AND footsteps" produces mud.
SFX saved with category tags (Ambience, Foley, Impact, UI) and per-project buckets. Same library is drag-droppable from the Composer's SFX panel.
ACE-Step XL Turbo via Diffusers — MIT-licensed weights, 20× realtime, runs comfortably on a consumer GPU. Six modes: Generate a fresh track, Extend an existing one, Add Intro / Add Outro, Repaint a drag-selected region without touching the rest, or Cover an existing track in a different genre. Style presets ("Lo-Fi Cafe", "Cinematic Build", etc.) prefill prompt + parameters with proven defaults that you edit on top. Lyrics field is optional — leave blank for instrumental, add [verse] / [chorus] / [bridge] tags for sung output.
Music Studio — generating a lo-fi hip-hop loop with vinyl crackle, BPM 80, C minor. Preset "Lo-Fi Cafe" prefilled the defaults; the rest is overridden on top.
MIT-licensed, 20× realtime on a 16 GB GPU. Loaded as a Diffusers pipeline (no ComfyUI dependency), eager dtype + bf16 with partial offload.
Generate · Extend · Add Intro · Add Outro · Repaint a drag-selected region · Cover an existing track in a new genre.
Curated prompt + parameter pairs that you start from and edit on top — "Lo-Fi Cafe", "Cinematic Build", "8-bit Action Theme", and more. Faster than building prompts from scratch every time.
One-click handoff from the Music Studio's saved track straight into FX Studio's Music tab for mastering — no file management in between.
Per-speaker and per-line effect chains for the whole conversation. Pick a session from the dropdown, see every line organized by speaker on the left, build an effect chain on the right. Apply a chain to an entire speaker (Laura gets the Telephone-Filter — High-Pass 300 Hz, Low-Pass 3400 Hz, slight saturation, compressor, gain trim — for the whole call) or override one specific line for emphasis. Compare button A/Bs original vs. processed; the Export FX Combined button writes the rendered version back as a session-bound artifact.
FX Studio — Laura selected, Lo-Fi (telephone) preset loaded, chain ready to apply across all 7 of her lines.
Apply an effect chain to every line of one speaker at once. Telephone-filter on the partner in a phone-call scene, broadcast polish on the narrator, no per-line manual work.
Single line needs an emphasis push? presence_boost at 2 kHz on just that line. Tree shows * markers where overrides live so you can see at a glance what's customized.
Curated chains: Voice Polish (Subtle), Voice Polish (Broadcast), Lo-Fi (telephone), Whisper Close-Mic, Radio. Save your own with one click.
Click Compare to flip between original and processed in place — same waveform window, no separate player to manage.
Same FX Studio shell, Music tab. Mastering chains use a different preset category — "Music Master Streaming (-14 LUFS · Spotify/Apple)" is the daily-driver default with High-Pass 30 Hz, two Presence Boosts, Compressor, Limiter, LUFS Normalize. Below the music library: Separate Stems runs Demucs at ~6× realtime, splitting any track into drums / bass / vocals / other. The stems become children of the master track in the library and can be dropped into the Composer as four parallel lanes for independent volume / FX / muting — or remixed back into a new master.
FX Studio Music tab — Jazz_1940_Intro split into 4 Demucs stems (bass / drums / other / vocals), Music Master Streaming chain (6 effects) applied.
Any music track → drums / bass / vocals / other at ~6× realtime. Stems show up as children of the master in the library, ready for independent processing.
Eight presets including streaming (-14 LUFS Spotify/Apple), cinematic (-16 LUFS), broadcast, lo-fi character chains. Pick one, tweak the chain, save your own.
One button drops all four stems onto the Composer timeline as parallel tracks — mute the drums, duck just the vocals under voice lines, layer effects per stem.
After per-stem edits, remix back into a single mastered file. Original stays in the library; the remix becomes a new entry — non-destructive workflow.
Where everything comes together. Voice sessions auto-sync from the Script tab as two parallel speaker tracks with proper gaps between lines. SFX clips drag from the library onto their own lanes. Music sits on a dedicated music lane (or four lanes if you dropped stems). Per-track color, type icon, drag-handle rename, individual volume. Auto-snap to clip edges, BPM beats, and markers. Auto-crossfade on overlapping music / SFX / ambience clips. Multi-select with Alt+Drag, ripple-edit with Alt+Shift+Drag. Two export targets: a single WAV / MP3 of the audio mix, or an MP4 with the audio plus audio-reactive visuals (spectrum bars, waveform pulse) overlaid on a still cover image.
Composer — NOIR CALL session laid out across voice lanes (Viktor / Laura), SFX track (footsteps, taxi), music lane with the vinyl loop. Video FX panel on the right configures audio-reactive overlays for MP4 export.
Picks up the latest Script session, builds two voice tracks with proper gaps, drops every line as a clip in order. Zero manual timeline assembly for dialogue.
Clip edges, markers, and BPM beats all snap automatically. Disable per-operation by holding Shift while dragging.
Overlapping music / SFX / ambience clips crossfade automatically — manual fade-curves only when you want to override them.
Alt+Drag for multi-select, Alt+Shift+Drag for ripple. Move a single clip and downstream clips slide with it — like DaVinci, but for audio drama.
Per-track color (12 presets + custom), type icon (mic / music / speaker / wave), inline rename via drag handle, individual volume + mute / solo.
WAV (24-bit, source sample rate) or MP3 (variable bitrate). Master-bus FX applied, peak limiter at -1 dB, LUFS-normalized to the selected target.
MP4 with the audio mix plus audio-reactive overlays — spectrum bars, waveform pulse, configurable resolution / FPS / brightness / glow — layered over a still cover image.
Drop named markers on the timeline for chapter points, cue points, edit anchors. Snap targets and visual reference during playback.
Every step above, in one finished audio drama clip. Two voices, telephone filter on Laura's side, MMAudio rain / vinyl / footsteps, ACE-Step jazz score under it all, mastered to -14 LUFS via the FX Studio chain, exported from the Composer.
Every model runs locally on a consumer GPU. No cloud APIs, no subscriptions, no telemetry — the whole pipeline is offline-capable.
PySide6 desktop app (Qt 6), Python 3.14, single-process with a worker thread for ML calls. Tabbed view router, lazy engine loading, 5-minute idle-unload.
Chatterbox EN / Multilingual / Turbo for synthesis · Qwen3-TTS for prose-to-voice design · Resemble Enhance for neural denoise + upsample.
ACE-Step XL Turbo via the Diffusers pipeline. MIT-licensed weights, bf16, partial offload — 20× realtime on a 16 GB GPU.
MMAudio (large_44k_v2) for short concrete sounds at 44.1 kHz · AudioLDM2 for longer atmosphere loops · same prompt-based interface for both.
Demucs running ~6× realtime, splits into drums / bass / vocals / other. Stems integrate into the Music Library as children of the master.
PyTorch + CUDA for inference, NumPy / SciPy for DSP, librosa for analysis, FFmpeg for muxing and video export.
Every engine on this page is downloaded once and runs offline from then on. No request leaves the machine. No vendor lock-in, no API quota, no surprise bill at the end of the month — and no licensing argument when a project goes commercial, because the music engine is MIT and the voice engines run on locally-stored weights.