Natural Storytelling with Piper TTS
Natural Storytelling with Piper TTS
What is Piper TTS?
Piper is a fast, local text-to-speech system that runs entirely offline. Unlike cloud-based TTS services, Piper processes speech locally with minimal latency and supports multiple languages and voices. It’s perfect for applications like ZestLoop where you need high-quality, natural-sounding narration.
Key benefits:
- 🚀 Fast synthesis (real-time on CPU)
- 🔒 Privacy-first (runs offline)
- 🌍 Multi-language support
- 💾 Small model sizes (20-40MB per voice)
Installation
First, install Piper and download voice models:
# Install Piper
pip install piper-tts
# Download voice models
python3 -m piper.download_voices fr_FR-siwis-medium
python3 -m piper.download_voices en_US-lessac-medium
Required dependencies:
pip install numpy scipy
The Problem with Basic TTS
Basic TTS systems generate continuous speech without natural pauses. This makes stories sound robotic and exhausting to listen to. The solution? Intelligent pause insertion based on punctuation.
Implementation
Here’s how to create natural-sounding narration with realistic pauses:
from piper.voice import PiperVoice
import numpy as np
from scipy.io import wavfile
import io
import wave
import re
import os
# Configure directories
MODELS_DIR = os.getenv("PIPER_MODELS_DIR", "./piper_models")
TTS_OUTPUT_DIR = os.getenv("TTS_OUTPUT_DIR", "./tts")
os.makedirs(TTS_OUTPUT_DIR, exist_ok=True)
# Load voice models for different languages
fr_voice = PiperVoice.load(os.path.join(MODELS_DIR, "fr_FR-siwis-medium.onnx"))
en_voice = PiperVoice.load(os.path.join(MODELS_DIR, "en_US-lessac-medium.onnx"))
# Map language codes to voices
voices = {
"fr": fr_voice,
"en": en_voice,
}
def synthesize_to_array(voice, text):
"""
Convert text to a numpy array of audio samples.
This function:
1. Creates an in-memory WAV file
2. Uses Piper to synthesize speech into that WAV
3. Converts the WAV data back to a numpy array for processing
"""
wav_io = io.BytesIO()
# Synthesize speech to WAV format
with wave.open(wav_io, 'wb') as wav_file:
voice.synthesize_wav(text, wav_file)
# Read WAV data as numpy array
wav_io.seek(0)
with wave.open(wav_io, 'rb') as wav_file:
frames = wav_file.readframes(wav_file.getnframes())
audio_np = np.frombuffer(frames, dtype=np.int16)
return audio_np
def add_silence(duration_ms, sample_rate):
"""
Generate silence for natural pauses between sentences.
Args:
duration_ms: Length of silence in milliseconds
sample_rate: Audio sample rate (typically 22050 or 16000 Hz)
Returns:
numpy array of zeros representing silence
"""
samples = int(sample_rate * duration_ms / 1000)
return np.zeros(samples, dtype=np.int16)
def synthesize_with_pauses(text, lang, output_file):
"""
Generate natural-sounding speech with realistic pauses.
This function:
1. Splits text at punctuation marks (keeping the punctuation)
2. Synthesizes each sentence fragment
3. Adds appropriate silence based on punctuation type:
- Questions/exclamations: 800ms pause
- Periods: 500ms pause
- Commas/semicolons: 300ms pause
4. Concatenates everything into a single audio file
Args:
text: The story or text to narrate
lang: Language code ('en' or 'fr')
output_file: Path where the WAV file will be saved
"""
text = text.strip()
# Split text on punctuation while keeping the punctuation marks
# e.g., "Hello, world!" → ["Hello", ",", " world", "!", ""]
parts = re.split(r'([.!?,;])', text)
# Get the voice for the specified language
voice = voices.get(lang, en_voice)
audio_chunks = []
sample_rate = voice.config.sample_rate
# Process sentence pairs (text + punctuation)
for i in range(0, len(parts)-1, 2):
sentence = (parts[i] + parts[i+1]).strip()
if not sentence:
continue
# Synthesize this sentence
audio = synthesize_to_array(voice, sentence)
audio_chunks.append(audio)
# Add appropriate pause based on punctuation type
punct = parts[i+1]
if '?' in punct or '!' in punct:
# Longer pause for questions and exclamations
audio_chunks.append(add_silence(800, sample_rate))
elif '.' in punct:
# Medium pause for periods
audio_chunks.append(add_silence(500, sample_rate))
elif ',' in punct or ';' in punct:
# Short pause for commas and semicolons
audio_chunks.append(add_silence(300, sample_rate))
# Combine all audio chunks into one file
final_audio = np.concatenate(audio_chunks)
wavfile.write(output_file, sample_rate, final_audio)
print(f"[INFO] Saved: {output_file}")
# Example usage
if __name__ == "__main__":
story = """
Once upon a time, in a land far away, there lived a curious cat.
She loved exploring! But one day, something unexpected happened.
Would she find her way home? The adventure was just beginning.
"""
synthesize_with_pauses(
text=story,
lang="en",
output_file="story_narration.wav"
)
How It Works
- Text Splitting: Uses regex to split on punctuation while preserving the punctuation marks
- Sentence-by-Sentence: Each sentence fragment is synthesized independently
- Smart Pausing: Different punctuation types trigger different pause durations:
?!→ 800ms (dramatic effect).→ 500ms (natural sentence break),;→ 300ms (breathing room)
- Audio Concatenation: All fragments and pauses are merged into a single WAV file
Recommendations
Voice Selection
- Medium models offer the best quality-to-size ratio (20-40MB)
- High-quality models are better for long-form content but slower
- Test different voices for your use case - some sound more natural than others
Performance Tips
- Pre-load voice models at startup (they take ~1-2 seconds to initialize)
- Cache generated audio for repeated content
- Process long texts in chunks to avoid memory issues
Pause Tuning
The pause durations (800ms, 500ms, 300ms) work well for storytelling but adjust based on:
- Content type (audiobooks vs. news vs. conversations)
- Voice speed and speaking rate
- Target audience preferences
Production Considerations
- Use environment variables for model paths to keep configuration flexible
- Add error handling for missing models or invalid language codes
- Consider async processing for web applications to avoid blocking
- Implement a queue system for batch processing multiple stories
Real-World Usage
I use this approach in ZestLoop to generate bedtime stories, jokes, and affirmations. The natural pausing makes a huge difference in listener engagement and comprehension.
Try it yourself and experiment with pause timings to match your content style!
Continue reading
Next article
Hexagonal Architecture with FastAPI: Database, Valkey Cache, Messaging
Related Content
Codexity Part 6: Small Model Inference with llama-cpp-python
Run a quantized 7B model locally to generate cited answers from scraped web content. Choose between Qwen, Mistral, Phi, and Llama models. Build prompts that make small models behave like large ones.
Codexity Part 1: Architecture of an Answer Engine
The first chapter in a series on building a Perplexity-style answer engine from scratch in Python. We lay out the full architecture, set up the project skeleton, and understand every component before writing a single line of business logic.
Codexity Part 2: Query Rewriting with LLMs
A user types a vague question. The query rewriter transforms it into targeted search queries using a local LLM. We cover intent classification, query decomposition, and prompt engineering that actually works with small models.