Skip to main content

On This Page

Natural Storytelling with Piper TTS

5 min read
Share

Natural Storytelling with Piper TTS

What is Piper TTS?

Piper is a fast, local text-to-speech system that runs entirely offline. Unlike cloud-based TTS services, Piper processes speech locally with minimal latency and supports multiple languages and voices. It’s perfect for applications like ZestLoop where you need high-quality, natural-sounding narration.

Key benefits:

  • 🚀 Fast synthesis (real-time on CPU)
  • 🔒 Privacy-first (runs offline)
  • 🌍 Multi-language support
  • 💾 Small model sizes (20-40MB per voice)

Installation

First, install Piper and download voice models:

# Install Piper
pip install piper-tts

# Download voice models
python3 -m piper.download_voices fr_FR-siwis-medium
python3 -m piper.download_voices en_US-lessac-medium

Required dependencies:

pip install numpy scipy

The Problem with Basic TTS

Basic TTS systems generate continuous speech without natural pauses. This makes stories sound robotic and exhausting to listen to. The solution? Intelligent pause insertion based on punctuation.

Implementation

Here’s how to create natural-sounding narration with realistic pauses:

from piper.voice import PiperVoice
import numpy as np
from scipy.io import wavfile
import io
import wave
import re
import os

# Configure directories
MODELS_DIR = os.getenv("PIPER_MODELS_DIR", "./piper_models")
TTS_OUTPUT_DIR = os.getenv("TTS_OUTPUT_DIR", "./tts")
os.makedirs(TTS_OUTPUT_DIR, exist_ok=True)

# Load voice models for different languages
fr_voice = PiperVoice.load(os.path.join(MODELS_DIR, "fr_FR-siwis-medium.onnx"))
en_voice = PiperVoice.load(os.path.join(MODELS_DIR, "en_US-lessac-medium.onnx"))

# Map language codes to voices
voices = {
    "fr": fr_voice,
    "en": en_voice,
}


def synthesize_to_array(voice, text):
    """
    Convert text to a numpy array of audio samples.
    
    This function:
    1. Creates an in-memory WAV file
    2. Uses Piper to synthesize speech into that WAV
    3. Converts the WAV data back to a numpy array for processing
    """
    wav_io = io.BytesIO()
    
    # Synthesize speech to WAV format
    with wave.open(wav_io, 'wb') as wav_file:
        voice.synthesize_wav(text, wav_file)
    
    # Read WAV data as numpy array
    wav_io.seek(0)
    with wave.open(wav_io, 'rb') as wav_file:
        frames = wav_file.readframes(wav_file.getnframes())
        audio_np = np.frombuffer(frames, dtype=np.int16)
    
    return audio_np


def add_silence(duration_ms, sample_rate):
    """
    Generate silence for natural pauses between sentences.
    
    Args:
        duration_ms: Length of silence in milliseconds
        sample_rate: Audio sample rate (typically 22050 or 16000 Hz)
    
    Returns:
        numpy array of zeros representing silence
    """
    samples = int(sample_rate * duration_ms / 1000)
    return np.zeros(samples, dtype=np.int16)


def synthesize_with_pauses(text, lang, output_file):
    """
    Generate natural-sounding speech with realistic pauses.
    
    This function:
    1. Splits text at punctuation marks (keeping the punctuation)
    2. Synthesizes each sentence fragment
    3. Adds appropriate silence based on punctuation type:
       - Questions/exclamations: 800ms pause
       - Periods: 500ms pause
       - Commas/semicolons: 300ms pause
    4. Concatenates everything into a single audio file
    
    Args:
        text: The story or text to narrate
        lang: Language code ('en' or 'fr')
        output_file: Path where the WAV file will be saved
    """
    text = text.strip()
    
    # Split text on punctuation while keeping the punctuation marks
    # e.g., "Hello, world!" → ["Hello", ",", " world", "!", ""]
    parts = re.split(r'([.!?,;])', text)
    
    # Get the voice for the specified language
    voice = voices.get(lang, en_voice)
    
    audio_chunks = []
    sample_rate = voice.config.sample_rate
    
    # Process sentence pairs (text + punctuation)
    for i in range(0, len(parts)-1, 2):
        sentence = (parts[i] + parts[i+1]).strip()
        
        if not sentence:
            continue
        
        # Synthesize this sentence
        audio = synthesize_to_array(voice, sentence)
        audio_chunks.append(audio)
        
        # Add appropriate pause based on punctuation type
        punct = parts[i+1]
        if '?' in punct or '!' in punct:
            # Longer pause for questions and exclamations
            audio_chunks.append(add_silence(800, sample_rate))
        elif '.' in punct:
            # Medium pause for periods
            audio_chunks.append(add_silence(500, sample_rate))
        elif ',' in punct or ';' in punct:
            # Short pause for commas and semicolons
            audio_chunks.append(add_silence(300, sample_rate))
    
    # Combine all audio chunks into one file
    final_audio = np.concatenate(audio_chunks)
    wavfile.write(output_file, sample_rate, final_audio)
    print(f"[INFO] Saved: {output_file}")


# Example usage
if __name__ == "__main__":
    story = """
    Once upon a time, in a land far away, there lived a curious cat.
    She loved exploring! But one day, something unexpected happened.
    Would she find her way home? The adventure was just beginning.
    """
    
    synthesize_with_pauses(
        text=story,
        lang="en",
        output_file="story_narration.wav"
    )

How It Works

  1. Text Splitting: Uses regex to split on punctuation while preserving the punctuation marks
  2. Sentence-by-Sentence: Each sentence fragment is synthesized independently
  3. Smart Pausing: Different punctuation types trigger different pause durations:
    • ? ! → 800ms (dramatic effect)
    • . → 500ms (natural sentence break)
    • , ; → 300ms (breathing room)
  4. Audio Concatenation: All fragments and pauses are merged into a single WAV file

Recommendations

Voice Selection

  • Medium models offer the best quality-to-size ratio (20-40MB)
  • High-quality models are better for long-form content but slower
  • Test different voices for your use case - some sound more natural than others

Performance Tips

  • Pre-load voice models at startup (they take ~1-2 seconds to initialize)
  • Cache generated audio for repeated content
  • Process long texts in chunks to avoid memory issues

Pause Tuning

The pause durations (800ms, 500ms, 300ms) work well for storytelling but adjust based on:

  • Content type (audiobooks vs. news vs. conversations)
  • Voice speed and speaking rate
  • Target audience preferences

Production Considerations

  • Use environment variables for model paths to keep configuration flexible
  • Add error handling for missing models or invalid language codes
  • Consider async processing for web applications to avoid blocking
  • Implement a queue system for batch processing multiple stories

Real-World Usage

I use this approach in ZestLoop to generate bedtime stories, jokes, and affirmations. The natural pausing makes a huge difference in listener engagement and comprehension.

Try it yourself and experiment with pause timings to match your content style!

Continue reading

Next article

Hexagonal Architecture with FastAPI: Database, Valkey Cache, Messaging

Related Content