Building Multi-Speaker AI Games with Gemini Live

The Deep Sea Stories game, developed by Fishjam.io, demonstrates a unique approach to handling multi-speaker conversations in AI games. By utilizing Gemini Live and implementing a custom Voice Activity Detection (VAD) filter, the game enables real-time audio streaming and responsive interactions between players and the AI Riddle Master. This innovative solution overcomes the challenges of traditional one-on-one chat architectures, allowing for a more immersive and engaging gaming experience.

Why This Matters

Traditional AI voice agents are designed for one-on-one conversations, which can lead to poor performance and latency issues in multi-speaker environments. The Deep Sea Stories game addresses these challenges by using a server-side filtering approach with VAD, ensuring that the AI agent can accurately process and respond to individual players’ audio inputs. This technical reality highlights the importance of considering the complexities of group conversations when designing AI-powered voice interfaces, as ideal models often assume a single speaker.

Key Insights

The Gemini Live API provides a robust foundation for building voice AI agents, with features like real-time audio streaming and Speech-to-Speech architectures.
Implementing a custom VAD filter can significantly improve the performance of multi-speaker AI interfaces, reducing latency and errors.
The Fishjam.io platform offers a scalable and reliable solution for real-time communication, enabling seamless audio streaming and interaction between players and the AI agent.

Working Example

// Initialize the Fishjam client and Gemini agent
const fishjamClient = new FishjamClient({
  fishjamId: process.env.FISHJAM_ID!,
  managementToken: process.env.FISHJAM_TOKEN!,
});
const genAi = GeminiIntegration.createClient({
  apiKey: process.env.GOOGLE_API_KEY!,
});

// Create the game room and Fishjam agent
const gameRoom = await fishjamClient.createRoom();
const { agent } = await fishjamClient.createAgent(gameRoom.id, {
  subscribeMode: "auto",
  output: GeminiIntegration.geminiInputAudioSettings,
});

// Configure and initialize the AI Riddle Master
const session = await genAi.live.connect({
  model: GEMINI_MODEL,
  config: {
    responseModalities: [Modality.AUDIO],
    systemInstruction: "here's the story: ..., and its solution: ... you should answer only yes or no questions about this story",
  },
  callbacks: {
    // Gemini -> Fishjam
    onmessage: (msg) => {
      if (msg.data) {
        // send Riddle Master's audio responses back to players
        const pcmData = Buffer.from(msg.data, "base64");
        agent.sendData(agentTrack.id, pcmData);
      }
      if (msg.serverContent?.interrupted) {
        console.log("Agent was interrupted by user.");
        // Clears the buffer on the Fishjam media server
        agent.interruptTrack(agentTrack.id);
      }
    },
  },
});

Practical Applications

Use Case: The Deep Sea Stories game demonstrates the potential of multi-speaker AI interfaces in gaming and interactive storytelling, enabling players to engage in immersive and dynamic conversations with the AI Riddle Master.
Pitfall: Failing to consider the complexities of group conversations can lead to poor performance, latency issues, and a subpar user experience, highlighting the importance of careful planning and implementation when designing AI-powered voice interfaces.

References:

On This Page

Building Multi-Speaker AI Games with Gemini Live