Dia: An Open-Source Text-to-Speech Model by Nari Labs

In the rapidly evolving landscape of artificial intelligence, text-to-speech (TTS) technology has made significant strides, enabling more natural and expressive synthetic voices. A notable advancement in this domain is Dia, an open-source TTS model developed by Nari Labs. This article delves into the features, capabilities, and applications of Dia, highlighting its contributions to the field of TTS.

Image credit: https://github.com/nari-labs/dia

Overview of Dia

Dia is a 1.6 billion parameter TTS model designed to generate highly realistic dialogue directly from text prompts. Unlike traditional TTS systems that focus on reading isolated sentences, Dia excels in producing coherent, multi-speaker conversations with nuanced emotional expressions and nonverbal cues.

The Spark Behind Dia

The story of Dia begins not with funding rounds or big tech backers, but with two engineers who fell in love with AI-generated podcasts. Inspired by NotebookLM's audio feature and frustrated by the lack of realism in existing TTS APIs, the duo behind Nari Labs set out to build something better. As co-founder Toby Kim puts it, "None of them sounded like real human conversation."

With zero outside funding but a lot of passion, they trained Dia using Google's Tensor Processing Units (TPUs) through the Google Research Cloud. The result? A model that can rival or surpass proprietary alternatives in emotional realism, dialogue flow, and nonverbal expression.

What Makes Dia Special

Dia isn't just another voice generator. It's a full-blown dialogue engine that can simulate multiple speakers, mood shifts, and background behaviors like laughter or coughing — all from raw text.

1. Multi-Speaker Dialogue: Using speaker tags like [S1] and [S2], Dia can alternate between voices in a script to generate lifelike conversations without needing any manual slicing.
2. Emotion and Expression: Want a character to laugh, sniffle, or clear their throat? Just write it into the script using tags like (laughs) or (coughs) — Dia understands these cues and integrates them naturally into the audio.
3. Voice Cloning & Conditioning: With zero-shot audio conditioning, you can upload a sample voice clip and have Dia match its tone, rhythm, and style throughout the output. It's powerful, flexible, and surprisingly easy to use.
4. Open and Free: Released under the Apache 2.0 license, Dia is fully open-source. You can use it commercially, remix it for your apps, or just experiment with it freely

Technical Specifications

Framework: Built on PyTorch 2.0+ and CUDA 12.6.
Hardware Requirements: Requires approximately 10GB of VRAM; tested on GPUs like the NVIDIA A4000.
Performance: Generates audio at approximately 40 tokens per second on enterprise-grade GPUs.
Language Support: Currently supports English only.
Model Size: 1.6B parameters
VRAM Requirements: 10GB

While the model currently requires GPU to run, Nari Labs is working on a quantized CPU version and even easier deployment tools.

Getting Started with Dia

Installation

To set up Dia, follow these steps:

Using the Gradio Interface

Dia provides a Gradio-based web interface for users to input text scripts and generate speech. This interface allows for easy experimentation with the model's capabilities.

Python Integration

For integration into custom applications, Dia can be used as a Python library:

Comparing Dia to ElevenLabs and Sesame

Nari Labs isn't shy about comparisons. On their Notion page, they showcase side-by-side examples pitting Dia against ElevenLabs Studio and the open Sesame CSM-1B model. Across multiple categories, Dia shines:

Dialogue Realism: Dia delivers natural speech pacing and tone changes.
Nonverbal Behavior: Competing models output "haha" instead of actual laughter, whereas Dia includes realistic sound effects.
Emotion-Heavy Scenes: Dia transitions smoothly across emotional tones, unlike others that flatten delivery.
Rhythmic Content: Rap lyrics and musical intonation are handled more fluidly by Dia.

In one test involving only nonverbal cues like coughs and sighs, Dia was the only model to interpret and vocalize them accurately.

Use Cases

Content Creators: Narrate YouTube videos or podcasts with expressive synthetic speech
Accessibility Tools: Enhance apps for visually impaired users with lifelike voices
Game Development: Script dialogue and emotion-rich scenes without voice actors
Education: Create adaptive learning experiences with emotionally aware speech

Ethical Use & Community

Dia is open-source but not lawless. Nari Labs prohibits use cases involving impersonation, misinformation, or illegal activities. They maintain a Discord community for support, feedback, and contributions, and invite responsible AI experimentation.

The Road Ahead

Despite being built by just two engineers — one full-time and one part-time — Dia is quickly gaining traction. Nari Labs is working on a consumer-friendly version for remixing and sharing dialogues, with early access available via email sign-up.

Support from the Google TPU Research Cloud, Hugging Face's ZeroGPU grant, and inspiration from models like SoundStorm and Parakeet helped pave the way for Dia's development. But its success is rooted in open collaboration, a relentless pursuit of quality, and a vision for truly human-like AI voices.

Try It Yourself

Gradio Demo: Available for in-browser testing

Dia also supports a Python library, command line interface, and comes with example code to help developers get started immediately.