In the rapidly evolving landscape of artificial intelligence, text-to-speech (TTS) technology has made significant strides, enabling more natural and expressive synthetic voices. A notable advancement in this domain is Dia, an open-source TTS model developed by Nari Labs. This article delves into the features, capabilities, and applications of Dia, highlighting its contributions to the field of TTS.
Image credit: https://github.com/nari-labs/dia
Dia is a 1.6 billion parameter TTS model designed to generate highly realistic dialogue directly from text prompts. Unlike traditional TTS systems that focus on reading isolated sentences, Dia excels in producing coherent, multi-speaker conversations with nuanced emotional expressions and nonverbal cues.
The story of Dia begins not with funding rounds or big tech backers, but with two engineers who fell in love with AI-generated podcasts. Inspired by NotebookLM's audio feature and frustrated by the lack of realism in existing TTS APIs, the duo behind Nari Labs set out to build something better. As co-founder Toby Kim puts it, "None of them sounded like real human conversation."
With zero outside funding but a lot of passion, they trained Dia using Google's Tensor Processing Units (TPUs) through the Google Research Cloud. The result? A model that can rival or surpass proprietary alternatives in emotional realism, dialogue flow, and nonverbal expression.
Dia isn't just another voice generator. It's a full-blown dialogue engine that can simulate multiple speakers, mood shifts, and background behaviors like laughter or coughing — all from raw text.
While the model currently requires GPU to run, Nari Labs is working on a quantized CPU version and even easier deployment tools.
To set up Dia, follow these steps:
Dia provides a Gradio-based web interface for users to input text scripts and generate speech. This interface allows for easy experimentation with the model's capabilities.
For integration into custom applications, Dia can be used as a Python library:
Nari Labs isn't shy about comparisons. On their Notion page, they showcase side-by-side examples pitting Dia against ElevenLabs Studio and the open Sesame CSM-1B model. Across multiple categories, Dia shines:
In one test involving only nonverbal cues like coughs and sighs, Dia was the only model to interpret and vocalize them accurately.
Dia is open-source but not lawless. Nari Labs prohibits use cases involving impersonation, misinformation, or illegal activities. They maintain a Discord community for support, feedback, and contributions, and invite responsible AI experimentation.
Despite being built by just two engineers — one full-time and one part-time — Dia is quickly gaining traction. Nari Labs is working on a consumer-friendly version for remixing and sharing dialogues, with early access available via email sign-up.
Support from the Google TPU Research Cloud, Hugging Face's ZeroGPU grant, and inspiration from models like SoundStorm and Parakeet helped pave the way for Dia's development. But its success is rooted in open collaboration, a relentless pursuit of quality, and a vision for truly human-like AI voices.
Gradio Demo: Available for in-browser testing
Dia also supports a Python library, command line interface, and comes with example code to help developers get started immediately.