Edit

Share via


MAI-Voice-1 in Azure Speech (preview)

Note

This feature is currently in public preview. This preview is provided without a service-level agreement, and is not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

MAI-Voice-1 is a neural text-to-speech model available through Azure Speech in Foundry Tools in public preview. It's built on Microsoft's in-house speech foundation models and produces expressive, natural speech output with consistent voice persona quality.

Similar to Azure Neural HD voices, MAI-Voice-1 understands input text holistically and automatically adapts tone, emotion, and speaking style. This enables more human-like and conversational speech without requiring extensive manual tuning.

Speech offers MAI-Voice-1 as an advanced neural voice model optimized for expressive, conversational, and long-form scenarios.

Model Voice Count Key Characteristics Best For
MAI-Voice-1 six prebuilt English (US) voices Emotionally rich, highly expressive, consistent persona quality, SSML style control Conversational AI, creative applications, long-form narration

Key features

Key features Description
Human-like speech generation MAI-Voice-1 generates highly natural and emotionally rich speech. The model interprets input text holistically and automatically adjusts emotion, pace, and rhythm without manual configuration.
Conversational expressiveness MAI-Voice-1 is optimized for conversational scenarios, producing engaging and context-aware speech suitable for assistants and interactive experiences.
Emotion and style control Developers can influence speaking style using SSML with mstts:express-as, enabling control over emotions such as joy, excitement, empathy, and more.
Consistent voice persona MAI-Voice-1 maintains a stable and consistent voice persona across long-form content while still allowing expressive variation.
High fidelity audio The model produces high-quality neural speech with natural prosody and clarity suitable for production-grade applications.
Real-time synthesis MAI-Voice-1 supports real-time speech synthesis using the Speech SDK and APIs.

Use MAI-Voice-1

MAI-Voice-1 uses the same Azure Speech SDKs and APIs as other Azure Neural and HD voices.

Prerequisites

  • An Azure account. Create one for free.
  • A Speech resource in the East US region. See Get started with text to speech to create one.
  • Your Speech resource key and region from the Keys and Endpoint page in the Azure portal.
  • The Azure Speech SDK installed: pip install azure-cognitiveservices-speech

Python example

The following Python code synthesizes speech using en-us-Teo:MAI-Voice-1 and saves it to output.mp3. Replace <key> with your Speech resource key.

import azure.cognitiveservices.speech as speechsdk

speech_config = speechsdk.SpeechConfig(
    subscription="<key>",
    region="eastus"
)

audio_config = speechsdk.audio.AudioOutputConfig(
    filename="output.mp3"
)

speech_config.set_speech_synthesis_output_format(
    speechsdk.SpeechSynthesisOutputFormat.Audio24Khz160KBitRateMonoMp3
)

synthesizer = speechsdk.SpeechSynthesizer(
    speech_config=speech_config,
    audio_config=audio_config
)

ssml = """
<speak version='1.0'
       xmlns='http://www.w3.org/2001/10/synthesis'
       xml:lang='en-US'>
  <voice name='en-us-Jasper:MAI-Voice-1'>
  <mstts:express-as style="excitement">hello world.</mstts:express-as> 
  </voice>
</speak>
"""

synthesizer.speak_ssml_async(ssml).get()

On success, an output.mp3 file containing the synthesized speech is saved to the current directory.

Reference: SpeechConfig | AudioOutputConfig | SpeechSynthesizer | speak_ssml_async

SSML examples

Basic SSML

The following SSML synthesizes a greeting using the en-us-Noa:MAI-Voice-1 voice.

<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts' xml:lang='en-US'>
    <voice name='en-US-Jasper:MAI-Voice-1'>
        <mstts:express-as style="excitement">hello world.</mstts:express-as>   
    </voice>
</speak>

Submit this SSML to the Speech REST API or SDK to receive synthesized audio.

Reference: Speech Synthesis Markup Language (SSML) | <voice> element

Personal Voice (MAI-voice-1 prompt mode)

Steps to Access:

  1. To access personal voice (voice cloning) using MAI-Voice-1:
  2. Apply for gated access via Azure AI Custom Neural Voice and Custom Avatar Limited Access Review.
  3. Once approved, access personal voice APIs at cognitive-services-speech-sdk/samples/custom-voice.
  4. Upload audio consent and prompt to create a personal voice
  5. Synthesize text using the created voice and MAI-Voice-1 model using the following SSML
<speak version='1.0'
       xmlns='http://www.w3.org/2001/10/synthesis'
       xmlns:mstts='http://www.w3.org/2001/mstts'
       xml:lang='en-US'> 
       <voice name='MAI-voice-1'> 
          <mstts:ttsembedding speakerProfileId='your speaker profile ID here'> 
          I'm happy to hear that you find me amazing and that I have made your trip planning easier and more fun.  
          </mstts:ttsembedding> 
       </voice>
</speak>  

Prebuilt voices

Voice ID Gender Recommended use case
en-us-Jasper:MAI-Voice-1 Male General Conversation, Sales, Emotional styles
en-us-June:MAI-Voice-1 Female General Conversation, Customer Service, Professional, Emotional styles
en-us-Grant:MAI-Voice-1 Male General Conversation, Professional, Emotional styles
en-us-Iris:MAI-Voice-1 Female General Conversation, Narration, Emotional styles
en-us-Reed:MAI-Voice-1 Male General Conversation
en-us-Joy:MAI-Voice-1 Female General Conversation

Usage: Available for third-party developers. Microsoft holds full licensing rights for commercial use.

Next steps