Configure real-time voice agents

[This article is prerelease documentation and is subject to change.]

Configure a real-time voice agent by turning on real-time voice, setting core options, and then configuring features like topics, multilingual support, DTMF, and silence detection.

Set up and enable real-time voice

  1. Create a new agent, and configure its basic details, such as a descriptive name and the purpose of the agent in the description.

  2. Go to the agent's Voice settings and turn on Enable voice, and then in Voice type, select Realtime voice. Learn more in Choose how to handle speech.

    Important

    This is a one-time selection. After you select Realtime voice, you can't switch back to Basic voice. To use Basic voice, create a new agent.

    Screenshot of an agent's Settings, highlighting the Realtime voice setting.

  3. Go to the agent's Security settings and select No Authentication.

Knowledge and tools

You can configure your agent to use knowledge and tools. Learn more in Knowledge sources summary, Add tools to custom agents, and Tools, knowledge, MCP, and API.

Nested agents (preview)

Real-time voice agents only support child agents.

Important

Ensure child agent descriptions don't overlap with topic descriptions. Explicitly define invocation order in the agent's instructions.

Topics

Real-time voice agents support all topics configured in Copilot Studio. Use topics to define deterministic behaviors such as greetings, business rules, and escalation, while the real-time voice model manages conversational responses at runtime. Learn more in Choose how to control the conversation.

Best practices when using topics with real-time voice agents

  • Use topics only when deterministic behavior is required.

  • Use static text in greeting messages for the fastest first response. Dynamic messages with variables and expressions increase initial latency.

  • Conversation Start is enabled by default. If you want the real-time voice model to handle the greeting, disable the Conversation Start topic; otherwise, the greeting configured in the Conversation Start topic plays instead of the voice model greeting.

  • Let the real-time voice model handle general conversation and follow-up questions.

  • The On Error topic should include an explicit action, such as transfer or end call. Message‑only error handling isn't sufficient. Without a deterministic next step, customers might experience silence or stuck calls, leading to confusion and poor voice experiences.

  • Use explicit topic and tool descriptions to declare ownership of data collection. Learn more in Write effective topic and tool descriptions.

Topic node support

The following list describes topic support in real-time voice agents:

Condition node

Feature Support
If/Else branching  Supported 
Power Fx expressions Supported 
Slot filling reprocessing  Supported 

Message node

Feature  Support
Basic message  Supported 
Message variations  Supported
Variable insertion  Supported
SSML  Supported 
Rich Media/Adaptive Cards  Not Applicable 
Quick Replies  Not Applicable 

Question node

Feature Support
Prompt text Supported
Automatic hold Not supported
Slot filling Supported
Skip behavior/Greedy slot filling Supported
Reprompt/Retry Supported
Invalid response handling Supported
Topic interruption Supported
Barge‑in Supported
Custom reprompt message Supported
DTMF input Supported
Silence detection Supported

HTTP node

Feature  Support
HTTP methods: GET, POST, PUT, PATCH, DELETE  Supported 
URL endpoints  Supported 
Headers and payloads Supported 
Response parsing and schema Supported 
Variable mapping Supported 
Error handling Supported 

Tool node

Feature Support
Power Automate flow Supported
Tool invocation Supported
Input/Output mapping Supported
New prompt Supported

Set variable value node

Feature  Support
Literal assignment  Supported 
Expression assignment  Supported 
Variable to variable  Supported 

Topic management node

Feature  Support
End current topic  Supported 
End all topics  Supported 
End conversation  Supported 
Go to step   Supported 
User input for recognize intent  Supported 
Go to another topic  Supported 

Transfer conversation node

Feature  Support
Transfer to agent  Supported 
External phone number transfer  Supported 

Advanced

Feature Support
Create generative answers  Supported 

System trigger support

Trigger Support Details
On Conversation Start Supported Fires when a new conversation begins
On Talk to representative Supported Transfers to human agent
Unknown Topic/On Unknown Intent Not supported Fallback when no topic matches
OnSelectIntent (multiple topics matched) Not supported Disambiguation between similar topics
Reset Conversation (OnSystemRedirect) Supported Clears variables and restarts flow
On Sign in Not supported
Unknown DTMF key press Supported Unmapped keypad input
The agent chooses / User says a phrase Supported Agent selects topic based on intent
A message is received Not supported Increases latency
A custom client event occurs Not supported Only at session start
The conversation update Not supported Members added or removed, session changes
It's invoked Not supported Requires synchronous UI
It's redirected Supported
The User is inactive for a while/Silence detection Supported User inactive timeout
A plan completes Not supported
AI response generated Not supported
On Error Supported Handles orchestration errors

Pass variables between topics and the language model

When you use topics in a hybrid conversational flow, understanding how to pass variables between topics and the real-time language model is critical for building reliable, stateful interactions.

This functionality works through the following process:

  • You pass input variables defined on a topic into the topic at invocation time, so the language model can provide structured data to the deterministic flow.

  • You return output variables defined on a topic to the language model at the end of topic execution as structured key-value pairs. The language model includes these outputs in the conversation context, and you can reference them in subsequent turns.

  • Tool call outputs follow the same pattern: you send outputs to the language model at the end of tool execution, and they're available for future use within the conversation context window.

  • The language model is populated with conversational context, including tool call output key-value pairs. However, you only return explicitly defined output variables as structured data. You can collect a value inside a topic, such as a verified account number. Define that value as an output. If you don't, the language model can't access it. The agent might ask the caller for the same information again later.

Learn more in Manage topic inputs and outputs.

Multilingual support

Add all secondary languages you want. Localization strings aren't required for any real-time flows. However, for deterministic topic messages, you need to provide the translated messages. Learn more in Configure and create multilingual agents.

The real-time model can understand and respond in many languages. However, Microsoft doesn't formally validate all languages for general availability.

As of April 2026, the following languages are formally validated:

  • English (United States) (en-US)
  • Spanish (United States) (es-US)
  • Arabic
  • Portuguese (Brazil) (pt-BR)
  • Italian (Italy) (it-IT)
  • German (Germany) (de-DE)
  • Dutch (Netherlands) (nl-NL)
  • French (Canada) (fr-CA)

Microsoft continues to validate other languages and adds them after certification completion. You can add any language supported by Copilot Studio. However, languages that aren't fully certified for GA-level quality should be thoroughly tested before production deployment.

Important

Technical language capability doesn't equal a supported or certified language. If you intend to deploy agents in languages other than English, you should conduct extensive testing with real-world callers and call flows before going live.

Context variables

A real-time voice agent supports context variables that allow it to behave more intelligently by carrying information about the call, the caller, and the current conversation. The system automatically provides a limited set of call and conversation context to the model at runtime. This set includes:

Context variable Description
Channel ID Identifies the communication channel used for the interaction. This identification helps the model understand that the conversation is occurring over a speech-to-speech voice channel.
Caller phone number (ANI) The originating phone number of the caller. The system can use this information to support caller identification scenarios.
Callee number (DNIS) The destination phone number that the caller dialed. This information helps distinguish which business number or entry point was reached.
Conversation ID A unique identifier for the active call session. Use this value to correlate and maintain continuity within a single conversation.
SIP headers A set of supported SIP header key-value pairs associated with the call. The set includes only nonsensitive and supported headers.
Current date (UTC) The current date in Coordinated Universal Time (UTC), provided at runtime to allow date-aware responses.
Current time (UTC) The current time in Coordinated Universal Time (UTC), provided at runtime to allow time-aware responses.

For all other context variables, follow the steps outlined in Configure context variables for agents.

Agent voice

Select the voice your agent uses by selecting your agent and go to Settings > Voice > Select voice. Real-time voice agents support the following voices:

  • Alloy
  • Ash
  • Ballad
  • Coral
  • Echo
  • Sage
  • Shimmer
  • Verse
  • Marin
  • Cedar

Note

  • The agent voice is for your real-time voice agent and isn't the one configured in Copilot Service admin center.
  • To match your Dynamics system message voices with your real-time voice agent, use only the following supported voices: Alloy, Echo, Shimmer, or Ash.

Speech sensitivity

Speech sensitivity voice activity detection (VAD) determines when the agent should respond after the caller finishes speaking.

Understanding VAD types

Real-time voice agents support two VAD approaches:

Screenshot of the Speech sensitivity dialog.

Server-based VAD - Based on sound (silence)

  • Detects end of speech based on audio signals (silence duration, volume)

  • Responds quickly once silence is detected

  • Deterministic, predictable behavior

  • Best for structured interactions, short responses, noisy environments

Semantic VAD - Based on sentence context

  • Determines turn completion based on meaning of what was said

  • Evaluates whether caller completed their thought

  • Adapts to natural pauses, filler words, trailing speech

  • Best for: Conversational interactions, complex questions, open-ended discussions

Select the right VAD

Use server-based VAD when all of the following conditions are true:

  • Interactions are structured (IVR-style menu navigation).

  • Responses are short and predictable.

  • Background noise is a concern (semantic VAD might wait too long).

  • You want fast, crisp turn-taking.

Use semantic VAD when all of the following conditions are true:

  • Conversations are open-ended.

  • Callers might hesitate or use filler words ("um", "let me think...").

  • Questions are complex (callers explain situations).

  • Natural conversation flow is prioritized.

Configure server-based VAD

Go to Settings > Voice > Phone Setup > Speech input > Sensitivity > Based on sound (silence).

Screenshot of the Speech sensitivity dialog when set to Based on sound (silence).

Parameter Description Default Recommended range
Threshold Sensitivity to voice versus noise (0-1 scale) 0.6 0.5-0.7
Prefix padding (ms) Audio captured before speech starts 300 ms 200-500 ms
Silence Duration (ms) Silence required to end turn 750 ms 750-1000 ms

Threshold

  • Lower (0.3-0.4): More sensitive; picks up quiet speech, might trigger on background noise.

  • Higher (0.7-0.9): Less sensitive; requires louder speech, reduces false triggers.

  • Recommended: Start with 0.5; increase if background noise causes false triggers.

Prefix padding

  • Captures audio before speech detection (prevents cutting off first word).

  • Lower (200 ms): Faster response; might miss first syllable.

  • Higher (500 ms): Safer capture; slight delay.

  • Recommended: 300 ms (good balance).

Silence Duration

  • How long the caller must be silent before agent responds.

  • Lower (500 ms): Fast turn-taking; might interrupt if caller pauses mid-thought.

  • Higher (1000 ms): More patient; might feel slow.

  • Recommended: Start with 750 ms.

Configure Semantic VAD

Go to Settings > Voice > Phone Setup > Speech input > Sensitivity > Based on sentence context.

Screenshot of the Speech sensitivity dialog when set to Based on sentence context.

Parameter: Eagerness (how quickly the agent responds after semantic completion)

Setting Behavior Best for
Low Waits longer, very patient Callers who think out loud, frequent pauses
Medium Balanced (default) General conversations
High Responds quickly Fast-paced interactions, simple questions

DTMF configuration

Dual-Tone Multi-Frequency (DTMF) allows callers to enter information by using their phone keypad.

You can turn on DTMF for your agent at both the topic and global levels. To set it at the global level, select your agent and go to Settings > Voice > Conversation behavior > DTMF.

To set it per topic node, learn more in Turn on DTMF support for your voice-enabled agent.

To support reliable input completion, you can configure DTMF timing and termination behavior. This configuration includes an inter‑digit timeout, which defines how long the system waits between key presses, and an optional termination character (such as # or *) that explicitly signals the end of input. When you use a termination character, the system processes input immediately without waiting for a timeout.

Silence detection

Silence detection allows real-time voice agents to recognize when a caller provides no input for a specified period. Set up silence detection as a global voice setting for the agent by going to Settings > Voice > Conversation Behavior > Silence detection.

The silence timer starts when the agent finishes speaking and detects no speech or DTMF input from the caller. If the silence timeout is reached, the agent follows the configured silence detection topic.

Important

  • Silence detection isn't turned on by default. If the user doesn't speak, the agent waits indefinitely without prompting. Explicitly turn on silence detection and configure a reprompt message to handle silent callers.

  • The default silence detection timeout is 7,000 ms (7 seconds). Validate this value against your specific use case and caller environment before deploying to production. Seven seconds might feel too long for some callers or too short for others depending on the nature of the interaction, for example, complex questions or noisy environments. Test with real-world call data to determine the appropriate threshold for your scenario.

  • Before enabling silence detection, ensure that the behavior you configure in your silence detection topic (for example, Escalate, Hang Up, Reprompt) is intentional and appropriate for your use case. Misconfigured fallback behavior, such as inadvertently setting the fallback to Escalate when the intent is to hang up, or vice versa, can result in unexpected call outcomes.

Latency messaging

Add latency message or music to your agent when background operations take longer than expected. To configure latency messaging, go to Settings > Voice > Conversation behavior > Latency messaging.

Screenshot of the Latency messaging dialog.

Real-time voice agent evaluation (preview)

Real-time voice agents support sending text during evaluation, however, audio processing isn't supported.