Configure real-time voice agents

[This article is prerelease documentation and is subject to change.]

Configure a real-time voice agent by turning on real-time voice, setting core options, and then configuring features like topics, multilingual support, DTMF, and silence detection.

Set up and enable real-time voice

Create a new agent, and configure its basic details, such as a descriptive name and the purpose of the agent in the description.
Go to the agent's Voice settings and turn on Enable voice, and then in Voice type, select Realtime voice. Learn more in Choose how to handle speech.

Important

This is a one-time selection. After you select Realtime voice, you can't switch back to Basic voice. To use Basic voice, create a new agent.
Go to the agent's Security settings and select No Authentication.

Knowledge and tools

You can configure your agent to use knowledge and tools. Learn more in Knowledge sources summary, Add tools to custom agents, and Tools, knowledge, MCP, and API.

Nested agents (preview)

Real-time voice agents only support child agents.

Important

Ensure child agent descriptions don't overlap with topic descriptions. Explicitly define invocation order in the agent's instructions.

Topics

Real-time voice agents support all topics configured in Copilot Studio. Use topics to define deterministic behaviors such as greetings, business rules, and escalation, while the real-time voice model manages conversational responses at runtime. Learn more in Choose how to control the conversation.

Best practices when using topics with real-time voice agents

Use topics only when deterministic behavior is required.
Use static text in greeting messages for the fastest first response. Dynamic messages with variables and expressions increase initial latency.
Conversation Start is enabled by default. If you want the real-time voice model to handle the greeting, disable the Conversation Start topic; otherwise, the greeting configured in the Conversation Start topic plays instead of the voice model greeting.
Let the real-time voice model handle general conversation and follow-up questions.
The On Error topic should include an explicit action, such as transfer or end call. Message‑only error handling isn't sufficient. Without a deterministic next step, customers might experience silence or stuck calls, leading to confusion and poor voice experiences.
Use explicit topic and tool descriptions to declare ownership of data collection. Learn more in Write effective topic and tool descriptions.

Topic node support

The following list describes topic support in real-time voice agents:

Condition node

Feature	Support
If/Else branching	Supported
Power Fx expressions	Supported
Slot filling reprocessing	Supported

Message node

Feature	Support
Basic message	Supported
Message variations	Supported
Variable insertion	Supported
SSML	Supported
Rich Media/Adaptive Cards	Not Applicable
Quick Replies	Not Applicable

Question node

Feature	Support
Prompt text	Supported
Automatic hold	Not supported
Slot filling	Supported
Skip behavior/Greedy slot filling	Supported
Reprompt/Retry	Supported
Invalid response handling	Supported
Topic interruption	Supported
Barge‑in	Supported
Custom reprompt message	Supported
DTMF input	Supported
Silence detection	Supported

HTTP node

Feature	Support
HTTP methods: GET, POST, PUT, PATCH, DELETE	Supported
URL endpoints	Supported
Headers and payloads	Supported
Response parsing and schema	Supported
Variable mapping	Supported
Error handling	Supported

Tool node

Feature	Support
Power Automate flow	Supported
Tool invocation	Supported
Input/Output mapping	Supported
New prompt	Supported

Set variable value node

Feature	Support
Literal assignment	Supported
Expression assignment	Supported
Variable to variable	Supported

Topic management node

Feature	Support
End current topic	Supported
End all topics	Supported
End conversation	Supported
Go to step	Supported
User input for recognize intent	Supported
Go to another topic	Supported

Transfer conversation node

Feature	Support
Transfer to agent	Supported
External phone number transfer	Supported

Advanced

Feature	Support
Create generative answers	Supported

System trigger support

Trigger	Support	Details
On Conversation Start	Supported	Fires when a new conversation begins
On Talk to representative	Supported	Transfers to human agent
Unknown Topic/On Unknown Intent	Not supported	Fallback when no topic matches
OnSelectIntent (multiple topics matched)	Not supported	Disambiguation between similar topics
Reset Conversation (OnSystemRedirect)	Supported	Clears variables and restarts flow
On Sign in	Not supported
Unknown DTMF key press	Supported	Unmapped keypad input
The agent chooses / User says a phrase	Supported	Agent selects topic based on intent
A message is received	Not supported	Increases latency
A custom client event occurs	Not supported	Only at session start
The conversation update	Not supported	Members added or removed, session changes
It's invoked	Not supported	Requires synchronous UI
It's redirected	Supported
The User is inactive for a while/Silence detection	Supported	User inactive timeout
A plan completes	Not supported
AI response generated	Not supported
On Error	Supported	Handles orchestration errors

Pass variables between topics and the language model

When you use topics in a hybrid conversational flow, understanding how to pass variables between topics and the real-time language model is critical for building reliable, stateful interactions.

This functionality works through the following process:

You pass input variables defined on a topic into the topic at invocation time, so the language model can provide structured data to the deterministic flow.
You return output variables defined on a topic to the language model at the end of topic execution as structured key-value pairs. The language model includes these outputs in the conversation context, and you can reference them in subsequent turns.
Tool call outputs follow the same pattern: you send outputs to the language model at the end of tool execution, and they're available for future use within the conversation context window.
The language model is populated with conversational context, including tool call output key-value pairs. However, you only return explicitly defined output variables as structured data. You can collect a value inside a topic, such as a verified account number. Define that value as an output. If you don't, the language model can't access it. The agent might ask the caller for the same information again later.

Learn more in Manage topic inputs and outputs.

Multilingual support

Add all secondary languages you want. Localization strings aren't required for any real-time flows. However, for deterministic topic messages, you need to provide the translated messages. Learn more in Configure and create multilingual agents.

The real-time model can understand and respond in many languages. However, Microsoft doesn't formally validate all languages for general availability.

As of April 2026, the following languages are formally validated:

English (United States) (en-US)
Spanish (United States) (es-US)
Arabic
Portuguese (Brazil) (pt-BR)
Italian (Italy) (it-IT)
German (Germany) (de-DE)
Dutch (Netherlands) (nl-NL)
French (Canada) (fr-CA)

Microsoft continues to validate other languages and adds them after certification completion. You can add any language supported by Copilot Studio. However, languages that aren't fully certified for GA-level quality should be thoroughly tested before production deployment.

Important

Technical language capability doesn't equal a supported or certified language. If you intend to deploy agents in languages other than English, you should conduct extensive testing with real-world callers and call flows before going live.

Context variables

A real-time voice agent supports context variables that allow it to behave more intelligently by carrying information about the call, the caller, and the current conversation. The system automatically provides a limited set of call and conversation context to the model at runtime. This set includes:

Context variable	Description
Channel ID	Identifies the communication channel used for the interaction. This identification helps the model understand that the conversation is occurring over a speech-to-speech voice channel.
Caller phone number (ANI)	The originating phone number of the caller. The system can use this information to support caller identification scenarios.
Callee number (DNIS)	The destination phone number that the caller dialed. This information helps distinguish which business number or entry point was reached.
Conversation ID	A unique identifier for the active call session. Use this value to correlate and maintain continuity within a single conversation.
SIP headers	A set of supported SIP header key-value pairs associated with the call. The set includes only nonsensitive and supported headers.
Current date (UTC)	The current date in Coordinated Universal Time (UTC), provided at runtime to allow date-aware responses.
Current time (UTC)	The current time in Coordinated Universal Time (UTC), provided at runtime to allow time-aware responses.

For all other context variables, follow the steps outlined in Configure context variables for agents.

Agent voice

Select the voice your agent uses by selecting your agent and go to Settings > Voice > Select voice. Real-time voice agents support the following voices:

Alloy
Ash
Ballad
Coral
Echo
Sage
Shimmer
Verse
Marin
Cedar

Note

The agent voice is for your real-time voice agent and isn't the one configured in Copilot Service admin center.
To match your Dynamics system message voices with your real-time voice agent, use only the following supported voices: Alloy, Echo, Shimmer, or Ash.

Speech sensitivity

Speech sensitivity voice activity detection (VAD) determines when the agent should respond after the caller finishes speaking.

Understanding VAD types

Real-time voice agents support two VAD approaches:

Screenshot of the Speech sensitivity dialog.

Server-based VAD - Based on sound (silence)

Detects end of speech based on audio signals (silence duration, volume)
Responds quickly once silence is detected
Deterministic, predictable behavior
Best for structured interactions, short responses, noisy environments

Semantic VAD - Based on sentence context

Determines turn completion based on meaning of what was said
Evaluates whether caller completed their thought
Adapts to natural pauses, filler words, trailing speech
Best for: Conversational interactions, complex questions, open-ended discussions

Select the right VAD

Use server-based VAD when all of the following conditions are true:

Interactions are structured (IVR-style menu navigation).
Responses are short and predictable.
Background noise is a concern (semantic VAD might wait too long).
You want fast, crisp turn-taking.

Use semantic VAD when all of the following conditions are true:

Conversations are open-ended.
Callers might hesitate or use filler words ("um", "let me think...").
Questions are complex (callers explain situations).
Natural conversation flow is prioritized.

Configure server-based VAD

Go to Settings > Voice > Phone Setup > Speech input > Sensitivity > Based on sound (silence).

Screenshot of the Speech sensitivity dialog when set to Based on sound (silence).

Parameter	Description	Default	Recommended range
Threshold	Sensitivity to voice versus noise (0-1 scale)	0.6	0.5-0.7
Prefix padding (ms)	Audio captured before speech starts	300 ms	200-500 ms
Silence Duration (ms)	Silence required to end turn	750 ms	750-1000 ms

Threshold

Lower (0.3-0.4): More sensitive; picks up quiet speech, might trigger on background noise.
Higher (0.7-0.9): Less sensitive; requires louder speech, reduces false triggers.
Recommended: Start with 0.5; increase if background noise causes false triggers.

Prefix padding

Captures audio before speech detection (prevents cutting off first word).
Lower (200 ms): Faster response; might miss first syllable.
Higher (500 ms): Safer capture; slight delay.
Recommended: 300 ms (good balance).

Silence Duration

How long the caller must be silent before agent responds.
Lower (500 ms): Fast turn-taking; might interrupt if caller pauses mid-thought.
Higher (1000 ms): More patient; might feel slow.
Recommended: Start with 750 ms.

Configure Semantic VAD

Go to Settings > Voice > Phone Setup > Speech input > Sensitivity > Based on sentence context.

Screenshot of the Speech sensitivity dialog when set to Based on sentence context.

Parameter: Eagerness (how quickly the agent responds after semantic completion)

Setting	Behavior	Best for
Low	Waits longer, very patient	Callers who think out loud, frequent pauses
Medium	Balanced (default)	General conversations
High	Responds quickly	Fast-paced interactions, simple questions

DTMF configuration

Dual-Tone Multi-Frequency (DTMF) allows callers to enter information by using their phone keypad.

You can turn on DTMF for your agent at both the topic and global levels. To set it at the global level, select your agent and go to Settings > Voice > Conversation behavior > DTMF.

To set it per topic node, learn more in Turn on DTMF support for your voice-enabled agent.

To support reliable input completion, you can configure DTMF timing and termination behavior. This configuration includes an inter‑digit timeout, which defines how long the system waits between key presses, and an optional termination character (such as # or *) that explicitly signals the end of input. When you use a termination character, the system processes input immediately without waiting for a timeout.

Silence detection

Silence detection allows real-time voice agents to recognize when a caller provides no input for a specified period. Set up silence detection as a global voice setting for the agent by going to Settings > Voice > Conversation Behavior > Silence detection.

The silence timer starts when the agent finishes speaking and detects no speech or DTMF input from the caller. If the silence timeout is reached, the agent follows the configured silence detection topic.

Important

Silence detection isn't turned on by default. If the user doesn't speak, the agent waits indefinitely without prompting. Explicitly turn on silence detection and configure a reprompt message to handle silent callers.
The default silence detection timeout is 7,000 ms (7 seconds). Validate this value against your specific use case and caller environment before deploying to production. Seven seconds might feel too long for some callers or too short for others depending on the nature of the interaction, for example, complex questions or noisy environments. Test with real-world call data to determine the appropriate threshold for your scenario.
Before enabling silence detection, ensure that the behavior you configure in your silence detection topic (for example, Escalate, Hang Up, Reprompt) is intentional and appropriate for your use case. Misconfigured fallback behavior, such as inadvertently setting the fallback to Escalate when the intent is to hang up, or vice versa, can result in unexpected call outcomes.

Latency messaging

Add latency message or music to your agent when background operations take longer than expected. To configure latency messaging, go to Settings > Voice > Conversation behavior > Latency messaging.

Real-time voice agent evaluation (preview)

Real-time voice agents support sending text during evaluation, however, audio processing isn't supported.

Feedback

Was this page helpful?

Last updated on 2026-04-23

Configure real-time voice agents

Set up and enable real-time voice

Knowledge and tools

Nested agents (preview)

Topics

Best practices when using topics with real-time voice agents

Topic node support

Condition node

Message node

Question node

HTTP node

Tool node

Set variable value node

Topic management node

Transfer conversation node

Advanced

System trigger support

Pass variables between topics and the language model

Multilingual support

Context variables

Agent voice

Speech sensitivity

Understanding VAD types

Select the right VAD

Configure server-based VAD

Threshold

Prefix padding

Silence Duration

Configure Semantic VAD

DTMF configuration

Silence detection

Latency messaging

Real-time voice agent evaluation (preview)

Feedback

Additional resources