Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
[This article is prerelease documentation and is subject to change.]
Configure a real-time voice agent by turning on real-time voice, setting core options, and then configuring features like topics, multilingual support, DTMF, and silence detection.
Set up and enable real-time voice
Create a new agent, and configure its basic details, such as a descriptive name and the purpose of the agent in the description.
Go to the agent's Voice settings and turn on Enable voice, and then in Voice type, select Realtime voice. Learn more in Choose how to handle speech.
Important
This is a one-time selection. After you select Realtime voice, you can't switch back to Basic voice. To use Basic voice, create a new agent.
Go to the agent's Security settings and select No Authentication.
Knowledge and tools
You can configure your agent to use knowledge and tools. Learn more in Knowledge sources summary, Add tools to custom agents, and Tools, knowledge, MCP, and API.
Nested agents (preview)
Real-time voice agents only support child agents.
Important
Ensure child agent descriptions don't overlap with topic descriptions. Explicitly define invocation order in the agent's instructions.
Topics
Real-time voice agents support all topics configured in Copilot Studio. Use topics to define deterministic behaviors such as greetings, business rules, and escalation, while the real-time voice model manages conversational responses at runtime. Learn more in Choose how to control the conversation.
Best practices when using topics with real-time voice agents
Use topics only when deterministic behavior is required.
Use static text in greeting messages for the fastest first response. Dynamic messages with variables and expressions increase initial latency.
Conversation Start is enabled by default. If you want the real-time voice model to handle the greeting, disable the Conversation Start topic; otherwise, the greeting configured in the Conversation Start topic plays instead of the voice model greeting.
Let the real-time voice model handle general conversation and follow-up questions.
The On Error topic should include an explicit action, such as transfer or end call. Message‑only error handling isn't sufficient. Without a deterministic next step, customers might experience silence or stuck calls, leading to confusion and poor voice experiences.
Use explicit topic and tool descriptions to declare ownership of data collection. Learn more in Write effective topic and tool descriptions.
Topic node support
The following list describes topic support in real-time voice agents:
Condition node
| Feature | Support |
|---|---|
| If/Else branching | Supported |
| Power Fx expressions | Supported |
| Slot filling reprocessing | Supported |
Message node
| Feature | Support |
|---|---|
| Basic message | Supported |
| Message variations | Supported |
| Variable insertion | Supported |
| SSML | Supported |
| Rich Media/Adaptive Cards | Not Applicable |
| Quick Replies | Not Applicable |
Question node
| Feature | Support |
|---|---|
| Prompt text | Supported |
| Automatic hold | Not supported |
| Slot filling | Supported |
| Skip behavior/Greedy slot filling | Supported |
| Reprompt/Retry | Supported |
| Invalid response handling | Supported |
| Topic interruption | Supported |
| Barge‑in | Supported |
| Custom reprompt message | Supported |
| DTMF input | Supported |
| Silence detection | Supported |
HTTP node
| Feature | Support |
|---|---|
| HTTP methods: GET, POST, PUT, PATCH, DELETE | Supported |
| URL endpoints | Supported |
| Headers and payloads | Supported |
| Response parsing and schema | Supported |
| Variable mapping | Supported |
| Error handling | Supported |
Tool node
| Feature | Support |
|---|---|
| Power Automate flow | Supported |
| Tool invocation | Supported |
| Input/Output mapping | Supported |
| New prompt | Supported |
Set variable value node
| Feature | Support |
|---|---|
| Literal assignment | Supported |
| Expression assignment | Supported |
| Variable to variable | Supported |
Topic management node
| Feature | Support |
|---|---|
| End current topic | Supported |
| End all topics | Supported |
| End conversation | Supported |
| Go to step | Supported |
| User input for recognize intent | Supported |
| Go to another topic | Supported |
Transfer conversation node
| Feature | Support |
|---|---|
| Transfer to agent | Supported |
| External phone number transfer | Supported |
Advanced
| Feature | Support |
|---|---|
| Create generative answers | Supported |
System trigger support
| Trigger | Support | Details |
|---|---|---|
| On Conversation Start | Supported | Fires when a new conversation begins |
| On Talk to representative | Supported | Transfers to human agent |
| Unknown Topic/On Unknown Intent | Not supported | Fallback when no topic matches |
| OnSelectIntent (multiple topics matched) | Not supported | Disambiguation between similar topics |
| Reset Conversation (OnSystemRedirect) | Supported | Clears variables and restarts flow |
| On Sign in | Not supported | |
| Unknown DTMF key press | Supported | Unmapped keypad input |
| The agent chooses / User says a phrase | Supported | Agent selects topic based on intent |
| A message is received | Not supported | Increases latency |
| A custom client event occurs | Not supported | Only at session start |
| The conversation update | Not supported | Members added or removed, session changes |
| It's invoked | Not supported | Requires synchronous UI |
| It's redirected | Supported | |
| The User is inactive for a while/Silence detection | Supported | User inactive timeout |
| A plan completes | Not supported | |
| AI response generated | Not supported | |
| On Error | Supported | Handles orchestration errors |
Pass variables between topics and the language model
When you use topics in a hybrid conversational flow, understanding how to pass variables between topics and the real-time language model is critical for building reliable, stateful interactions.
This functionality works through the following process:
You pass input variables defined on a topic into the topic at invocation time, so the language model can provide structured data to the deterministic flow.
You return output variables defined on a topic to the language model at the end of topic execution as structured key-value pairs. The language model includes these outputs in the conversation context, and you can reference them in subsequent turns.
Tool call outputs follow the same pattern: you send outputs to the language model at the end of tool execution, and they're available for future use within the conversation context window.
The language model is populated with conversational context, including tool call output key-value pairs. However, you only return explicitly defined output variables as structured data. You can collect a value inside a topic, such as a verified account number. Define that value as an output. If you don't, the language model can't access it. The agent might ask the caller for the same information again later.
Learn more in Manage topic inputs and outputs.
Multilingual support
Add all secondary languages you want. Localization strings aren't required for any real-time flows. However, for deterministic topic messages, you need to provide the translated messages. Learn more in Configure and create multilingual agents.
The real-time model can understand and respond in many languages. However, Microsoft doesn't formally validate all languages for general availability.
As of April 2026, the following languages are formally validated:
- English (United States) (en-US)
- Spanish (United States) (es-US)
- Arabic
- Portuguese (Brazil) (pt-BR)
- Italian (Italy) (it-IT)
- German (Germany) (de-DE)
- Dutch (Netherlands) (nl-NL)
- French (Canada) (fr-CA)
Microsoft continues to validate other languages and adds them after certification completion. You can add any language supported by Copilot Studio. However, languages that aren't fully certified for GA-level quality should be thoroughly tested before production deployment.
Important
Technical language capability doesn't equal a supported or certified language. If you intend to deploy agents in languages other than English, you should conduct extensive testing with real-world callers and call flows before going live.
Context variables
A real-time voice agent supports context variables that allow it to behave more intelligently by carrying information about the call, the caller, and the current conversation. The system automatically provides a limited set of call and conversation context to the model at runtime. This set includes:
| Context variable | Description |
|---|---|
| Channel ID | Identifies the communication channel used for the interaction. This identification helps the model understand that the conversation is occurring over a speech-to-speech voice channel. |
| Caller phone number (ANI) | The originating phone number of the caller. The system can use this information to support caller identification scenarios. |
| Callee number (DNIS) | The destination phone number that the caller dialed. This information helps distinguish which business number or entry point was reached. |
| Conversation ID | A unique identifier for the active call session. Use this value to correlate and maintain continuity within a single conversation. |
| SIP headers | A set of supported SIP header key-value pairs associated with the call. The set includes only nonsensitive and supported headers. |
| Current date (UTC) | The current date in Coordinated Universal Time (UTC), provided at runtime to allow date-aware responses. |
| Current time (UTC) | The current time in Coordinated Universal Time (UTC), provided at runtime to allow time-aware responses. |
For all other context variables, follow the steps outlined in Configure context variables for agents.
Agent voice
Select the voice your agent uses by selecting your agent and go to Settings > Voice > Select voice. Real-time voice agents support the following voices:
- Alloy
- Ash
- Ballad
- Coral
- Echo
- Sage
- Shimmer
- Verse
- Marin
- Cedar
Note
- The agent voice is for your real-time voice agent and isn't the one configured in Copilot Service admin center.
- To match your Dynamics system message voices with your real-time voice agent, use only the following supported voices: Alloy, Echo, Shimmer, or Ash.
Speech sensitivity
Speech sensitivity voice activity detection (VAD) determines when the agent should respond after the caller finishes speaking.
Understanding VAD types
Real-time voice agents support two VAD approaches:
Server-based VAD - Based on sound (silence)
Detects end of speech based on audio signals (silence duration, volume)
Responds quickly once silence is detected
Deterministic, predictable behavior
Best for structured interactions, short responses, noisy environments
Semantic VAD - Based on sentence context
Determines turn completion based on meaning of what was said
Evaluates whether caller completed their thought
Adapts to natural pauses, filler words, trailing speech
Best for: Conversational interactions, complex questions, open-ended discussions
Select the right VAD
Use server-based VAD when all of the following conditions are true:
Interactions are structured (IVR-style menu navigation).
Responses are short and predictable.
Background noise is a concern (semantic VAD might wait too long).
You want fast, crisp turn-taking.
Use semantic VAD when all of the following conditions are true:
Conversations are open-ended.
Callers might hesitate or use filler words ("um", "let me think...").
Questions are complex (callers explain situations).
Natural conversation flow is prioritized.
Configure server-based VAD
Go to Settings > Voice > Phone Setup > Speech input > Sensitivity > Based on sound (silence).
| Parameter | Description | Default | Recommended range |
|---|---|---|---|
| Threshold | Sensitivity to voice versus noise (0-1 scale) | 0.6 | 0.5-0.7 |
| Prefix padding (ms) | Audio captured before speech starts | 300 ms | 200-500 ms |
| Silence Duration (ms) | Silence required to end turn | 750 ms | 750-1000 ms |
Threshold
Lower (0.3-0.4): More sensitive; picks up quiet speech, might trigger on background noise.
Higher (0.7-0.9): Less sensitive; requires louder speech, reduces false triggers.
Recommended: Start with 0.5; increase if background noise causes false triggers.
Prefix padding
Captures audio before speech detection (prevents cutting off first word).
Lower (200 ms): Faster response; might miss first syllable.
Higher (500 ms): Safer capture; slight delay.
Recommended: 300 ms (good balance).
Silence Duration
How long the caller must be silent before agent responds.
Lower (500 ms): Fast turn-taking; might interrupt if caller pauses mid-thought.
Higher (1000 ms): More patient; might feel slow.
Recommended: Start with 750 ms.
Configure Semantic VAD
Go to Settings > Voice > Phone Setup > Speech input > Sensitivity > Based on sentence context.
Parameter: Eagerness (how quickly the agent responds after semantic completion)
| Setting | Behavior | Best for |
|---|---|---|
| Low | Waits longer, very patient | Callers who think out loud, frequent pauses |
| Medium | Balanced (default) | General conversations |
| High | Responds quickly | Fast-paced interactions, simple questions |
DTMF configuration
Dual-Tone Multi-Frequency (DTMF) allows callers to enter information by using their phone keypad.
You can turn on DTMF for your agent at both the topic and global levels. To set it at the global level, select your agent and go to Settings > Voice > Conversation behavior > DTMF.
To set it per topic node, learn more in Turn on DTMF support for your voice-enabled agent.
To support reliable input completion, you can configure DTMF timing and termination behavior. This configuration includes an inter‑digit timeout, which defines how long the system waits between key presses, and an optional termination character (such as # or *) that explicitly signals the end of input. When you use a termination character, the system processes input immediately without waiting for a timeout.
Silence detection
Silence detection allows real-time voice agents to recognize when a caller provides no input for a specified period. Set up silence detection as a global voice setting for the agent by going to Settings > Voice > Conversation Behavior > Silence detection.
The silence timer starts when the agent finishes speaking and detects no speech or DTMF input from the caller. If the silence timeout is reached, the agent follows the configured silence detection topic.
Important
Silence detection isn't turned on by default. If the user doesn't speak, the agent waits indefinitely without prompting. Explicitly turn on silence detection and configure a reprompt message to handle silent callers.
The default silence detection timeout is 7,000 ms (7 seconds). Validate this value against your specific use case and caller environment before deploying to production. Seven seconds might feel too long for some callers or too short for others depending on the nature of the interaction, for example, complex questions or noisy environments. Test with real-world call data to determine the appropriate threshold for your scenario.
Before enabling silence detection, ensure that the behavior you configure in your silence detection topic (for example, Escalate, Hang Up, Reprompt) is intentional and appropriate for your use case. Misconfigured fallback behavior, such as inadvertently setting the fallback to Escalate when the intent is to hang up, or vice versa, can result in unexpected call outcomes.
Latency messaging
Add latency message or music to your agent when background operations take longer than expected. To configure latency messaging, go to Settings > Voice > Conversation behavior > Latency messaging.
Real-time voice agent evaluation (preview)
Real-time voice agents support sending text during evaluation, however, audio processing isn't supported.