Edit

Share via


Default Guardrail policies for Azure OpenAI

Azure OpenAI in Microsoft Foundry Models includes default safety policies applied to all models (excluding Azure OpenAI Whisper). These configurations provide you with a responsible experience by default, including content filtering models, blocklists, prompt transformation, content credentials, and other features.

Default safety aims to mitigate risks in different categories such as hate and fairness, sexual, violence, self-harm, protected material content, and user prompt injection attacks. To learn more about Guardrail and Controls, visit our documentation describing categories and severity levels.

All safety policies are configurable. To learn more about configurability, see the documentation on configuring Guardrails.

When content is detected that exceeds the severity threshold for a risk category, the API request is blocked and returns an error response indicating which category triggered the filter. This applies to both user prompts (input) and model completions (output).

Prerequisites

  • An Azure subscription with access to Azure OpenAI Service
  • Deployed Azure OpenAI models (excluding Whisper, which uses different safety configurations)

Text models

Text models in the Azure OpenAI can take in and generate both text and code. These models leverage Azure’s text content filters to detect and prevent harmful content. This system works on both prompts and completions.

Risk Category Prompt/Completion Severity Threshold
Hate and Fairness Prompts and Completions Medium
Violence Prompts and Completions Medium
Sexual Prompts and Completions Medium
Self-Harm Prompts and Completions Medium
User prompt injection attack (Jailbreak) Prompts N/A
Protected Material – Text Completions N/A
Protected Material – Code Completions N/A

Vision models

Vision-enabled chat models

Risk Category Prompt/Completion Severity Threshold
Hate and Fairness Prompts and Completions Medium
Violence Prompts and Completions Medium
Sexual Prompts and Completions Medium
Self-Harm Prompts and Completions Medium
Identification of Individuals and Inference of Sensitive Attributes Prompts N/A
User prompt injection attack (Jailbreak) Prompts N/A

Image generation models

Risk Category Prompt/Completion Severity Threshold
Hate and Fairness Prompts and Completions Medium
Violence Prompts and Completions Medium
Sexual Prompts and Completions Medium
Self-Harm Prompts and Completions Medium
Content Credentials Completions N/A
Deceptive Generation of Political Candidates Prompts N/A
Depictions of Public Figures Prompts N/A
User prompt injection attack (Jailbreak) Prompts N/A
Protected Material – Art and Studio Characters Prompts N/A
Profanity Prompts N/A

Audio Models

Risk Category Prompt/Completion Severity Threshold
Hate and Fairness Prompts and Completions Medium
Violence Prompts and Completions Medium
Sexual Prompts and Completions Medium
Self-Harm Prompts and Completions Medium
User prompt injection attack (Jailbreak) Prompts N/A
Protected Material - Text Completions N/A
Protected Material - Code Completions N/A

Severity levels

Guardrails and controls ensure that AI-generated outputs align with ethical guidelines and safety standards. Azure OpenAI provides Guardrail capabilities to help identify and mitigate risks associated with various categories of harmful or inappropriate content. This article outlines the key risk categories and their descriptions to help you better understand the built-in Guardrail system.

Note

The text content filtering models for the hate, sexual, violence, and self-harm categories are specifically trained and tested on the following languages: English, German, Japanese, Spanish, French, Italian, Portuguese, and Chinese. However, the service can work in many other languages, but the quality might vary. In all cases, you should do your own testing to ensure that it works for your application.

Text content

Warning

The Severity definitions tab in this document contains examples of harmful content that may be disturbing to some readers.

Image content

Warning

The Severity definitions tab in this document contains examples of harmful content that may be disturbing to some readers.

Testing safety policies

To verify that default safety policies are active, send a test prompt that should trigger content filtering. For example:

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "[test prompt]"}]
)

If safety policies are active, you'll receive a content filtering response indicating which category was triggered.

Next steps