A fully managed platform in Microsoft Foundry for hosting, scaling, and securing AI agents built with any supported framework or model
The context does not provide any information about specific outages or service-side changes on March 17, 2026, so no statement can be made about that incident itself.
However, the behavior described matches a transient, service-side issue affecting runs that include images, which then self-resolved. For production, the focus should be on:
- Using run status and error fields for observability
- Adding retry and backoff around
failedruns - Adding telemetry around image-tool usage and content filtering
1. Use run status and error fields
Runs in Foundry Agent Service have well-defined statuses:
-
queued -
in_progress -
requires_action -
completed -
failed -
cancelled -
expired
For each run, log at least:
-
run.id -
run.status -
run.lastError?.codeandrun.lastError?.message(if present) -
thread.id - Agent id / version
- Whether the message contained images (boolean flag)
This allows correlation of failures specifically to image-bearing messages.
Reference: run status values and error handling are described in the Foundry Agent Service documentation for threads, runs, and messages.
2. Implement robust retry and backoff
Best practices for agents include:
- Always checking for
failedstatus - Implementing retry logic with exponential backoff
- Using appropriate polling intervals (start around 500 ms and increase for longer operations)
In the polling loop, instead of immediately throwing on failed, consider:
- Classifying failures as transient vs. permanent (based on
lastError.codeif available) - Retrying transient failures a small number of times with exponential backoff
Example pattern (conceptual):
const MAX_RETRIES = 3;
async function runWithRetry(projectClient, threadId, agentId) {
let attempt = 0;
while (true) {
const run = await projectClient.agents.runs.create(threadId, agentId);
const { run: completedRun } = await _pollRunUntilComplete(projectClient, threadId, run, maxWaitTime);
if (completedRun.status === "completed") {
return completedRun;
}
if (completedRun.status === "failed") {
attempt++;
logFailure(completedRun); // log run.id, lastError, hasImage, etc.
if (attempt > MAX_RETRIES) {
throw new Error(`Agent run failed after ${attempt} attempts: ${completedRun.lastError?.message}`);
}
const delayMs = 2 ** attempt * 1000; // exponential backoff
await delay(delayMs);
continue;
}
throw new Error(`Unexpected run status: ${completedRun.status}`);
}
}
This pattern mitigates transient platform issues without masking persistent bugs.
3. Telemetry for image usage and content filtering
Image-related failures can also be caused by:
- Missing or misconfigured image-generation deployments
- Missing headers for image tools
- Content filtering blocking the request
- Regional or model limitations
The image-generation troubleshooting guidance recommends:
- Verifying orchestrator model and
gpt-image-1deployments exist in the same Foundry project - Ensuring the correct header (
x-ms-oai-image-generation-deployment) is present when using the image-generation tool - Checking content filtering logs when prompts do not produce images or are blocked
- Confirming regional/model support for the tool
For observability, log:
- Whether the message uses image tools or inline
image_urlcontent - The model/deployment names used (orchestrator + image model)
- Region / Foundry project
- Any content-filtering outcomes if exposed in the error payload
4. Thread-level and run-level logging
Threads and runs are the core primitives for tracing conversation flows:
- Threads persist messages and maintain history
- Runs process the thread and append agent responses
Recommended telemetry:
- On message send: log
thread.id, message type(s) (text, image), and size metadata (e.g., image byte length) - On run start: log
run.id,thread.id, agent id/version - On each poll: log state transitions (
queued→in_progress→completed/failed/expired) - On completion: log duration and final status
This aligns with best practices:
- Clean up threads and agents when no longer needed
- Handle errors gracefully with retries
- Use appropriate polling intervals
- Keep conversations concise for performance
5. Handling data-loss or corruption scenarios
In rare cases, failures can be caused by underlying data issues (for example, thread or agent records lost or corrupted). The guidance notes:
- Failures may appear as inability to invoke an agent or interact with a thread
- There are generally no specific recovery capabilities for data-level loss; affected agents/threads may be unrecoverable
For production:
- Log when a run fails in a way that suggests thread/agent corruption (e.g., repeated failures on the same thread while new threads succeed)
- Implement logic to recreate threads or agents when such patterns are detected
6. Image-tool specific telemetry
When using the image-generation tool (rather than inline image_url only), recommended checks and logs include:
- Whether both orchestrator and
gpt-image-1deployments exist in the project - Whether the
x-ms-oai-image-generation-deploymentheader matches the deployment name - Whether the tool is supported in the region and with the chosen orchestrator model
- Whether content filtering blocked the request
Logging these attributes per run makes it easier to distinguish platform issues from configuration or policy issues.
References: