AI Foundry Agent Run Failure When Sending Images (Node.js SDK)

Question

Hi everyone,

I’m working with AI Foundry (Node.js SDK) in an Azure Web App. My flow involves creating a thread, sending a user message (text + base64 images), and polling until the run completes.

Here’s the situation:

Until March 16, 2026, everything worked fine.

On March 17, 2026, the same code failed with: Code

  Error: Agent run failed: Sorry, something went wrong.
    at AIFoundryService._pollRunUntilComplete (...)

The failure only happened when an image was included in the message.
If I sent text-only messages, the agent worked fine.
From the next day onward, even with images, the same code started working again without changes.

Library used-

@azure/ai-projects": "^1.0.1",

const { AIProjectClient } = require('@azure/ai-projects');
@param {AIProjectClient} projectClient

Message Construction (text + image URL)

js

// Build message payload with text + image
const messageContent = [
  { type: 'text', text: customPrompt },
  {
    type: 'image_url',
    imageUrl: {
      url: `data:image/png;base64,${base64Image}`, // base64-encoded PNG
      detail: 'high'
    }
  }
];

Send Message and Run Agent

// Create a thread and send the user message
const thread = await projectClient.agents.threads.create();
await projectClient.agents.messages.create(thread.id, "user", messageContent);

// Start agent run
let run = await projectClient.agents.runs.create(thread.id, agent.id);

// Poll until run completes
const { run: completedRun } = await this._pollRunUntilComplete(
  projectClient,
  thread.id,
  run,
  maxWaitTime
);

// Retrieve agent response
const assistantResponse = await this._getAssistantResponse(projectClient, thread.id);

Polling Function

async _pollRunUntilComplete(projectClient, threadId, run, maxWaitTime = DEFAULT_TIMEOUT_MS) {
    const startTime = Date.now();
    
    while (run.status === "queued" || run.status === "in_progress") {
        if (Date.now() - startTime > maxWaitTime) {
            throw new Error(`Agent run timed out after ${maxWaitTime / 1000} seconds`);
        }
        await new Promise(resolve => setTimeout(resolve, POLL_INTERVAL_MS));
        run = await projectClient.agents.runs.get(threadId, run.id);
    }

    if (run.status === "failed") {
        throw new Error(`Agent run failed: ${run.lastError?.message}`);
    }

    return { run, durationMs: Date.now() - startTime };
}

This looks like a transient service-side anomaly rather than a bug in my implementation. I want to understand:

Were there any service-side changes or outages in AI Foundry around March 17, 2026 that could explain why image-based runs failed?

My goal is to ensure that in production environments, we don’t just see “something went wrong” but have full observability into the agent’s message flow, especially when images are involved.

Thanks in advance for any guidance on root cause analysis and recommended telemetry patterns!

Answer

The context does not provide any information about specific outages or service-side changes on March 17, 2026, so no statement can be made about that incident itself.

However, the behavior described matches a transient, service-side issue affecting runs that include images, which then self-resolved. For production, the focus should be on:

Using run status and error fields for observability
Adding retry and backoff around failed runs
Adding telemetry around image-tool usage and content filtering

1. Use run status and error fields

Runs in Foundry Agent Service have well-defined statuses:

queued
in_progress
requires_action
completed
failed
cancelled
expired

For each run, log at least:

run.id
run.status
run.lastError?.code and run.lastError?.message (if present)
thread.id
Agent id / version
Whether the message contained images (boolean flag)

This allows correlation of failures specifically to image-bearing messages.

Reference: run status values and error handling are described in the Foundry Agent Service documentation for threads, runs, and messages.

2. Implement robust retry and backoff

Best practices for agents include:

Always checking for failed status
Implementing retry logic with exponential backoff
Using appropriate polling intervals (start around 500 ms and increase for longer operations)

In the polling loop, instead of immediately throwing on failed, consider:

Classifying failures as transient vs. permanent (based on lastError.code if available)
Retrying transient failures a small number of times with exponential backoff

Example pattern (conceptual):

const MAX_RETRIES = 3;

async function runWithRetry(projectClient, threadId, agentId) {
  let attempt = 0;
  while (true) {
    const run = await projectClient.agents.runs.create(threadId, agentId);
    const { run: completedRun } = await _pollRunUntilComplete(projectClient, threadId, run, maxWaitTime);

    if (completedRun.status === "completed") {
      return completedRun;
    }

    if (completedRun.status === "failed") {
      attempt++;
      logFailure(completedRun); // log run.id, lastError, hasImage, etc.
      if (attempt > MAX_RETRIES) {
        throw new Error(`Agent run failed after ${attempt} attempts: ${completedRun.lastError?.message}`);
      }
      const delayMs = 2 ** attempt * 1000; // exponential backoff
      await delay(delayMs);
      continue;
    }

    throw new Error(`Unexpected run status: ${completedRun.status}`);
  }
}

This pattern mitigates transient platform issues without masking persistent bugs.

3. Telemetry for image usage and content filtering

Image-related failures can also be caused by:

Missing or misconfigured image-generation deployments
Missing headers for image tools
Content filtering blocking the request
Regional or model limitations

The image-generation troubleshooting guidance recommends:

Verifying orchestrator model and gpt-image-1 deployments exist in the same Foundry project
Ensuring the correct header (x-ms-oai-image-generation-deployment) is present when using the image-generation tool
Checking content filtering logs when prompts do not produce images or are blocked
Confirming regional/model support for the tool

For observability, log:

Whether the message uses image tools or inline image_url content
The model/deployment names used (orchestrator + image model)
Region / Foundry project
Any content-filtering outcomes if exposed in the error payload

4. Thread-level and run-level logging

Threads and runs are the core primitives for tracing conversation flows:

Threads persist messages and maintain history
Runs process the thread and append agent responses

Recommended telemetry:

On message send: log thread.id, message type(s) (text, image), and size metadata (e.g., image byte length)
On run start: log run.id, thread.id, agent id/version
On each poll: log state transitions (queued → in_progress → completed/failed/expired)
On completion: log duration and final status

This aligns with best practices:

Clean up threads and agents when no longer needed
Handle errors gracefully with retries
Use appropriate polling intervals
Keep conversations concise for performance

5. Handling data-loss or corruption scenarios

In rare cases, failures can be caused by underlying data issues (for example, thread or agent records lost or corrupted). The guidance notes:

Failures may appear as inability to invoke an agent or interact with a thread
There are generally no specific recovery capabilities for data-level loss; affected agents/threads may be unrecoverable

For production:

Log when a run fails in a way that suggests thread/agent corruption (e.g., repeated failures on the same thread while new threads succeed)
Implement logic to recreate threads or agents when such patterns are detected

6. Image-tool specific telemetry

When using the image-generation tool (rather than inline image_url only), recommended checks and logs include:

Whether both orchestrator and gpt-image-1 deployments exist in the project
Whether the x-ms-oai-image-generation-deployment header matches the deployment name
Whether the tool is supported in the region and with the chosen orchestrator model
Whether content filtering blocked the request

Logging these attributes per run makes it easier to distinguish platform issues from configuration or policy issues.

References:

Threads, runs, and messages in Foundry Agent Service (classic)
Foundry Agent Service resource and data loss recovery
Use the image generation tool (preview) (java)
Use the image generation tool (preview) (rest-api)

Share via