Share via

AI Foundry Agent Run Failure When Sending Images (Node.js SDK)

Ramisetty Karuna Sree (MAQ LLC) 0 Reputation points Microsoft External Staff
2026-04-03T10:22:43.87+00:00

Hi everyone,

I’m working with AI Foundry (Node.js SDK) in an Azure Web App. My flow involves creating a thread, sending a user message (text + base64 images), and polling until the run completes.

Here’s the situation:

  • Until March 16, 2026, everything worked fine.
  • On March 17, 2026, the same code failed with: Code
      Error: Agent run failed: Sorry, something went wrong.
        at AIFoundryService._pollRunUntilComplete (...)
    
  • The failure only happened when an image was included in the message.
  • If I sent text-only messages, the agent worked fine.
  • From the next day onward, even with images, the same code started working again without changes.

Library used-

@azure/ai-projects": "^1.0.1",
const { AIProjectClient } = require('@azure/ai-projects');
@param {AIProjectClient} projectClient

Message Construction (text + image URL)

js

// Build message payload with text + image
const messageContent = [
  { type: 'text', text: customPrompt },
  {
    type: 'image_url',
    imageUrl: {
      url: `data:image/png;base64,${base64Image}`, // base64-encoded PNG
      detail: 'high'
    }
  }
];

Send Message and Run Agent

// Create a thread and send the user message
const thread = await projectClient.agents.threads.create();
await projectClient.agents.messages.create(thread.id, "user", messageContent);

// Start agent run
let run = await projectClient.agents.runs.create(thread.id, agent.id);

// Poll until run completes
const { run: completedRun } = await this._pollRunUntilComplete(
  projectClient,
  thread.id,
  run,
  maxWaitTime
);

// Retrieve agent response
const assistantResponse = await this._getAssistantResponse(projectClient, thread.id);

Polling Function

async _pollRunUntilComplete(projectClient, threadId, run, maxWaitTime = DEFAULT_TIMEOUT_MS) {
    const startTime = Date.now();
    
    while (run.status === "queued" || run.status === "in_progress") {
        if (Date.now() - startTime > maxWaitTime) {
            throw new Error(`Agent run timed out after ${maxWaitTime / 1000} seconds`);
        }
        await new Promise(resolve => setTimeout(resolve, POLL_INTERVAL_MS));
        run = await projectClient.agents.runs.get(threadId, run.id);
    }

    if (run.status === "failed") {
        throw new Error(`Agent run failed: ${run.lastError?.message}`);
    }

    return { run, durationMs: Date.now() - startTime };
}

This looks like a transient service-side anomaly rather than a bug in my implementation. I want to understand:

Were there any service-side changes or outages in AI Foundry around March 17, 2026 that could explain why image-based runs failed?

My goal is to ensure that in production environments, we don’t just see “something went wrong” but have full observability into the agent’s message flow, especially when images are involved.

Thanks in advance for any guidance on root cause analysis and recommended telemetry patterns!

Foundry Agent Service
Foundry Agent Service

A fully managed platform in Microsoft Foundry for hosting, scaling, and securing AI agents built with any supported framework or model

0 comments No comments

1 answer

Sort by: Most helpful
  1. Q&A Assist
    2026-04-03T10:23:07.97+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    The context does not provide any information about specific outages or service-side changes on March 17, 2026, so no statement can be made about that incident itself.

    However, the behavior described matches a transient, service-side issue affecting runs that include images, which then self-resolved. For production, the focus should be on:

    1. Using run status and error fields for observability
    2. Adding retry and backoff around failed runs
    3. Adding telemetry around image-tool usage and content filtering

    1. Use run status and error fields

    Runs in Foundry Agent Service have well-defined statuses:

    • queued
    • in_progress
    • requires_action
    • completed
    • failed
    • cancelled
    • expired

    For each run, log at least:

    • run.id
    • run.status
    • run.lastError?.code and run.lastError?.message (if present)
    • thread.id
    • Agent id / version
    • Whether the message contained images (boolean flag)

    This allows correlation of failures specifically to image-bearing messages.

    Reference: run status values and error handling are described in the Foundry Agent Service documentation for threads, runs, and messages.

    2. Implement robust retry and backoff

    Best practices for agents include:

    • Always checking for failed status
    • Implementing retry logic with exponential backoff
    • Using appropriate polling intervals (start around 500 ms and increase for longer operations)

    In the polling loop, instead of immediately throwing on failed, consider:

    • Classifying failures as transient vs. permanent (based on lastError.code if available)
    • Retrying transient failures a small number of times with exponential backoff

    Example pattern (conceptual):

    const MAX_RETRIES = 3;
    
    async function runWithRetry(projectClient, threadId, agentId) {
      let attempt = 0;
      while (true) {
        const run = await projectClient.agents.runs.create(threadId, agentId);
        const { run: completedRun } = await _pollRunUntilComplete(projectClient, threadId, run, maxWaitTime);
    
        if (completedRun.status === "completed") {
          return completedRun;
        }
    
        if (completedRun.status === "failed") {
          attempt++;
          logFailure(completedRun); // log run.id, lastError, hasImage, etc.
          if (attempt > MAX_RETRIES) {
            throw new Error(`Agent run failed after ${attempt} attempts: ${completedRun.lastError?.message}`);
          }
          const delayMs = 2 ** attempt * 1000; // exponential backoff
          await delay(delayMs);
          continue;
        }
    
        throw new Error(`Unexpected run status: ${completedRun.status}`);
      }
    }
    

    This pattern mitigates transient platform issues without masking persistent bugs.

    3. Telemetry for image usage and content filtering

    Image-related failures can also be caused by:

    • Missing or misconfigured image-generation deployments
    • Missing headers for image tools
    • Content filtering blocking the request
    • Regional or model limitations

    The image-generation troubleshooting guidance recommends:

    • Verifying orchestrator model and gpt-image-1 deployments exist in the same Foundry project
    • Ensuring the correct header (x-ms-oai-image-generation-deployment) is present when using the image-generation tool
    • Checking content filtering logs when prompts do not produce images or are blocked
    • Confirming regional/model support for the tool

    For observability, log:

    • Whether the message uses image tools or inline image_url content
    • The model/deployment names used (orchestrator + image model)
    • Region / Foundry project
    • Any content-filtering outcomes if exposed in the error payload

    4. Thread-level and run-level logging

    Threads and runs are the core primitives for tracing conversation flows:

    • Threads persist messages and maintain history
    • Runs process the thread and append agent responses

    Recommended telemetry:

    • On message send: log thread.id, message type(s) (text, image), and size metadata (e.g., image byte length)
    • On run start: log run.id, thread.id, agent id/version
    • On each poll: log state transitions (queuedin_progresscompleted/failed/expired)
    • On completion: log duration and final status

    This aligns with best practices:

    • Clean up threads and agents when no longer needed
    • Handle errors gracefully with retries
    • Use appropriate polling intervals
    • Keep conversations concise for performance

    5. Handling data-loss or corruption scenarios

    In rare cases, failures can be caused by underlying data issues (for example, thread or agent records lost or corrupted). The guidance notes:

    • Failures may appear as inability to invoke an agent or interact with a thread
    • There are generally no specific recovery capabilities for data-level loss; affected agents/threads may be unrecoverable

    For production:

    • Log when a run fails in a way that suggests thread/agent corruption (e.g., repeated failures on the same thread while new threads succeed)
    • Implement logic to recreate threads or agents when such patterns are detected

    6. Image-tool specific telemetry

    When using the image-generation tool (rather than inline image_url only), recommended checks and logs include:

    • Whether both orchestrator and gpt-image-1 deployments exist in the project
    • Whether the x-ms-oai-image-generation-deployment header matches the deployment name
    • Whether the tool is supported in the region and with the chosen orchestrator model
    • Whether content filtering blocked the request

    Logging these attributes per run makes it easier to distinguish platform issues from configuration or policy issues.


    References:

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.